Paper ID: 1631
Title: Action-Conditional Video Prediction using Deep Networks in Atari Games
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper addresses the problem of learning a model of Atari 2600 games (a popular testbed for reinforcement learning algorithms), in other words predicting future frames conditioned on action input.

This is a challenging problem and its solution is a useful tool to build better controllers.

The paper is clear and well-structured, and has convincing experiments (and videos).

The model is a CNN (with a fully-connected layer) followed by multiplicative interactions with an action vector, followed by convolution decoding layers. The recurrent version has an LSTM layer added after the CNN.

The authors evaluate their models both on pixel accuracy (traditional way of evaluating such models) and on usefulness for control (what we really care about).

It would be desirable to include more experimental details about 1) the network architecture (especially the deconvolution part) and 2) the network training procedure. Ideally, code would be made available, but more details in the main text or supplementary would also be fine.

Some comments:

- It is a bit unfortunate that the authors did not try to predict rewards in addition to the next frames, that would have opened the door to using the model for planning (e.g., using UCT), instead of using a trained model-free controller to test the usefulness for control - which is a bit harder to interpret.

- Baselines against which the models are compared are a bit weak, but this is fair enough since there are no obvious candidates to compare against (afaik).

- About the exploration section, is the predictive model learned online to help with exploration? Or is it learned using data from a regular DQN (uninformed exploration) first, and then used to direct the exploration of a new controller. If it's the latter then it's not clear what this is achieving - since exploration has already been done to obtain the model. In any case, it is still a bit surprising that this helps in some games.

- The controlled vs uncontrolled dynamics section at the end is interesting.

- The authors might want to take a look at this relevant recent work "DeepMPC: Learning Deep Latent Features for Model Predictive Control" on learning deep predictive models (using also multiplicative interactions with the actions) for control, although this isn't in the visual domain.

Minor things/typos:

- "In Seaquest, new objects appear from the left side or right side randomly, and these are hard to predict." I'm not sure this is completely true, but it certainly looks random.

- line 430: predicing

[Updated score after rebuttal. Other recent papers which learn deep dynamical model from images, though not for the Atari game: -From Pixels to Torques: Policy Learning with Deep Dynamical Models -Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images ]
Q2: Please summarize your review in 1-2 sentences
A neat paper on learning the dynamics of Atari Games from data. The paper is well-written and has some convincing experiments.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper presents a system for action conditional prediction of video frames using deep convolutional and recurrent neural networks. The authors propose two deep models for encoding the video data using feedforward CNNs or a recurrent neural net, transforming the frame based on the applied action and decoding them back using up-convolution. The system is trained on four video games using the atari emulator and tested for video prediction and efficacy for control using the DQN algorithm.

Quality & Originality: ------------------------ The paper is sound and logical. The proposed deep architectures are similar to previous work, with the added action dependent transformation. The main novelty comes from the new multiplicative formulation of this transformation as compared to an additive fully connected layer. Most other parts including the training has been proposed in the literature. The system does achieve good results, being able to predict multiple frames into the future.

The experiments compare and contrast the two variants against two other naive baselines which do not consider the actions. A more informative baseline would have been the additive fully connected action transformation layer as this takes into account the action's effect. The authors should try to compare against this.

Other experiments confirm the efficacy of the system for control, based on the DQN algorithm. The informed sampling for the DQN also improves on state-of the art results for the DQN, but that is to be expected given a decent enough generative model of the emulator. A good sanity check there would be to use the emulator as the generative model and compare the scores w.r.t the results from the learned model. Additionally, the actions separate well across the learned representations.

Clarity: --------- The paper is well written and clear. The figures need to be improved (especially Fig 5.a) where it is hard to see any difference between the two methods) and caption text should be added below the figures to make it easier for the reader to gain context.

Significance: --------------- The paper proposes a new method for learning representations for video prediction conditioned on the actions. The paper builds on the use of deep auto-encoders for learning representations for control and is relevant to the NIPS community. There has been some prior work on action-based image data prediction in robotics that the authors may wish to cite: Boots et al., Learning Predictive Models of a Depth Camera & Manipulator from Raw Execution Traces, ICRA 2014 - This paper also does action-dependent prediction of kinect data (640 x 480), but over much shorter timeframes and simpler environments.

Overall, the paper is concise and the ideas are well presented. I feel that the work is a bit incremental with not a significant theoretical contribution. Adding more experimental results with better baselines should improve the paper as a whole. I vote for a borderline accept.

Q2: Please summarize your review in 1-2 sentences
This paper presents two deep-architectures for predicting sequences of video frames of the Atari emulator conditioned on the agent's actions and previous frames. The system is tested on a few Atari games and is used with the DQN algorithm to evaluate its efficacy for control. The paper is well written with a few experimental results, but comes across as incremental work. I vote for a borderline accept.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper easy to read and easy to understand in big strokes the ideas and the message it conveys. However I found that the authors sometimes (maybe to improve readability) skip certain details. It is not obvious how important these details are , but makes a quick reproducible of the results hard.

I will point out a few things that I would like the authors to be more clear about:

* specify in more details the model ! Is not clear how the decoder is symmetric to

the encoder. How do you upsample as you go down?

* what learning procedure was used to train the generative model (SGD,

SGD+momenum, RMSprop, Adam, Ada-delta, Adagrad .. to name just a few of

the more popular approaches) ?

* For the LSTM was there any clipping used (most LSTM models out there rely

on gradient clipping for stability)

* What learning rate, momentum etc. was used. Where this chosen based on

intuition, or did the authors run a grid search (or random sampling of

the hyper-parameters)

* How is the data constructed more exactly? Was there any pruning used to

make sure that you get enough frames from each stage of a game (the image

statistics early on are potentially very different from later e.g.

pacman and might require balancing the dataset)?

What is the average episode length (compared to the average length of the movie you can generate)?

* is the curriculum learning (going from 1,3,5 frames in the future)

necessary? Does it result in more stable generation? Was this validated


* the authors claim that the action should have a multiplicative

interaction, which I think is a reasonable assumption. Has this fact been

verified experimentally though? Given that the task is novel, it is hard

to judge how important these bits are.
Q2: Please summarize your review in 1-2 sentences
The paper is well written, though occasionally lacks details of the exact experimental procedure which most probably makes the result un-reproducable. The results are impressive

and the authors make some interesting observations and analysis about the behavior of their model.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The authors present a system for video frame prediction in ATARI games taking into account the game actions. The main novelty is a multiplicative transformation layer that selects weights based on a given action vector. To keep the system scalable, a factorization of the weight tensor is used. Encoding and decoding of images is done by a well-established convolutional network architecture.

They propose two different architectures: a feed-forward network that gets a fixed number of previous frames as input and a recurrent network that gets only one frame but has an LSTM layer before the transformation layer. (I think at least once in the paper they should say what's an LSTM and what the abbreviation stands for.)

The qualitative assessment of the generated frames as shown by the supplementary videos is convincing. A quantitative evaluation of mean squared pixel error is given in a clear way but it is not surprising that their system beats a linear and a nonlinear predictor that don't take into account the game actions.

Another experiment in which the generated frames are used instead of the real frames as input for a DQN agent also shows results as expected. The performance is worse than for real input frames, better that random play, and their system beats the no-action predictors.

An interesting application of this system is to use it during training of an agent to improve exploration. They show that if a DQN agent takes an action that leads to a predicted frame that has least similarity with previously seen frames (rather than a random action), the final performance is significantly improved on some games.

Finally it is shown that the system can be used to automatically analyze some of the game dynamics. Game actions that have similar effects can be identified from similarities in the transformation weight matrix. Furthermore the system estimates which pixels of the image are controlled directly by actions and which pixels are uncontrollable game dynamics.

The authors write that "To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs." This may be true, because to my knowledge previous such systems did not have LSTM. I know, however, that there is an old paper by Schmidhuber and Huber from 1991 (International Journal of Neural Systems) where a similar neural predictor learns to predict the next visual input frame of a fovea, given previous input and action. The predictor is embedded in a reinforcement learning system that learns sequences of saccades that lead to desired visual objects defined in a separate goal input. Schmidhuber also had papers at IJCNN 1990 and NIPS 1991 where both the predictor and the action generator were recurrent (but no LSTM yet). Nevertheless, it would be good if the authors could point out what's different in their system, which for example also uses a different RL method.

The results are clear, the paper is well structured and suggested for acceptance after minor revisions as indicated above.

Q2: Please summarize your review in 1-2 sentences
A novel architecture for video frame prediction in ATARI games based on game actions is presented. There are promising results on improved exploration during agent training, as well as a simple analysis of game dynamics.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper is nice to read.

The idea is clear, the implementation looks sound, and the results are good.

I also appreciate that the authors take a moment to highlight limitations (Section 4.1), and that they give both quantitative and qualitative analyses, as well as suggesting hypotheses.

Clearly a lot of work went into this paper, and the analysis seems good and fair.

The authors do not just consider how to do something (generative model), but also why (exploration), and then test this.

The main drawback is that the results on usefulness are a bit inconclusive, and the ultimate general impact of this line of work is a bit unclear.
Q2: Please summarize your review in 1-2 sentences
This is a nice empirical paper about using generative models to hallucinate futures in Atari games. The (empirical, quantitative and qualitative) analyses of the results are nice and comprehensive.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thank you for all your comments. We will revise the paper accordingly.

R1, R5: Regarding details of training and architecture:
To make the results reproducible, we will include more details of training and release our code.

R1, R2, R3: Regarding baselines:
We trained a MLP with 2 hidden layers where actions are concatenated to the first hidden layer, which uses similar number of parameters as our models. This model performs significantly worse than ours and cannot handle different actions. We believe that this serves as a stronger baseline, thus provides insight into the effectiveness of our architecture.

R2, R5: Regarding multiplicative transformation:
We confirmed empirically that our model consistently performs better than a model that is the same except for actions concatenated into the fully-connected layer of the encoder. When the decoding layer is simply a fully-connected layer the concatenation model performs significantly worse than our model and fails to differentiate different actions, whereas our multiplicative model handles different actions well. This implies that multiplicative interaction is better at conditioning the transformation, while concatenation requires additional highly non-linear layers to successfully condition the transformation on action variables.

Regarding planning:
We agree that building a reward prediction model for planning is a natural future direction and we have discussed this in the conclusion.

Regarding online training:
As described in Sec. 4, our predictive model is trained offline. We chose offline training to carefully evaluate and compare two proposed architectures by constructing a fixed set of training/testing data. However, our model is not limited to offline training; it is possible to train our model jointly with DQN by sharing the replay memory. We agree that online training is an interesting next step but they are beyond the scope of this paper. Instead, we focused on showing that our method can achieve asymptotically higher game scores on some game domains.

We will cite the DeepMPC paper. This work is contemporary and proposes a multiplicative interaction with actions. However, the main difference is that this paper deals with 3-dimensional observations, while our model is dealing with far higher dimensional observations.

Regarding the comments on novelty:
To our knowledge, there has been no deep convolutional architectures that incorporate "actions" into account for high-res video modelling, as suggested by R1 and R3. We also showed a novel application of deep networks in vision-based RL domains.

Regarding informed exploration using the emulator:
We agree that the use of the game emulator itself for informed exploration is a useful baseline. We in fact already implemented such a baseline and included the results in the supp. materials (Table 1). To sum up, exploration using our predictive model achieves almost the same performance as using the emulator except for Ms Pacman. We will make this comparison more salient in the text.

We will cite Boots' et al. paper as related work. The main difference is that our model is based on deep neural networks, whereas their work is based on a non-parametric variant of predictive state representations. Furthermore, our model deals with more complex spatio-temporal data that involves many local transformation/interactions as well as long-term dependencies.

Thanks for pointing out the work by Schmidhuber and Huber, which we will cite as related work. In addition to differences you mentioned, our work tackles "global" video prediction problem (instead of predicting patches in the foveated regions) with deep convolutional/recurrent network architectures and a new way to combine actions (multiplicative interactions). While their model is embedded into the controller, our model is used for better exploration in RL context.

As stated by R3, our main contribution is to propose a new deep architecture using control variables (with multiplicative interaction) for long-term video prediction. To our knowledge, our paper is the first work that shows very long-term predictions (up to several hundred frames) in high-dimensional videos. In addition, we propose a novel application of deep networks in an RL domain, which improves the state-of-the-art controller. Our paper shows detailed qualitative (video) and quantitative (squared loss and game scores) results as R3 and R6 mentioned.

Regarding incremental (curriculum) training:
We confirmed that the incremental training stabilizes learning. Our models (especially the recurrent model) perform much worse in terms of squared loss if the model is trained on 5-step prediction objective from scratch.

We agree that the predictive models in a control setting are just a beginning, but our work can be potentially useful for improving short-term or long-term planning in RL.