{"title": "Action-Conditional Video Prediction using Deep Networks in Atari Games", "book": "Advances in Neural Information Processing Systems", "page_first": 2863, "page_last": 2871, "abstract": "Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future (image-)frames are dependent on control variables or actions as well as previous frames. While not composed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the proposed architectures are able to generate visually-realistic frames that are also useful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs.", "full_text": "Action-Conditional Video Prediction\nusing Deep Networks in Atari Games\n\nJunhyuk Oh\n\nSatinder Singh\n\nXiaoxiao Guo Honglak Lee\nUniversity of Michigan, Ann Arbor, MI 48109, USA\n\nRichard Lewis\n\n{junhyuk,guoxiao,honglak,rickl,baveja}@umich.edu\n\nAbstract\n\nMotivated by vision-based reinforcement learning (RL) problems, in particular\nAtari games from the recent benchmark Aracade Learning Environment (ALE),\nwe consider spatio-temporal prediction problems where future image-frames de-\npend on control variables or actions as well as previous frames. While not com-\nposed of natural scenes, frames in Atari games are high-dimensional in size, can\ninvolve tens of objects with one or more objects being controlled by the actions\ndirectly and many other objects being in\ufb02uenced indirectly, can involve entry and\ndeparture of objects, and can involve deep partial observability. We propose and\nevaluate two deep neural network architectures that consist of encoding, action-\nconditional transformation, and decoding layers based on convolutional neural\nnetworks and recurrent neural networks. Experimental results show that the pro-\nposed architectures are able to generate visually-realistic frames that are also use-\nful for control over approximately 100-step action-conditional futures in some\ngames. To the best of our knowledge, this paper is the \ufb01rst to make and evaluate\nlong-term predictions on high-dimensional video conditioned by control inputs.\n\nIntroduction\n\n1\nOver the years, deep learning approaches (see [5, 26] for survey) have shown great success in many\nvisual perception problems (e.g., [16, 7, 32, 9]). However, modeling videos (building a generative\nmodel) is still a very challenging problem because it often involves high-dimensional natural-scene\ndata with complex temporal dynamics. Thus, recent studies have mostly focused on modeling simple\nvideo data, such as bouncing balls or small patches, where the next frame is highly-predictable given\nthe previous frames [29, 20, 19]. In many applications, however, future frames depend not only on\nprevious frames but also on control or action variables. For example, the \ufb01rst-person-view in a ve-\nhicle is affected by wheel-steering and acceleration. The camera observation of a robot is similarly\ndependent on its movement and changes of its camera angle. More generally, in vision-based rein-\nforcement learning (RL) problems, learning to predict future images conditioned on actions amounts\nto learning a model of the dynamics of the agent-environment interaction, an essential component\nof model-based approaches to RL. In this paper, we focus on Atari games from the Arcade Learn-\ning Environment (ALE) [1] as a source of challenging action-conditional video modeling problems.\nWhile not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens\nof objects with one or more objects being controlled by the actions directly and many other objects\nbeing in\ufb02uenced indirectly, can involve entry and departure of objects, and can involve deep partial\nobservability. To the best of our knowledge, this paper is the \ufb01rst to make and evaluate long-term\npredictions on high-dimensional images conditioned by control inputs.\nThis paper proposes, evaluates, and contrasts two spatio-temporal prediction architectures based on\ndeep networks that incorporate action variables (See Figure 1). Our experimental results show that\nour architectures are able to generate realistic frames over 100-step action-conditional future frames\nwithout diverging in some Atari games. We show that the representations learned by our architec-\ntures 1) approximately capture natural similarity among actions, and 2) discover which objects are\ndirectly controlled by the agent\u2019s actions and which are only indirectly in\ufb02uenced or not controlled.\nWe evaluated the usefulness of our architectures for control in two ways: 1) by replacing emulator\nframes with predicted frames in a previously-learned model-free controller (DQN; DeepMind\u2019s state\n\n1\n\n\f(a) Feedforward encoding\n\n(b) Recurrent encoding\nFigure 1: Proposed Encoding-Transformation-Decoding network architectures.\n\nof the art Deep-Q-Network for Atari Games [21]), and 2) by using the predicted frames to drive a\nmore informed than random exploration strategy to improve a model-free controller (also DQN).\n2 Related Work\nVideo Prediction using Deep Networks. The problem of video prediction has led to a variety of\narchitectures in deep learning. A recurrent temporal restricted Boltzmann machine (RTRBM) [29]\nwas proposed to learn temporal correlations from sequential data by introducing recurrent connec-\ntions in RBM. A structured RTRBM (sRTRBM) [20] scaled up RTRBM by learning dependency\nstructures between observations and hidden variables from data. More recently, Michalski et al. [19]\nproposed a higher-order gated autoencoder that de\ufb01nes multiplicative interactions between consec-\nutive frames and mapping units, and showed that temporal prediction problem can be viewed as\nlearning and inferring higher-order interactions between consecutive images. Srivastava et al. [28]\napplied a sequence-to-sequence learning framework [31] to a video domain, and showed that long\nshort-term memory (LSTM) [12] networks are capable of generating video of bouncing handwrit-\nten digits. In contrast to these previous studies, this paper tackles problems where control variables\naffect temporal dynamics, and in addition scales up spatio-temporal prediction to larger-size images.\nALE: Combining Deep Learning and RL. Atari 2600 games provide challenging environments\nfor RL because of high-dimensional visual observations, partial observability, and delayed rewards.\nApproaches that combine deep learning and RL have made signi\ufb01cant advances [21, 22, 11]. Speci\ufb01-\ncally, DQN [21] combined Q-learning [36] with a convolutional neural network (CNN) and achieved\nstate-of-the-art performance on many Atari games. Guo et al. [11] used the ALE-emulator for mak-\ning action-conditional predictions with slow UCT [15], a Monte-Carlo tree search method, to gener-\nate training data for a fast-acting CNN, which outperformed DQN on several domains. Throughout\nthis paper we will use DQN to refer to the architecture used in [21] (a more recent work [22] used a\ndeeper CNN with more data to produce the currently best-performing Atari game players).\nAction-Conditional Predictive Model for RL. The idea of building a predictive model for\nvision-based RL problems was introduced by Schmidhuber and Huber [27]. They proposed a neural\nnetwork that predicts the attention region given the previous frame and an attention-guiding action.\nMore recently, Lenz et al. [17] proposed a recurrent neural network with multiplicative interactions\nthat predicts the physical coordinate of a robot. Compared to this previous work, our work is evalu-\nated on much higher-dimensional data with complex dependencies among observations. There have\nbeen a few attempts to learn from ALE data a transition-model that makes predictions of future\nframes. One line of work [3, 4] divides game images into patches and applies a Bayesian framework\nto predict patch-based observations. However, this approach assumes that neighboring patches are\nenough to predict the center patch, which is not true in Atari games because of many complex in-\nteractions. The evaluation in this prior work is 1-step prediction loss; in contrast, here we make and\nevaluate long-term predictions both for quality of pixels generated and for usefulness to control.\n3 Proposed Architectures and Training Method\nThe goal of our architectures is to learn a function f : x1:t, at \u2192 xt+1, where xt and at are the frame\nand action variables at time t, and x1:t are the frames from time 1 to time t. Figure 1 shows our two\narchitectures that are each composed of encoding layers that extract spatio-temporal features from\nthe input frames (\u00a73.1), action-conditional transformation layers that transform the encoded features\ninto a prediction of the next frame in high-level feature space by introducing action variables as\nadditional input (\u00a73.2) and \ufb01nally decoding layers that map the predicted high-level features into\npixels (\u00a73.3). Our contributions are in the novel action-conditional deep convolutional architectures\nfor high-dimensional, long-term prediction as well as in the novel use of the architectures in vision-\nbased RL domains.\n\n2\n\n\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0action \t\r \u00a0\t\r \u00a0\t\r \u00a0encoding transformation decoding \t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0action encoding transformation decoding \f3.1 Two Variants: Feedforward Encoding and Recurrent Encoding\nFeedforward encoding takes a \ufb01xed history of previous frames as an input, which is concatenated\nthrough channels (Figure 1a), and stacked convolution layers extract spatio-temporal features di-\nrectly from the concatenated frames. The encoded feature vector henc\n\nt \u2208 Rn at time t is:\n\nt = CNN (xt\u2212m+1:t) ,\nhenc\n\n(1)\nwhere xt\u2212m+1:t \u2208 R(m\u00d7c)\u00d7h\u00d7w denotes m frames of h \u00d7 w pixel images with c color channels.\nCNN is a mapping from raw pixels to a high-level feature vector using multiple convolution layers\nand a fully-connected layer at the end, each of which is followed by a non-linearity. This encoding\ncan be viewed as early-fusion [14] (other types of fusions, e.g., late-fusion or 3D convolution [35]\ncan also be applied to this architecture).\nRecurrent encoding takes one frame as an input for each time-step and extracts spatio-temporal\nfeatures using an RNN in which the temporal dynamics is modeled by the recurrent layer on top\nof the high-level feature vector extracted by convolution layers (Figure 1b). In this paper, LSTM\nwithout peephole connection is used for the recurrent layer as follows:\n\n, ct] = LSTM(cid:0)CNN (xt) , henc\n\n(cid:1) ,\n\n[henc\n\nt\n\nt\u22121, ct\u22121\n\n(2)\nwhere ct \u2208 Rn is a memory cell that retains information from a deep history of inputs. Intuitively,\nCNN (xt) is given as input to the LSTM so that the LSTM captures temporal correlations from\nhigh-level spatial features.\n3.2 Multiplicative Action-Conditional Transformation\nWe use multiplicative interactions between the encoded feature vector and the control variables:\n\n(cid:88)\nt \u2208 Rn is an action-transformed feature, at \u2208 Ra is\nwhere henc\nthe action-vector at time t, W \u2208 Rn\u00d7n\u00d7a is 3-way tensor weight, and b \u2208 Rn is bias. When the\naction a is represented using one-hot vector, using a 3-way tensor is equivalent to using different\nweight matrices for each action. This enables the architecture to model different transformations\nfor different actions. The advantages of multiplicative interactions have been explored in image and\ntext processing [33, 30, 18]. In practice the 3-way tensor is not scalable because of its large number\nof parameters. Thus, we approximate the tensor by factorizing into three matrices as follows [33]:\n\n\u2208 Rn is an encoded feature, hdec\n\nt,j at,l + bi,\n\nhdec\nt,i =\n\nWijlhenc\n\n(3)\n\nj,l\n\nt\n\nt (cid:12) Waat) + b,\n\nhdec\nt = Wdec (Wenchenc\n\n(4)\nwhere Wdec \u2208 Rn\u00d7f , Wenc \u2208 Rf\u00d7n, Wa \u2208 Rf\u00d7a, b \u2208 Rn, and f is the number of factors.\nUnlike the 3-way tensor, the above factorization shares the weights between different actions by\nmapping them to the size-f factors. This sharing may be desirable relative to the 3-way tensor when\nthere are common temporal dynamics in the data across different actions (discussed further in \u00a74.3).\n3.3 Convolutional Decoding\nIt has been recently shown that a CNN is capable of generating an image effectively using upsam-\npling followed by convolution with stride of 1 [8]. Similarly, we use the \u201cinverse\u201d operation of\nconvolution, called deconvolution, which maps 1 \u00d7 1 spatial region of the input to d \u00d7 d using de-\nconvolution kernels. The effect of s \u00d7 s upsampling can be achieved without explicitly upsampling\nthe feature map by using stride of s. We found that this operation is more ef\ufb01cient than upsampling\nfollowed by convolution because of the smaller number of convolutions with larger stride.\nIn the proposed architecture, the transformed feature vector hdec is decoded into pixels as follows:\n\n\u02c6xt+1 = Deconv(cid:0)Reshape(cid:0)hdec(cid:1)(cid:1) ,\n\n(5)\nwhere Reshape is a fully-connected layer where hidden units form a 3D feature map, and Deconv\nconsists of multiple deconvolution layers, each of which is followed by a non-linearity except for\nthe last deconvolution layer.\n3.4 Curriculum Learning with Multi-Step Prediction\nIt is almost inevitable for a predictive model to make noisy predictions of high-dimensional images.\nWhen the model is trained on a 1-step prediction objective, small prediction errors can compound\n\n3\n\n\fthrough time. To alleviate this effect, we use a multi-step prediction objective. More speci\ufb01cally,\ngiven the training data D =\n, the model is trained to minimize\nthe average squared error over K-step predictions as follows:\n\nx(i)\n1 , a(i)\n\n, a(i)\nTi\n\nx(i)\nTi\n\ni=1\n\n1\n\n(cid:110)(cid:16)(cid:16)\n\n, ...,\n\n(cid:17)\n(cid:88)\n\n(cid:16)\n(cid:88)\n\nK(cid:88)\n\n(cid:17)(cid:17)(cid:111)N\n(cid:13)(cid:13)(cid:13)\u02c6x(i)\n\nt+k \u2212 x(i)\n\nt+k\n\n(cid:13)(cid:13)(cid:13)2\n\nLK (\u03b8) =\n\n1\n2K\n\ni\n\nt\n\nk=1\n\n,\n\n(6)\n\nt+k is a k-step future prediction. Intuitively, the network is repeatedly unrolled through K\n\nwhere \u02c6x(i)\ntime steps by using its prediction as an input for the next time-step.\nThe model is trained in multiple phases based on increasing K as suggested by Michalski et al. [19].\nIn other words, the model is trained to predict short-term future frames and \ufb01ne-tuned to predict\nlonger-term future frames after the previous phase converges. We found that this curriculum learn-\ning [6] approach is necessary to stabilize the training. A stochastic gradient descent with backprop-\nagation through time (BPTT) is used to optimize the parameters of the network.\n4 Experiments\nIn the experiments that follow, we have the following goals for our two architectures. 1) To evaluate\nthe predicted frames in two ways: qualitatively evaluating the generated video, and quantitatively\nevaluating the pixel-based squared error, 2) To evaluate the usefulness of predicted frames for control\nin two ways: by replacing the emulator\u2019s frames with predicted frames for use by DQN, and by using\nthe predictions to improve exploration in DQN, and 3) To analyze the representations learned by our\narchitectures. We begin by describing the details of the data, and model architecture, and baselines.\nData and Preprocessing. We used our replication of DQN to generate game-play video datasets\nusing an \u0001-greedy policy with \u0001 = 0.3, i.e. DQN is forced to choose a random action with 30%\nprobability. For each game, the dataset consists of about 500, 000 training frames and 50, 000 test\nframes with actions chosen by DQN. Following DQN, actions are chosen once every 4 frames which\nreduces the video from 60fps to 15fps. The number of actions available in games varies from 3 to\n18, and they are represented as one-hot vectors. We used full-resolution RGB images (210 \u00d7 160)\nand preprocessed the images by subtracting mean pixel values and dividing each pixel value by 255.\nNetwork Architecture. Across all game domains, we use the same network architecture as fol-\nlows. The encoding layers consist of 4 convolution layers and one fully-connected layer with 2048\nhidden units. The convolution layers use 64 (8 \u00d7 8), 128 (6 \u00d7 6), 128 (6 \u00d7 6), and 128 (4 \u00d7 4)\n\ufb01lters with stride of 2. Every layer is followed by a recti\ufb01ed linear function [23]. In the recurrent\nencoding network, an LSTM layer with 2048 hidden units is added on top of the fully-connected\nlayer. The number of factors in the transformation layer is 2048. The decoding layers consists of one\nfully-connected layer with 11264 (= 128\u00d7 11\u00d7 8) hidden units followed by 4 deconvolution layers.\nThe deconvolution layers use 128 (4\u00d74), 128 (6\u00d76), 128 (6\u00d76), and 3 (8\u00d78) \ufb01lters with stride of\n2. For the feedforward encoding network, the last 4 frames are given as an input for each time-step.\nThe recurrent encoding network takes one frame for each time-step, but it is unrolled through the\nlast 11 frames to initialize the LSTM hidden units before making a prediction. Our implementation\nis based on Caffe toolbox [13].\nDetails of Training. We use the curriculum learning scheme above with three phases of increasing\nprediction step objectives of 1, 3 and 5 steps, and learning rates of 10\u22124, 10\u22125, and 10\u22125, respec-\ntively. RMSProp [34, 10] is used with momentum of 0.9, (squared) gradient momentum of 0.95,\nand min squared gradient of 0.01. The batch size for each training phase is 32, 8, and 8 for the feed-\nforward encoding network and 4, 4, and 4 for the recurrent encoding network, respectively. When\nthe recurrent encoding network is trained on 1-step prediction objective, the network is unrolled\nthrough 20 steps and predicts the last 10 frames by taking ground-truth images as input. Gradients\nare clipped at [\u22120.1, 0.1] before non-linearity of each gate of LSTM as suggested by [10].\nTwo Baselines for Comparison. The \ufb01rst baseline is a multi-layer perceptron (MLP) that takes\nthe last frame as input and has 4 hidden layers with 400, 2048, 2048, and 400 units. The action\ninput is concatenated to the second hidden layer. This baseline uses approximately the same number\nof parameters as the recurrent encoding model. The second baseline, no-action feedforward (or\nnaFf ), is the same as the feedforward encoding model (Figure 1a) except that the transformation\nlayer consists of one fully-connected layer that does not get the action as input.\n\n4\n\n\fFigure 2: Example of predictions over 250 steps in Freeway. The \u2018Step\u2019 and \u2018Action\u2019 columns show the\nnumber of prediction steps and the actions taken respectively. The white boxes indicate the object controlled\nby the agent. From prediction step 256 to 257 the controlled object crosses the top boundary and reappears at\nthe bottom; this non-linear shift is predicted by our architectures and is not predicted by MLP and naFf. The\nhorizontal movements of the uncontrolled objects are predicted by our architectures and naFf but not by MLP.\n\n(a) Seaquest\n\n(b) Space Invaders\n\n(c) Freeway\n\n(d) QBert\n\n(e) Ms Pacman\n\nFigure 3: Mean squared error over 100-step predictions\n\n4.1 Evaluation of Predicted Frames\nQualitative Evaluation: Prediction video. The prediction videos of our models and baselines are\navailable in the supplementary material and at the following website: https://sites.google.\ncom/a/umich.edu/junhyuk-oh/action-conditional-video-prediction. As seen in the\nvideos, the proposed models make qualitatively reasonable predictions over 30\u2013500 steps depending\non the game. In all games, the MLP baseline quickly diverges, and the naFf baseline fails to predict\nthe controlled object. An example of long-term predictions is illustrated in Figure 2. We observed\nthat both of our models predict complex local translations well such as the movement of vehicles\nand the controlled object. They can predict interactions between objects such as collision of two\nobjects. Since our architectures effectively extract hierarchical features using CNN, they are able to\nmake a prediction that requires a global context. For example, in Figure 2, the model predicts the\nsudden change of the location of the controlled object (from the top to the bottom) at 257-step.\nHowever, both of our models have dif\ufb01culty in accurately predicting small objects, such as bullets in\nSpace Invaders. The reason is that the squared error signal is small when the model fails to predict\nsmall objects during training. Another dif\ufb01culty is in handling stochasticity. In Seaquest, e.g., new\nobjects appear from the left side or right side randomly, and so are hard to predict. Although our\nmodels do generate new objects with reasonable shapes and movements (e.g., after appearing they\nmove as in the true frames), the generated frames do not necessarily match the ground-truth.\nQuantitative Evaluation: Squared Prediction Error. Mean squared error over 100-step predic-\ntions is reported in Figure 3. Our predictive models outperform the two baselines for all domains.\nHowever, the gap between our predictive models and naFf baseline is not large except for Seaquest.\nThis is due to the fact that the object controlled by the action occupies only a small part of the image.\n\n5\n\nMLPStepnaFfFeedforwardRecurrentGround\tTruthAction255256257\u00e9\u00e9no-op05010005010015020005010001002003004000501000204060800501000100200300400050100050100150200250MLPnaFfFeedforwardRecurrent\f(a) Ms Pacman (28 \u00d7 28 cropped)\n\n(b) Space Invaders (90 \u00d7 90 cropped)\n\nFigure 4: Comparison between two encoding models (feedforward and recurrent). (a) Controlled object is\nmoving along a horizontal corridor. As the recurrent encoding model makes a small translation error at 4th\nframe, the true position of the object is in the crossroad while the predicted position is still in the corridor. The\n(true) object then moves upward which is not possible in the predicted position and so the predicted object\nkeeps moving right. This is less likely to happen in feedforward encoding because its position prediction is\nmore accurate.\n(b) The objects move down after staying at the same location for the \ufb01rst \ufb01ve steps. The\nfeedforward encoding model fails to predict this movement because it only gets the last four frames as input,\nwhile the recurrent model predicts this downwards movement more correctly.\n\n(a) Seaquest\n\n(b) Space Invaders\n\n(c) Freeway\n\n(d) QBert\n\n(e) Ms Pacman\n\nFigure 5: Game play performance using the predictive model as an emulator. \u2018Emulator\u2019 and \u2018Rand\u2019 correspond\nto the performance of DQN with true frames and random play respectively. The x-axis is the number of steps\nof prediction before re-initialization. The y-axis is the average game score measured from 30 plays.\n\nQualitative Analysis of Relative Strengths and Weaknesses of Feedforward and Recurrent\nEncoding. We hypothesize that feedforward encoding can model more precise spatial transfor-\nmations because its convolutional \ufb01lters can learn temporal correlations directly from pixels in the\nconcatenated frames. In contrast, convolutional \ufb01lters in recurrent encoding can learn only spatial\nfeatures from the one-frame input, and the temporal context has to be captured by the recurrent layer\non top of the high-level CNN features without localized information. On the other hand, recurrent\nencoding is potentially better for modeling arbitrarily long-term dependencies, whereas feedforward\nencoding is not suitable for long-term dependencies because it requires more memory and parame-\nters as more frames are concatenated into the input.\nAs evidence, in Figure 4a we show a case where feedforward encoding is better at predicting the\nprecise movement of the controlled object, while recurrent encoding makes a 1-2 pixel translation\nerror. This small error leads to entirely different predicted frames after a few steps. Since the\nfeedforward and recurrent architectures are identical except for the encoding part, we conjecture\nthat this result is due to the failure of precise spatio-temporal encoding in recurrent encoding. On\nthe other hand, recurrent encoding is better at predicting when the enemies move in Space Invaders\n(Figure 4b). This is due to the fact that the enemies move after 9 steps, which is hard for feedforward\nencoding to predict because it takes only the last four frames as input. We observed similar results\nshowing that feedforward encoding cannot handle long-term dependencies in other games.\n4.2 Evaluating the Usefulness of Predictions for Control\nReplacing Real Frames with Predicted Frames as Input to DQN. To evaluate how useful the\npredictions are for playing the games, we implement an evaluation method that uses the predictive\nmodel to replace the game emulator. More speci\ufb01cally, a DQN controller that takes the last four\nframes is \ufb01rst pre-trained using real frames and then used to play the games based on \u0001 = 0.05-\ngreedy policy where the input frames are generated by our predictive model instead of the game\nemulator. To evaluate how the depth of predictions in\ufb02uence the quality of control, we re-initialize\nthe predictions using the true last frames after every n-steps of prediction for 1 \u2264 n \u2264 100. Note\nthat the DQN controller never takes a true frame, just the outputs of our predictive models.\nThe results are shown in Figure 5. Unsurprisingly, replacing real frames with predicted frames\nreduces the score. However, in all the games using the model to repeatedly predict only a few time\n\n6\n\nFeed-forwardRecurrentTrueFeed-forwardRecurrentTrue050100020004000600080000501000200400600050100010203005010001000200030004000500005010005001000150020002500EmulatorRandMLPnaFfFeedforwardRecurrent\fTable 1: Average game score of DQN over 100 plays with standard error. The \ufb01rst row and the second row\nshow the performance of our DQN replication with different exploration strategies.\n\nModel\n13119 (538)\nDQN - Random exploration\nDQN - Informed exploration 13265 (577)\n\nSeaquest\n\nS. Invaders\n698 (20)\n681 (23)\n\nQBert\n\nFreeway\n30.9 (0.2) 3876 (106)\n32.2 (0.2) 8238 (498)\n\nMs Pacman\n2281 (53)\n2522 (57)\n\n(a) Random exploration.\n\n(b) Informed exploration.\n\nFigure 6: Comparison between two exploration methods on Ms Pacman.\nEach heat map shows the trajectories of the controlled object measured\nover 2500 steps for the corresponding method.\n\nFigure 7: Cosine similarity be-\ntween every pair of action factors\n(see text for details).\n\nsteps yields a score very close to that of using real frames. Our two architectures produce much\nbetter scores than the two baselines for deep predictions than would be suggested based on the much\nsmaller differences in squared error. The likely cause of this is that our models are better able to\npredict the movement of the controlled object relative to the baselines even though such an ability\nmay not always lead to better squared error. In three out of the \ufb01ve games the score remains much\nbetter than the score of random play even when using 100 steps of prediction.\nImproving DQN via Informed Exploration. To learn control in an RL domain, exploration of\nactions and states is necessary because without it the agent can get stuck in a bad sub-optimal policy.\nIn DQN, the CNN-based agent was trained using an \u0001-greedy policy in which the agent chooses\neither a greedy action or a random action by \ufb02ipping a coin with probability of \u0001. Such random\nexploration is a basic strategy that produces suf\ufb01cient exploration, but can be slower than more\ninformed exploration strategies. Thus, we propose an informed exploration strategy that follows\nthe \u0001-greedy policy, but chooses exploratory actions that lead to a frame that has been visited least\noften (in the last d time steps), rather than random actions. Implementing this strategy requires a\npredictive model because the next frame for each possible action has to be considered.\nThe method works as follows. The most recent d frames are stored in a trajectory memory, denoted\ni=1. The predictive model is used to get the next frame x(a) for every action a. We\nestimate the visit-frequency for every predicted frame by summing the similarity between the pre-\ndicted frame and the most d recent frames stored in the trajectory memory using a Gaussian kernel\nas follows:\n\nD = (cid:8)x(i)(cid:9)d\n\nd(cid:88)\n\nk(x, y) = exp(\u2212(cid:88)\n\nnD(x(a)) =\n\nk(x(a), x(i));\n\nmin(max((xj \u2212 yj)2 \u2212 \u03b4, 0), 1)/\u03c3)\n\n(7)\n\ni=1\n\nj\n\nwhere \u03b4 is a threshold, and \u03c3 is a kernel bandwidth. The trajectory memory size is 200 for QBert\nand 20 for the other games, \u03b4 = 0 for Freeway and 50 for the others, and \u03c3 = 100 for all games. For\ncomputational ef\ufb01ciency, we trained a new feedforward encoding network on 84 \u00d7 84 gray-scaled\nimages as they are used as input for DQN. The details of the network architecture are provided in\nthe supplementary material. Table 1 summarizes the results. The informed exploration improves\nDQN\u2019s performance using our predictive model in three of \ufb01ve games, with the most signi\ufb01cant\nimprovement in QBert. Figure 6 shows how the informed exploration strategy improves the initial\nexperience of DQN.\n4.3 Analysis of Learned Representations\nSimilarity among Action Representations.\nIn the factored multiplicative interactions, every ac-\ntion is linearly transformed to f factors (Waa in Equation 4). In Figure 7 we present the cosine\nsimilarity between every pair of action-factors after training in Seaquest. \u2018N\u2019 and \u2018F\u2019 corresponds\n\n7\n\n\u00e9NFNF\u00e8\u00e7\u00ea\u00ec\u00eb\u00ee\u00ed\u00e9\u00e8\u00e7\u00ea\u00ec\u00eb\u00ee\u00ed\f(cid:104)\n\n(fi \u2212 Ea[fi])2(cid:105)\n\nto \u2018no-operation\u2019 and \u2018\ufb01re\u2019. Arrows correspond to movements with (black) or without (white) \u2018\ufb01re\u2019.\nThere are positive correlations between actions that have the same movement directions (e.g., \u2018up\u2019\nand \u2018up+\ufb01re\u2019), and negative correlations between actions that have opposing directions. These re-\nsults are reasonable and discovered automatically in learning good predictions.\nDistinguishing Controlled and Uncontrolled Objects is itself a hard and interesting problem.\nBellemare et al. [2] proposed a framework to learn contingent regions of an image affected by agent\naction, suggesting that contingency awareness is useful for model-free agents. We show that our\narchitectures implicitly learn contingent regions as they learn to predict the entire image.\ni,:)(cid:62)a) with higher\nIn our architectures, a factor (fi = (Wa\nvariance measured over all possible actions, Var (fi) =\nEa\n, is more likely to transform an image\ndifferently depending on actions, and so we assume such\nfactors are responsible for transforming the parts of the\nimage related to actions. We therefore collected the high\nvariance (referred to as \u201chighvar\u201d) factors from the model\ntrained on Seaquest (around 40% of factors), and collected\nthe remaining factors into a low variance (\u201clowvar\u201d) subset.\nGiven an image and an action, we did two controlled for-\nward propagations: giving only highvar factors (by setting\nthe other factors to zeros) and vice versa. The results are\nvisualized as \u2018Action\u2019 and \u2018Non-Action\u2019 in Figure 8.\nIn-\nterestingly, given only highvar-factors (Action), the model\npredicts sharply the movement of the object controlled by\nactions, while the other parts are mean pixel values. In con-\ntrast, given only lowvar-factors (Non-Action), the model\npredicts the movement of the other objects and the back-\nground (e.g., oxygen), and the controlled object stays at its\nprevious location. This result implies that our model learns\nto distinguish between controlled objects and uncontrolled objects and transform them using disen-\ntangled representations (see [25, 24, 37] for related work on disentangling factors of variation).\n5 Conclusion\nThis paper introduced two different novel deep architectures that predict future frames that are de-\npendent on actions and showed qualitatively and quantitatively that they are able to predict visually-\nrealistic and useful-for-control frames over 100-step futures on several Atari game domains. To\nour knowledge, this is the \ufb01rst paper to show good deep predictions in Atari games. Since our ar-\nchitectures were domain independent we expect that they will generalize to many vision-based RL\nproblems. In future work we will learn models that predict future reward in addition to predicting\nfuture frames and evaluate the performance of our architectures in model-based RL.\nAcknowledgments. This work was supported by NSF grant IIS-1526059, Bosch Research, and\nONR grant N00014-13-1-0762. Any opinions, \ufb01ndings, conclusions, or recommendations expressed\nhere are those of the authors and do not necessarily re\ufb02ect the views of the sponsors.\nReferences\n[1] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evalua-\n\nFigure 8: Distinguishing controlled and\nuncontrolled objects. Action image shows\na prediction given only learned action-\nfactors with high variance; Non-Action\nimage given only low-variance factors.\n\ntion platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 2013.\n\n[2] M. G. Bellemare, J. Veness, and M. Bowling.\n\ngames. In AAAI, 2012.\n\nInvestigating contingency awareness using Atari 2600\n\n[3] M. G. Bellemare, J. Veness, and M. Bowling. Bayesian learning of recursively factored environments. In\n\nICML, 2013.\n\n[4] M. G. Bellemare, J. Veness, and E. Talvitie. Skip context tree switching. In ICML, 2014.\n[5] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1\u2013127,\n\n2009.\n\n[6] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n[7] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn CVPR, 2012.\n\n8\n\nPrev.\tframeNext\tframePredictionActionNon-Action\f[8] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[10] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[11] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time Atari game play using\n\nof\ufb02ine Monte-Carlo tree search planning. In NIPS, 2014.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. In ACM Multimedia, 2014.\n\n[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classi\ufb01-\n\ncation with convolutional neural networks. In CVPR, 2014.\n\n[15] L. Kocsis and C. Szepesv\u00b4ari. Bandit based Monte-Carlo planning. In ECML. 2006.\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[17] I. Lenz, R. Knepper, and A. Saxena. DeepMPC: Learning deep latent features for model predictive\n\ncontrol. In RSS, 2015.\n\n[18] R. Memisevic. Learning to relate images. IEEE TPAMI, 35(8):1829\u20131846, 2013.\n[19] V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent gram-\n\nmar cells. In NIPS, 2014.\n\n[20] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee. Structured recurrent temporal restricted Boltzmann\n\nmachines. In ICML, 2014.\n\n[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing\n\nAtari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\n[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[23] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML, 2010.\n[24] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold\n\ninteraction. In ICML, 2014.\n\n[25] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial\n\nexpression recognition. In ECCV. 2012.\n\n[26] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117, 2015.\n[27] J. Schmidhuber and R. Huber. Learning to generate arti\ufb01cial fovea trajectories for target detection. Inter-\n\nnational Journal of Neural Systems, 2:125\u2013134, 1991.\n\n[28] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using\n\nLSTMs. In ICML, 2015.\n\n[29] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted Boltzmann machine. In\n\nNIPS, 2009.\n\n[30] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML,\n\n2011.\n\n[31] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.\n[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-\n\nnovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.\n\n[33] G. W. Taylor and G. E. Hinton. Factored conditional restricted Boltzmann machines for modeling motion\n\nstyle. In ICML, 2009.\n\n[34] T. Tieleman and G. Hinton. Lecture 6.5 - RMSProp: Divde the gradient by a running average of its recent\n\nmagnitude. Coursera, 2012.\n\n[35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D\n\nconvolutional networks. In ICCV, 2015.\n\n[36] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n[37] J. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transforma-\n\ntions for 3D view synthesis. In NIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1631, "authors": [{"given_name": "Junhyuk", "family_name": "Oh", "institution": "University of Michigan"}, {"given_name": "Xiaoxiao", "family_name": "Guo", "institution": "University of Michigan, Ann Arbor"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "U. Michigan"}, {"given_name": "Richard", "family_name": "Lewis", "institution": "University of Michigan"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}]}