{"title": "Shaping Belief States with Generative Environment Models for RL", "book": "Advances in Neural Information Processing Systems", "page_first": 13475, "page_last": 13487, "abstract": "When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.", "full_text": "Shaping Belief States with Generative Environment\n\nModels for RL\n\nKarol Gregor\n\nDanilo Jimenez Rezende\n\nFrederic Besse\n\nYan Wu\n\nHamza Merzic\n\nA\u00e4ron van den Oord\n\n{karolg, danilor, fbesse, yanwu, hamzamerzic, avdnoord}@google.com\n\nGoogle DeepMind\n\nLondon, UK\n\nAbstract\n\nWhen agents interact with a complex environment, they must form and maintain\nbeliefs about the relevant aspects of that environment. We propose a way to\nef\ufb01ciently train expressive generative models in complex environments. We show\nthat a predictive algorithm with an expressive generative model can form stable\nbelief-states in visually rich and dynamic 3D environments. More precisely, we\nshow that the learned representation captures the layout of the environment as well\nas the position and orientation of the agent. Our experiments show that the model\nsubstantially improves data-ef\ufb01ciency on a number of reinforcement learning (RL)\ntasks compared with strong model-free baseline agents. We \ufb01nd that predicting\nmultiple steps into the future (overshooting), in combination with an expressive\ngenerative model, is critical for stable representations to emerge. In practice, using\nexpressive generative models in RL is computationally expensive and we propose\na scheme to reduce this computational burden, allowing us to build agents that are\ncompetitive with model-free baselines.\n\n1\n\nIntroduction\n\nWe are interested in making agents that can solve a wide range of tasks in complex and dynamic\nenvironments. While tasks may be vastly different from each other, there is a large amount of\nstructure in the world that can be captured and used by the agents in a task-independent manner.\nThis observation is consistent with the view that such general agents must understand the world\naround them [1]. The collection of algorithms that learn representations by exploiting structure in\nthe data that are general enough to support a wide range of downstream tasks is what we refer to as\nunsupervised learning or self-supervised learning. We hypothesize that an ideal unsupervised learning\nalgorithm should use past observations to create a stable representation of the environment. That\nis, a representation that captures the global factors of variation of the environment in a temporally\ncoherent way. As an example, consider an agent navigating in a complex landscape. At any given\ntime, only a small part of the environment is observable from the the perspective of the agent. The\nframes that this agent observes can vary signi\ufb01cantly over time, even though the global structure of\nthe environment is relatively static with only a few moving objects. An useful representation of such\nan environment would contain, for example, a map describing the overall layout of the terrain. Our\ngoal is to learn such representations in a general manner.\nPredictive models have long been hypothesized as a general mechanism to produce useful represen-\ntations based on which an agent can perform a wide variety of tasks in partially observed worlds\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[2, 3]. A formal way of describing agents in partially observed environments is through the notion\nof partially observable Markov decision processes [4, 5] (POMDPs). A key concept in POMDPs is\nthe notion of a belief-state, which can be de\ufb01ned as the suf\ufb01cient statistics of future states [6, 7, 8].\nIn this paper we refer to belief-states as any vector representation that is suf\ufb01cient to predict future\nobservations as in [9, 10].\nA fundamental problem in building useful environment models, which we want to address in this\nwork, is long-term consistency [11, 12]. This problem is characterized by many models\u2019 failure to\nperform coherent long-term predictions, while performing accurate short-term predictions, even in\ntrivial but partially observed environments [11, 13, 14].\nWe argue that this is not merely a model capacity issue. As previous work has shown, globally\ncoherent environment maps can be learned by position conditioned models [15]. We thus propose\nthat this problem is better diagnosed as the failure in model conditioning and a weak objective, which\nwe discuss in more details in Section 2.1.\nThe main contributions of this paper are: 1) We demonstrate that an expressive generative model of\nrich 3D environments can be learned from merely \ufb01rst-person views and actions to capture long-term\nconsistency; 2) we provide an analysis of three different belief-state architectures (LSTM [16],\nLSTM + Kanerva Memory [17] and LSTM + slot-based Memory [18]) on the ability to decode\nthe environment\u2019s map and the agent\u2019s location. 3) we design and perform experiments to test\nthe effects of both overshooting and memory, demonstrating that generative models bene\ufb01t from\nthese components more than deterministic models; 4) we show that training agents that share their\nbelief-state with the generative model have substantially increased data-ef\ufb01ciency compared to a\nstrong model-free baseline [19, 18], without signi\ufb01cantly affecting the training speed; 5) we show\none of the \ufb01rst agents that learns to collect construction materials in order to build complex structures\nfrom a \ufb01rst-person perspective in a 3d environment.\nThe remainder of this paper is organized as follows: in Section 2 we introduce the main components\nof our agent\u2019s architecture and discuss some key challenges in using expressive generative models in\nRL. Namely, the problem of conditioning, Section 2.1, and why next-step models are not suf\ufb01cient\nin Section 2.2, in Section 3 we discuss related research. Finally, we describe our experiments in\nSection 4.\n\n2 Methods\n\nIn this section we describe our proposed agent and model architectures. Our agent has two main\ncomponents. The \ufb01rst is a recurrent neural network (RNN) as in [19] which observes frames xt,\nprocesses them through a feed-forward network and aggregates the resulting outputs by updating its\nrecurrent state bt. From this updated state, the agent core then outputs the action logits, a sampled\naction and the value function baseline. The second component is the unsupervised model, which\ncan be: (i) a contrastive loss based on action-conditional CPC [20]; (ii) a deterministic predictive\nmodel (similar to [11]) and (iii) an expressive generative predictive model based on ConvDRAW,\n[21]. We also investigate different memory architectures in the form of slot-based memory (as used\nin the reconstructive memory agent, RMA) [18] and compressive memory (Kanerva) [17]. The\nunsupervised model consists of a recurrent network, which we refer to as simulation network or\nSimCore, that starts with a given belief state s0\nt = bt at some random time t, and then simulates\ndeterministically forward, seeing only the actions of the agent. After simulating for k steps, we use\nthe resulting state sk\nt )(in cases\n(ii) and (iii)). A diagram illustrating this agent is provided in Figure 1 and the precise computation\nsteps are provided in Table 1. A concrete example of the computation of the model\u2019s loss is provided\nas pseudo-code in Appendix K.\nEvaluating the loss of an expressive generative model for an entire sequence is computationally\nexpensive. We address this by only computing the model\u2019s loss at a small random subset of the\nframes in the sequence as shown in Table 1.\n\nt to predict the distribution of frames p(xt+k|bt, at...(t+k)) = p(xt+k|sk\n\n2.1 Frame generative models and the problem of conditioning.\n\nIt is known that expressive frame generative models are hard to condition [22, 23, 24, 25]. This is es-\npecially problematic for learning belief-states, because it is this conditioning that provides the learning\n\n2\n\n\fFigure 1: Diagram of the agent and model. The agent receives observations x from the environment,\nprocesses them through a feed-forward residual network (green) and forms a state using a recurrent\nnetwork (blue), online. This state is a belief state and is used to calculate policy and value as well as\nbeing the starting point for predictions of the future. These are done using a second recurrent network\n(orange) - a simulation network (SimCore) that simulates into the future seeing only the actions. The\nsimulated state is used to conditioning for a generative model (red) of a future frame.\n\nAgent Core\n\nSimCore\n\nBelief State Update\nValue and Policy Logits\nAction\nSimulation State Initialization\nSimulation State Update\nSimulation Starting Times\nLikelihood Evaluation Times\nPredictive Loss\n\nbt = RNNagent(bt1, at1, xt)\nVt, ot = f (bt)\nat \u21e0 Cat(ot)\ns0\nt = bt\nsk+1\nt\nti \u21e0 unif(1, Lu)\nk \u21e0 unif(1, min(Lu, ti + Lo))\ni\nL = PNg\ni=1PNt\nk=1 log p(xti+i\n\n= RNNsimcore(sk\n\nt , at+k)\n\nk\n\nk|si\nti )\n\nt = bt. The states of the SimCore RNN sk\n\nTable 1: Agent and simulation core de\ufb01nitions. The agent\u2019s RNN state bt is what we call the belief\nt is initialized to be equal to the belief\nstate. At the start of a simulation, the SimCore\u2019s RNN state sk\nstate s0\nk to condition the model of the\nframes at times ti + i\nk. Here Lu is the total length of an episode, Ng (typically 2) is the number of\npoints in the future used to evaluate the predictive loss, Nt (typically 6) is the number of random\npoints along the trajectory where we unroll the predictive model and Lo is the overshoot length\n(typically 4-32), which is the maximum time-length used to train the predictive model. We choose\nNg and Nt small compared to Lo to maintain a low computational cost of the model\u2019s loss.\n\nt are used at times i\n\nsignal for the formation of belief-states. If the generative model ignores the conditioning information,\nit will not be possible to optimize the belief-states. More precisely, if the generative model fails\nk=1 ln p(xt+k )\n\nti ) \u21e1 PNt\nto use the conditioning we have L = PNt\nand thus rbtL\u21e1 0, and consequently learning the belief-state bt is not possible.\nWe observe this problem by experimenting with expressive state of-the-art deep generative models\nsuch as PixelCNN [26], ConvDRAW [21] and RNVP [27]. We found empirically that a modi\ufb01ed\nConvDRAW in combination with GECO [28] works well in practice, which allows us to learn stable\nbelief-states while maintaining good sample quality. As a result, we can use our model to consistently\nsimulate many steps into the future. More details of the modi\ufb01ed ConvDRAW architecture and\nGECO optimization experiments are provided in Appendix A and Appendix G respectively.\n\nk=1 ln p(xt+k|sk\n\ni=1PNg\n\ni=1PNg\n\n2.2 Why is next-step prediction not suf\ufb01cient?\nA theoretically suf\ufb01cient and conceptually simple way to form a belief state bt is to train a next-step\nprediction model p(xt+1|x1, . . . , xt) = p(xt+1|bt), where bt = RNN(bt1, xt, at) summarizes\nthe past. Under an optimal solution, it contains all the information needed to predict the future\n\n3\n\n\fp(xt+1, xt+2, . . .|bt) because any joint distribution can be factorized as a product of such conditionals:\np(xt+1, xt+2, . . .|bt) = p(xt+1|bt) \u21e5 p(xt+2|bt+1 = f (xt+1, bt)) \u21e5 . . . This reasoning motivates a\nlot of research using next-step prediction in RL, e.g. [29, 30].\nWe argue that next-step prediction exacerbates the conditioning problem described in Sec-\ntion 2.1.\nIn a physically realistic environment the immediate future observation xt+1 can be\npredicted, with high accuracy, as a near-deterministic function of the immediate past observa-\ntions x(t\uf8ff),...,t. This intuition can be expressed as p(xt+1|x(t\uf8ff),...,t, a(t\uf8ff1),...,(t1), b(t\uf8ff)) \u21e1\np(xt+1|x(t\uf8ff),...,t, a(t\uf8ff1),...,(t1)). That is, the immediate past weakens the dependency on belief-\nstate vectors, resulting in rbtL\u21e1 0. Predicting the distant future, in contrast, requires knowledge of\nthe global structure of the environment, encouraging the formation of belief-states that contain that\ninformation.\nGenerative environment models trained with overshooting have been explored in the context of\nmodel-based planning [31, 32, 33, 34]. But evidence of the effect of overshooting has been restricted\nto the agent\u2019s performance evaluation [33, 31]. While there is some evidence that overshooting\nimproves the ability to predict the long-term future [11], there is no extensive study examining which\naspects of the environment are retained by these models.\nAs noted above, for a given belief-state the entropy of the distribution of target observations increases\nwith the overshoot length (due to partial observability and/or randomness in the environment), going\nfrom a near deterministic (uni-modal) distribution to a highly multi-modal distribution. This leads us\nto hypothesize that deterministic prediction models should bene\ufb01t less from growing the overshooting\nlength compared to generative prediction models. Our experiments below support this hypothesis.\n\n2.3 Belief-states, Localization and Mapping\n\nExtracting a consistent high-level representation of the environment such as the bird\u2019s eye view \"map\"\nfrom merely \ufb01rst-person views and actions in a completely unsupervised manner is a notoriously\ndif\ufb01cult task [12] and a lot of the success in addressing this problem is due to the injection of a\nsubstantial amount of human prior knowledge in the models [35, 36, 37, 38, 39, 40].\nWhile previous work has primarily focused on extracting human-interpretable maps of the environ-\nment, our approach is to decode position, orientation and top down view or layout of the environment\nfrom the agent\u2019s belief-state bt. This decoding process does not interfere with the agent\u2019s training,\nand is not restricted to 2D map-layouts.\nWe use one-layer MLP to predict the discretized position and orientation and a convolutional network\nto predict the top down view (map decoder). When the belief-state bt is learned by an LSTM, it is\ncomposed of the LSTM hidden state ht and the LSTM cell state ct. Since the location and map decoder\nneed access to the full belief-state, we condition these maps on the vector ut = concat(ht, tanh(ct)).\nWhen using the episodic, slot based, RMA memory we \ufb01rst take a number of reads from the memory\nconditioned on the current belief-state bt and concatenate them with ut de\ufb01ned above. For the\nKanerva memory we learn a \ufb01xed set of read vectors and concatenate the retrieved memories with\nut.\n\n3 Other Related Work\n\nThe idea of learning general world models to support decision-making is probably one of the most\npervasive ideas in AI research, [30, 11, 41, 42, 14, 43, 44, 45, 46, 33, 30, 47, 48, 3, 49, 50, 51, 52]. In\nspite of a vast literature supporting the potential advantages of model-based RL, it has been a challenge\nto demonstrate the bene\ufb01ts of model-based agents in visually rich, dynamic, 3D environments. The\nchallenge of model-based RL in rich 3D environments has compelled some researchers to use\nprivileged information such as camera-locations [15], depth information [53], and other ground-truth\nstate variables of the state simulator [54, 49]. On the other hand, some work has provided evidence\nthat we may not need very expressive models to bene\ufb01t to some degree [30].\nOur proposed model could in principle be used in combination with planning algorithms. But\nwe take a step back from planning and focus more on the effect of various model choices on the\nlearned representations and belief states. This approach is similar to having a representation shaped\nvia auxiliary unsupervised losses for a model-free RL agent. Combining auxiliary losses with\n\n4\n\n\freinforcement learning is an active area of research and a variety of auxiliary losses have been\nexplored. A non-exhaustive list includes pixel control [55], contrastive predictive coding (CPC) [56],\naction-conditional CPC [20], frame reconstruction [35, 57], next-frame prediction [18, 58, 20, 59]\nand successor representations [60, 61, 62].\nAs in [20, 10, 50] our proposed architecture has a shared belief-state between the generative model\nand the agent\u2019s policy network. The closest paper to our work is [20], where a comparison is made\nbetween action-conditional CPC and next-step prediction using a deterministic next-step model.\nThere are several key differences between this paper and our work: (i) We analyze the decoding of\nthe environment\u2019s map from the belief state; (ii) We show that while next-frame prediction may be\nsuf\ufb01cient to encode position and orientation, it is necessary to combine expressive generative models\nwith overshoot to form a consistent map representation; (iii) We demonstrate that an expressive\ngenerative model can be trained to simulate visually rich 3D environments for several steps in the\nfuture; (iv) We also analyze the impact on RL performance of various model choices. We also discuss\nand propose solutions to the general problem of conditioning expressive generative models.\n\n4 Experiments\n\nWe analyze our agent\u2019s performance with respect to both its ability to represent the environment\n(e.g. knows its position and map layout) and RL performance. Our experiments span three main\naxes of variation: (i) the choice of unsupervised loss for the overshoots (e.g. deterministic prediction,\ngenerative prediction and contrastive); (ii) the choice of overshoot length and (iii) the choice of\narchitecture for the belief-state and simulation RNNs (LSTM [16], LSTM with Kanerva memory\n[17] and LSTM with the memory from the reconstructive memory agent (RMA) [18]). RMA uses a\nslot based memory that stores all past vectors, whereas Kanerva memory updates existing memories\nwith new information in a compressive way, see Appendix B for more details.\nThe agent is trained using the IMPALA framework [19], a variant of policy gradients, see Appendix D\nfor details. The model is trained jointly with the agent, sharing the belief network. We \ufb01nd that the\nrunning speed decreases only by about 20 40% compared to the agent without model. We use Adam\nfor optimization [63]. The detailed choice of various hyperparameters is provided in Appendix F.\nOur experiments are performed using four families of procedural environments: (a) DeepMind-Lab\nlevels [64] and three new environments that we created using the Unity Engine: (b) Random City; (c)\nBlock building environment; (d) Random Terrain.\n\n4.1 Random City\n\nThe Random City is a procedurally generated 3D navigation environment, Figure 2 showing \ufb01rst\nperson view (top row) and the top down view (second row). At the beginning of an episode, a random\nnumber of randomly colored boxes (i.e. \u201cbuildings\u201d) are placed on top of a 2d plane. We used this\nenvironment primarily as a way to analyze how different model architectures affect the formation of\nconsistent belief-states. We generated training data for the models using \ufb01xed handcrafted policy that\nchooses random locations and path planning policy to move between these locations and analyze the\nmodel and the content of the belief state (no RL in this experiment).\nIn the third row of Figure 2 we show the top down view reconstructions from the belief state (to\nemphasize, the belief-state was not trained with this information). We see that whenever the agent\nsees a new building, the building appears on a map, and it still preserves the other buildings seen so\nfar even if they are not in its current \ufb01eld-of-view. Rows four and \ufb01ve show a later part of an input\nsequence (when the model has seen a large part of the environment) and a rollout from the model. We\nsee that the model is able to preserve the information in its state and use this information to correctly\nsimulate forward.\nWe analyze the effects of self-supervised loss type, overshoot length and memory architecture on\nposition prediction and map decoding accuracy. The results are shown in Figure 3. We make the\nfollowing observations: (i) an increase in the overshoot length improves the ability to decode the\nagent\u2019s position and the map layout for all losses (up to certain length); (ii) The contrastive loss\nprovides the best decoding of the agent\u2019s position for all overshoot lengths Figure 3a; (iii) The\ngenerative prediction loss provides the best map decoding and is the most sensitive to the overshoot\n\n5\n\n\fFigure 2: Random City environment. Rows: 1. Input to the model sequence starting from the\nbeginning of the episode. 2. Top down view (a map). 3. Top down view decoded from the belief\nstate. The belief state was not trained with this decoding signal, but only from the \ufb01rst person view\n(top row). We see that the model is able to \ufb01ll up the map as it sees new frames. 4. Frames later in\nthe sequence (after 170 steps). 5. Rollout from the model. The model knows what it will see as the\nagent rotates. See supplementary video https://youtu.be/dOnvAp_wxv0.\n\n(a) Position decoding\n\n(b) Map decoding\n\n(c) Map samples\n\n(d) Map MSE vs Overshoot Length for each model\n\nFigure 3: The choice of model and overshoot length have signi\ufb01cant impact on state representation.\n(a) All models bene\ufb01t from an increase in the overshoot length with respect to position decoding,\nwith the Contrastive model reaching higher accuracy; (b) The Generative models are the most\nsensitive to overshoot length with respect to Map decoding MSE. A substantial reduction in map\ndecoding MSE is obtained by using architectures with memory; (c) Examples of decoded maps.\nEach block shows real maps (top-row) and decoded maps (bottom-row). Top block: Contrastive\nmodel samples at Overshoot Length 1 (MSE of approx. 160); Bottom block: Generative + Kanerva\nat Overshoot Length 12 (MSE of approx. 117). We can clearly notice the difference in the details\nfor both models. (d) Effect of overshoot on environment\u2019s map decoding. This analysis shows that\nGenerative and Generative + Kanerva bene\ufb01t the most from an increase in overshoot length in contrast\nto Deterministic and Contrastive architectures. In particular, we observe that Generative + Kanerva\narchitecture is particularly good at forming belief-states that contain a map of the environment.\n\nlength with respect to map decoding error Figure 3d. (iv) The combination of generative model with\nKanerva memory provides the best map decoding accuracy.\nWe also see that the contrastive loss is very good at localization but poor at mapping. This loss\nis trained to distinguish a given state from others within the simulated sequence and from other\nelements of the mini-batch. We hypothesize that keeping track of location very accurately allows to\ndistinguish a given time point from others, but that in a varied environment it is easy for the network\nto distinguish one batch element from others without forming a map of the environment.\nWe also see that Kanerva memory works signi\ufb01cantly better then pure LSTM and the slot based\nmemory. However, the latter result might be due to limitation of the method used to analyze the\ncontent of the belief state. In fact it is likely that the information is in the state since slot based\nmemory stores all the past vectors, but that it is hard to extract this information. This also raises an\ninteresting point - what is a belief state? Storing all past data contains all the information a model\ncan have. We suggest that what we are after is a more compact representation that is stored in a easy\nto access way. Kanerva memory aims to not only to store the past information but integrate it with\nalready stored information in a compressed way.\n\n6\n\n\fFigure 4: Generative SimCore results in substantial data-ef\ufb01ciency gains for agents in DeepMind-Lab\nrelative to a strong model-free baseline. We also observe that model-free agents have substantially\nhigher variance in their scores. See supplementary video https://youtu.be/dOnvAp_wxv0.\n\nFigure 5: The input and the rollout in DeepMind Lab. The agent is able to correctly rollout for many\nsteps, and remember where the rewarding object is (the object in the bottom right frames).\n\n4.2 DeepMind Lab\n\nDeepMind Lab [64] is a collection of 3D, \ufb01rst-person view environments where agents perform\na diverse set of tasks. We use a subset of DeepMind Lab (rat_goal_driven, rat_goal_doors,\nrat_goal_driven_large and keys_doors_random) to investigate how the addition of the generative\nprediction loss with overshoot affects the agent\u2019s representation or belief-state as well as its RL\nperformance.\nWe compare four agents in the following experiments. The \ufb01rst termed LSTM is the standard\nIMPALA agent with LSTM recurrent core. Next agent, termed RMA is the agent of [18], the core\nof which consist of and LSTM and a slot based episodic memory. The \ufb01nal two agents termed\nLSTM+SimCore and RMA+SimCore are the same as LSTM and RMA agents, but with the model\nloss added.\nThe results of our experiments are shown in Figure 4. We see that adding the model loss improves\nperformance for both LSTM and RMA agents. While Kanerva memory signi\ufb01cantly helps in the data\nregime we found it to be unstable in the RL setting. More work is required to solve this problem.\nWe found that using the RMA memory helped substantially with large environments as shown in\nFigure 4(rat_goal_driven_large).\nWe found that map reconstruction loss varies signi\ufb01cantly during training. This could be due to policy\ngradients affecting the belief state, changing policy or changing of the way the model represents the\ninformation, with decoder having hard time keeping up. We found that longer overshoot lengths\nperform better than shorter ones, but that did not translate into improved RL performance. This could\nalso be an artifact of the environment - there are permanent features present on the horizon, and the\nagent does not need to know the map to navigate to the goal. The model is able to correctly rollout for\na number of steps, Figure 5 knowing where the rewarding object is (the object on the bottom right).\n\n4.3 Voxel environment\n\nWe want to create an environment that requires agents to learn complex behaviours to solve tasks. For\nthis, we created a voxel-based, procedural environment with Unity that can be modi\ufb01ed by the agents\nvia building mechanisms, resulting in a large combinatorial space of possible voxel con\ufb01gurations\nand behavioural strategies. See accompanying video for examples of this environment and of learned\nbehaviors.\n\n7\n\n\fFigure 6: Top: Voxel levels. There are four levels: BridgeFood, Cliff, Food and HighFood. For each\nlevel, four views are shown: Early frame \ufb01rst person view, early frame third person view, later frame\n\ufb01rst person view, later frame third person view. The agent sees only the \ufb01rst person view and its goal\nis to pick up yellow blocks, which it needs to get to. The agent has blocks that it can place. The\nagent learns how to build towers (BridgeFood and HighFood) and stairs (Cliff) to climb to the food.\nBottom: Training the agent with SimCore substantially increases data-ef\ufb01ciency. See supplementary\nvideo https://youtu.be/dOnvAp_wxv0.\n\nThe environment consists of blocks of different types and appearances that are placed in a three\ndimensional grid. The agent moves continuously through the environment, obeying Unity engine\u2019s\nphysics. The agent has a simple inventory, can pick up blocks of certain types, place them back\ninto the world and interact with certain blocks. We build four levels, Figure 6 top, where the goal\nis to consume all yellow blocks (\u2018food\u2019). The levels are: Food: Five food blocks placed at random\nlocations in a plane - this is a curriculum level for the agent to quickly learn that yellow blocks give\nreward. HighFood: The same setting, but the food is also placed at random height. If the food is\nslightly high, the agent needs to look up and jump to get it. If the food is even higher, the agent needs\nto place blocks on the ground, jump on them, look up at the food and jump. Cliff: There is a cliff of\nrandom height with food at the top. The agent needs to \ufb01rst pick up blocks and then build structures\nto climb and reach the top of the cliff. Interestingly, the agent learns to pick them up and build stairs\non the side of the cliff. Bridge: The agent needs to get to the other side of a randomly sized gap,\neither by building a bridge or falling down and then building a tower to climb back up. The agent\nlearns the latter. We also trained the agent on more complex versions of the levels, showing rather\ncompetent abilities of building high structures climbing, see Appendix J and the accompanying video.\nWe compared the LSTM and LSTM+SimCore agents (without an episodic memory) on these levels.\nIn this case, one agent is playing all four levels at the same time. From Figure 6 we see that the\nSimCore signi\ufb01cantly improves the performance on all the levels. In addition we found that the\nperformance is much less sensitive to Adam hyper-parameters as well as unroll length. We also\nfound that the model is able to simulate its movement, building behaviours and block picking, see\n(Appendix J) for samples.\nFinally we tested a map building ability in a more naturalistic, procedurally generated terrain,\nFigure 11. This environment is harder than the city, because it takes signi\ufb01cantly more steps to cross\nthe environment. We also analyzed a simple RL setting of picking up randomly placed blocks. We\nfound that an LSTM agent contains an approximate map, but the information not seen for a while\ngradually fades away. We hope to scale up these experiments in the future.\n\n8\n\n\f5 Discussion\n\nIn this paper we introduced a scheme to train expressive generative environment models with RL\nagents. We found that expressive generative models in combination with overshoot can form stable\nbelief states in 3D environments from \ufb01rst person views, with little prior knowledge about the\nstructure of these environments. We also showed that by sharing the belief-states of the model with\nthe agent we substantially increase the data-ef\ufb01ciency in a variety of RL tasks relative to strong\nbaselines. There are more elements that need to be investigated in the future. First, we found that\ntraining the belief state together with the agent makes it harder to either form a belief state or decode\nthe map from it. This could result from the effect of policy gradients or changing of policy or\nchanging the way the belief is represented. Additionally we aim towards scaling up the system, either\nthrough better training or better use of memory architectures. Finally, it would be good to use the\nmodel not only for representation learning but for planning as well.\n\n9\n\n\fReferences\n[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\nnew perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u2013\n1828, 2013.\n\n[2] Peter Elias. Predictive coding\u2013i. IRE Transactions on Information Theory, 1(1):16\u201324, 1955.\n[3] J\u00fcrgen Schmidhuber. Curious model-building control systems. In [Proceedings] 1991 IEEE\n\nInternational Joint Conference on Neural Networks, pages 1458\u20131463. IEEE, 1991.\n\n[4] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in\n\npartially observable stochastic domains. Arti\ufb01cial intelligence, 101(1-2):99\u2013134, 1998.\n\n[5] Karl J Astrom. Optimal control of markov processes with incomplete state information. Journal\n\nof mathematical analysis and applications, 10(1):174\u2013205, 1965.\n\n[6] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[7] Tommi Jaakkola, Satinder P Singh, and Michael I Jordan. Reinforcement learning algorithm for\npartially observable markov decision problems. In Advances in neural information processing\nsystems, pages 345\u2013352, 1995.\n\n[8] Milos Hauskrecht. Value-function approximations for partially observable markov decision\n\nprocesses. Journal of arti\ufb01cial intelligence research, 13:33\u201394, 2000.\n\n[9] Karol Gregor and Frederic Besse. Temporal difference variational auto-encoder. CoRR,\n\nabs/1806.03107, 2018.\n\n[10] Pol Moreno, Jan Humplik, George Papamakarios, Bernardo Avila Pires, Lars Buesing, Nicolas\nHeess, and Theophane Weber. Neural belief states for partially observed domains. NeurIPS\n2018 workshop on Reinforcement Learning under Partial Observability, 2018.\n\n[11] Silvia Chiappa, S\u00e9bastien Racani\u00e8re, Daan Wierstra, and Shakir Mohamed. Recurrent environ-\n\nment simulators. CoRR, abs/1704.02254, 2017.\n\n[12] Marco Fraccaro, Danilo Jimenez Rezende, Yori Zwols, Alexander Pritzel, SM Eslami, and Fabio\nViola. Generative temporal models with spatial memory for partially observed environments.\narXiv preprint arXiv:1804.09401, 2018.\n\n[13] Nal Kalchbrenner, A\u00e4ron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex\nGraves, and Koray Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 1771\u20131779. JMLR. org, 2017.\n\n[14] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-\nIn Advances in neural\n\nconditional video prediction using deep networks in atari games.\ninformation processing systems, pages 2863\u20132871, 2015.\n\n[15] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene\nrepresentation and rendering. Science, 360(6394):1204\u20131210, 2018.\n\n[16] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[17] Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap. The kanerva machine: A generative\n\ndistributed memory. arXiv preprint arXiv:1804.01756, 2018.\n\n[18] Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico\nCarnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by\ntransporting value. arXiv preprint arXiv:1810.06721, 2018.\n\n[19] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl\nwith importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.\n\n10\n\n\f[20] Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A. Pires, Toby Pohlen,\n\nand R\u00e9mi Munos. Neural predictive belief representations. CoRR, abs/1811.06407, 2018.\n\n[21] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra.\nTowards conceptual compression. In Advances In Neural Information Processing Systems,\npages 3549\u20133557, 2016.\n\n[22] Rui Liu, Yu Liu, Xinyu Gong, Xiaogang Wang, and Hongsheng Li. Conditional adversarial\n\ngenerative \ufb02ow for controllable image synthesis. arXiv preprint arXiv:1904.01782, 2019.\n\n[23] Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy.\n\nFixing a broken elbo. arXiv preprint arXiv:1711.00464, 2017.\n\n[24] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the\npixelcnn with discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint\narXiv:1701.05517, 2017.\n\n[25] Ali Razavi, A\u00e4ron van den Oord, Ben Poole, and Oriol Vinyals. Preventing posterior collapse\n\nwith delta-vaes. arXiv preprint arXiv:1901.03416, 2019.\n\n[26] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.\nConditional image generation with pixelcnn decoders. In Advances in neural information\nprocessing systems, pages 4790\u20134798, 2016.\n\n[27] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n[28] Danilo Jimenez Rezende and Fabio Viola. Generalized elbo with constrained optimization,\n\ngeco. 2018.\n\n[29] Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P\nReichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and\nquerying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006,\n2018.\n\n[30] David Ha and J\u00fcrgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.\n\n[31] John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and\nSergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with\ntrajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.\n\n[32] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee,\nand James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint\narXiv:1811.04551, 2018.\n\n[33] David Silver, Hado P. van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley,\nGabriel Dulac-Arnold, David P. Reichert, Neil C. Rabinowitz, Andr\u00e9 Barreto, and Thomas\nDegris. The predictron: End-to-end learning and planning. In ICML, 2017.\n\n[34] Brandon Amos, Laurent Dinh, Serkan Cabi, Thomas Roth\u00f6rl, Sergio G\u00f3mez Colmenarejo,\nAlistair Muldal, Tom Erez, Yuval Tassa, Nando de Freitas, and Misha Denil. Learning awareness\nmodels. arXiv preprint arXiv:1804.06318, 2018.\n\n[35] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of\ndepth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 1851\u20131858, 2017.\n\n[36] Ganesh Iyer, J Krishna Murthy, Gunshi Gupta, Madhava Krishna, and Liam Paull. Geometric\nIn Proceedings of the IEEE\n\nconsistency for self-supervised end-to-end visual odometry.\nConference on Computer Vision and Pattern Recognition Workshops, pages 267\u2013275, 2018.\n\n[37] Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam:\n\nLearning to explore with external memory. arXiv preprint arXiv:1706.09520, 2017.\n\n11\n\n\f[38] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforce-\n\nment learning. In International Conference on Learning Representations, 2018.\n\n[39] Baris Kayalibay, Atanas Mirchev, Maximilian Soelch, Patrick van der Smagt, and Justin Bayer.\n\nNavigation and planning in latent maps. 2018.\n\n[40] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cogni-\ntive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2616\u20132625, 2017.\n\n[41] J\u00fcrgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010).\n\nIEEE Transactions on Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\n[42] Chris Xie, Sachin Patil, Teodor Moldovan, Sergey Levine, and Pieter Abbeel. Model-based\nreinforcement learning with parametrized physical models and optimism-driven exploration. In\n2016 IEEE international conference on robotics and automation (ICRA), pages 504\u2013511. IEEE,\n2016.\n\n[43] Paul Munro. A dual back-propagation scheme for scalar reward learning. In Ninth Annual\n\nConference of the Cognitive Science Society, pages 165\u2013176, 1987.\n\n[44] Paul J Werbos. Learning how the world works: Speci\ufb01cations for predictive networks in robots\nand brains. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics,\nNY, 1987.\n\n[45] Derrick Nguyen and Bernard Widrow. The truck backer-upper: An example of self-learning in\n\nneural networks. In Advanced neural computers, pages 11\u201319. Elsevier, 1990.\n\n[46] Marc Peter Deisenroth and Carl E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient\n\napproach to policy search. In ICML, 2011.\n\n[47] Long-Ji Lin and Tom M Mitchell. Memory approaches to reinforcement learning in non-\n\nMarkovian domains. Citeseer, 1992.\n\n[48] Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent\n\nreinforcement learning: a hybrid approach. arXiv preprint arXiv:1509.03044, 2015.\n\n[49] Carlos Diuk, Andre Cohen, and Michael L Littman. An object-oriented representation for\nef\ufb01cient reinforcement learning. In Proceedings of the 25th international conference on Machine\nlearning, pages 240\u2013247. ACM, 2008.\n\n[50] Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep\n\nvariational reinforcement learning for POMDPs. arXiv preprint arXiv:1806.02426, 2018.\n\n[51] Shagun Sodhani, Anirudh Goyal, Tristan Deleu, Yoshua Bengio, Sergey Levine, and Jian\nTang. Learning powerful policies by using consistent dynamics model. arXiv preprint\narXiv:1906.04355, 2019.\n\n[52] Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Bengio, Devi\nParikh, and Dhruv Batra. Learning dynamics model in reinforcement learning by incorporating\nthe long term future. arXiv preprint arXiv:1903.01599, 2019.\n\n[53] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino,\nMisha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in\ncomplex environments. arXiv preprint arXiv:1611.03673, 2016.\n\n[54] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement\n\nlearning. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[55] Max Jaderberg, Volodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver,\nand Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR,\nabs/1611.05397, 2017.\n\n[56] A\u00e4ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. CoRR, abs/1807.03748, 2018.\n\n12\n\n\f[57] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel,\nMatthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot\ntransfer in reinforcement learning. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1480\u20131490. JMLR. org, 2017.\n\n[58] S\u00e9bastien Racani\u00e8re, Th\u00e9ophane Weber, David Reichert, Lars Buesing, Arthur Guez,\nDanilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li,\net al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural\ninformation processing systems, pages 5690\u20135701, 2017.\n\n[59] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint\n\narXiv:1611.01779, 2016.\n\n[60] Andr\u00e9 Barreto, Will Dabney, R\u00e9mi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt,\nand David Silver. Successor features for transfer in reinforcement learning. In Advances in\nneural information processing systems, pages 4055\u20134065, 2017.\n\n[61] Peter Dayan. Improving generalization for temporal difference learning: The successor repre-\n\nsentation. Neural Computation, 5(4):613\u2013624, 1993.\n\n[62] Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor\n\nreinforcement learning. arXiv preprint arXiv:1606.02396, 2016.\n\n[63] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[64] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich\nK\u00fcttler, Andrew Lefrancq, Simon Green, V\u00edctor Vald\u00e9s, Amir Sadik, et al. Deepmind lab. arXiv\npreprint arXiv:1612.03801, 2016.\n\n[65] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[66] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n[67] Yan Wu, Gregory Wayne, Karol Gregor, and Timothy Lillicrap. Learning attractor dynamics for\ngenerative memory. In Advances in Neural Information Processing Systems, pages 9379\u20139388,\n2018.\n\n13\n\n\f", "award": [], "sourceid": 7484, "authors": [{"given_name": "Karol", "family_name": "Gregor", "institution": "DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}, {"given_name": "Frederic", "family_name": "Besse", "institution": "DeepMind"}, {"given_name": "Yan", "family_name": "Wu", "institution": "DeepMind"}, {"given_name": "Hamza", "family_name": "Merzic", "institution": "DeepMind"}, {"given_name": "Aaron", "family_name": "van den Oord", "institution": "Google Deepmind"}]}