{"title": "Explicit Explore-Exploit Algorithms in Continuous State Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 9377, "page_last": 9387, "abstract": "We present a new model-based algorithm for reinforcement learning (RL) which\nconsists of explicit exploration and exploitation phases, and is applicable in large or\ninfinite state spaces. The algorithm maintains a set of dynamics models consistent\nwith current experience and explores by finding policies which induce high dis-\nagreement between their state predictions. It then exploits using the refined set of\nmodels or experience gathered during exploration. We show that under realizability\nand optimal planning assumptions, our algorithm provably finds a near-optimal\npolicy with a number of samples that is polynomial in a structural complexity\nmeasure which we show to be low in several natural settings. We then give a\npractical approximation using neural networks and demonstrate its performance\nand sample efficiency in practice.", "full_text": "Explicit Explore-Exploit Algorithms in Continuous\n\nState Spaces\n\nMikael Henaff\n\nMicrosoft Research\n\nmihenaff@microsoft.com\n\nAbstract\n\nWe present a new model-based algorithm for reinforcement learning (RL) which\nconsists of explicit exploration and exploitation phases, and is applicable in large or\nin\ufb01nite state spaces. The algorithm maintains a set of dynamics models consistent\nwith current experience and explores by \ufb01nding policies which induce high dis-\nagreement between their state predictions. It then exploits using the re\ufb01ned set of\nmodels or experience gathered during exploration. We show that under realizability\nand optimal planning assumptions, our algorithm provably \ufb01nds a near-optimal\npolicy with a number of samples that is polynomial in a structural complexity\nmeasure which we show to be low in several natural settings. We then give a\npractical approximation using neural networks and demonstrate its performance\nand sample ef\ufb01ciency in practice.\n\n1\n\nIntroduction\n\nWhat is a good algorithm for systematically exploring an environment for the purpose of reinforcement\nlearning? A good answer could make the application of deep RL to complex problems [31, 30, 28, 20]\nmuch more sample ef\ufb01cient. In tabular Markov Decision Processes (MDPs) with a small number of\ndiscrete states, model-based algorithms which perform exploration in a provably sample-ef\ufb01cient\nmanner have existed for over a decade [24, 5, 47]. The \ufb01rst of these, known as the Explicit Explore-\nExploit (E3) algorithm [24], progressively builds a model of the environment\u2019s dynamics. At each\nstep, the agent uses this model to plan, either to explore and reach an unknown state, or to exploit\nand maximize its reward within the states it knows well. By actively seeking out unknown states,\nthe algorithm provably learns a near-optimal policy using a number of samples which is at most\npolynomial in the number of states. Many problems of interest, however, have a set of states which is\nin\ufb01nite or extremely large (for example, all images represented with \ufb01nite precision), and in these\nsettings, tabular algorithms are no longer applicable.\nIn this work, we propose a new E3-style algorithm which operates in large or continuous state spaces.\nThe algorithm maintains a set of dynamics models which are consistent with the agent\u2019s current\nexperience, and explores the environment by executing policies designed to induce high disagreement\nbetween their predictions. We show that under realizability and optimal planning assumptions, our\nalgorithm provably \ufb01nds a near-optimal policy using a number of samples from the environment\nwhich is independent of the number of states, and is instead polynomial in the rank of the model mis\ufb01t\nmatrix, a structural complexity measure which we show to be low in natural settings such as small\ntabular MDPs, large MDPs with factored transition dynamics [23] and (potentially in\ufb01nite) low rank\nMDPs. We then present a practical version of the algorithm using neural networks, and demonstrate\nits performance and sample ef\ufb01ciency empirically on several problems with large or continuous state\nspaces.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:104)\n\nvexplore(\u03c0,Mt)\n\nexplore = argmax\u03c0\u2208\u03a0\n\u03c0t\nif vexplore(\u03c0t\n\nexplore,Mt) > \u0001|A| then\n\nAlgorithm 1 (M, \u03a0, n, \u0001, \u03c6)\n1: Inputs Initial model set M, policy class \u03a0, number of trajectories n, tolerance \u0001, model error \u03c6.\n2: M1 \u2190 M\n3: Initialize replay buffer R \u2190 \u2205.\n4: for t = 1, 2, ... do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n\nCollect dataset of n trajectories following \u03c0t\nMt+1 \u2190 UpdateModelSet(Mt,R, \u03c6)\nChoose any \u02dcM \u2208 Mt\n\u03c0exploit = argmax\u03c0\u2208\u03a0\nHalt and return \u03c0exploit\n\nexplore, add to replay buffer R\n\nvexploit(\u03c0, \u02dcM )\n\nelse\n\n(cid:105)\n\n(cid:105)\n\n(cid:104)\n\n2 Algorithm\nWe consider an episodic, \ufb01nite-horizon MDP setting de\ufb01ned by a tuple (S,A, M (cid:63), R(cid:63), H). Here\nS is a set of states (which could be large or in\ufb01nite), A is a discrete set of actions, M (cid:63) is the true\n(unknown) transition model mapping state-action pairs to distributions over next states, R(cid:63) is the true\nfunction mapping states to rewards in [0, 1], and H is the horizon length. For simplicity we assume\nrewards are part of the state and the agent has access to R(cid:63), so the task of predicting future rewards is\nincluded in that of predicting future states. A state s \u2208 S at time step h will be denoted by sh.\nThe general form of our algorithm is given by Algorithm 1. At each epoch t, the algorithm maintains\na set of dynamics models Mt which are consistent with the experience accumulated so far, and\nsearches for an exploration policy which will induce high disagreement between their predictions. If\nsuch a policy is found, it is executed and the set of models is updated to re\ufb02ect the new experience.\nOtherwise, the algorithm switches to its exploit phase and searches for a policy which will maximize\nits predicted future rewards.\nM (\u00b7) denote the distribution over states at time step h induced by sampling actions from policy\nLet P \u03c0,h\nM(cid:48) (\u00b7)), where \u03b4 denotes\nM (\u00b7), P \u03c0,h\n\u03c0 and transitions from model M, and let D(\u03c0, M, M(cid:48), h) = \u03b4(P \u03c0,h\na distance measure between probability distributions such as KL divergence or total variation. The\nquantity which the exploration policy seeks to maximize at epoch t is given by:\n\nvexplore(\u03c0,Mt) = max\n\nM,M(cid:48)\u2208Mt\n\nD(\u03c0, M, M(cid:48), h)\n\nH(cid:88)\n\nh=1\n\nMaximizing this quantity can be viewed as solving a \ufb01ctitious exploration MDP, whose state space\nis the concatenation of |Mt| state vectors in the original MDP, whose transition matrix consists of\na block-diagonal matrix whose blocks are the transition matrices of the models in Mt, and whose\nreward function is the distance measured using \u03b4 between the pairs of components of the state vector\ncorresponding to different models. Importantly, searching for an exploration policy can be done\ninternally by the agent and does not require any environment interaction, which will be key to the\nalgorithm\u2019s sample ef\ufb01ciency.\nOnce the agent can no longer \ufb01nd a policy which induces suf\ufb01cient disagreement between its\ncandidate models in Mt, it chooses a model and computes an exploitation policy using the model\u2019s\npredicted reward:\n\nvexploit(\u03c0, M ) =\n\nP \u03c0,h\nM (sh)R(cid:63)(sh)\n\nH(cid:88)\n\n(cid:88)\n\nh=1\n\nsh\n\n2\n\n\f3 Sample Complexity Analysis\n\n3.1 Algorithm Instantiation\n\nWe \ufb01rst give an instantiation of Algorithm 1, called DREEM (DisagReement-led Elimination of\nEnvironment Models), for which we will prove sample complexity results. All proofs can be found\nin Appendix A. The algorithm starts with a large set of candidate models M, which is assumed\nto contain the true model, and iteratively eliminates models which are not consistent with the\nexperience gathered through the exploration policy. We will show that the number of samples needed\nto \ufb01nd a near-optimal policy is independent of the number of states, and is instead polynomial in\n|A|, H, log |M|, log |\u03a0|, and the rank of the model mis\ufb01t matrix, a quantity which we de\ufb01ne below\nand which is low in natural settings. DREEM is identical to Algorithm 1, with the UpdateModelSet\nsubroutine instantiated as follows:\nAlgorithm 2 UpdateModelSet(Mt,R, \u03c6)\n\nexplore, M, h) using data from R collected using the\n\n1: For each M \u2208 Mt, h \u2264 H, compute(cid:102)W(\u03c0t\n2: Mt+1 \u2190 {M \u2208 Mt : (cid:102)W(\u03c0t\n\nlast exploration policy \u03c0t\n\nexplore\n\n3: Return Mt+1\n\nexplore, M, h) \u2264 \u03c6 for all h \u2264 H}\n\nsh\u22121\u223cP \u03c0,h\u22121\n\nM (cid:63)\n\n,ah\u22121\u223cU (A)\n\nW(\u03c0, M, h) = E\n\nThe quantity W(\u03c0, M, h) can be thought of as the error of model M in parts of the state space visited\nby \u03c0 at time step h, and is formally de\ufb01ned below.\nDe\ufb01nition 1. The mis\ufb01t of model M discovered by policy \u03c0 at time step h is given by:\n\n(cid:3)\n(cid:2)(cid:107)PM (\u00b7|sh\u22121, ah\u22121) \u2212 PM (cid:63) (\u00b7|sh\u22121, ah\u22121)(cid:107)T V\nThe empirical mis\ufb01t estimated using a dataset collected by following \u03c0 is denoted(cid:102)W(\u03c0, M, h).\nSee Appendix A.1 for details on computing(cid:102)W. We will make use of the following two assumptions:\n\nAssumption 1. M contains the true model M (cid:63) and \u03a0 contains optimal policies for all models in M.\nAssumption 2. The policy optimizations in Algorithm 1 are performed exactly.\nThe \ufb01rst is a standard realizability assumption. The second assumes access to an optimal planner\nand has been used in several previous works [24, 23, 22, 5]. This does not mean that the planning\nproblem is trivial, but is meant to separate the dif\ufb01culty of planning from that of exploration.\nWe note that DREEM will not be computationally feasible for many problems since the sets Mt will\noften be large, and the algorithm requires iterating over them during both the elimination and planning\nsteps. However, it distills the key ideas of Algorithm 1 and demonstrates its sample ef\ufb01ciency when\noptimizations can be performed exactly. We will later give a practical instantiation of Algorithm 1\nand demonstrate its sample ef\ufb01ciency empirically.\n\n3.2 Structural Complexity Measure\nSince we are considering settings where S is large or in\ufb01nite, it is not meaningful to give sample\ncomplexity results in terms of the number of states, as is often done for tabular algorithms. We instead\nuse a structural complexity measure which is independent of the number of states, and depends on\nthe maximum rank over a set of error matrices, which we de\ufb01ne next. 1\nDe\ufb01nition 2. (Model Mis\ufb01t Matrices) Let M be a model class and \u03a0 a policy class. De\ufb01ne the set\nof matrices A1, ..., AH \u2208 R|\u03a0|\u00d7|M| by Ah(\u03c0, M ) = W(\u03c0, M, h) for all \u03c0 \u2208 \u03a0 and M \u2208 M.\nUsing the ranks of error matrices as complexity measures of RL environments was proposed in\n[21, 49]. Although the model mis\ufb01t matrices Ah may themselves be very large, we show next that\ntheir ranks are in fact small in several natural settings.\n\n1We use a generalized notion of rank with a condition on the row norms of the factorization: for an m \u00d7 n\nmatrix B, denote rank(B, \u03b2) to be the smallest integer k such that B = U V (cid:62) with U \u2208 Rm\u00d7k, V \u2208 Rn\u00d7k\nand for every pair of rows ui, vj we have (cid:107)ui(cid:107)2 \u00b7 (cid:107)vj(cid:107)2 \u2264 \u03b2. \u03b2 appears in Lemma 3 and Theorem 1.\n\n3\n\n\fProposition 1. Assume |S| is \ufb01nite and let Ah be the matrix de\ufb01ned above. Then rank(Ah) \u2264 |S|.\nProposition 2. Let \u0393 denote the true transition matrix of size |S| \u00d7 |S \u00d7 A|, with \u0393(s(cid:48), (s, a)) =\nPM (cid:63) (s(cid:48)|s, a). Assume that there exist two matrices \u03931, \u03932 of sizes |S| \u00d7 K and K \u00d7 |S \u00d7 A| such\nthat \u0393 = \u03931\u03932. Then rank(Ah) \u2264 K.\nThe next proposition, which is a straightforward adaptation of a result from [49] 2, shows that the\nranks of the model mis\ufb01t matrices are also low in factored MDPs [23].\nProposition 3. Consider a factored MDP setting where the state space is given by S = Od where\nd \u2208 N and O is a small \ufb01nite set, and the transition matrix has a factored structure with L parameters.\nThen rank(Ah) \u2264 L.\n\n3.3 Sample Complexity\n\nNow that we have de\ufb01ned our structural complexity measure, we prove sample complexity results for\nDREEM. We will use a slightly different de\ufb01nition of D(\u03c0, M, M(cid:48)) than the one in Section 2, in that\nthe last action is sampled uniformly:\nDe\ufb01nition 3. (Predicted Model Disagreement Induced by Policy)\n\nD(\u03c0, M, M(cid:48), h) =\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nsh\u22121\n\nah\u22121\n\nsh\n\n(cid:12)(cid:12)(cid:12)PM (sh|sh\u22121, ah\u22121)P \u03c0,h\u22121\n\nM\n\n(sh\u22121)U (ah)\n\n\u2212 PM(cid:48)(sh|sh\u22121, ah\u22121)P \u03c0,h\u22121\n\nM(cid:48)\n\n(sh\u22121)U (ah)\n\n(cid:12)(cid:12)(cid:12)\n\nWe begin by proving a lemma which, intuitively, states that if a policy induces disagreement between\ntwo models of the environment, then it will also induce disagreement between at least one of\nthese models and the true model. This means that by searching for and then executing a policy\nwhich induces disagreement between at least two models, the agent will collect experience from the\nenvironment which will enable it to invalidate at least one of them.\nLemma 1. Let M be a set of models and \u03a0 a set of policies. If there exist M, M(cid:48) \u2208 M, \u03c0 \u2208 \u03a0 and\nh \u2264 H such that D(\u03c0, M, M(cid:48), h) > \u03b1, then there exists h(cid:48) \u2264 h such that W(\u03c0, M, h(cid:48)) > \u03b1\n4|A|\u00b7H or\nW(\u03c0, M(cid:48), h(cid:48)) > \u03b1\n\n4|A|\u00b7H (or both).\n\nNext, we give a lemma which states that at any time step, the agent has either found an exploration\npolicy which will lead it to collect experience which allows it to reduce its set of candidate models, or\nhas found an exploitation policy which is close to optimal. Here v\u03c0 is the value of \u03c0 in the true MDP.\nLemma 2. (Explore or Exploit) Suppose the true model M (cid:63) is never eliminated. At iteration t,\none of the following two conditions must hold: either there exists M \u2208 Mt, ht \u2264 H such that\nW(\u03c0t\n\n4H 2|A|2 , or the algorithm returns \u03c0exploit such that v\u03c0exploit > v\u03c0(cid:63) \u2212 \u0001.\n\nexplore, M, ht) >\n\n\u0001\n\nThe above two lemmas state that at any time step, the agent either reduces its set of candidate models\nor \ufb01nds an exploitation policy which is close to optimal. However, since the initial set of candidate\nmodels may be very large, we need to ensure that many models are discarded at each exploration\nstep. Our next lemma bounds the number of iterations of the algorithm by showing that the set of\ncandidate models is reduced by a constant factor at every step.\nLemma 3. (Iteration Complexity) Let d = max1\u2264h\u2264H rank(Ah) and \u03c6 =\n\n. Suppose\nexplore, M, h)| \u2264 \u03c6 holds for all t, h \u2264 H and M \u2208 M. Then the\nnumber of rounds of Algorithm 1 with the UpdateModelSet routine given by Algorithm 2 is at most\nHd log( \u03b2\n\nexplore, M, h) \u2212 W(\u03c0t\n\nthat |(cid:102)W(\u03c0t\n\n24H 2|A|2\n\n\u0001\n\n\u221a\n\nd\n\n2\u03c6 )/ log(5/3).\n\nThe proof operates by representing each matrix Ah in factored form, which induces an embedding\nof each model in Mt in a d-dimensional space. Minimum volume ellipsoids are then constructed\naround these embeddings. A geometric argument shows that the volume of these ellipsoids shrinks\n\n2Appendix E.2, Proposition 2\n\n4\n\n\fby a constant factor from one iteration of the algorithm to the next, leading to a number of updates\nlinear in d. Combining the previous lemmas with a concentration argument, we get our main result:\nTheorem 1. Assuming that M (cid:63) \u2208 M,\nand\n2\u03c6 )/ log(5/3). Run Algorithm 1 with inputs (M, n, \u03c6) where n =\ndenote T = Hd log( \u03b2\n\u0398(H 4|A|4d log(T|M||\u03a0|/\u03b4)/\u00012), and the UpdateModelSet routine is given by Algorithm 2. Then\nwith probability at least 1 \u2212 \u03b4, Algorithm 1 outputs a policy \u03c0exploit such that v\u03c0exploit \u2265 v\u03c0(cid:63) \u2212 \u0001.\nThe number of trajectories collected is at most \u02dcO\n\nfor any \u0001, \u03b4 \u2208 (0, 1] set \u03c6 =\n\n(cid:16) H 5d2|A|4\n\n(cid:16) T|M||\u03a0|\n\n(cid:17)(cid:17)\n\n\u221a\n\nd\n\n\u0001\n\n24H 2|A|2\n\n.\n\nlog\n\n\u00012\n\n\u03b4\n\nNote that the above result requires knowledge of d to set the \u03c6 and n parameters. If this quantity is\nunknown, it can be estimated using a doubling trick which does not affect the algorithm\u2019s asymptotic\nsample complexity. Details can be found in Appendix A.3.\n\n4 Neural-E3: A Practical Instantiation\n\nThe above analysis shows that Algorithm 1 is sample ef\ufb01cient given an idealized instantiation, which\nmay not be computationally practical for large model classes. Here we give a computationally\nef\ufb01cient instantiation called Neural-E3, which requires implementing the UpdateModelSet routine\nand the planning routines.\n\n4.1 Model Updates\nWe represent Mt as an ensemble of action-conditional dynamics models {M1, ..., ME}, parame-\nterized by neural networks, which are trained to model the next-state distribution PM (cid:63) (sh+1|sh, a)\nusing the data from the replay buffer R. The models are trained to minimize the following loss:\n\nL(M,R) = E(sh+1,ah,sh)\u223cR[\u2212 log PM (sh+1|sh, ah)]\n\nThe models in M1 are initialized with random weights and the subroutine UpdateModelSet in\nAlgorithm 1 takes as input Mt, performs Nupdate gradient updates to each of the models using\ndifferent minibatches sampled from R, and returns the updated set of models Mt+1. The dynamics\nmodels can be deterministic or stochastic (for example, Mixture Density Networks [4] or Variational\nAutoencoders [26]).\n\n4.2 Planning\n\nThe exploration and exploitation phases require computing a policy to optimize vexplore or vexploit\nand executing it in the environment. If the environment is deterministic, policies can be represented\nas action sequences, in which case we use a generalized version of breadth-\ufb01rst search applicable in\ncontinuous state spaces. This uses a priority queue, where expanded states are assigned a priority\nbased on their minimum distance to other states in the currently expanded search tree. Details can\nbe found in Appendix B.2.1. For stochastic environments, we used implicit policies obtained using\nMonte-Carlo Tree Search (MCTS) [10], where each node in the tree consists of empirical distributions\npredicted by the different models conditioned on the action sequence leading to the node. The agent\nonly executes the \ufb01rst action of the sequence returned by the planning procedure, and replans at every\nstep to account for the stochasticity of the environment. See Appendix B.2.2 for details.\n\n4.3 Exploitation with Off-Policy RL\n\nFor some problems with sparse rewards, it may be computationally impractical to use planning during\nthe exploitation phase, even with a perfect model. Note that much of the exploration phase, which\nuses model disagreement as a \ufb01ctitious reward, can be seen as an MDP with dense rewards, while\nthe rewards in the true MDP may be sparse. In these settings, we use an alternative approach where\na parameterized value function such as a DQN [31] is trained using the experience collected in the\nreplay buffer during exploration. This can be done of\ufb02ine without collecting additional samples from\nthe environment. We also found this useful for problems with antishaped rewards, where the MCTS\nprocedure can be biased away from the optimal actions if they temporarily lead to lower reward than\nsuboptimal ones.\n\n5\n\n\f4.4 Relationship between Idealized and Practical Algorithms\nFor both the idealized and practical algorithms, Mt represents a set of models with low error on the\ncurrent replay buffer. In the idealized algorithm, models with high error are eliminated explicitly in\nAlgorithm 2, while in the practical algorithm, models with high error are avoided by the optimization\nprocedure. The main difference between the two algorithms is that the idealized version maintains\nall models in the model class which have low error (which includes the true model), whereas the\npractical version only maintains a subset due to time and memory constraints. A potential failure\nmode of the practical algorithm would be if all the models wrongly agree in their predictions in some\nunexplored part of the state-action space which leads to high reward. However, in practice we found\nthat using different initializations and minibatches was suf\ufb01cient to obtain a diverse set of models,\nand that using even a relatively small ensemble (4 to 8 models) led to successful exploration.\n\n5 Related Work\n\nTheoretical guarantees for a number of model-based RL algorithms exist in the tabular setting\n[24, 5, 47, 45] and in the continuous setting when the dynamics are assumed to be linear [51, 1, 11].\nThe Metric-E3 algorithm [22] operates in general state spaces, but its sample complexity depends\non the covering number which may be exponential in dimension. The algorithm of [29] addresses\ngeneral model classes and optimizes lower bounds on the value function, and provably converges to\na locally optimal policy with a number of samples polynomial in the dimension of the state space.\nIt also admits an approximate instantiation which was shown to work well in continuous control\ntasks. The work of [49] provides an algorithm which provably recovers a globally near-optimal\npolicy with polynomial sample complexity using a structural complexity measure which we adapt\nfor our analysis, but does not investigate practical approximations. The algorithm we analyze is\nfundamentally different from both of these approaches, as it uses disagreement over predicted states\nrather than optimism to drive exploration.\nOur practical approximation is closely related to the MAX algorithm [44], which also uses disagree-\nment between different models in an ensemble to drive exploration. Our version differs in a few\nways i) we use maximal disagreement rather than variance to measure uncertainty, as this re\ufb02ects our\ntheoretical analysis ii) we de\ufb01ne the exploration MDP differently, by propagating the state predictions\nof the different models rather than sampling at each step iii) we explicitly address the exploitation step,\nwhereas they focused primarily on exploration. The work of [40] also used disagreement between\nsingle-step predictions to train an exploration policy.\nSeveral works have empirically demonstrated the sample ef\ufb01ciency of model-based RL in continuous\nsettings [2, 12, 46, 34, 8], including with high-dimensional images [17, 19, 16]. These have primarily\nfocused on settings with dense rewards where simple exploration was suf\ufb01cient, or where rich\nobservational data was available.\nOther approaches to exploration include augmenting rewards with exploration bonuses, such as\ninverse counts in the tabular setting [48, 27], pseudo-counts derived from density models over the\nstate space [3, 38], prediction errors of either a dynamics model [39] or a randomly initialized network\n[7], or randomizing value functions [36, 37]. These have primarily focused on model-free methods,\nwhich have been known to have high sample complexity despite yielding good \ufb01nal performance.\n\n6 Experiments\n\nWe now give empirical results for the Neural-E3 algorithm described in Section 4. See Appendix C\nfor experimental details and https://github.com/mbhenaff/neural-e3 for source code.\n\n6.1 Stochastic Combination Lock\n\nWe begin with a set of experiments on the stochastic combination lock environment described in\n[14] and shown in Figure 1(a). These environments consist of H levels with 3 states per level and 4\nactions. Two of the states lead to high reward and the third is a dead state from which it is impossible\nto recover. The effect of actions are \ufb02ipped with probability 0.1, and the one-hot state encodings\nare appended with random Bernoulli noise to increase the number of possible observations. We\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Environments tested. a) Stochastic combination lock: The agent must reach the red\nstates to collect high reward while avoiding the dead states (black) from which it cannot recover. b)\nMazes: The agent (green) must navigate through the maze to reach the goal (red). Different mazes\nare generated each episode, which requires generalizing across mazes (colors are changed here for\nreadability) c) Continuous Control: Classic control tasks requiring non-trivial exploration.\n\nexperimented with two task variants: a \ufb01rst where the rewards are zero everywhere except for the red\nstates, and a second where small, antishaped rewards encourage the agent to transition to the dead\nstates (see Appendix C.1 for details). This tests an algorithm\u2019s robustness to poor local optima.\nWe compare against three other methods: a double DQN [18] with prioritized experience replay [41]\nusing the OpenAI Baselines implementation [13], a Proximal Policy Optimization (PPO) agent [42],\nand a PPO agent with a Random Network Distillation (RND) exploration bonus [7]. For Neural-E3,\nwe used stochastic dynamics models outputting the parameters of a multivariate Bernoulli distribution,\nand the MCTS procedure described in Appendix B.2.2 during the exploration phase. We used the\nDQN-based method described in Section 4.3 for the exploit phase.\nFigure 2(a) shows performance across 5 random seeds for the \ufb01rst variant of the task. For all horizons,\nNeural-E3 achieves the optimal reward across most seeds. The DQN also performs well, although it\noften requires more samples than Neural-E3. For longer horizons, PPO never collects rewards, while\nPPO+RND eventually succeeds given a large number of episodes (see Appendix C.1).\nFigure 2(b) shows results for the task variant with antishaped rewards. For longer horizons, Neural-E3\nis the only method to achieve the globally optimal reward, whereas none of the other methods get\npast the poor local optimum induced by the misleading rewards. Note that Neural-E3 actually obtains\nless reward than the other methods during its exploration phase, but this pays off during exploitation\nsince it enables the agent to eventually discover states with much higher reward.\n\n6.2 Maze Environment\n\nWe next evaluated our approach on a maze environment, which is a modi\ufb01ed version of the Collect\ndomain [35], shown in Figure 1(b). States consist of RGB images where the three channels represent\nthe walls, the agent and the goal respectively. The agent receives a reward of 2.0 for reaching the\ngoal, \u22120.5 for hitting a wall and \u22120.2 otherwise. Mazes are generated randomly for each episode,\nthus the number of states is extremely large and the agent must learn to generalize across mazes. Our\ndynamics models are action-conditional convolutional networks taking as input an image and action\nand predicting the next image and reward. We used the deterministic search procedure described in\nSection B.2.1 for planning.\nWe compared to two other approaches. The \ufb01rst was a double DQN with prioritized experience\nreplay as before. The second was a model-based agent identical to ours, except that it uses a uniform\nexploration policy during the explore phase. This is similar to the PETS algorithm [8] applied to\ndiscrete action spaces, as it optimizes rewards over an ensemble of dynamics models. We call this\nUE2, for Uniform Explore Exploit.\nPerformance measured by reward across 3 random seeds is shown in Figure 2(c) for different maze\nsizes. The DQN agent is able to solve the smallest 5 \u00d7 5 mazes after a large number of episodes,\nbut is not able to learn meaningful behavior for larger mazes. The UE2 and Neural-E3 agents both\nperform similarly for the 5 \u00d7 5 mazes, but the relative performance of Neural-E3 improves as the\nsize of the maze becomes larger. Note also that the Neural-E3 agent collects more reward during\nits exploration phase, even though it is not explicitly optimizing for reward but rather for model\ndisagreement. Figure 5 in Appendix C.2 shows the model predictions for an action sequence executed\nby the Neural-E3 agent during the exploration phase. The predictions of the different models agree\nuntil the reward is reached, which is a rare event.\n\n7\n\n\u2026...\u2026...\u2026...\f(a) Stochastic Combination Lock\n\n(b) Stochastic Combination Lock with antishaped rewards\n\n(c) Maze\n\n(d) Classic Control\n\nFigure 2: Comparison of methods across different domains. Solid lines represent median performance\nacross seeds, shaded region represents range between best and worst seeds.\n\n6.3 Continuous Control\n\nWe then evaluated our approach on two continuous control domains, shown in Figure 1(c). Mountain-\nCar [32] is an environment with simple non-linear dynamics and continuous state space (S \u2286 R2)\nwhere the agent must drive an underpowered car up a steep hill, which requires building momentum\nby \ufb01rst driving up the opposite end of the hill. The agent only receives reward at the top of the hill,\nhence this requires non-trivial exploration. Acrobot [50] requires swinging a simple under-actuated\nrobot above a given height, also with a continuous state space (S \u2286 R6). Both tasks have discrete\naction spaces with |A| = 3.\nWe found that even planning with a perfect model was computationally impractical due to the sparsity\nof rewards, hence we used the method described in section 4.3, where we trained a DQN of\ufb02ine\nusing the data collected during exploration. Results for Neural-E3, DQN and RND agents across 5\nrandom seeds are shown in Figure 2(d). For Mountain Car, the DQN is able to solve the task but\nrequires around 1200 episodes to do so. Neural-E3 is able to quickly explore, and solves the task to a\nsimilar degree of success in under 300 episodes. The RND agent only starts to collect reward after\n10K episodes. For the Acrobot task, Neural-E3 also explores quickly, although its increase in sample\nef\ufb01ciency is less pronounced compared to the DQN. The RND agent is also able to make quicker\nprogress on this task, which suggests that the exploration problem may not be as dif\ufb01cult.\n\n7 Conclusion\n\nThis work extends the classic E3 algorithm to operate in large or in\ufb01nite state spaces. On the\ntheoretical side, we present a model-elimination based version of the algorithm which provably\nrequires only a polynomial number of samples to learn a near-optimal policy with high probability.\nEmpirically, we show that this algorithm can be approximated using neural networks and still provide\ngood sample ef\ufb01ciency in practice. An interesting direction for future work would be combining the\nexploration and exploitation phases in a uni\ufb01ed process, which has been done in the tabular setting\n[5]. Another direction would be to explicitly encourage disagreement between different models in\nthe ensemble for unseen inputs, in order to better approximate the maximal disagreement between\nmodels in a version space which we use in our idealized algorithm. Such ideas have been proposed in\nactive learning [9] and contextual bandits [15], and could potentially be adapted to multi-step RL.\n\n8\n\n0500100015002000Episodes012345RewardHorizon length 50500100015002000Episodes012345RewardHorizon length 100500100015002000Episodes012345RewardHorizon length 150500100015002000Episodes012345RewardHorizon length 20Neural-E3DQNPPO+RNDPPO0500100015002000Episodes01234RewardHorizon length 50500100015002000Episodes01234RewardHorizon length 100500100015002000Episodes01234RewardHorizon length 150500100015002000Episodes01234RewardHorizon length 20Neural-E3DQNPPO+RNDPPO0255075100125150175Episodes20151050Reward5 x 5 MazeNeural-E3UE2DQNDQN (3k episodes)0255075100125150175Episodes35302520151050Reward10 x 10 MazeNeural-E3UE2DQN0255075100125150175Episodes3530252015105Reward20 x 20 MazeNeural-E3UE2DQN0200400600800100012001400Episodes200180160140120100RewardMountain CarNeural-E3DQNRND(10k ep)0200400600800100012001400Episodes20018016014012010080RewardAcrobotNeural-E3DQNRND\fAcknowledgments\n\nI would like to thank Akshay Krishnamurthy, John Langford, Alekh Agarwal and Miro Dudik for\nhelpful discussions and feedback.\n\nReferences\n[1] Y. Abbasi-Yadkori and C. Szepesv\u00e1ri. Regret bounds for the adaptive control of linear quadratic\nIn S. M. Kakade and U. von Luxburg, editors, Proceedings of the 24th Annual\nsystems.\nConference on Learning Theory, volume 19 of Proceedings of Machine Learning Research,\npages 1\u201326, Budapest, Hungary, 09\u201311 Jun 2011. PMLR.\n\n[2] C. G. Atkeson and J. C. Santamaria. A comparison of direct and model-based reinforcement\nlearning. In IN INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, pages\n3557\u20133564. IEEE Press, 1997.\n\n[3] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying\n\ncount-based exploration and intrinsic motivation. CoRR, abs/1606.01868, 2016.\n\n[4] C. M. Bishop. Mixture density networks. Technical report, 1994.\n[5] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-\n\noptimal reinforcement learning. J. Mach. Learn. Res., 3:213\u2013231, Mar. 2003.\n\n[6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym, 2016.\n\n[7] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation.\n\nIn International Conference on Learning Representations, 2019.\n\n[8] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful\n\nof trials using probabilistic dynamics models. In NeurIPS, 2018.\n\n[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2):201\u2013221, May 1994.\n\n[10] R. Coulom. Ef\ufb01cient selectivity and backup operators in monte-carlo tree search. volume 4630,\n\n05 2006.\n\n[11] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear\n\nquadratic regulator. CoRR, abs/1710.01688, 2017.\n\n[12] M. P. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to\npolicy search. In In Proceedings of the International Conference on Machine Learning, 2011.\n[13] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor,\nY. Wu, and P. Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.\n[14] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably ef\ufb01cient\nRL with rich observations via latent state decoding. In Proceedings of the 36th International\nConference on Machine Learning, 2019.\n\n[15] D. J. Foster, A. Agarwal, M. Dud\u00edk, H. Luo, and R. E. Schapire. Practical contextual bandits\nwith regression oracles. In ICML, volume 80 of Proceedings of Machine Learning Research,\npages 1534\u20131543. PMLR, 2018.\n\n[16] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 2450\u20132462. Curran Associates, Inc., 2018.\n[17] D. Hafner, T. P. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning\n\nlatent dynamics for planning from pixels. CoRR, abs/1811.04551, 2018.\n\n[18] H. v. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning.\nIn Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, AAAI\u201916, pages\n2094\u20132100. AAAI Press, 2016.\n\n[19] M. Henaff, A. Canziani, and Y. LeCun. Model-predictive policy learning with uncertainty regu-\nlarization for driving in dense traf\ufb01c. In International Conference on Learning Representations,\n2019.\n\n9\n\n\f[20] M. Hessel, J. Modayil, H. P. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan,\nB. Piot, M. G. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement\nlearning. In AAAI, 2018.\n\n[21] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision\nprocesses with low Bellman rank are PAC-learnable. In D. Precup and Y. W. Teh, editors, Pro-\nceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 1704\u20131713, International Convention Centre, Sydney,\nAustralia, 06\u201311 Aug 2017. PMLR.\n\n[22] S. M. Kakade, M. Kearns, and J. Langford. Exploration in metric state spaces. In ICML, 2003.\n[23] M. Kearns and D. Koller. Ef\ufb01cient reinforcement learning in factored mdps. In Proceedings of\nthe 16th International Joint Conference on Arti\ufb01cial Intelligence - Volume 2, IJCAI\u201999, pages\n740\u2013747, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[24] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Mach.\n\nLearn., 49(2-3):209\u2013232, Nov. 2002.\n\n[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014.\n\ncite\narxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference\nfor Learning Representations, San Diego, 2015.\n\n[26] D. P. Kingma and M. Welling. Auto-encoding variational bayes, 2013. cite arxiv:1312.6114.\n[27] J. Z. Kolter and A. Y. Ng. Near-bayesian exploration in polynomial time. In Proceedings of the\n26th Annual International Conference on Machine Learning, ICML \u201909, pages 513\u2013520, New\nYork, NY, USA, 2009. ACM.\n\n[28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\n\nContinuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2016.\n\n[29] Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma. Algorithmic framework for model-\nbased deep reinforcement learning with theoretical guarantees. In International Conference on\nLearning Representations, 2019.\n\n[30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nAsynchronous methods for deep reinforcement learning. In M. F. Balcan and K. Q. Weinberger,\neditors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of\nProceedings of Machine Learning Research, pages 1928\u20131937, New York, New York, USA,\n20\u201322 Jun 2016. PMLR.\n\n[31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,\nH. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, Feb. 2015.\n\n[32] A. W. Moore. Ef\ufb01cient memory-based learning for robot control. Technical report, 1990.\n[33] A. Mueller. Integral probability metrics and their generating classes of functions. 1997.\n[34] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based\n\ndeep reinforcement learning with model-free \ufb01ne-tuning. pages 7559\u20137566, 05 2018.\n\n[35] J. Oh, S. Singh, and H. Lee. Value prediction network. In I. Guyon, U. V. Luxburg, S. Ben-\ngio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 6118\u20136128. Curran Associates, Inc., 2017.\n\n[36] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 4026\u20134034. Curran Associates, Inc., 2016.\n\n[37] I. Osband, D. Russo, Z. Wen, and B. V. Roy. Deep exploration via randomized value functions.\n\nCoRR, abs/1703.07608, 2017.\n\n[38] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration\nwith neural density models. In Proceedings of the 34th International Conference on Machine\nLearning - Volume 70, ICML\u201917, pages 2721\u20132730. JMLR.org, 2017.\n\n[39] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-\nsupervised prediction. 2017 IEEE Conference on Computer Vision and Pattern Recognition\nWorkshops (CVPRW), pages 488\u2013489, 2017.\n\n10\n\n\f[40] D. Pathak, D. Gandhi, and A. Gupta. Self-supervised exploration via disagreement. In ICML,\n\n2019.\n\n[41] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. CoRR,\n\nabs/1511.05952, 2016.\n\n[42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. CoRR, abs/1707.06347, 2017.\n\n[43] Z. Shangtong. Modularized implementation of deep rl algorithms in pytorch. https://github.\n\ncom/ShangtongZhang/DeepRL, 2018.\n\n[44] P. Shyam, W. Jaskowski, and F. Gomez. Model-based active exploration. CoRR, abs/1810.12162,\n\n2018.\n\n[45] J. Sorg, S. Singh, and R. L. Lewis. Variance-based rewards for approximate bayesian rein-\nforcement learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI\u201910, pages 564\u2013571, Arlington, Virginia, United States, 2010. AUAI Press.\n\n[46] A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks: Learning\ngeneralizable representations for visuomotor control. In Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July\n10-15, 2018, pages 4739\u20134748, 2018.\n\n[47] A. L. Strehl and M. L. Littman. A theoretical analysis of model-based interval estimation. In\nProceedings of the 22Nd International Conference on Machine Learning, ICML \u201905, pages\n856\u2013863, New York, NY, USA, 2005. ACM.\n\n[48] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov\n\ndecision processes. J. Comput. Syst. Sci., 74(8):1309\u20131331, Dec. 2008.\n\n[49] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based reinforcement\n\nlearning in contextual decision processes. CoRR, abs/1811.08540, 2018.\n\n[50] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse\ncoding. In Advances in Neural Information Processing Systems 8, pages 1038\u20131044. MIT Press,\n1996.\n\n[51] R. S. Sutton, C. Szepesv\u00e1ri, A. Geramifard, and M. Bowling. Dyna-style planning with linear\nfunction approximation and prioritized sweeping. In Proceedings of the Twenty-Fourth Con-\nference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201908, pages 528\u2013536, Arlington, Virginia,\nUnited States, 2008. AUAI Press.\n\n11\n\n\f", "award": [], "sourceid": 5001, "authors": [{"given_name": "Mikael", "family_name": "Henaff", "institution": "Microsoft Research"}]}