{"title": "EX2: Exploration with Exemplar Models for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2577, "page_last": 2587, "abstract": "Deep reinforcement learning algorithms have been shown to learn complex tasks using highly general policy classes. However, sparse reward problems remain a significant challenge. Exploration methods based on novelty detection have been particularly successful in such settings but typically require generative or predictive models of the observations, which can be difficult to train when the observations are very high-dimensional and complex, as in the case of raw images. We propose a novelty detection algorithm for exploration that is based entirely on discriminatively trained exemplar models, where classifiers are trained to discriminate each visited state against all others. Intuitively, novel states are easier to distinguish against other states seen during training. We show that this kind of discriminative modeling corresponds to implicit density estimation, and that it can be combined with count-based exploration to produce competitive results on a range of popular benchmark tasks, including state-of-the-art results on challenging egocentric observations in the vizDoom benchmark.", "full_text": "EX2: Exploration with Exemplar Models for Deep\n\nReinforcement Learning\n\nJustin Fu\u2217\n\nJohn D. Co-Reyes\u2217\n\nSergey Levine\n\nUniversity of California Berkeley\n\n{justinfu,jcoreyes,svlevine}@eecs.berkeley.edu\n\nAbstract\n\nDeep reinforcement learning algorithms have been shown to learn complex tasks\nusing highly general policy classes. However, sparse reward problems remain a\nsigni\ufb01cant challenge. Exploration methods based on novelty detection have been\nparticularly successful in such settings but typically require generative or predictive\nmodels of the observations, which can be dif\ufb01cult to train when the observations\nare very high-dimensional and complex, as in the case of raw images. We propose a\nnovelty detection algorithm for exploration that is based entirely on discriminatively\ntrained exemplar models, where classi\ufb01ers are trained to discriminate each visited\nstate against all others. Intuitively, novel states are easier to distinguish against\nother states seen during training. We show that this kind of discriminative modeling\ncorresponds to implicit density estimation, and that it can be combined with count-\nbased exploration to produce competitive results on a range of popular benchmark\ntasks, including state-of-the-art results on challenging egocentric observations in\nthe vizDoom benchmark.\n\n1\n\nIntroduction\n\nRecent work has shown that methods that combine reinforcement learning with rich function ap-\nproximators, such as deep neural networks, can solve a range of complex tasks, from playing Atari\ngames (Mnih et al., 2015) to controlling simulated robots (Schulman et al., 2015). Although deep\nreinforcement learning methods allow for complex policy representations, they do not by themselves\nsolve the exploration problem: when the reward signals are rare and sparse, such methods can struggle\nto acquire meaningful policies. Standard exploration strategies, such as \u0001-greedy strategies (Mnih\net al., 2015) or Gaussian noise (Lillicrap et al., 2015), are undirected and do not explicitly seek out\ninteresting states. A promising avenue for more directed exploration is to explicitly estimate the\nnovelty of a state, using predictive models that generate future states (Schmidhuber, 1990; Stadie\net al., 2015; Achiam & Sastry, 2017) or model state densities (Bellemare et al., 2016; Tang et al., 2017;\nAbel et al., 2016). Related concepts such as count-based bonuses have been shown to provide sub-\nstantial speedups in classic reinforcement learning (Strehl & Littman, 2009; Kolter & Ng, 2009), and\nseveral recent works have proposed information-theoretic or probabilistic approaches to exploration\nbased on this idea (Houthooft et al., 2016; Chentanez et al., 2005) by drawing on formal results in\nsimpler discrete or linear systems (Bubeck & Cesa-Bianchi, 2012). However, most novelty estimation\nmethods rely on building generative or predictive models that explicitly model the distribution over\nthe current or next observation. When the observations are complex and high-dimensional, such as in\nthe case of raw images, these models can be dif\ufb01cult to train, since generating and predicting images\nand other high-dimensional objects is still an open problem, despite recent progress (Salimans et al.,\n2016). Though successful results with generative novelty models have been reported with simple\nsynthetic images, such as in Atari games (Bellemare et al., 2016; Tang et al., 2017), we show in our\n\n\u2217equal contribution.\n\n\fexperiments that such generative methods struggle with more complex and naturalistic observations,\nsuch as the ego-centric image observations in the vizDoom benchmark.\nHow can we estimate the novelty of visited states, and thereby provide an intrinsic motivation signal\nfor reinforcement learning, without explicitly building generative or predictive models of the state or\nobservation? The key idea in our EX2 algorithm is to estimate novelty by considering how easy it is\nfor a discriminatively trained classi\ufb01er to distinguish a given state from other states seen previously.\nThe intuition is that, if a state is easy to distinguish from other states, it is likely to be novel. To\nthis end, we propose to train exemplar models for each state that distinguish that state from all other\nobserved states. We present two key technical contributions that make this into a practical exploration\nmethod. First, we describe how discriminatively trained exemplar models can be used for implicit\ndensity estimation, allowing us to unify this intuition with the theoretically rigorous framework of\ncount-based exploration. Our experiments illustrate that, in simple domains, the implicitly estimated\ndensities provide good estimates of the underlying state densities without any explicit generative\ntraining. Second, we show how to amortize the training of exemplar models to prevent the total\nnumber of classi\ufb01ers from growing with the number of states, making the approach practical and\nscalable. Since our method does not require any explicit generative modeling, we can use it on a\nrange of complex image-based tasks, including Atari games and the vizDoom benchmark, which\nhas complex 3D visuals and extensive camera motion due to the egocentric viewpoint. Our results\nshow that EX2 matches the performance of generative novelty-based exploration methods on simpler\ntasks, such as continuous control benchmarks and Atari, and greatly exceeds their performance on the\ncomplex vizDoom domain, indicating the value of implicit density estimation over explicit generative\nmodeling for intrinsic motivation.\n\n2 Related Work\n\nIn \ufb01nite MDPs, exploration algorithms such as E3 (Kearns & Singh, 2002) and R-max (Brafman &\nTennenholtz, 2002) offer theoretical optimality guarantees. However, these methods typically require\nmaintaining state-action visitation counts, which can make extending them to high dimensional and/or\ncontinuous states very challenging. Exploring in such state spaces has typically involved strategies\nsuch as introducing distance metrics over the state space (Pazis & Parr, 2013; Kakade et al., 2003),\nand approximating the quantities used in classical exploration methods. Prior works have employed\napproximations for the state-visitation count (Tang et al., 2017; Bellemare et al., 2016; Abel et al.,\n2016), information gain, or prediction error based on a learned dynamics model (Houthooft et al.,\n2016; Stadie et al., 2015; Achiam & Sastry, 2017). Bellemare et al. (2016) show that count-based\nmethods in some sense bound the bonuses produced by exploration incentives based on intrinsic\nmotivation, such as model uncertainty or information gain, making count-based or density-based\nbonuses an appealing and simple option.\nOther methods avoid tackling the exploration problem directly and use randomness over model\nparameters to encourage novel behavior (Chapelle & Li, 2011). For example, bootstrapped DQN\n(Osband et al., 2016) avoids the need to construct a generative model of the state by instead training\nmultiple, randomized value functions and performs exploration by sampling a value function, and\nexecuting the greedy policy with respect to the value function. While such methods scale to complex\nstate spaces as well as standard deep RL algorithms, they do not provide explicit novelty-seeking\nbehavior, but rather a more structured random exploration behavior.\nAnother direction explored in prior work is to examine exploration in the context of hierarchical\nmodels. An agent that can take temporally extended actions represented as action primitives or skills\ncan more easily explore the environment (Stolle & Precup, 2002). Hierarchical reinforcement learning\nhas traditionally tried to exploit temporal abstraction (Barto & Mahadevan, 2003) and relied on semi-\nMarkov decision processes. A few recent works in deep RL have used hierarchies to explore in sparse\nreward environments (Florensa et al., 2017; Heess et al., 2016). However, learning a hierarchy is\ndif\ufb01cult and has generally required curriculum learning or manually designed subgoals (Kulkarni\net al., 2016). In this work, we discuss a general exploration strategy that is independent of the design\nof the policy and applicable to any architecture, though our experiments focus speci\ufb01cally on deep\nreinforcement learning scenarios, including image-based navigation, where the state representation is\nnot conducive to simple count-based metrics or generative models.\n\n2\n\n\fConcurrently with this work, Pathak et al. (2017) proposed to use discriminatively trained exploration\nbonuses by learning state features which are trained to predict the action from state transition pairs.\nThen given a state and action, their model predicts the features of the next state and the bonus is\ncalculated from the prediction error. In contrast to our method, this concurrent work does not attempt\nto provide a probabilistic model of novelty and does not perform any sort of implicit density estimation.\nSince their method learns an inverse dynamics model, it does not provide for any mechanism to\nhandle novel events that do not correlate with the agent\u2019s actions, though it does succeed in avoiding\nthe need for generative modeling.\n\nrewards, \u03c0\u2217 = arg max\u03c0 E\u03c4\u223c\u03c0[(cid:80)T\nand \u03c0(\u03c4 ) = \u03c10(s0)(cid:81)T\n\n3 Preliminaries\nIn this paper, we consider a Markov decision process (MDP), de\ufb01ned by the tuple (S,A,T , R, \u03b3, \u03c10).\nS,A are the state and action spaces, respectively. The transition distribution T (s(cid:48)|a, s), initial\nstate distribution \u03c10(s), and reward function R(s, a) are unknown in the reinforcement learning\n(RL) setting and can only be queried through interaction with the MDP. The goal of reinforce-\nment learning is to \ufb01nd the optimal policy \u03c0\u2217 that maximizes the expected sum of discounted\nt=0 \u03b3tR(st, at)] , where, \u03c4 denotes a trajectory (s0, a0, ...sT , aT )\nt=0 \u03c0(at|st)T (st+1|st, at). Our experiments evaluate episodic tasks with a\npolicy gradient RL algorithm, though extensions to in\ufb01nite horizon settings or other algorithms, such\nas Q-learning and actor-critic, are straightforward.\nCount-based exploration algorithms maintain a state-action visitation count N (s, a), and encourage\nthe agent to visit rarely seen states, operating on the principle of optimism under uncertainty. This is\ntypically achieved by adding a reward bonus for visiting rare states. For example, MBIE-EB (Strehl\nuses a \u03b2/(N (s, a) + |S|). In the \ufb01nite state and action spaces, these methods are PAC-MDP (for\nMBIE-EB) or PAC-BAMDP (for BEB), roughly meaning that the agent acts suboptimally for only a\npolynomial number of steps. In domains where explicit counting is impractical, pseudo-counts can\nbe used based on a density estimate p(s, a), which typically is done using some sort of generatively\ntrained density estimation model (Bellemare et al., 2016). We will describe how we can estimate\ndensities using only discriminatively trained classi\ufb01ers, followed by a discussion of how this implicit\nestimator can be incorporated into a pseudo-count novelty bonus method.\n\n& Littman, 2009) uses a bonus of \u03b2/(cid:112)N (s, a), where \u03b2 is a constant, and BEB (Kolter & Ng, 2009)\n\n4 Exemplar Models and Density Estimation\n\nWe begin by describing our discriminative model used to predict novelty of states visited during\ntraining. We highlight a connection between this particular form of discriminative model and density\nestimation, and in Section 5 describe how to use this model to generate reward bonuses.\n\n4.1 Exemplar Models\n\nTo avoid the need for explicit generative models, our novelty estimation method uses exemplar\nmodels. Given a dataset X = {x1, ...xn}, an exemplar model consists of a set of n classi\ufb01ers or\ndiscriminators {Dx1 , ....Dxn}, one for each data point. Each individual discriminator Dxi is trained\nto distinguish a single positive data point xi, the \u201cexemplar,\u201d from the other points in the dataset\nX. We borrow the term \u201cexemplar model\u201d from Malisiewicz et al. (2011), which coined the term\n\u201cexemplar SVM\u201d to refer to a particular linear model trained to classify each instance against all others.\nHowever, to our knowledge, our work is the \ufb01rst to apply this idea to exploration for reinforcement\nlearning. In practice, we avoid the need to train n distinct classi\ufb01ers by amortizing through a single\nexemplar-conditioned network, as discussed in Section 6.\nLet PX (x) denote the data distribution over X , and let Dx\u2217 (x) : X \u2192 [0, 1] denote the discriminator\nassociated with exemplar x\u2217. In order to obtain correct density estimates, as discussed in the next\nsection, we present each discriminator with a balanced dataset, where half of the data consists of the\nexemplar x\u2217 and half comes from the background distribution PX (x). Each discriminator is then\ntrained to model a Bernoulli distribution Dx\u2217 (x) = P (x = x\u2217|x) via maximum likelihood. Note\nthat the label x = x\u2217 is noisy because data that is extremely similar or identical to x\u2217 may also\noccur in the background distribution PX (x), so the classi\ufb01er does not always output 1. To obtain the\n\n3\n\n\fmaximum likelihood solution, the discriminator is trained to optimize the following cross-entropy\nobjective\n\n(E\u03b4x\u2217 [log D(x)] + EPX [log 1 \u2212 D(x)]) .\n\n(1)\n\nDx\u2217 = arg max\n\nD\u2208D\n\nWe discuss practical amortized methods that avoid the need to train n discriminators in Section 6, but\nto keep the derivation in this section simple, we consider independent discriminators for now.\n\n4.2 Exemplar Models as Implicit Density Estimation\n\nTo show how the exemplar model can be used for implicit density estimation, we begin by considering\nan in\ufb01nitely powerful, optimal discriminator, for which we can make an explicit connection between\nthe discriminator and the underlying data distribution PX (x):\nProposition 1. (Optimal Discriminator) For a discrete distribution PX (x), the optimal discriminator\nDx\u2217 for exemplar x\u2217 satis\ufb01es\n\nDx\u2217 (x) =\n\n\u03b4x\u2217 (x)\n\n\u03b4x\u2217 (x) + PX (x)\n\nand\n\nDx\u2217 (x\u2217) =\n\n1\n\n1 + PX (x\u2217)\n\n.\n\nProof. The proof is obtained by taking the derivative of the loss in Eq. (1) with respect to D(x),\nsetting it to zero, and solving for D(x).\nIt follows that, if the discriminator is optimal, we can recover the probability of a data point PX (x\u2217)\nby evaluating the discriminator at its own exemplar x\u2217, according to\n\nPX (x\u2217) =\n\n1 \u2212 Dx\u2217 (x\u2217)\nDx\u2217 (x\u2217)\n\n.\n\n(2)\n\nFor continuous domains, \u03b4x\u2217 (x\u2217) \u2192 \u221e, so D(x) \u2192 1. This means we are unable to recover\nPX (x) via Eq. (2). However, we can smooth the delta by adding noise \u0001 \u223c q(\u0001) to the exemplar\nx\u2217 during training, which allows us to recover exact density estimates by solving for PX (x). For\nexample, if we let q = N (0, \u03c32I), then the optimal discriminator evaluated at x\u2217 satis\ufb01es Dx\u2217 (x\u2217) =\n\u221a\n1/\n\n. Even if we do not know the noise variance, we have\n\n\u221a\n1/\n\nd(cid:105)\n\n2\u03c0\u03c32\n\n2\u03c0\u03c32\n\n/\n\nd\n\n+ PX (x)\n\n(cid:104)\n\n(cid:104)\n\n(cid:105)\nPX (x\u2217) \u221d 1 \u2212 Dx\u2217 (x\u2217)\nDx\u2217 (x\u2217)\n\n(3)\nThis proportionality holds for any noise q as long as (\u03b4x\u2217 \u2217 q)(x\u2217) (where \u2217 denotes convolution) is\nthe same for every x\u2217. The reward bonus we describe in Section 5 is invariant to the normalization\nfactor, so proportional estimates are suf\ufb01cient.\nIn practice, we can get density estimates that are better suited for exploration by introducing smooth-\ning, which involves adding noise to the background distribution PX , to produce the estimator\n\n.\n\n(\u03b4x\u2217 \u2217 q)(x)\n\nDx\u2217 (x) =\n\n(\u03b4x\u2217 \u2217 q)(x) + (PX \u2217 q)(x\u2217)\n\n.\n\nWe then recover our density estimate as (PX \u2217 q)(x\u2217). In the case when PX is a collection of delta\nfunctions around data points, this is equivalent to kernel density estimation using the noise distribution\nas a kernel. With Gaussian noise q = N (0, \u03c32I), this is equivalent to using an RBF kernel.\n\n4.3 Latent Space Smoothing with Noisy Discriminators\n\nIn the previous section, we discussed how adding noise can provide for smoothed density estimates,\nwhich is especially important in complex or continuous spaces, where all states might be distin-\nguishable with a powerful enough discriminator. Unfortunately, for high-dimensional states, such as\nimages, adding noise directly to the state often does not produce meaningful new states, since the\ndistribution of states lies on a thin manifold, and any added noise will lift the noisy state off of this\nmanifold. In this section, we discuss how we can learn a smoothing distribution by injecting the noise\ninto a learned latent space, rather than adding it to the original states.\n\n4\n\n\fFormally, we introduce a latent variable z. We wish to train an encoder distribution q(z|x), and a\nlatent space classi\ufb01er p(y|z) = D(z)y(1 \u2212 D(z))1\u2212y, where y = 1 when x = x\u2217 and y = 0 when\nx (cid:54)= x\u2217. We additionally regularize the noise distribution against a prior distribution p(z), which\n2 pX (x) denote the balanced training\n\nin our case is a unit Gaussian. Letting (cid:101)p(x) = 1\n\ndistribution from before, we can learn the latent space by maximizing the objective\n\n2 \u03b4x\u2217 (x) + 1\n\n(4)\n\nE(cid:101)p[Eqz|x [log p(y|z)] \u2212 DKL(q(z|x)||p(z))] .\n\nmax\n\npy|z,qz|x\n\nIntuitively, this objective optimizes the noise distribution so as to maximize classi\ufb01cation accuracy\nwhile transmitting as little information through the latent space as possible. This causes z to only\ncapture the factors of variation in x that are most informative for distinguish points from the exemplar,\nresulting in noise that stays on the state manifold. For example, in the Atari domain, latent space\nnoise might correspond to smoothing over the location of the player and moving objects on the screen,\nin contrast to performing pixel-wise Gaussian smoothing.\nx pX (x)q(z|x)dx denote the marginal-\nized positive and negative densities over the latent space, we can characterize the optimal discriminator\nand encoder distributions as follows. For any encoder q(z|x), the optimal discriminator D(z) satis\ufb01es:\n\nx \u03b4x\u2217 (x)q(z|x)dx and q(z|y = 0) =(cid:82)\n\nLetting q(z|y = 1) =(cid:82)\n\np(y = 1|z) = D(z) =\n\nq(z|y = 1) + q(z|y = 0)\nand for any discriminator D(z), the optimal encoder distribution satis\ufb01es:\nq(z|x) \u221d D(z)ysoft(x)(1 \u2212 D(z))1\u2212ysoft(x)p(z) ,\n\n,\n\nq(z|y = 1)\n\n\u03b4x\u2217 (x)\n\nwhere ysoft(x) = p(y = 1|x) =\n\u03b4x\u2217 (x)+pX (x) is the average label of x. These can be obtained by\ndifferentiating the objective, and the full derivation is included in Appendix A.1. Intuitively, q(z|x)\nis equal to the prior p(z) by default, which carries no information about x. It then scales up the\nprobability on latent codes z where the discriminator is con\ufb01dent and correct. To recover a density\nestimate, we estimate D(x) = Eq[D(z)] and apply Eq. (3) to obtain the density.\n\n4.4 Smoothing from Suboptimal Discriminators\n\nIn our previous derivations, we assume an optimal, in\ufb01nitely powerful discriminator which can\nemit a different value D(x) for every input x. However, this is typically not possible except for\nsmall, countable domains. A secondary but important source of density smoothing occurs when the\ndiscriminator has dif\ufb01culty distinguishing two states x and x(cid:48). In this case, the discriminator will\naverage over the outputs of the in\ufb01nitely powerful discriminator. This form of smoothing comes from\nthe inductive bias of the discriminator, which is dif\ufb01cult to quantify. In practice, we typically found\nthis effect to be bene\ufb01cial for our model rather than harmful. An example of such smoothed density\nestimates is shown in Figure 2. Due to this effect, adding noise is not strictly necessary to bene\ufb01t\nfrom smoothing, though it provides for signi\ufb01cantly better control over the degree of smoothing.\n\n5 EX2: Exploration with Exemplar Models\n\nWe can now describe our exploration algorithm based on implicit density models. Pseudocode for a\nbatch policy search variant using the single exemplar model is shown in Algorithm 1. Online variants\nfor other RL algorithms, such as Q-learning, are also possible. In order to apply the ideas from\ncount-based exploration described in Section 3, we must approximate the state visitation counts\nN (s) = nP (s), where P (s) is the distribution over states visited during training. Note that we can\neasily use state-action counts N (s, a), but we omit the action for simplicity of notation. To generate\napproximate samples from P (s), we use a replay buffer B, which is a \ufb01rst-in \ufb01rst-out (FIFO) queue\nthat holds previously visited states. Our exemplars are the states we wish to score, which are the states\nin the current batch of trajectories. In an online algorithm, we would instead train a discriminator\nafter receiving every new observation one at a time, and compute the bonus in the same manner.\nGiven the output from discriminators trained to optimize Eq (1), we augment the reward with a\nfunction of the \u201cnovelty\u201d of the state (where \u03b2 is a hyperparameter that can be tuned to the magnitude\nof the task reward): R(cid:48)(s, a) = R(s, a) + \u03b2f (Ds(s)).\n\n5\n\n\fSample trajectories {\u03c4j} from policy \u03c0i\nfor state s in {\u03c4} do\n\nAlgorithm 1 EX2 for batch policy optimization\n1: Initialize replay buffer B\n2: for iteration i in {1, . . . , N} do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: B \u2190 B \u222a {\u03c4i}\n11: end for\n\nSample a batch of negatives {s(cid:48)\nTrain discriminator Ds to minimize Eq. (1) with positive s, and negatives {s(cid:48)\nCompute reward R(cid:48)(s, a) = R(s, a) + \u03b2f (Ds(s))\n\nk} from B.\n\nend for\nImprove \u03c0i with respect to R(cid:48)(s, a) using any policy optimization method.\n\nk}.\n\nIn our experiments, we use the heuristic bonus \u2212 log p(s), due to the fact that normalization constants\nbecome absorbed by baselines used in typical RL algorithms. For discrete domains, we can also use a\n\ncount-based 1/(cid:112)N (s) (Tang et al., 2017), where N (s) = nP (s), and n being the size of the replay\n\nbuffer B. A summary of EX2 for a generic batch reinforcement learner is shown in Algorithm 1.\n\n6 Model Architecture\n\nTo process complex observations such as images, we implement our exemplar model using neural\nnetworks, with convolutional models used for image-based domains. To reduce the computational\ncost of training such large per-exemplar classi\ufb01ers, we explore two methods for amortizing the\ncomputation across multiple exemplars.\n\n6.1 Amortized Multi-Exemplar Model\n\nInstead of training a separate classi\ufb01er for each exemplar, we can instead train a single model that is\nconditioned on the exemplar x\u2217. When using the latent space formulation, we condition the latent\nspace discriminator p(y|z) on an encoded version of x\u2217 given by q(z\u2217|x\u2217), resulting in a classi\ufb01er\nfor the form p(y|z, z\u2217) = D(z, z\u2217)y(1 \u2212 D(z, z\u2217))1\u2212y. The advantage of this amortized model is\nthat it does not require us to train new discriminators from scratch at each iteration, and provides\nsome degree of generalization for density estimation at new states. A diagram of this architecture is\nshown in Figure 1. The amortized architecture has the appearance of a comparison operator: it is\ntrained to output 0 when x\u2217 (cid:54)= x, and the optimal discriminator values covered in Section 4 when\nx\u2217 = x, subject to the smoothing imposed by the latent space noise.\n\n6.2 K-Exemplar Model\n\nAs long as the distribution of positive examples is known, we can recover density estimates via Eq. (3).\nThus, we can also consider a batch of exemplars x1, ..., xK, and sample from this batch uniformly\nduring training. We refer to this model as the \"K-Exemplar\" model, which allows us to interpolate\nsmoothly between a more powerful model with one discriminator per state (K = 1) with a weaker\nmodel that uses a single discriminator for all states (K = # states). A more detailed discussion of\nthis method is included in Appendix A.2. In our experiments, we batch adjacent states in a trajectory\ninto the same discriminator which corresponds to a form of temporal regularization that assumes that\nadjacent states in time are similar. We also share the majority of layers between discriminators in the\nneural networks similar to (Osband et al., 2016), and only allow the \ufb01nal linear layer to vary amongst\ndiscriminators, which forces the shared layers to learn a joint feature representation, similarly to the\namortized model. An example architecture is shown in Figure 1.\n\n6.3 Relationship to Generative Adverserial Networks (GANs)\n\nOur exploration algorithm has an interesting interpretation related to GANs (Goodfellow et al.,\n2014). The policy can be viewed as the generator of a GAN, and the exemplar model serves as the\ndiscriminator, which is trying to classify states from the current batch of trajectories against previous\n\n6\n\n\fa) Amortized Architecture\nFigure 1: A diagram of our a) amortized model architecture and b) the K-exemplar model architecture.\nNoise is injected after the encoder module (a) or after the shared layers (b). Although possible, we do\nnot tie the encoders of (a) in our experiments.\n\nb) K-Exemplar Architecture\n\nstates. Using the K-exemplar version of our algorithm, we can train a single discriminator for all\nstates in the current batch (rather than one for each state), which mirrors the GAN setup.\nIn GANs, the generator plays an adverserial game with the discriminator by attempting to produce\nindistinguishable samples in order to fool the discriminator. However, in our algorithm, the generator\nis rewarded for helping the discriminator rather than fooling it, so our algorithm plays a cooperative\ngame instead of an adverserial one. Instead, they are competing with the progression of time: as a\nnovel state becomes visited frequently, the replay buffer will become saturated with that state and it\nwill lose its novelty. This property is desirable in that it forces the policy to continually seek new\nstates from which to receive exploration bonuses.\n\n7 Experimental Evaluation\n\nThe goal of our experimental evaluation is to compare the EX2 method to both a na\u00efve exploration\nstrategy and to recently proposed exploration schemes for deep reinforcement learning based on\nexplicit density modeling. We present results on both low-dimensional benchmark tasks used in\nprior work, and on more complex vision-based tasks, where prior density-based exploration bonus\nmethods are dif\ufb01cult to apply. We use TRPO (Schulman et al., 2015) for policy optimization, because\nit operates on both continuous and discrete action spaces, and due to its relative robustness to hyper-\nparameter choices (Duan et al., 2016). Our code and additional supplementary material including\nvideos will be available at https://sites.google.com/view/ex2exploration.\n\nExperimental Tasks Our experiments include three low-dimensional tasks intended to assess\nwhether EX2 can successfully perform implicit density estimation and computer exploration bonuses,\nand four high-dimensional image-based tasks of varying dif\ufb01culty intended to evaluate whether\nimplicit density estimation provides improvement in domains where generative modeling is dif\ufb01cult.\nThe \ufb01rst low-dimensional task is a continuous 2D maze with a sparse reward function that only\nprovides a reward when the agent is within a small radius of the goal. Because this task is 2D, we can\nuse it to directly visualize the state visitation densities and compare to an upper bound histogram\nmethod for density estimation. The other two low-dimensional tasks are benchmark tasks from\nthe OpenAI gym benchmark suite, SparseHalfCheetah and SwimmerGather, which provide for a\ncomparison against prior work on generative exploration bonuses in the presence of sparse rewards.\nFor the vision-based tasks, we include three Atari games, as well as a much more dif\ufb01cult ego-centric\nnavigation task based on vizDoom (DoomMyWayHome+). The Atari games are included for easy\ncomparison with prior methods based on generative models, but do not provide especially challenging\nvisual observations, since the clean 2D visuals and relatively low visual diversity of these tasks makes\ngenerative modeling easy. In fact, prior work on video prediction for Atari games easily achieves\naccurate predictions hundreds of frames into the future (Oh et al., 2015), while video prediction\non natural images is challenging even a couple of frames into the future (Mathieu et al., 2015).\nThe vizDoom maze navigation task is intended to provide a comparison against prior methods with\nsubstantially more challenging observations: the game features a \ufb01rst-person viewpoint, 3D visuals,\nand partial observability, as well as the usual challenges associated with sparse rewards. We make\nthe task particularly dif\ufb01cult by initializing the agent in the furthest room from the goal location,\n\n7\n\n\fb) Empirical\n\na) Exemplar\nc) Varying Smoothing\nFigure 2: a, b) Illustration of estimated densities on the 2D\nmaze task produced by our model (a), compared to the empiri-\ncal discretized distribution (b). Our method provides reasonable,\nsomewhat smoothed density estimates. c) Density estimates pro-\nduced with our implicit density estimator on a toy dataset (top\nleft), with increasing amounts of noise regularization.\n\nFigure 3: Example task images.\nFrom top to bottom, left to right:\nDoom, map of the MyWayHome\ntask (goal is green, start is blue),\nVenture, HalfCheetah.\n\nrequiring it to navigate through 8 rooms before reaching the goal. Sample images taken from several\nof these tasks are shown in Figure 3 and detailed task descriptions are given in Appendix A.3.\nWe compare the two variants of our method (K-exemplar and amortized) to standard random ex-\nploration, kernel density estimation (KDE) with RBF kernels, a method based on Bayesian neural\nnetwork generative models called VIME (Houthooft et al., 2016), and exploration bonuses based on\nhashing of latent spaces learned via an autoencoder (Tang et al., 2017).\n\n2D Maze On the 2D maze task, we can visually compare the estimated state density from our\nexemplar model and the empirical state-visitation distribution sampled from the replay buffer, as\nshown in Figure 2. Our model generates sensible density estimates that smooth out the true empirical\ndistribution. For exploration performance, shown in Table 1,TRPO with Gaussian exploration cannot\n\ufb01nd the sparse reward goal, while both variants of our method perform similarly to VIME and KDE.\nSince the dimensionality of the task is low, we also use a histogram-based method to estimate the\ndensity, which provides an upper bound on the performance of count-based exploration on this task.\n\nContinuous Control: SwimmerGather and SparseHalfCheetah SwimmerGather and Sparse-\nHalfCheetah are two challenging continuous control tasks proposed by Houthooft et al. (2016). Both\nenvironments feature sparse reward and medium-dimensional observations (33 and 20 dimensions\nrespectively). SwimmerGather is a hierarchical task in which no previous algorithms using na\u00efve\nexploration have made any progress. Our results demonstrate that, even on medium-dimensional\ntasks where explicit generative models should perform well, our implicit density estimation approach\nachieves competitive results. EX2, VIME, and Hashing signi\ufb01cantly outperform the na\u00efve TRPO\nalgorithm and KDE on SwimmerGather, and amortized EX2outperforms all other methods on Sparse-\nHalfCheetah by a signi\ufb01cant margin. This indicates that the implicit density estimates obtained by\nour method provide for exploration bonuses that are competitive with a variety of explicit density\nestimation techniques.\n\nImage-Based Control: Atari and Doom In our \ufb01nal set of experiments, we test the ability of\nour algorithm to scale to rich sensory inputs and high dimensional image-based state spaces. We\nchose several Atari games that have sparse rewards and present an exploration challenge, as well as a\nmaze navigation benchmark based on vizDoom. Each domain presents a unique set of challenges.\nThe vizDoom domain contains the most realistic images, and the environment is viewed from an\negocentric perspective which makes building dynamics models dif\ufb01cult and increases the importance\nof intelligent smoothing and generalization. The Atari games (Freeway, Frostbite, Venture) contain\nsimpler images from a third-person viewpoint, but often contain many moving, distractor objects\nthat a density model must generalize to. Freeway and Venture contain sparse reward, and Frostbite\ncontains a small amount of dense reward but attaining higher scores typically requires exploration.\nOur results demonstrate that EX2 is able to generate coherent exploration behavior even high-\ndimensional visual environments, matching the best-performing prior methods on the Atari games.\nOn the most challenging task, DoomMyWayHome+, our method greatly exceeds all of the prior\n\n8\n\n\fTask\n2D Maze\nSparseHalfCheetah\nSwimmerGather\nFreeway (Atari)\nFrostbite (Atari)\nVenture (Atari)\nDoomMyWayHome\n1 Houthooft et al. (2016)\n\n-104.2\n3.56\n0.228\n\n-\n-\n-\n\n-132.2\n173.2\n0.240\n33.3\n4901\n900\n0.788\n\n0.740\n2 Schulman et al. (2015)\n\n0.443\n3 Tang et al. (2017)\n\nK-Ex.(ours) Amor.(ours) VIME1 TRPO2 Hashing3 KDE\n-117.5\n\n-175.6\n\n-135.5\n98.0\n0.196\n\n-\n-\n-\n\n0\n0\n\n16.5\n2869\n121\n0.250\n\n-\n0.5\n0.258\n33.5\n5214\n445\n0.331\n\n0.098\n\n0\n\n-\n-\n-\n\n0.195\n\nHistogram\n\n-69.6\n\n-\n-\n-\n-\n-\n-\n\nTable 1: Mean scores (higher is better) of our algorithm (both K-exemplar and amortized) versus\nVIME (Houthooft et al., 2016), baseline TRPO, Hashing, and kernel density estimation (KDE). Our\napproach generally matches the performance of previous explicit density estimation methods, and\ngreatly exceeds their performance on the challenging DoomMyWayHome+ task, which features\ncamera motion, partial observability, and extremely sparse rewards. We did not run VIME or K-\nExemplar on Atari games due to computational cost. Atari games are trained for 50 M time steps.\nLearning curves are included in Appendix A.5\n\nexploration techniques, and is able to guide the agent through multiple rooms to the goal. This result\nindicates the bene\ufb01t of implicit density estimation: while explicit density estimators can achieve good\nresults on simple, clean images in the Atari games, they begin to struggle with the more complex\negocentric observations in vizDoom, while our EX2 is able to provide reasonable density estimates\nand achieves good results.\n\n8 Conclusion and Future Work\n\nWe presented EX2, a scalable exploration strategy based on training discriminative exemplar models\nto assign novelty bonuses. We also demonstrate a novel connection between exemplar models and\ndensity estimation, which motivates our algorithm as approximating pseudo-count exploration. This\ndensity estimation technique also does not require reconstructing samples to train, unlike most\nmethods for training generative or energy-based models. Our empirical results show that EX2 tends\nto achieve comparable results to the previous state-of-the-art for continuous control tasks on low-\ndimensional environments, and can scale gracefully to handle rich sensory inputs such as images.\nSince our method avoids the need for generative modeling of complex image-based observations, it\nexceeds the performance of prior generative methods on domains with more complex observation\nfunctions, such as the egocentric Doom navigation task.\nTo understand the tradeoffs between discriminatively trained exemplar models and generative mod-\neling, it helps to consider the behavior of the two methods when over\ufb01tting or under\ufb01tting. Both\nmethods will assign \ufb02at bonuses when under\ufb01tting and high bonuses to all new states when over\ufb01tting.\nHowever, in the case of exemplar models, over\ufb01tting is easy with high dimensional observations,\nespecially in the amortized model where the network simply acts as a comparator. Under\ufb01tting is\nalso easy to achieve, simply by increasing the magnitude of the noise injected into the latent space.\nTherefore, although both approach can suffer from over\ufb01tting and under\ufb01tting, the exemplar method\nprovides a single hyperparameter that interpolates between these extremes without changing the\nmodel. An exciting avenue for future work would be to adjust this smoothing factor automatically,\nbased on the amount of available data. More generally, implicit density estimation with exemplar\nmodels is likely to be of use in other density estimation applications, and exploring such applications\nwould another exciting direction for future work.\n\nAcknowledgement We would like to thank Adam Stooke, Sandy Huang, and Haoran Tang for\nproviding ef\ufb01cient and parallelizable policy search code. We thank Joshua Achiam for help with\nsetting up benchmark tasks. This research was supported by NSF IIS-1614653, NSF IIS-1700696, an\nONR Young Investigator Program award, and Berkeley DeepDrive.\n\n9\n\n\fReferences\nAbel, David, Agarwal, Alekh, Diaz, Fernando, Krishnamurthy, Akshay, and Schapire, Robert E.\nExploratory gradient boosting for reinforcement learning in complex domains. In Advances in\nNeural Information Processing Systems (NIPS), 2016.\n\nAchiam, Joshua and Sastry, Shankar. Surprise-based intrinsic motivation for deep reinforcement\n\nlearning. CoRR, abs/1703.01732, 2017.\n\nBarto, Andrew G. and Mahadevan, Sridhar. Recent advances in hierarchical reinforcement learning.\n\nDiscrete Event Dynamic Systems, 13(1-2), 2003.\n\nBellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos,\nRemi. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Informa-\ntion Processing Systems (NIPS), 2016.\n\nBrafman, Ronen I. and Tennenholtz, Moshe. R-max \u2013 a general polynomial time algorithm for\n\nnear-optimal reinforcement learning. Journal of Machine Learning Research (JMLR), 2002.\n\nBubeck, S\u00e9bastien and Cesa-Bianchi, Nicol\u00f2. Regret analysis of stochastic and nonstochastic\n\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5, 2012.\n\nChapelle, O. and Li, Lihong. An empirical evaluation of thompson sampling. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2011.\n\nChentanez, Nuttapong, Barto, Andrew G, and Singh, Satinder P.\n\nIntrinsically Motivated Rein-\nforcement Learning. In Advances in Neural Information Processing Systems (NIPS). MIT Press,\n2005.\n\nDuan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep\nreinforcement learning for continuous control. In International Conference on Machine Learning\n(ICML), 2016.\n\nFlorensa, Carlos Campo, Duan, Yan, and Abbeel, Pieter. Stochastic neural networks for hierarchical\nreinforcement learning. In International Conference on Learning Representations (ICLR), 2017.\n\nGoodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,\nIn Advances in Neural\n\nCourville, Aaron, and Bengio, Yoshua. Generative adversarial nets.\nInformation Processing Systems (NIPS). 2014.\n\nHeess, Nicolas, Wayne, Gregory, Tassa, Yuval, Lillicrap, Timothy P., Riedmiller, Martin A., and\nSilver, David. Learning and transfer of modulated locomotor controllers. CoRR, abs/1610.05182,\n2016.\n\nHouthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, Turck, Filip De, and Abbeel, Pieter. Vime:\nVariational information maximizing exploration. In Advances in Neural Information Processing\nSystems (NIPS), 2016.\n\nKakade, Sham, Kearns, Michael, and Langford, John. Exploration in metric state spaces.\n\nInternational Conference on Machine Learning (ICML), 2003.\n\nIn\n\nKearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 2002.\n\nKolter, J. Zico and Ng, Andrew Y. Near-bayesian exploration in polynomial time. In International\n\nConference on Machine Learning (ICML), 2009.\n\nKulkarni, Tejas D, Narasimhan, Karthik, Saeedi, Ardavan, and Tenenbaum, Josh. Hierarchical deep\nreinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in\nNeural Information Processing Systems (NIPS). 2016.\n\nLillicrap, Timothy P., Hunt, Jonathan J., Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval,\nSilver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. In\nInternational Conference on Learning Representations (ICLR), 2015.\n\n10\n\n\fMalisiewicz, Tomasz, Gupta, Abhinav, and Efros, Alexei A. Ensemble of exemplar-svms for object\n\ndetection and beyond. In International Conference on Computer Vision (ICCV), 2011.\n\nMathieu, Micha\u00ebl, Couprie, Camille, and LeCun, Yann. Deep multi-scale video prediction beyond\nmean square error. CoRR, abs/1511.05440, 2015. URL http://arxiv.org/abs/1511.05440.\n\nMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare,\nMarc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig,\nBeattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra,\nDaan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 02 2015.\n\nOh, Junhyuk, Guo, Xiaoxiao, Lee, Honglak, Lewis, Richard, and Singh, Satinder. Action-conditional\nvideo prediction using deep networks in atari games. In Advances in Neural Information Processing\nSystems (NIPS), 2015.\n\nOsband, Ian, Blundell, Charles, and Alexander Pritzel, Benjamin Van Roy. Deep exploration via\n\nbootstrapped DQN. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\nPathak, Deepak, Agrawal, Pulkit, Efros, Alexei A., and Darrell, Trevor. Curiosity-driven exploration\nby self-supervised prediction. In International Conference on Machine Learning (ICML), 2017.\n\nPazis, Jason and Parr, Ronald. Pac optimal exploration in continuous space markov decision processes.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2013.\n\nSalimans, Tim, Goodfellow, Ian J., Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems\n(NIPS), 2016.\n\nSchmidhuber, J\u00fcrgen. A possibility for implementing curiosity and boredom in model-building\nneural controllers. In Proceedings of the First International Conference on Simulation of Adaptive\nBehavior on From Animals to Animats, Cambridge, MA, USA, 1990. MIT Press. ISBN 0-262-\n63138-5.\n\nSchulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and Abbeel, Pieter. Trust region\n\npolicy optimization. In International Conference on Machine Learning (ICML), 2015.\n\nStadie, Bradly C., Levine, Sergey, and Abbeel, Pieter. Incentivizing exploration in reinforcement\n\nlearning with deep predictive models. CoRR, abs/1507.00814, 2015.\n\nStolle, Martin and Precup, Doina. Learning Options in Reinforcement Learning. Springer Berlin\nHeidelberg, Berlin, Heidelberg, 2002. ISBN 978-3-540-45622-3. doi: 10.1007/3-540-45622-8_16.\n\nStrehl, Alexander L. and Littman, Michael L. An analysis of model-based interval estimation for\n\nmarkov decision processes. Journal of Computer and System Sciences, 2009.\n\nTang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, Xi, Duan, Yan, Schulman, John,\nTurck, Filip De, and Abbeel, Pieter. #exploration: A study of count-based exploration for deep\nreinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n11\n\n\f", "award": [], "sourceid": 1489, "authors": [{"given_name": "Justin", "family_name": "Fu", "institution": "UC Berkeley"}, {"given_name": "John", "family_name": "Co-Reyes", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}