{"title": "Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 11784, "page_last": 11794, "abstract": "Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify \\emph{bootstrapping error} as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random data and suboptimal demonstrations, on a range of continuous control tasks.", "full_text": "Stabilizing Off-Policy Q-Learning via Bootstrapping\n\nError Reduction\n\nAviral Kumar\u2217\nUC Berkeley\n\naviralk@berkeley.edu\n\nJustin Fu\u2217\nUC Berkeley\n\njustinjfu@eecs.berkeley.edu\n\nGeorge Tucker\nGoogle Brain\n\ngjt@google.com\n\nSergey Levine\n\nUC Berkeley, Google Brain\n\nsvlevine@eecs.berkeley.edu\n\nAbstract\n\nOff-policy reinforcement learning aims to leverage experience collected from\nprior policies for sample-ef\ufb01cient learning. However, in practice, commonly used\noff-policy approximate dynamic programming methods based on Q-learning and\nactor-critic methods are highly sensitive to the data distribution, and can make only\nlimited progress without collecting additional on-policy data. As a step towards\nmore robust off-policy algorithms, we study the setting where the off-policy experi-\nence is \ufb01xed and there is no further interaction with the environment. We identify\nbootstrapping error as a key source of instability in current methods. Bootstrap-\nping error is due to bootstrapping from actions that lie outside of the training data\ndistribution, and it accumulates via the Bellman backup operator. We theoretically\nanalyze bootstrapping error, and demonstrate how carefully constraining action se-\nlection in the backup can mitigate it. Based on our analysis, we propose a practical\nalgorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate\nthat BEAR is able to learn robustly from different off-policy distributions, including\nrandom and suboptimal demonstrations, on a range of continuous control tasks.\n\n1\n\nIntroduction\n\nOne of the primary drivers of the success of machine learning methods in open-world perception\nsettings, such as computer vision [19] and NLP [8], has been the ability of high-capacity function\napproximators, such as deep neural networks, to learn generalizable models from large amounts of\ndata. Reinforcement learning (RL) has proven comparatively dif\ufb01cult to scale to unstructured real-\nworld settings because most RL algorithms require active data collection. As a result, RL algorithms\ncan learn complex behaviors in simulation, where data collection is straightforward, but real-world\nperformance is limited by the expense of active data collection. In some domains, such as autonomous\ndriving [38] and recommender systems [3], previously collected datasets are plentiful. Algorithms\nthat can utilize such datasets effectively would not only make real-world RL more practical, but also\nwould enable substantially better generalization by incorporating diverse prior experience.\nIn principle, off-policy RL algorithms can leverage this data; however, in practice, off-policy al-\ngorithms are limited in their ability to learn entirely from off-policy data. Recent off-policy RL\nmethods (e.g., [18, 29, 23, 9]) have demonstrated sample-ef\ufb01cient performance on complex tasks\nin robotics [23] and simulated environments [36]. However, these methods can still fail to learn\nwhen presented with arbitrary off-policy data without the opportunity to collect more experience\n\n\u2217Equal Contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffrom the environment. This issue persists even when the off-policy data comes from effective expert\npolicies, which in principle should address any exploration challenge [6, 12, 11]. This sensitivity to\nthe training data distribution is a limitation of practical off-policy RL algorithms, and one would hope\nthat an off-policy algorithm should be able to learn reasonable policies through training on static\ndatasets before being deployed in the real world.\nIn this paper, we aim to develop off-policy, value-based RL methods that can learn from large,\nstatic datasets. As we show, a crucial challenge in applying value-based methods to off-policy\nscenarios arises in the bootstrapping process employed when Q-functions are evaluated on out of\nout-of-distribution action inputs for computing the backup when training from off-policy data. This\nmay introduce errors in the Q-function and the algorithm is unable to collect new data in order to\nremedy those errors, making training unstable and potentially diverging. Our primary contribution is\nan analysis of error accumulation in the bootstrapping process due to out-of-distribution inputs and a\npractical way of addressing this error. First, we formalize and analyze the reasons for instability and\npoor performance when learning from off-policy data. We show that, through careful action selection,\nerror propagation through the Q-function can be mitigated. We then propose a principled algorithm\ncalled bootstrapping error accumulation reduction (BEAR) to control bootstrapping error in practice,\nwhich uses the notion of support-set matching to prevent error accumulation. Through systematic\nexperiments, we show the effectiveness of our method on continuous-control MuJoCo tasks, with\na variety of off-policy datasets: generated by a random, suboptimal, or optimal policies. BEAR is\nconsistently robust to the training dataset, matching or exceeding the state-of-the-art in all cases,\nwhereas existing algorithms only perform well for speci\ufb01c datasets.\n2 Related Work\nIn this work, we study off-policy reinforcement learning with static datasets. Errors arising from\ninadequate sampling, distributional shift, and function approximation have been rigorously studied as\n\u201cerror propagation\u201d in approximate dynamic programming (ADP) [4, 27, 10, 33]. These works often\nstudy how Bellman errors accumulate and propagate to nearby states via bootstrapping. In this work,\nwe build upon tools from this analysis to show that performing Bellman backups on static datasets\nleads to error accumulation due to out-of-distribution values. Our approach is motivated as reducing\nthe rate of propagation of error propagation between states.\nOur approach constrains actor updates so that the actions remain in the support of the training dataset\ndistribution. Several works have explored similar ideas in the context of off-policy learning learning\nin online settings. Kakade and Langford [22] shows that large policy updates can be destructive, and\npropose a conservative policy iteration scheme which constrains actor updates to be small for provably\nconvergent learning. Grau-Moya et al. [16] use a learned prior over actions in the maximum entropy\nRL framework [25] and justify it as a regularizer based on mutual information. However, none of these\nmethods use static datasets. Importance Sampling based distribution re-weighting [29, 15, 30, 26]\nhas also been explored primarily in the context of off-policy policy evaluation.\nMost closely related to our work is batch-constrained Q-learning (BCQ) [12] and SPIBB [24], which\nalso discuss instability arising from previously unseen actions. Fujimoto et al. [12] show convergence\nproperties of an action-constrained Bellman backup operator in tabular, error-free settings. We prove\nstronger results under approximation errors and provide a bound on the suboptimality of the solution.\nThis is crucial as it drives the design choices for a practical algorithm. As a consequence, although we\nexperimentally \ufb01nd that [12] outperforms standard Q-learning methods when the off-policy data is\ncollected by an expert, BEAR outperforms [12] when the off-policy data is collected by a suboptimal\npolicy, as is common in real-life applications. Empirically, we \ufb01nd BEAR achieves stronger and\nmore consistent results than BCQ across a wide variety of datasets and environments. As we explain\nbelow, the BCQ constraint is too aggressive; BCQ generally fails to substantially improve over the\nbehavior policy, while our method actually improves when the data collection policy is suboptimal\nor random. SPIBB [24], like BEAR, is an algorithm based on constraining the learned policy to\nthe support of a behavior policy. However, the authors do not extend safe performance guarantees\nfrom the batch-constrained case to the relaxed support-constrained case, and do not evaluate on\nhigh-dimensional control tasks.\n\n3 Background\nWe represent\n(S,A, P, R, \u03c10, \u03b3), where S is the state space, A is the action space, P (s(cid:48)\n\nthe environment as a Markov decision process (MDP) de\ufb01ned by a tuple\n|s, a) is the transition\n\n2\n\n\fstate marginal of a policy \u03c0, de\ufb01ned as the average state visited by the policy,(cid:80)\u221e\n\ndistribution, \u03c10(s) is the initial state distribution, R(s, a) is the reward function, and \u03b3 \u2208 (0, 1) is the\ndiscount factor. The goal in RL is to \ufb01nd a policy \u03c0(a|s) that maximizes the expected cumulative\ndiscounted rewards which is also known as the return. The notation \u00b5\u03c0(s) denotes the discounted\n\u03c0(s). P \u03c0 is\nshorthand for the transition matrix from s to s(cid:48) following a certain policy \u03c0, p(s(cid:48)\n|s, a)].\nQ-learning learns the optimal state-action value function Q\u2217(s, a), which represents the expected\ncumulative discounted reward starting in s taking action a and then acting optimally thereafter. The\noptimal policy can be recovered from Q\u2217 by choosing the maximizing action. Q-learning algorithms\nare based on iterating the Bellman optimality operator T , de\ufb01ned as\n\nt=0 \u03b3tpt\n|s) = E\u03c0[p(s(cid:48)\n\n(T \u02c6Q)(s, a) := R(s, a) + \u03b3ET (s(cid:48)|s,a)[max\n\na(cid:48)\n\n\u02c6Q(s(cid:48), a(cid:48))].\n\nWhen the state space is large, we represent \u02c6Q as a hypothesis from the set of function approximators\nQ (e.g., neural networks). In theory, the estimate of the Q-function is updated by projecting T \u02c6Q into\nQ (i.e., minimizing the mean squared Bellman error E\u03bd[(Q \u2212 T \u02c6Q)2], where \u03bd is the state occupancy\nmeasure under the behaviour policy). This is also referred to a Q-iteration. In practice, an empirical\nestimate of T \u02c6Q is formed with samples, and treated as a supervised (cid:96)2 regression target to form the\nnext approximate Q-function iterate. In large action spaces (e.g., continuous), the maximization\nmaxa(cid:48) Q(s(cid:48), a(cid:48)) is generally intractable. Actor-critic methods [35, 13, 18] address this by additionally\nlearning a policy \u03c0\u03b8 that maximizes the Q-function. In this work, we study off-policy learning from a\nstatic dataset of transitions D = {(s, a, s(cid:48), R(s, a))}, collected under an unknown behavior policy\n\u03b2(\u00b7|s). We denote the distribution over states and actions induced by \u03b2 as \u00b5(s, a).\n4 Out-of-Distribution Actions in Q-Learning\n\nFigure 1: Performance of SAC on HalfCheetah-v2\n(return (left) and log Q-values (right)) with off-policy\nexpert data w.r.t. number of training samples (n). Note\nthe large discrepancy between returns (which are nega-\ntive) and log Q-values (which have large positive values),\nwhich is not solved with additional samples.\n\nQ-learning methods often fail to learn on static,\noff-policy data, as shown in Figure 1. At \ufb01rst\nglance, this resembles over\ufb01tting, but increasing\nthe size of the static dataset does not rectify the\nproblem, suggesting the issue is more complex.\nWe can understand the source of this instability\nby examining the form of the Bellman backup.\nAlthough minimizing the mean squared Bell-\nman error corresponds to a supervised regression\nproblem, the targets for this regression are them-\nselves derived from the current Q-function esti-\nmate. The targets are calculated by maximizing\nthe learned Q-values with respect to the action\nat the next state. However, the Q-function esti-\nmator is only reliable on inputs from the same\ndistribution as its training set. As a result, na\u00efvely maximizing the value may evaluate the \u02c6Q estimator\non actions that lie far outside of the training distribution, resulting in pathological values that incur\nlarge error. We refer to these actions as out-of-distribution (OOD) actions.\nFormally, let \u03b6k(s, a) = |Qk(s, a) \u2212 Q\u2217(s, a)| denote the total error at iteration k of Q-learning,\nand let \u03b4k(s, a) = |Qk(s, a) \u2212 T Qk\u22121(s, a)| denote the current Bellman error. Then, we have\n\u03b6k(s, a) \u2264 \u03b4k(s, a) + \u03b3 maxa(cid:48) Es(cid:48)[\u03b6k\u22121(s(cid:48), a(cid:48))]. In other words, errors from (s(cid:48), a(cid:48)) are discounted,\nthen accumulated with new errors \u03b4k(s, a) from the current iteration. We expect \u03b4k(s, a) to be high\non OOD states and actions, as errors at these state-actions are never directly minimized while training.\nTo mitigate bootstrapping error, we can restrict the policy to ensure that it output actions that lie in\nthe support of the training distribution. This is distinct from previous work (e.g., BCQ [12]) which\nimplicitly constrains the distribution of the learned policy to be close to the behavior policy, similarly\nto behavioral cloning [31]. While this is suf\ufb01cient to ensure that actions lie in the training set with\nhigh probability, it is overly restrictive. For example, if the behavior policy is close to uniform, the\nlearned policy will behave randomly, resulting in poor performance, even when the data is suf\ufb01cient\nto learn a strong policy (see Figure 2 for an illustration). Formally, this means that a learned policy\n\u03c0(a|s) has positive density only where the density of the behaviour policy \u03b2(a|s) is more than a\nthreshold (i.e., \u2200a, \u03b2(a|s) \u2264 \u03b5 =\u21d2 \u03c0(a|s) = 0), instead of a closeness constraint on the value of\n\n3\n\n0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps\u22121000\u2212750\u2212500\u221225002505007501000HalfCheetah-v2:AverageReturnn=1000n=10000n=100000n=10000000.0K0.2K0.4K0.6K0.8K1.0KTrainSteps051015202530HalfCheetah-v2:log(Q)n=1000n=10000n=100000n=1000000\fthe density \u03c0(a|s) and \u03b2(a|s). Our analysis instead reveals a tradeoff between staying within the\ndata distribution and \ufb01nding a suboptimal solution when the constraint is too restrictive. Our analysis\nmotivates us to restrict the support of the learned policy, but not the probabilities of the actions lying\nwithin the support. This avoids evaluating the Q-function estimator on OOD actions, but remains\n\ufb02exible in order to \ufb01nd a performant policy. Our proposed algorithm leverages this insight.\n\n4.1 Distribution-Constrained Backups\n\nIn this section, we de\ufb01ne and analyze a backup operator that restricts the set of policies used in the\nmaximization of the Q-function, and we derive performance bounds which depend on the restricted\nset. This provides motivation for constraining policy support to the data distribution. We begin with\nthe de\ufb01nition of a distribution-constrained operator:\nDe\ufb01nition 4.1 (Distribution-constrained operators). Given a set of policies \u03a0, the distribution-\nconstrained backup operator is de\ufb01ned as:\n\nT \u03a0Q(s, a) def= E(cid:2)R(s, a) + \u03b3 max\n\n\u03c0\u2208\u03a0\n\nEP (s(cid:48)|s,a) [V (s(cid:48))](cid:3)\n\nV (s) def= max\n\u03c0\u2208\u03a0\n\nE\u03c0[Q(s, a)] .\n\nThis backup operator satis\ufb01es properties of the standard Bellman backup, such as convergence to a\n\ufb01xed point, as discussed in Appendix A. To analyze the (sub)optimality of performing this backup\nunder approximation error, we \ufb01rst quantify two sources of error. The \ufb01rst is a suboptimality bias.\nThe optimal policy may lie outside the policy constraint set, and thus a suboptimal solution will be\nfound. The second arises from distribution shift between the training distribution and the policies\nused for backups. This formalizes the notion of OOD actions. To capture suboptimality in the \ufb01nal\nsolution, we de\ufb01ne a suboptimality constant, which measures how far \u03c0\u2217 is from \u03a0.\nDe\ufb01nition 4.2 (Suboptimality constant). The suboptimality constant is de\ufb01ned as:\n\n\u03b1(\u03a0) = max\n\ns,a |T \u03a0Q\u2217(s, a) \u2212 T Q\u2217(s, a)|.\n\nNext, we de\ufb01ne a concentrability coef\ufb01cient [28], which quanti\ufb01es how far the visitation distribution\ngenerated by policies from \u03a0 is from the training data distribution. This constant captures the degree\nto which states and actions are out of distribution.\nAssumption 4.1 (Concentrability). Let \u03c10 denote the initial state distribution, and \u00b5(s, a) denote\nthe distribution of the training data over S \u00d7 A, with marginal \u00b5(s) over S. Suppose there exist\ncoef\ufb01cients c(k) such that for any \u03c01, ...\u03c0k \u2208 \u03a0 and s \u2208 S:\n\nwhere P \u03c0i is the transition operator on states induced by \u03c0i. Then, de\ufb01ne the concentrability\ncoef\ufb01cient C(\u03a0) as\n\nTo provide some intuition for C(\u03a0), if \u00b5 was generated by a single policy \u03c0, and \u03a0 = {\u03c0} was a\nsingleton set, then we would have C(\u03a0) = 1, which is the smallest possible value. However, if \u03a0\ncontained policies far from \u03c0, the value could be large, potentially in\ufb01nite if the support of \u03a0 is not\ncontained in \u03c0. Now, we bound the performance of approximate distribution-constrained Q-iteration:\nTheorem 4.1. Suppose we run approximate distribution-constrained value iteration with a set\nconstrained backup T \u03a0. Assume that \u03b4(s, a) \u2265 maxk |Qk(s, a) \u2212 T \u03a0Qk\u22121(s, a)| bounds the\nBellman error. Then,\n\n(cid:21)\n\nlim\nk\u2192\u221e\n\nE\u03c10[|V \u03c0k (s) \u2212 V \u2217(s)|] \u2264\n\n\u03b3\n\n(1 \u2212 \u03b3)2\n\nProof. See Appendix B, Theorem B.1\n\nC(\u03a0)E\u00b5[max\n\u03c0\u2208\u03a0\n\nE\u03c0[\u03b4(s, a)]] +\n\n1 \u2212 \u03b3\n\u03b3\n\n\u03b1(\u03a0)\n\nThis bound formalizes the tradeoff between keeping policies chosen during backups close to the data\n(captured by C(\u03a0)) and keeping the set \u03a0 large enough to capture well-performing policies (captured\nby \u03b1(\u03a0)). When we expand the set of policies \u03a0, we are increasing C(\u03a0) but decreasing \u03b1(\u03a0). An\nexample of this tradeoff, and how a careful choice of \u03a0 can yield superior results, is given in a tabular\n\n4\n\n\u03c10P \u03c01P \u03c02...P \u03c0k (s) \u2264 c(k)\u00b5(s),\n\nC(\u03a0) def= (1 \u2212 \u03b3)2\n\nk\u03b3k\u22121c(k).\n\n\u221e(cid:88)\n\nk=1\n\n(cid:20)\n\n\fFigure 2: Visualized error propagation in Q-learning for various choices of the constraint set \u03a0: unconstrained\n(top row) distribution-constrained (middle), and constrained to the behaviour policy (policy-evaluation, bottom).\nTriangles represent Q-values for actions that move in different directions. The task (left) is to reach the bottom-\nleft corner (G) from the top-left (S), but the behaviour policy (visualized as arrows in the task image, support\nstate-action pairs are shown in black on the support set image) travels to the bottom-right with a small amount of\n\u0001-greedy exploration. Dark values indicate high error, and light values indicate low error. Standard backups\npropagate large errors from the low-support regions into the high-support regions, leading to high error. Policy\nevaluation reduces error propagation from low-support regions, but introduces signi\ufb01cant suboptimality bias, as\nthe data policy is not optimal. A carefully chosen distribution-constrained backup strikes a balance between these\ntwo extremes, by con\ufb01ning error propagation in the low-support region while introducing minimal suboptimality\nbias.\n\n(cid:16)\n\n\u03b3\n\n1 +\n\n(cid:17)\n\ngridworld example in Fig. 2, where we visualize errors accumulated during distribution-constrained\nQ-iteration for different choices of \u03a0.\nFinally, we motivate the use of support sets to construct \u03a0. We are interested in the case where\n\u03a0\u0001 = {\u03c0 | \u03c0(a|s) = 0 whenever \u03b2(a|s) < \u0001}, where \u03b2 is the behavior policy (i.e., \u03a0 is the set of\npolicies that have support in the probable regions of the behavior policy). De\ufb01ning \u03a0\u0001 in this way\nallows us to bound the concentrability coef\ufb01cient:\nTheorem 4.2. Assume the data distribution \u00b5 is generated by a behavior policy \u03b2. Let \u00b5(s)\nbe the marginal state distribution under the data distribution. De\ufb01ne \u03a0\u0001 = {\u03c0 | \u03c0(a|s) =\n0 whenever \u03b2(a|s) < \u0001} and let \u00b5\u03a0\u0001 be the highest discounted marginal state distribution start-\ning from the initial state distribution \u03c1 and following policies \u03c0 \u2208 \u03a0\u0001 at each time step thereafter.\nThen, there exists a concentrability coef\ufb01cient C(\u03a0\u0001) which is bounded:\n\n(1 \u2212 \u0001)\n\n(1 \u2212 \u03b3)f (\u0001)\n\nC(\u03a0\u0001) \u2264 C(\u03b2) \u00b7\nwhere f (\u0001) def= mins\u2208S,\u00b5\u03a0\u0001 (s)>0[\u00b5(s)] > 0.\nProof. See Appendix B, Theorem B.2\nQualitatively, f (\u0001) is the minimum discounted visitation marginal of a state under the behaviour policy\nif only actions which are more than \u0001 likely are executed in the environment. Thus, using support sets\ngives us a single lever, \u0001, which simultaneously trades off the value of C(\u03a0) and \u03b1(\u03a0). Not only can\nwe provide theoretical guarantees, we will see in our experiments (Sec. 6) that constructing \u03a0 in this\nway provides a simple and effective method for implementing distribution-constrained algorithms.\nIntuitively, this means we can prevent an increase in overall error in the Q-estimate by selecting\npolicies supported on the support of the training action distribution, which would ensure roughly\nbounded projection error \u03b4k(s, a) while reducing the suboptimality bias, potentially by a large amount.\nBounded error \u03b4k(s, a) on the support set of the training distribution is a reasonable assumption when\nusing highly expressive function approximators, such as deep networks, especially if we are willing\nto reweight the transition set [32, 11]. We further elaborate on this point in Appendix C.\n\n5 Bootstrapping Error Accumulation Reduction (BEAR)\nWe now propose a practical actor-critic algorithm (built on the framework of TD3 [13] or SAC [18])\nthat uses distribution-constrained backups to reduce accumulation of bootstrapping error. The key\ninsight is that we can search for a policy with the same support as the training distribution, while\n\n5\n\n\fpreventing accidental error accumulation. Our algorithm has two main components. Analogous\nto BCQ [13], we use K Q-functions and use the minimum Q-value for policy improvement, and\ndesign a constraint which will be used for searching over the set of policies \u03a0\u0001, which share the\nsame support as the behaviour policy. Both of these components will appear as modi\ufb01cations of the\npolicy improvement step in actor-critic style algorithms. We also note that policy improvement can\nbe performed with the mean of the K Q-functions, and we found that this scheme works as good in\nour experiments.\nWe denote the set of Q-functions as: \u02c6Q1,\u00b7\u00b7\u00b7 , \u02c6QK. Then, the policy is updated to maximize the\nconservative estimate of the Q-values within \u03a0\u0001:\n\n\u03c0\u03c6(s) := max\n\u03c0\u2208\u03a0\u0001\n\nEa\u223c\u03c0(\u00b7|s)\n\nmin\n\nj=1,..,K\n\n\u02c6Qj(s, a)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nIn practice, the behaviour policy \u03b2 is unknown, so we need an approximate way to constrain \u03c0 to \u03a0.\nWe de\ufb01ne a differentiable constraint that approximately constrains \u03c0 to \u03a0, and then approximately\nsolve the constrained optimization problem via dual gradient descent. We use the sampled version\nof maximum mean discrepancy (MMD) [17] between the unknown behaviour policy \u03b2 and the\nactor \u03c0 because it can be estimated based solely on samples from the distributions. Given samples\n(cid:88)\nx1,\u00b7\u00b7\u00b7 , xn \u223c P and y1,\u00b7\u00b7\u00b7 , ym \u223c Q, the sampled MMD between P and Q is given by:\nMMD2({x1,\u00b7\u00b7\u00b7 , xn},{y1,\u00b7\u00b7\u00b7 , ym}) =\nHere, k(\u00b7,\u00b7) is any universal kernel. In our experiments, we \ufb01nd both Laplacian and Gaussian kernels\nwork well. The expression for MMD does not involve the density of either distribution and it can be\noptimized directly through samples. Empirically we \ufb01nd that, in the low-intermediate sample regime,\nthe sampled MMD between P and Q is similar to the MMD between a uniform distribution over P \u2019s\nsupport and Q, which makes MMD roughly suited for constraining distributions to a given support\nset. (See Appendix C.3 for numerical simulations justifying this approach).\nPutting everything together, the optimization problem in the policy improvement step is\n\nk(xi, xi(cid:48))\u2212\n\n(cid:88)\n\n(cid:88)\n\nk(xi, yj)+\n\nk(yj, yj(cid:48)).\n\n2\nnm\n\n1\nm2\n\n1\nn2\n\nj,j(cid:48)\n\ni,i(cid:48)\n\ni,j\n\n\u03c0\u03c6 := max\n\u03c0\u2208\u2206|S|\n\nEs\u223cDEa\u223c\u03c0(\u00b7|s)\n\nmin\n\nj=1,..,K\n\n\u02c6Qj(s, a)\n\ns.t. Es\u223cD[MMD(D(s), \u03c0(\u00b7|s))] \u2264 \u03b5\n\n(1)\n\nwhere \u03b5 is an approximately chosen threshold. We choose a threshold of \u03b5 = 0.05 in our experiments.\nThe algorithm is summarized in Algorithm 1.\nHow does BEAR connect with distribution-constrained backups described in Section 4.1? Step 5 of\nthe algorithm restricts \u03c0\u03c6 to lie in the support of \u03b2. This insight is formally justi\ufb01ed in Theorems 4.1\n& 4.2 (C(\u03a0\u03b5) is bounded). Computing distribution-constrained backup exactly by maximizing over\n\u03c0 \u2208 \u03a0\u03b5 is intractable in practice. As an approximation, we sample Dirac policies in the support of \u03b2\n(Alg 1, Line 5) and perform empirical maximization to compute the backup. As the maximization\nis performed over a narrower set of Dirac policies ({\u03b4ai} \u2286 \u03a0\u03b5), the bound in Theorem 4.1 still\nholds. Empirically, we show in Section 6 that this approximation is suf\ufb01cient to outperform previous\nmethods. This connection is brie\ufb02y discussed in Appendix C.2.\nAlgorithm 1 BEAR Q-Learning (BEAR-QL)\ninput : Dataset D, target network update rate \u03c4, mini-batch size N, sampled actions for MMD n, minimum \u03bb\n1: Initialize Q-ensemble {Q\u03b8i}K\n}K\ni=1, and a target\ni \u2190 \u03b8i\n\ni=1, actor \u03c0\u03c6, Lagrange multiplier \u03b1, target networks {Q\u03b8(cid:48)\n\nactor \u03c0\u03c6(cid:48), with \u03c6(cid:48) \u2190 \u03c6, \u03b8(cid:48)\n\ni\n\n2: for t in {1, . . . , N} do\n3:\n\nSample mini-batch of transitions (s, a, r, s(cid:48)) \u223c D\nQ-update:\nSample p action samples, {ai \u223c \u03c0\u03c6(cid:48) (\u00b7|s(cid:48))}p\nDe\ufb01ne y(s, a) := maxai [\u03bb minj=1,..,K Q\u03b8(cid:48)\n\u2200i, \u03b8i \u2190 arg min\u03b8i (Q\u03b8i (s, a) \u2212 (r + \u03b3y(s, a)))2\nPolicy-update:\nSample actions {\u02c6ai \u223c \u03c0\u03c6(\u00b7|s)}m\nUpdate \u03c6, \u03b1 by minimizing Equation 1 by using dual gradient descent with Lagrange multiplier \u03b1\nUpdate Target Networks: \u03b8(cid:48)\n\ni=1 and {aj \u223c D(s)}n\ni \u2190 \u03c4 \u03b8i + (1 \u2212 \u03c4 )\u03b8(cid:48)\n\n(s(cid:48), ai) + (1 \u2212 \u03bb) maxj=1,..,K Q\u03b8(cid:48)\n\ni; \u03c6(cid:48) \u2190 \u03c4 \u03c6 + (1 \u2212 \u03c4 )\u03c6(cid:48)\n\nj=1, n preferably an intermediate integer(1-10)\n\n(s(cid:48), ai)]\n\ni=1\n\nj\n\nj\n\n4:\n5:\n6:\n\n7:\n8:\n9:\n10: end for\n\n6\n\n\fFigure 3: Average performance of BEAR-QL, BCQ, Na\u00efve RL and BC on medium-quality data averaged over\n5 seeds. BEAR-QL outperforms both BCQ and Na\u00efve RL. Average return over the training data is indicated by\nthe magenta line. One step on the x-axis corresponds to 1000 gradient steps.\n\nIn summary, the actor is updated towards maximizing the Q-function while still being constrained\nto remain in the valid search space de\ufb01ned by \u03a0\u0001. The Q-function uses actions sampled from the\nactor to then perform distribution-constrained Q-learning, over a reduced set of policies. At test time,\nwe sample p actions from \u03c0\u03c6(s) and the Q-value maximizing action out of these is executed in the\nenvironment. Implementation and other details are present in Appendix D.\n\n6 Experiments\nIn our experiments, we study how BEAR performs when learning from static off-policy data on a\nvariety of continuous control benchmark tasks. We evaluate our algorithm in three settings: when the\ndataset D is generated by (1) a completely random behaviour policy, (2) a partially trained, medium\nscoring policy, and (3) an optimal policy. Condition (2) is of particular interest, as it captures many\ncommon use-cases in practice, such as learning from imperfect demonstration data (e.g., of the sort\nthat are commonly available for autonomous driving [14]), or reusing previously collected experience\nduring off-policy RL. We compare our method to several prior methods: a baseline actor-critic\nalgorithm (TD3), the BCQ algorithm [12], which aims to address a similar problem, as discussed in\nSection 4, KL-control [21] (which solves a KL-penalized RL problem similarly to maximum entropy\nRL), a static version of DQfD [20] (where a constraint to upweight Q-values of state-action pairs\nobserved in the dataset is added as an auxiliary loss on top a regular actor-critic algorithm), and a\nbehaviour cloning (BC) baseline, which simply imitates the data distribution. This serves to measure\nwhether each method actually performs effective RL, or simply copies the data. We report the average\nevaluation return over 5 seeds of the policy given by the learned algorithm, in the form of a learning\ncurve as a function of number of gradient steps taken by the algorithm. These samples are only\ncollected for evaluation, and are not used for training.\n6.1 Performance on Medium-Quality Data\n\nWe \ufb01rst discuss the evaluation of condition with \u201cmediocre\u201d data (2), as this condition resembles\nthe settings where we expect training on of\ufb02ine data to be most useful. We collected one million\ntransitions from a partially trained policy, so as to simulate imperfect demonstration data or data from\na mediocre prior policy. In this scenario, we found that BEAR-QL consistently outperforms both\nBCQ [12] and a na\u00efve off-policy RL baseline (TD3) by large margins, as shown in Figure 3. This\nscenario is the most relevant from an application point of view, as access to optimal data may not\nbe feasible, and random data might have inadequate exploration to ef\ufb01cient learn a good policy. We\nalso evaluate the accuracy with which the learned Q-functions predict actual policy returns. These\ntrends are provided in Appendix E. Note that the performance of BCQ often tracks the performance\nof the BC baseline, suggesting that BCQ primarily imitates the data. Our KL-control baseline uses\nautomatic temperature tuning [18]. We \ufb01nd that KL-control usually performs similar or worse to BC,\nwhereas DQfD tends to diverge often due to cumulative error due to OOD actions and often exhibits\na huge variance across different runs (for example, HalfCheetah-v2 environment).\n6.2 Performance on Random and Optimal Datasets\n\nIn Figure 5, we show the performance of each method when trained on data from a random policy\n(top) and a near-optimal policy (bottom). In both cases, our method BEAR achieves good results,\nconsistently exceeding the average dataset return on random data, and matching the optimal policy\nreturn on optimal data. Na\u00efve RL also often does well on random data. For a random data policy, all\nactions are in-distribution, since they all have equal probability. This is consistent with our hypothesis\n\n7\n\n0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps0100020003000400050006000HalfCheetah-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps0500100015002000250030003500Walker2d-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.1K0.2K0.3K0.4KTrainSteps050010001500200025003000Hopper-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps\u2212500\u221225002505007501000Ant-v2BCQBEAR-QLNaive-RLBCDQfDKL-c\fFigure 5: Average performance of BEAR-QL, BCQ, Na\u00efve RL and BC on random data (top row) and optimal\ndata (bottom row) over 5 seeds. BEAR-QL is the only algorithm capable of learning in both scenarios. Na\u00efve RL\ncannot handle optimal data, since it does not illustrate mistakes, and BCQ favors a behavioral cloning strategy\n(performs quite close to behaviour cloning in most cases), causing it to fail on random data. Average return over\nthe training dataset is indicated by the dashed magenta line.\n\nthat OOD actions are one of the main sources of error in off-policy learning on static datasets. The\nprior BCQ method [12] performs well on optimal data but performs poorly on random data, where\nthe constraint is too strict. These results show that BEAR-QL is robust to the dataset composition,\nand can learn consistently in a variety of settings. We \ufb01nd that KL-control and DQfD can be unstable\nin these settings.\nFinally, in Figure 4, we show that BEAR outperforms other considered prior methods in the challeng-\ning Humanoid-v2 environment as well, in two cases \u2013 Medium-quality data and random data.\n\n6.3 Analysis of BEAR-QL\n\nIn this section, we aim to analyze different com-\nponents of our method via an ablation study.\nOur \ufb01rst ablation studies the support constraint\ndiscussed in Section 5, which uses MMD to mea-\nsure support. We replace it with a more standard\nKL-divergence distribution constraint, which\nmeasures similarity in density. Our hypothesis\nis that this should provide a more conservative\nconstraint, since matching distributions is not\nnecessary for matching support. KL-divergence\nperforms well in some cases, such as with opti-\nmal data, but as shown in Figure 6, it performs\nworse than MMD on medium-quality data. Even\nwhen KL-divergence is hand tuned fully, so as to prevent instability issues it still performs worse\nthan a not-well tuned MMD constraint. We provide the results for this setting in the Appendix. We\nalso vary the number of samples n that are used to compute the MMD constraint. We \ufb01nd that\nsmaller n (\u2248 4 or 5) gives better performance. Although the difference is not large, consistently better\nperformance with 4 samples leans in favour of our hypothesis that an intermediate number of samples\nworks well for support matching, and hence is less restrictive.\n\nFigure 4: Performance of BEAR-QL, BCQ, Na\u00efve RL\nand BC on medium-quality (left) and random (right) data\nin the Humanoid-v2 environment. Note that BEAR-QL\noutperforms prior methods.\n\n7 Discussion and Future Work\nThe goal in our work was to study off-policy reinforcement learning with static datasets. We\ntheoretically and empirically analyze how error propagates in off-policy RL due to the use of out-of-\ndistribution actions for computing the target values in the Bellman backup. Our experiments suggest\nthat this source of error is one of the primary issues af\ufb02icting off-policy RL: increasing the number\n\n8\n\n0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps\u22121000010002000300040005000HalfCheetah-v2BCQBEARNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps02000400060008000100001200014000HalfCheetah-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps02004006008001000Walker2d-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps010002000300040005000Walker2d-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.1K0.2K0.3K0.4KTrainSteps0100200300400500600700800Hopper-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.1K0.2K0.3K0.4KTrainSteps05001000150020002500300035004000Hopper-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6KTrainSteps\u22121000\u22125000500100015002000Ant-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps\u22122000\u221210000100020003000400050006000Ant-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps01000200030004000500060007000Humanoid-v2BCQBEAR-QLNaive-RLBCDQfDKL-c0.0K0.2K0.4K0.6K0.8K1.0KTrainSteps0100200300400500600700Humanoid-v2BCQBEAR-QLNaive-RLBCDQfDKL-c\fof samples does not appear to mitigate the degradation issue (Figure 1), and training with na\u00efve\nRL on data from a random policy, where there are no out-of-distribution actions, shows much less\ndegradation than training on data from more focused policies (Figure 5). Armed with this insight, we\ndevelop a method for mitigating the effect of out-of-distribution actions, which we call BEAR-QL.\nBEAR-QL constrains the backup to use actions that have non-negligible support under the data\ndistribution, but without being overly conservative in constraining the learned policy. We observe\nexperimentally that BEAR-QL achieves good performance across a range of tasks, and across a range\nof dataset compositions, learning well on random, medium-quality, and expert data.\nWhile BEAR-QL substantially stabilizes off-\npolicy RL, we believe that this problem merits\nfurther study. One limitation of our current\nmethod is that, although the learned policies\nare more performant than those acquired with\nna\u00efve RL, performance sometimes still tends\nto degrade for long learning runs. An exciting\ndirection for future work would be to develop\nan early stopping condition for RL, perhaps\nby generalizing the notion of validation error\nto reinforcement learning. A limitation of ap-\nproaches that perform constrained-action se-\nlection is that they can be overly conservative\nwhen compared to methods that constrain state-\ndistributions directly, especially with datasets\ncollected from mixtures of policies. We leave\nit to future work to design algorithms that can directly constrain state distributions. A theoretically\nrobust method for support matching ef\ufb01ciently in high-dimensional continuous action spaces is a\nquestion for future research. Perhaps methods from outside RL, predominantly used in domain\nadaptation, such as using asymmetric f-divergences [37] can be used for support restriction. Another\npromising future direction is to examine how well BEAR-QL can work on large-scale off-policy\nlearning problems, of the sort that are likely to arise in domains such as robotics, autonomous driving,\noperations research, and commerce. If RL algorithms can learn effectively from large-scale off-policy\ndatasets, reinforcement learning can become a truly data-driven discipline, bene\ufb01ting from the same\nadvantage in generalization that has been seen in recent years in supervised learning \ufb01elds, where\nlarge datasets have enabled rapid progress in terms of accuracy and generalization [7].\n\nFigure 6: Average return (averaged Hopper-v2 and\nWalker2d-v2) as a function of train steps for ablation\nstudies from Section 6.3. (a) MMD constrained optimiza-\ntion is more stable and leads to better returns, (b) 4 sample\nMMD is more performant than 10.\n\nAcknowledgements\n\nWe thank Kristian Hartikainen for sharing implementations of RL algorithms and for help in debug-\nging certain issues. We thank Matthew Soh for help in setting up environments. We thank Aurick\nZhou, Chelsea Finn, Abhishek Gupta and Kelvin Xu for informative discussions. We thank O\ufb01r\nNachum for comments on an earlier draft of this paper. We thank Google, NVIDIA, and Amazon for\nproviding computational resources. This research was supported by Berkeley DeepDrive, JPMorgan\nChase & Co., NSF IIS-1651843 and IIS-1614653, the DARPA Assured Autonomy program, and\nARL DCIST CRA W911NF-17-2-0181.\n\nReferences\n[1] Andr\u00e4s Antos, Csaa Szepesvari, and Remi Munos. Value-iteration based \ufb01tted policy iteration:\nLearning with a single trajectory. In 2007 IEEE International Symposium on Approximate\nDynamic Programming and Reinforcement Learning, pages 330\u2013337, April 2007. doi: 10.1109/\nADPRL.2007.368207.\n\n[2] Andr\u00e1s Antos, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Fitted q-iteration in continuous action-\nspace mdps. In Advances in Neural Information Processing Systems 20, pages 9\u201316. Curran\nAssociates, Inc., 2008.\n\n[3] James Bennett, Stan Lanning, et al. The net\ufb02ix prize. 2007.\n[4] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scienti\ufb01c,\n\n1996.\n\n9\n\n0.0K0.1K0.2K0.3K0.4K0.5KTrainSteps05001000150020002500MMDvsKLconstraintonmediocredataKLMMD0.0K0.1K0.2K0.3K0.4KTrainSteps050010001500200025003000NumberofsamplesforMMDHopper,4Hopper,10Walker2d,4Walker2d,10\f[5] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning?\n\nIn ICML 2019.\n\n[6] Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babuska. The importance of experience\n\nreplay database composition in deep reinforcement learning. 01 2015.\n\n[7] Jia Deng, Wei Dong, Richard S. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.\n\nLarge-Scale Hierarchical Image Database. In CVPR09, 2009.\n\nImageNet: A\n\n[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[9] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl\nIn Proceedings of the International\nwith importance weighted actor-learner architectures.\nConference on Machine Learning (ICML), 2018.\n\n[10] Amir-massoud Farahmand, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Error propagation for approxi-\nmate policy and value iteration. In Advances in Neural Information Processing Systems, pages\n568\u2013576, 2010.\n\n[11] Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep\n\nq-learning algorithms. arXiv preprint arXiv:1902.10250, 2019.\n\n[12] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning\n\nwithout exploration. arXiv preprint arXiv:1812.02900, 2018.\n\n[13] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error\nin actor-critic methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th\nInternational Conference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 1587\u20131596. PMLR, 2018.\n\n[14] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement\n\nlearning from imperfect demonstrations. In ICLR (Workshop). OpenReview.net, 2018.\n\n[15] Carles Gelada and Marc G. Bellemare. Off-policy deep reinforcement learning by bootstrapping\n\nthe covariate shift. CoRR, abs/1901.09455, 2019.\n\n[16] Jordi Grau-Moya, Felix Leibfried, and Peter Vrancx. Soft q-learning with mutual-information\nregularization. In International Conference on Learning Representations, 2019. URL https:\n//openreview.net/forum?id=HyEtjoCqFX.\n\n[17] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. J. Mach. Learn. Res., 13:723\u2013773, March 2012. ISSN\n1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2188410.\n\n[18] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 770\u2013778, 2016.\n\n[20] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan,\nJohn Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In\nThirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[21] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, \u00c0gata Lapedriza,\nNoah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement\nlearning of implicit human preferences in dialog. CoRR, abs/1907.00456, 2019. URL http:\n//arxiv.org/abs/1907.00456.\n\n[22] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\nIn Proceedings of the Nineteenth International Conference on Machine Learning, pages 267\u2013\n274. Morgan Kaufmann Publishers Inc., 2002.\n\n[23] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,\nDeirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine.\nScalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings\n\n10\n\n\fof The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning\nResearch, pages 651\u2013673. PMLR, 2018.\n\n[24] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with\n\nbaseline bootstrapping. In International Conference on Machine Learning (ICML), 2019.\n\n[25] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and\n\nreview. CoRR, abs/1805.00909, 2018. URL http://arxiv.org/abs/1805.00909.\n\n[26] A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-\n\ndifference learning. arXiv preprint arXiv:1507.01569, 2015.\n\n[27] R\u00e9mi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth\nInternational Conference on International Conference on Machine Learning, pages 560\u2013567.\nAAAI Press, 2003.\n\n[28] R\u00e9mi Munos. Error bounds for approximate value iteration. In Proceedings of the National\n\nConference on Arti\ufb01cial Intelligence, 2005.\n\n[29] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient\noff-policy reinforcement learning. In Advances in Neural Information Processing Systems,\npages 1054\u20131062, 2016.\n\n[30] Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning\nwith function approximation. In International Conference on Machine Learning (ICML), 2001.\n\n[31] Stefan Schaal. Is imitation learning the route to humanoid robots?, 1999.\n[32] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nCoRR, abs/1511.05952, 2016.\n\n[33] Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist.\nApproximate modi\ufb01ed policy iteration and its application to the game of tetris. Journal of\nMachine Learning Research, 16:1629\u20131676, 2015. URL http://jmlr.org/papers/v16/\nscherrer15a.html.\n\n[34] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust\nregion policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd\nInternational Conference on Machine Learning, volume 37 of Proceedings of Machine Learning\nResearch, pages 1889\u20131897, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[35] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Second\n\nedition, 2018.\n\n[36] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based\n\ncontrol. In IROS, pages 5026\u20135033, 2012.\n\n[37] Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. Domain adaptation with\n\nasymmetrically-relaxed distribution alignment. In ICML 2019.\n\n[38] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and\nTrevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling.\nCoRR, abs/1805.04687, 2018. URL http://arxiv.org/abs/1805.04687.\n\n11\n\n\f", "award": [], "sourceid": 6282, "authors": [{"given_name": "Aviral", "family_name": "Kumar", "institution": "UC Berkeley"}, {"given_name": "Justin", "family_name": "Fu", "institution": "UC Berkeley"}, {"given_name": "Matthew", "family_name": "Soh", "institution": "UC Berkeley"}, {"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}