{"title": "Real-Time Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3073, "page_last": 3082, "abstract": "Markov Decision Processes (MDPs), the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent's environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world safety critical situations, this mismatch between the assumptions underlying classical MDPs and the reality of real-time computation may lead to undesirable outcomes. \nIn this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why they are suboptimal when used in real-time. We then use those insights to create a new algorithm Real-Time Actor Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor Critic both in real-time and non-real-time settings.", "full_text": "9\n1\n0\n2\n\n \nc\ne\nD\n2\n1\n\n \n\n \n \n]\n\nG\nL\n.\ns\nc\n[\n \n \n\n4\nv\n8\n4\n4\n4\n0\n\n.\n\n1\n1\n9\n1\n:\nv\ni\nX\nr\na\n\nReal-Time Reinforcement Learning\n\nSimon Ramstedt\nMila, Element AI,\n\nUniversit\u00e9 de Montr\u00e9al\n\nsimonramstedt@gmail.com\n\nChristopher Pal\nMila, Element AI,\n\nPolytechnique Montr\u00e9al\n\nchristopher.pal@polymtl.ca\n\nAbstract\n\nMarkov Decision Processes (MDPs), the mathematical framework underlying\nmost algorithms in Reinforcement Learning (RL), are often used in a way that\nwrongfully assumes that the state of an agent\u2019s environment does not change during\naction selection. As RL systems based on MDPs begin to \ufb01nd application in real-\nworld, safety-critical situations, this mismatch between the assumptions underlying\nclassical MDPs and the reality of real-time computation may lead to undesirable\noutcomes. In this paper, we introduce a new framework, in which states and actions\nevolve simultaneously and show how it is related to the classical MDP formulation.\nWe analyze existing algorithms under the new real-time formulation and show why\nthey are suboptimal when used in real time. We then use those insights to create a\nnew algorithm Real-Time Actor-Critic (RTAC) that outperforms the existing state-\nof-the-art continuous control algorithm Soft Actor-Critic both in real-time and non-\nreal-time settings. Code and videos can be found at github.com/rmst/rtrl.\n\nReinforcement Learning, has led to great successes in games (Tesauro, 1994; Mnih et al., 2015; Silver\net al., 2017) and is starting to be applied successfully to real-world robotic control (Schulman et al.,\n2015; Hwangbo et al., 2019).\nThe theoretical underpinning for most methods in Reinforcement Learn-\ning is the Markov Decision Process (MDP) framework (Bellman, 1957).\nWhile it is well suited to describe turn-based decision problems such\nas board games, this framework is ill suited for real-time applications\nin which the environment\u2019s state continues to evolve while the agent\nselects an action (Travnik et al., 2018). Nevertheless, this framework\nhas been used for real-time problems using what are essentially tricks,\ne.g. pausing a simulated environment during action selection or ensuring\nthat the time required for action selection is negligible (Hwangbo et al.,\n2017).\nInstead of relying on such tricks, we propose an augmented decision-\nmaking framework - Real-Time Reinforcement Learning (RTRL) - in\nwhich the agent is allowed exactly one timestep to select an action.\nRTRL is conceptually simple and opens up new algorithmic possibilities\nbecause of its special structure.\nWe leverage RTRL to create Real-Time Actor-Critic (RTAC), a new actor-\ncritic algorithm, better suited for real-time interaction, that is based on\nSoft Actor-Critic (Haarnoja et al., 2018a). We then show experimentally\nthat RTAC outperforms SAC in both real-time and non-real-time settings.\n\nFigure 1: Turn-based\n\nFigure 2: Real-time\n\ninteraction\n\ninteraction\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1 Background\n\nIn Reinforcement Learning the world is split up into agent and environment. The agent is represented\nby a policy \u2013 a state-conditioned action distribution, while the environment is represented by a Markov\nDecision Process (Def. 1). Traditionally, the agent-environment interaction has been governed by\nthe MDP framework. Here, however, we strictly use MDPs to represent the environment. The\nagent-environment interaction is instead described by different types of Markov Reward Processes\n(MRP), with the TBMRP (Def. 2) behaving like the traditional interaction scheme.\nDe\ufb01nition 1. A Markov Decision Process (MDP) is characterized by a tuple with\n(3) initial state distribution \u00b5 : S \u2192 R,\n(1) state space S,\n(4) transition distribution p : S \u00d7 S \u00d7 A \u2192 R,\n\n(5) reward function r : S \u00d7 A \u2192 R.\n\n(2) action space A,\n\nAn agent-environment system can be condensed into a Markov Reward Process (S, \u00b5, \u03ba, \u00afr) consisting\nof a Markov process (S, \u00b5, \u03ba) and a state-reward function \u00afr. The Markov process induces a sequence\nof states (st)t\u2208N and, together with \u00afr, a sequence of rewards (rt)t\u2208N = (\u00afr(st))t\u2208N.\nAs usual, the objective is to \ufb01nd a policy that maximizes the expected sum of rewards. In practice,\nrewards can be discounted and augmented to guarantee convergence, reduce variance and encour-\nage exploration. However, when evaluating the performance of an agent, we will always use the\nundiscounted sum of rewards.\n\n1.1 Turn-Based Reinforcement Learning\n\nUsually considered part of the standard Reinforcement Learning framework is the\nturn-based scheme in which agent and environment interact. We call this interaction\nscheme Turn-Based Markov Reward Process.\nDe\ufb01nition 2. A Turn-Based Markov Reward Process (S, \u00b5, \u03ba, \u00afr) = TBMRP (E, \u03c0)\ncombines a Markov Decision Process E = (S, A, \u00b5, p, r) with a policy \u03c0, such that\n\u03ba(st+1|st)=\n\np(st+1|st,a)\u03c0(a|st) da and\n\n\u00afr(st)=\n\n(cid:90)\n\n(cid:90)\n\nr(st,a)\u03c0(a|st) da.\n\n(1)\n\nA\n\nA\n\nWe say the interaction is turn-based, because the environment pauses while the agent\nselects an action and the agent pauses until it receives a new observation from the\nenvironment. This is illustrated in Figure 1. An action selected in a certain state is\npaired up again with that same state to induce the next. The state does not change\nduring the action selection process.\n\nFigure 3:\nTBMRP\n\n2 Real-Time Reinforcement Learning\n\nIn contrast to the conventional, turn-based interaction scheme, we propose an alter-\nnative, real-time interaction framework in which states and actions evolve simulta-\nneously. Here, agent and environment step in unison to produce new state-action\npairs xxxt+1 = (st+1, at+1 ) from old state-action pairs xxxt = (st, at ) as illustrated in\nFigures 2 and 4.\nDe\ufb01nition 3. A Real-Time Markov Reward Process (XXX, \u00b5\u00b5\u00b5, \u03ba\u03ba\u03ba, \u00afrrr) = RTMRP (E, \u03c0\u03c0\u03c0)\ncombines a Markov Decision Process E = (S, A, \u00b5, p, r) with a policy \u03c0, such that\n\u03ba\u03ba\u03ba(st+1,at+1|st,at )=p(st+1|st,at) \u03c0\u03c0\u03c0(at+1|st,at) and \u00afrrr(st,at )=r(st,at).\n(2)\nThe system state space is XXX = S \u00d7 A. The initial action a0 can be set to some \ufb01xed\nvalue, i.e. \u00b5\u00b5\u00b5(s0, a0 ) = \u00b5(s0) \u03b4(a0 \u2212 c).1\nNote that we introduced a new policy \u03c0\u03c0\u03c0 that takes state-action pairs instead of just states. That is\nbecause the system state xxx = (s, a) is now a state-action pair and s alone is not a suf\ufb01cient statistic\nof the future of the stochastic process anymore.\n\nFigure 4:\nRTMRP\n\n1\u03b4 is the Dirac delta distribution. If y \u223c \u03b4(\u00b7 \u2212 x) then y = x with probability one.\n\n2\n\n\f2.1 The real-time framework is made for back-to-back action selection\nIn the real-time framework, the agent has exactly one timestep to select an action. If an agent takes\nlonger that its policy would have to be broken up into stages that take less than one timestep to\nevaluate. On the other hand, if an agent takes less than one timestep to select an action, the real-time\nframework will delay applying the action until the next observation is made. The optimal case is\nwhen an agent, immediately upon \ufb01nishing selecting an action, observes the next state and starts\ncomputing the next action. This continuous, back-to-back action selection is ideal in that it allows the\nagent to update its actions the quickest and no delay is introduced through the real-time framework.\nTo achieve back-to-back action selection, it might be necessary to match timestep size to the policy\nevaluation time. With current algorithms, reducing timestep size might lead to worse performance.\nRecently, however, progress has been made towards timestep agnostic methods (Tallec et al., 2019).\nWe believe back-to-back action selection is an achievable goal and we demonstrate here that the\nreal-time framework is effective even if we are not able to tune timestep size (Section 5).\n2.2 Real-time interaction can be expressed within the turn-based framework\n\nIt is possible to express real-time interaction within the standard, turn-based framework, which allows\nus to reconnect the real-time framework to the vast body of work in RL. Speci\ufb01cally, we are trying to\n\ufb01nd an augmented environment RTMDP (E) that behaves the same with turn-based interaction as\nwould E with real-time interaction.\nIn the real-time framework the agent communicates its action to the environment via the state.\nHowever, in the traditional, turn-based framework, only the environment can directly in\ufb02uence the\nstate. We therefore need to deterministically \"pass through\" the action to the next state by augmenting\nthe transition function. The RTMDP has two types of actions, (1) the actions aaat emitted by the policy\nand (2) the action component at of the state xxxt = (st, at ), where at = aaat\u22121 with probability one.\nDe\ufb01nition 4. A Real-Time Markov Decision Process (XXX, A, \u00b5\u00b5\u00b5, ppp, rrr) = RTMDP (E) augments\nanother Markov Decision Process E = (S, A, \u00b5, p, r), such that\n(1) state space XXX = S \u00d7 A,\n(3) initial state distribution \u00b5\u00b5\u00b5(xxx0) = \u00b5\u00b5\u00b5(s0, a0 ) = \u00b5(s0) \u03b4(a0 \u2212 c),\n(4) transition distribution ppp(xxxt+1|xxxt, aaat) = ppp(st+1, at+1|st, at , aaat) = p(st+1|st, at) \u03b4(at+1 \u2212 aaat)\n(5) reward function rrr(xxxt, aaat) = rrr(st, at , aaat) = r(st, at).\nTheorem 1. 2 A policy \u03c0\u03c0\u03c0 : A\u00d7XXX \u2192 R interacting with RTMDP (E) in the conventional, turn-based\nmanner gives rise to the same Markov Reward Process as \u03c0\u03c0\u03c0 interacting with E in real-time, i.e.\n\n(2) action space is A,\n\n(tap to see code)\n\nRTMRP (E, \u03c0\u03c0\u03c0) = TBMRP (RTMDP (E), \u03c0\u03c0\u03c0).\n\n(3)\nInterestingly, the RTMDP is equivalent to a 1-step constant delay MDP (Walsh et al. (2008)). However,\nwe believe the different intuitions behind both of them warrant the different names: The constant delay\nMDP is trying to model external action and observation delays whereas the RTMDP is modelling\nthe time it takes to select an action. The connection makes sense, though: In a framework where the\naction selection is assumed to be instantaneous, we can apply a delay to account for the fact that the\naction selection was not instantaneous after all.\n2.3 Turn-based interaction can be expressed within the real-time framework\n\nIt is also possible to de\ufb01ne an augmentation TBMDP (E) that allows us to express turn-based\nenvironments (e.g. Chess, Go) within the real-time framework (Def. 7 in the Appendix). By assigning\nseparate timesteps to agent and environment, we can allow the agent to act while the environment\npauses. More speci\ufb01cally, we add a binary variable b to the state to keep track of whether it is the\nenvironment\u2019s or the agent\u2019s turn. While b inverts at every timestep, the underlying environment only\nadvances every other timestep.\nTheorem 2. A policy \u03c0\u03c0\u03c0(aaa|s, b, a) = \u03c0(aaa|s) interacting with TBMDP (E) in real time, gives rise to\na Markov Reward Process that contains (Def. 10) the MRP resulting from \u03c0 interacting with E in the\nconventional, turn-based manner, i.e.\n\n(4)\nAs a result, not only can we use conventional algorithms in the real-time framework but we can use\nalgorithms built on the real-time framework for all turn-based problems.\n\nTBMRP (E, \u03c0) \u221d RTMRP (TBMDP (E), \u03c0\u03c0\u03c0)\n\n2All proofs are in Appendix C.\n\n3\n\n\f3 Reinforcement Learning in Real-Time Markov Decision Processes\n\nHaving established the RTMDP as a compatibility layer between conventional RL and RTRL, we can\nnow look how existing theory changes when moving from an environment E to RTMDP (E).\nSince most RL methods assume that the environment\u2019s dynamics are completely unknown, they will\nnot be able to make use of the fact that we precisely know part of the dynamics of RTMDP. Speci\ufb01cally\nthey will have to learn from data, the effects of the \"feed-through\" mechanism which could lead to\nmuch slower learning and worse performance when applied to an environment RTMDP (E) instead\nof E. This could especially hurt the performance of off-policy algorithms which have been among\nthe most successful RL methods to date (Mnih et al., 2015; Haarnoja et al., 2018a). Most off-policy\nmethods make use of the action-value function.\nDe\ufb01nition 5. The action value function q\u03c0\ncan be recursively de\ufb01ned as\n\nE for an environment E = (S, A, \u00b5, p, r) and a policy \u03c0\n\nE(st, at) = r(st, at) + Est+1\u223cp(\u00b7|st,at)[Eat+1\u223c\u03c0(\u00b7|st+1)[q\u03c0\nq\u03c0\n\nE(st+1, at+1)]]\n\n(5)\n\nWhen this identity is used to train an action-value estimator, the transition st, at, st+1 can be sampled\nfrom a replay memory containing off-policy experience while the next action at+1 is sampled from\nthe policy \u03c0.\nLemma 1. In a Real-Time Markov Decision Process for the action-value function we have\nRTMDP(E)(st,at ,aaat)=r(st,at)+Est+1\u223cp(\u00b7|st,at)[E\nq\u03c0\u03c0\u03c0\n\nRTMDP(E)(st+1,aaat ,aaat+1)]] (6)\nNote that the action aaat does not affect the reward nor the next state. The only thing that aaat does\naffect is at+1 which, in turn, only in the next timestep will affect r(st+1, at+1) and st+2. To learn\nthe effect of an action on E (speci\ufb01cally the future rewards), we now have to perform two updates\nwhere previously we only had to perform one. We will investigate experimentally the effect of this on\nthe off-policy Soft Actor-Critic algorithm (Haarnoja et al., 2018a) in Section 5.1.\n\naaat+1\u223c\u03c0\u03c0\u03c0(\u00b7|st+1,aaat )[q\u03c0\u03c0\u03c0\n\n3.1 Learning the state-value function off-policy\n\nThe state-value function can usually not be used in the same way as the action-value function for\noff-policy learning.\nDe\ufb01nition 6. The state-value function v\u03c0\n\nE(st) = Eat\u223c\u03c0(\u00b7|st)[r(st, at) + Est+1\u223cp(\u00b7|st,at)[v\u03c0\nv\u03c0\n\nE for an environment E = (S, A, \u00b5, p, r) and a policy \u03c0 is\n(7)\n\nE(st+1)]]\n\nThe de\ufb01nition shows that the expectation over the action is taken before the expectation over the next\nstate. When using this identity to train a state-value estimator, we cannot simply change the action\ndistribution to allow for off-policy learning since we have no way of resampling the next state.\nLemma 2. In a Real-Time Markov Decision Process for the state-value function we have\n\nRTMDP(E)(st, at ) = r(st, at) + Est+1\u223cp(\u00b7|st,at)[E\nv\u03c0\u03c0\u03c0\n\naaat\u223c\u03c0\u03c0\u03c0(\u00b7|st, at )[v\u03c0\u03c0\u03c0\n\nRTMDP(E)(st+1, aaat )]].\n\n(8)\n\nHere, st, at, st+1 is always a valid transition no matter what action aaat is selected. Therefore, when\nusing the real-time framework, we can use the value function for off-policy learning. Since Equation 8\nis the same as Equation 5 (except for the policy inputs), we can use the state-value function where\npreviously the action-value function was used without having to learn the dynamics of the RTMDP\nfrom data since they have already been applied to Equation 8.\n\n3.2 Partial simulation\n\nThe off-policy learning procedure described in the previous section can be applied more gener-\nally. Whenever parts of the agent-environment system are known and (temporarily) independent of\nthe remaining system, they can be used to generate synthetic experience. More precisely, tran-\nsitions with a start state s = (w, z) can be generated according to the true transition kernel\n\u03ba(s(cid:48)|s) by simulating the known part of the transition (w \u2192 w(cid:48)) and using a stored sample for\nthe unknown part of the transition (z \u2192 z(cid:48)). This is only possible if the transition kernel fac-\ntorizes as \u03ba(w(cid:48), z(cid:48)|s) = \u03baknown(w(cid:48)|s) \u03baunknown(z(cid:48)|s). Hindsight Experience Replay (Andrychowicz\net al., 2017) can be seen as another example of partial simulation. There, the goal part of the state\nevolves independently of the rest which allows for changing the goal in hindsight. In the next section,\nwe use the same partial simulation principle to compute the gradient of the policy loss.\n\n4\n\n\f4 Real-Time Actor-Critic (RTAC)\n\nActor-Critic algorithms (Konda & Tsitsiklis, 2000) formulate the RL problem as bi-level optimization\nwhere the critic evaluates the actor as accurately as possible while the actor tries to improve its\nevaluation by the critic. Silver et al. (2014) showed that it is possible to reparameterize the actor\nevaluation and directly compute the pathwise derivative from the critic with respect to the actor\nparameters and thus telling the actor how to improve. Heess et al. (2015) extended that to stochastic\npolicies and Haarnoja et al. (2018a) further extended it to the maximum entropy objective to create\nSoft Actor-Critic (SAC) which RTAC is going to be based on and compared against.\nIn SAC a parameterized policy \u03c0 (the actor) is optimized to minimize the KL-divergence between\nitself and the exponential of an approximation of the action-value function q (the critic) normalized\nby Z (where Z is unknown but irrelevant to the gradient) giving rise to the policy loss\n\nE,\u03c0 = Est\u223cDDKL(\u03c0(\u00b7|st)|| exp( 1\nLSAC\n\n\u03b1 q(st,\u00b7))/Z(st))3\n\n(9)\n\nwhere D is a uniform distribution over the replay memory containing past states, actions and rewards.\nThe action-value function itself is optimized to \ufb01t Equation 5 presented in the previous section\n(augmented with an entropy term). We can thus expect SAC to perform worse in RTMDPs.\nIn order to create an algorithm better suited for the real-time setting we propose to use a state-value\nfunction approximator vvv as the critic instead, that will give rise to the same policy gradient.\nProposition 1. The following policy loss based on the state-value function\n\nRTMDP (E),\u03c0\u03c0\u03c0 = E(st,at)\u223cDEst+1\u223cp(\u00b7|st,at)DKL(\u03c0\u03c0\u03c0(\u00b7|st, at )|| exp( 1\nLRTAC\n\n\u03b1 \u03b3vvv(st+1,\u00b7))/Z(st+1))\n\n(10)\n\nhas the same policy gradient as LSAC\n\nRTMDP (E),\u03c0\u03c0\u03c0, i.e.\nRTMDP (E),\u03c0\u03c0\u03c0 = \u2207\u03c0\u03c0\u03c0LSAC\n\n\u2207\u03c0\u03c0\u03c0LRTAC\n\nRTMDP (E),\u03c0\u03c0\u03c0\n\n(11)\n\nThe value function itself is trained off-policy according to the procedure described in Section 3.1 to\n\ufb01t an augmented version of Equation 8, speci\ufb01cally\n\nvvvtarget = r(st, at) + Est+1\u223cp(\u00b7|st,at)[Eaaat\u223c\u03c0\u03c0\u03c0(\u00b7|st,at)[\u00afvvv \u00af\u03b8((st+1, aaat)) \u2212 \u03b1 log(\u03c0\u03c0\u03c0(aaat|st, at))]].\n\n(12)\n\nTherefore, for the value loss, we have\n\nRTMDP (E),vvv = E(xxxt,rrrt,st+1)\u223cD[(vvv(xxxt) \u2212 vvvtarget)2]\nLRTAC\n\n(13)\n\n4.1 Merging actor and critic\n\nUsing the state-value function as the critic has another advantage: When evaluated at the same\ntimestep, the critic does not depend on the actor\u2019s output anymore and we are therefore able to use\na single neural network to represent both the actor and the critic. Merging actor and critic makes\nit necessary to trade off between the value function and policy loss. Therefore, we introduce an\nadditional hyperparameter \u03b2.\n\nL(\u03b8) = \u03b2LRTAC\n\nRTMDP (E),\u03c0\u03c0\u03c0\u03b8\n\n+ (1 \u2212 \u03b2)LRTAC\n\nRTMDP (E),vvv\u03b8\n\n(14)\n\nMerging actor and critic could speed up learning and even improve generalization, but could also lead\nto greater instability. We compare RTAC with both merged and separate actor and critic networks in\nSection 5.\n\n4.2 Stabilizing learning\n\nActor-Critic algorithms are known to be unstable during training. We use a number of techniques that\nhelp make training more stable. Most notably we use Pop-Art output normalization (van Hasselt et al.,\n2016) to normalize the value targets. This is necessary if v and \u03c0 are represented using an overlapping\nset of parameters. Since the scale of the error gradients of the value loss is highly non-stationary it\n3\u03b1 is a temperature hyperparameter. For \u03b1 \u2192 0, the maximum entropy objective reduces to the traditional\n\nobjective. To compare with the hyperparameters table we have \u03b1 = entropy scale\nreward scale .\n\n5\n\n\fis hard to \ufb01nd a good trade-off between policy and value loss (\u03b2). If v and \u03c0 are separate, Pop-Art\nmatters less, but still improves performance both in SAC as well as in RTAC.\nAnother dif\ufb01culty are the recursive value function targets. Since we try to maximize the value\nfunction, overestimation errors in the value function approximator are ampli\ufb01ed and recursively used\nas target values in the following optimization steps. As introduced by Fujimoto et al. (2018) and like\nSAC, we will use two value function approximators and take their minimum when computing the\ntarget values to reduce value overestimation, i.e. \u00afvvv \u00af\u03b8(\u00b7) = mini\u2208{1,2} vvv \u00af\u03b8,i(\u00b7).\nLastly, to further stabilize the recursive value function estimation, we use target networks that slowly\ntrack the weights of the network (Mnih et al., 2015; Lillicrap et al., 2015), i.e. \u00af\u03b8 \u2190 \u03c4 \u03b8 + (1 \u2212 \u03c4 )\u00af\u03b8.\nThe tracking weights \u00af\u03b8 are then used to compute vvvtarget in Equation 12.\n\n5 Experiments\nWe compare Real-Time Actor-Critic to Soft Actor-Critic (Haarnoja et al., 2018a) on several OpenAI-\nGym/MuJoCo benchmark environments (Brockman et al., 2016; Todorov et al., 2012) as well as on\ntwo Avenue autonomous driving environments with visual observations (Ibrahim et al., 2019).\nThe SAC agents used for the results here, include both an action-value and a state-value function\napproximator and use a \ufb01xed entropy scale \u03b1 (as in Haarnoja et al. (2018a)). In the code accompanying\nthis paper we dropped the state-value function approximator since it had no impact on the results (as\ndone and observed in Haarnoja et al. (2018b)). For a comparison to other algorithms such as DDPG,\nPPO and TD3 also see Haarnoja et al. (2018a,b).\nTo make the comparison between the two algorithms as fair as possible, we also use output nor-\nmalization in SAC which improves performance on all tasks (see Figure 9 in Appendix A for a\ncomparison between normalized and unnormalized SAC). Both SAC and RTAC are performing a\nsingle optimization step at every timestep in the environment starting after the \ufb01rst 10000 timesteps\nof collecting experience based on the initial random policy. The hyperparameters used can be found\nin Table 1.\n\n5.1 SAC in Real-Time Markov Decision Processes\nWhen comparing the return trends of SAC in turn-based environments E against SAC in real-\ntime environments RTMDP (E), the performance of SAC deteriorates. This seems to con\ufb01rm our\nhypothesis that having to learn the dynamics of the augmented environment from data impedes\naction-value function approximation (as hypothesized in Section 3).\n\nFigure 5: Return trends for SAC in turn-based environments E and real-time environments\nRTMDP (E). Mean and 95% con\ufb01dence interval are computed over eight training runs per\n\nenvironment.\n\n6\n\n\f5.2 RTAC and SAC on MuJoCo in real time\n\nFigure 6 shows a comparison between RTAC and SAC in real-time versions of the benchmark\nenvironments. We can see that RTAC learns much faster and achieves higher returns than SAC in\nRTMDP (E). This makes sense as it does not have to learn from data the \"pass-through\" behavior\nof the RTMDP. We show RTAC with separate neural networks for the policy and value components\nshowing that a big part of RTAC\u2019s advantage over SAC is its value function update. However, the fact\nthat policy and value function networks can be merged further improves RTAC\u2019s performance as the\nplots suggest. Note that RTAC is always in RTMDP (E), therefore we do not explicitly state it again.\nRTAC is even outperforming SAC in E (when SAC is allowed to act without real-time constraints) in\nfour out of six environments including the two hardest ones - Ant and Humanoid - with largest state\nand action space (Figure 11). We theorize this is possible due to the merged actor and critic networks\nused in RTAC. It is important to note however, that for RTAC with merged actor and critic networks\noutput normalization is critical (Figure 12).\n\nFigure 6: Comparison between RTAC and SAC in RTMDP versions of the benchmark environments.\n\nMean and 95% con\ufb01dence interval are computed over eight training runs per environment.\n\n5.3 Autonomous driving task\n\nIn addition to the MuJoCo environments, we have also tested RTAC and SAC on an autonomous\ndriving task using the Avenue simulator (Ibrahim et al., 2019). Avenue is a game-engine-based\nsimulator where the agent controls a car. In the task shown here, the agent has to stay on the road and\npossibly steer around pedestrians. The observations are single image (256x64 grayscale pixels) and\nthe car\u2019s velocity. The actions are continuous and two dimensional, representing steering angle and\ngas-brake. The agent is rewarded proportionally to the car\u2019s velocity in the direction of the road and\nnegatively rewarded when making contact with a pedestrian or another car. In addition, episodes are\nterminated when leaving the road or colliding with any objects or pedestrians.\n\nFigure 7: Left: Agent\u2019s view in RaceSolo. Right: Passenger view in CityPedestrians.\n\n7\n\n\fFigure 8: Comparison between RTAC and SAC in RTMDP versions of the autonomous driving tasks.\n\nWe can see that RTAC under real-time constraints outperforms SAC even without real-time\n\nconstraints. Mean and 95% con\ufb01dence interval are computed over four training runs per environment.\n\nThe hyperparameters used for the autonomous driving task are largely the same as for the MuJoCo\ntasks, however we used a lower entropy reward scale (0.05) and lower learning rate (0.0002). We\nused convolutional neural networks with four layers of convolutions with \ufb01lter sizes (8, 4, 4, 4),\nstrides (2, 2, 2, 1) and (64, 64, 128, 128) channels. The convolutional layers are followed by two\nfully connected layers with 512 units each.\n\n6 Related work\n\nTravnik et al. (2018) noticed that the traditional MDP framework is ill suited for real-time problems.\nOther than our paper, however, no rigorous framework is proposed as an alternative, nor is any\ntheoretical analysis provided.\nFiroiu et al. (2018) applies a multi-step action delay to level the playing \ufb01eld between humans and\narti\ufb01cial agents on the ALE (Atari) benchmark However, it does not address the problems arising\nfrom the turn-based MDP framework or recognizes the signi\ufb01cance and consequences of the one-step\naction delay.\nSimilar to RTAC, NAF (Gu et al., 2016) is able to do continuous control with a single neural network.\nHowever, it is requiring the action-value function to be quadratic in the action (and thus possible to\noptimize in closed form). This assumption is quite restrictive and could not outperform more general\nmethods such as DDPG.\nIn SVG(1) (Heess et al., 2015) a differentiable transition model is used to compute the path-wise\nderivative of the value function one timestep after the action selection. This is similar to what RTAC\nis doing when using the value function to compute the policy gradient. However, in RTAC, we use\nthe actual differentiable dynamics of the RTMDP, i.e. \"passing through\" the action to the next state,\nand therefore we do not need to approximate the transition dynamics. At the same time, transitions\nfor the underlying environment are not modelled at all and instead sampled which is only possible\nbecause the actions aaat in a RTMDP only start to in\ufb02uence the underlying environment at the next\ntimestep.\n\n7 Discussion\n\nWe have introduced a new framework for Reinforcement Learning, RTRL, in which agent and\nenvironment step in unison to create a sequence of state-action pairs. We connected RTRL to the\nconventional Reinforcement Learning framework through the RTMDP and investigated its effects\nin theory and practice. We predicted and con\ufb01rmed experimentally that conventional off-policy\nalgorithms would perform worse in real-time environments and then proposed a new actor-critic\nalgorithm, RTAC, that not only avoids the problems of conventional off-policy methods with real-\ntime interaction but also allows us to merge actor and critic which comes with an additional gain\nin performance. We showed that RTAC outperforms SAC on both a standard, low dimensional\ncontinuous control benchmark, as well as a high dimensional autonomous driving task.\n\n8\n\n\fAcknowledgments\n\nWe would like to thank Cyril Ibrahim for building and helping us with the Avenue simulator; Craig\nQuiter and Sherjil Ozair for insightful discussions about agent-environment interaction; Alex Pich\u00e9,\nScott Fujimoto, Bhairav Metha and Jhelum Chakravorty, for reading drafts of this paper and \ufb01nally\nJose Gallego, Olexa Bilaniuk and many others at Mila that helped us on countless occasions online\nand of\ufb02ine.\nThis work was completed during a part-time internship at Element AI and was supported by the Open\nPhilanthropy Project.\n\nReferences\nAndrychowicz, Marcin, Wolski, Filip, Ray, Alex, Schneider, Jonas, Fong, Rachel, Welinder, Peter,\nMcGrew, Bob, Tobin, Josh, Abbeel, OpenAI Pieter, and Zaremba, Wojciech. Hindsight experience\nreplay. In Advances in Neural Information Processing Systems, pp. 5048\u20135058, 2017.\n\nBellman, Richard. A markovian decision process. Journal of mathematics and mechanics, pp.\n\n679\u2013684, 1957.\n\nBrockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie,\n\nand Zaremba, Wojciech. Openai gym, 2016.\n\nFiroiu, Vlad, Ju, Tina, and Tenenbaum, Joshua B. At human speed: Deep reinforcement learning\n\nwith action delay. CoRR, abs/1810.07286, 2018.\n\nFujimoto, Scott, van Hoof, Herke, and Meger, David. Addressing function approximation error in\n\nactor-critic methods. arXiv preprint arXiv:1802.09477, 2018.\n\nGu, Shixiang, Lillicrap, Timothy, Sutskever, Ilya, and Levine, Sergey. Continuous deep q-learning\nwith model-based acceleration. In International Conference on Machine Learning, pp. 2829\u20132838,\n2016.\n\nHaarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018a.\n\nHaarnoja, Tuomas, Zhou, Aurick, Hartikainen, Kristian, Tucker, George, Ha, Sehoon, Tan, Jie,\nKumar, Vikash, Zhu, Henry, Gupta, Abhishek, Abbeel, Pieter, et al. Soft actor-critic algorithms\nand applications. arXiv preprint arXiv:1812.05905, 2018b.\n\nHeess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Tim, Erez, Tom, and Tassa, Yuval. Learning\ncontinuous control policies by stochastic value gradients. In Advances in Neural Information\nProcessing Systems, pp. 2944\u20132952, 2015.\n\nHwangbo, Jemin, Sa, Inkyu, Siegwart, Roland, and Hutter, Marco. Control of a quadrotor with\n\nreinforcement learning. IEEE Robotics and Automation Letters, 2(4):2096\u20132103, 2017.\n\nHwangbo, Jemin, Lee, Joonho, Dosovitskiy, Alexey, Bellicoso, Dario, Tsounis, Vassilios, Koltun,\nVladlen, and Hutter, Marco. Learning agile and dynamic motor skills for legged robots. Science\nRobotics, 4(26):eaau5872, 2019.\n\nIbrahim, Cyril, Ramstedt, Simon, and Pal, Christopher. Avenue.\n\nelementai/avenue, 2019.\n\nhttps://github.com/\n\nKingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nKonda, Vijay R and Tsitsiklis, John N. Actor-critic algorithms. In Advances in neural information\n\nprocessing systems, pp. 1008\u20131014, 2000.\n\nLillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval,\nSilver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n9\n\n\fMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-\nlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nSchulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael, and Moritz, Philipp. Trust region\n\npolicy optimization. In International Conference on Machine Learning, pp. 1889\u20131897, 2015.\n\nSilver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas, Wierstra, Daan, and Riedmiller, Martin.\n\nDeterministic policy gradient algorithms. In ICML, 2014.\n\nSilver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez,\nArthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\nTallec, Corentin, Blier, L\u00e9onard, and Ollivier, Yann. Making deep q-learning methods robust to time\n\ndiscretization. arXiv preprint arXiv:1901.09732, 2019.\n\nTesauro, Gerald. Td-gammon, a self-teaching backgammon program, achieves master-level play.\n\nNeural computation, 6(2):215\u2013219, 1994.\n\nTodorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based\ncontrol. In IROS, pp. 5026\u20135033. IEEE, 2012. ISBN 978-1-4673-1737-5. URL http://dblp.\nuni-trier.de/db/conf/iros/iros2012.html#TodorovET12.\n\nTravnik, Jaden B., Mathewson, Kory W., Sutton, Richard S., and Pilarski, Patrick M. Reactive\nreinforcement learning in asynchronous environments. Frontiers in Robotics and AI, 5:79, 2018.\nISSN 2296-9144. doi: 10.3389/frobt.2018.00079. URL https://www.frontiersin.org/\narticle/10.3389/frobt.2018.00079.\n\nvan Hasselt, Hado P, Guez, Arthur, Hessel, Matteo, Mnih, Volodymyr, and Silver, David. Learning\nvalues across many orders of magnitude. In Advances in Neural Information Processing Systems,\npp. 4287\u20134295, 2016.\n\nWalsh, Thomas J., Nouri, Ali, Li, Lihong, and Littman, Michael L. Learning and planning in\nenvironments with delayed feedback. Autonomous Agents and Multi-Agent Systems, 18:83\u2013105,\n2008.\n\n10\n\n\f", "award": [], "sourceid": 1748, "authors": [{"given_name": "Simon", "family_name": "Ramstedt", "institution": "Mila"}, {"given_name": "Chris", "family_name": "Pal", "institution": "Montreal Institute for Learning Algorithms, \u00c9cole Polytechnique, Universit\u00e9 de Montr\u00e9al"}]}