{"title": "Temporal Regularization for Markov Decision Process", "book": "Advances in Neural Information Processing Systems", "page_first": 1779, "page_last": 1789, "abstract": "Several applications of Reinforcement Learning suffer from instability due to high\nvariance. This is especially prevalent in high dimensional domains. Regularization\nis a commonly used technique in machine learning to reduce variance, at the cost\nof introducing some bias. Most existing regularization techniques focus on spatial\n(perceptual) regularization. Yet in reinforcement learning, due to the nature of the\nBellman equation, there is an opportunity to also exploit temporal regularization\nbased on smoothness in value estimates over trajectories. This paper explores a\nclass of methods for temporal regularization. We formally characterize the bias\ninduced by this technique using Markov chain concepts. We illustrate the various\ncharacteristics of temporal regularization via a sequence of simple discrete and\ncontinuous MDPs, and show that the technique provides improvement even in\nhigh-dimensional Atari games.", "full_text": "Temporal Regularization in Markov Decision Process\n\nPierre Thodoroff\nMcGill University\n\npierre.thodoroff@mail.mcgill.ca\n\nAudrey Durand\nMcGill University\n\naudrey.durand@mcgill.ca\n\nJoelle Pineau\n\nMcGill University & Facebook AI Research\n\njpineau@cs.mcgill.ca\n\nDoina Precup\n\nMcGill University\n\ndprecup@cs.mcgill.ca\n\nAbstract\n\nSeveral applications of Reinforcement Learning suffer from instability due to high\nvariance. This is especially prevalent in high dimensional domains. Regularization\nis a commonly used technique in machine learning to reduce variance, at the cost\nof introducing some bias. Most existing regularization techniques focus on spatial\n(perceptual) regularization. Yet in reinforcement learning, due to the nature of the\nBellman equation, there is an opportunity to also exploit temporal regularization\nbased on smoothness in value estimates over trajectories. This paper explores a\nclass of methods for temporal regularization. We formally characterize the bias\ninduced by this technique using Markov chain concepts. We illustrate the various\ncharacteristics of temporal regularization via a sequence of simple discrete and\ncontinuous MDPs, and show that the technique provides improvement even in\nhigh-dimensional Atari games.\n\n1\n\nIntroduction\n\nThere has been much progress in Reinforcement Learning (RL) techniques, with some impressive\nsuccess with games [30], and several interesting applications on the horizon [17, 29, 26, 9]. However\nRL methods are too often hampered by high variance, whether due to randomness in data collection,\neffects of initial conditions, complexity of learner function class, hyper-parameter con\ufb01guration, or\nsparsity of the reward signal [15]. Regularization is a commonly used technique in machine learning\nto reduce variance, at the cost of introducing some (smaller) bias. Regularization typically takes the\nform of smoothing over the observation space to reduce the complexity of the learner\u2019s hypothesis\nclass.\nIn the RL setting, we have an interesting opportunity to consider an alternative form of regularization,\nnamely temporal regularization. Effectively, temporal regularization considers smoothing over the\ntrajectory, whereby the estimate of the value function at one state is assumed to be related to the value\nfunction at the state(s) that typically occur before it in the trajectory. This structure arises naturally\nout of the fact that the value at each state is estimated using the Bellman equation. The standard\nBellman equation clearly de\ufb01nes the dependency between value estimates. In temporal regularization,\nwe amplify this dependency by making each state depend more strongly on estimates of previous\nstates as opposed to multi-step methods that considers future states.\nThis paper proposes a class of temporally regularized value function estimates. We discuss properties\nof these estimates, based on notions from Markov chains, under the policy evaluation setting, and\nextend the notion to the control case. Our experiments show that temporal regularization effectively\nreduces variance and estimation error in discrete and continuous MDPs. The experiments also\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fhighlight that regularizing in the time domain rather than in the spatial domain allows more robustness\nto cases where state features are mispeci\ufb01ed or noisy, as is the case in some Atari games.\n\n2 Related work\n\nRegularization in RL has been considered via several different perspectives. One line of investigation\nfocuses on regularizing the features learned on the state space [11, 25, 24, 10, 21, 14]. In particular\nbackward bootstrapping method\u2019s can be seen as regularizing in feature space based on temporal\nproximity [34, 20, 1]. These approaches assume that nearby states in the state space have similar\nvalue. Other works focus on regularizing the changes in policy directly. Those approaches are often\nbased on entropy methods [23, 28, 2]. Explicit regularization in the temporal space has received\nmuch less attention. Temporal regularization in some sense may be seen as a \u201cbackward\u201d multi-step\nmethod [32]. The closest work to ours is possibly [36], where they de\ufb01ne natural value approximator\nby projecting the previous states estimates by adjusting for the reward and \u03b3. Their formulation, while\nsharing similarity in motivation, leads to different theory and algorithm. Convergence properties and\nbias induced by this class of methods were also not analyzed in Xu et al. [36].\n\n3 Technical Background\n\n3.1 Markov chains\n\nWe begin by introducing discrete Markov chains concepts that will be used to study the properties\nof temporally regularized MDPs. A discrete-time Markov chain [19] is de\ufb01ned by a discrete set of\nstates S and a transition function P : S \u00d7 S (cid:55)\u2192 [0, 1] which can also be written in matrix form as\nPij = P(i|j). Throughout the paper, we make the following mild assumption on the Markov chain:\nAssumption 1. The Markov chain P is ergodic: P has a unique stationary distribution \u00b5.\n\nIn Markov chains theory, one of the main challenge is to study the mixing time of the chain [19].\nSeveral results have been obtained when the chain is called reversible, that is when it satis\ufb01es detailed\nbalance.\nDe\ufb01nition 1 (Detailed balance [16]). Let P be an irreducible Markov chain with invariant stationary\ndistribution \u00b51. A chain is said to satisfy detailed balance if and only if\n\n\u00b5iPij = \u00b5jPji\n\n\u2200i, j \u2208 S.\n\n(1)\n\n(2)\n\nIntuitively this means that if we start the chain in a stationary distribution, the amount of probability\nthat \ufb02ows from i to j is equal to the one from j to i. In other words, the system must be at equilibrium.\nAn intuitive example of a physical system not satisfying detailed balance is a snow \ufb02ake in a coffee.\nIndeed, many chains do not satisfy this detailed balance property. In this case it is possible to use a\ndifferent, but related, chain called the reversal Markov chain to infer mixing time bounds [7].\n\nDe\ufb01nition 2 (Reversal Markov chain [16]). Let (cid:101)P the reversal Markov chain of P be de\ufb01ned as:\nIf P is irreducible with invariant distribution \u00b5, then (cid:101)P is also irreducible with invariant distribution\nThe reversal Markov chain (cid:101)P can be interpreted as the Markov chain P with time running backwards.\nIf the chain is reversible, then P = (cid:101)P .\n\n(cid:102)Pij =\n\n\u2200i, j \u2208 S.\n\n\u00b5jPji\n\n\u00b5i\n\n\u00b5.\n\n3.2 Markov Decision Process\nA Markov Decision Process (MDP), as de\ufb01ned in [27], consists of a discrete set of states S, a\ntransition function P : S \u00d7 A \u00d7 S (cid:55)\u2192 [0, 1], and a reward function r : S \u00d7 A (cid:55)\u2192 R. On each round t,\nthe learner observes current state st \u2208 S and selects action at \u2208 A, after which it receives reward\nrt = r(st, at) and moves to new state st+1 \u223c P(\u00b7|st, at). We de\ufb01ne a stationary policy \u03c0 as a\nprobability distribution over actions conditioned on states \u03c0 : S \u00d7 A (cid:55)\u2192 [0, 1].\n\n1\u00b5i de\ufb01nes the ith element of \u00b5\n\n2\n\n\f3.2.1 Discounted Markov Decision Process\n\nexpected return of policy \u03c0 at a state s \u2208 S, v\u03c0(s) = E\u03c0[(cid:80)\u221e\n\nWhen performing policy evaluation in the discounted case, the goal is to estimate the discounted\nt=0 \u03b3trt+1|s0 = s], with discount factor\n\u03b3 \u2208 [0, 1). This v\u03c0 can be obtained as the \ufb01xed point of the Bellman operator T \u03c0 such that:\n\nT \u03c0v\u03c0 = r\u03c0 + \u03b3P \u03c0v\u03c0,\n\n(3)\nwhere P \u03c0 denotes the |S|\u00d7|S| transition matrix under policy \u03c0, v\u03c0 is the state values column-vector,\nand r is the reward column-vector. The matrix P \u03c0 also de\ufb01nes a Markov chain.\nIn the control case, the goal is to \ufb01nd the optimal policy \u03c0\u2217 that maximizes the discounted expected\nreturn. Under the optimal policy, the optimal value function v\u2217 is the \ufb01xed point of the non-linear\noptimal Bellman operator:\n\nT \u2217v\u2217 = max\n\na\u2208A [r(a) + \u03b3P (a)v\u2217].\n\n(4)\n\n4 Temporal regularization\n\nRegularization in the feature/state space, or spatial regularization as we call it, exploits the regularities\nthat exist in the observation (or state). In contrast, temporal regularization considers the temporal\nstructure of the value estimates through a trajectory. Practically this is done by smoothing the value\nestimate of a state using estimates of states that occurred earlier in the trajectory. In this section\nwe \ufb01rst introduce the concept of temporal regularization and discuss its properties in the policy\nevaluation setting. We then show how this concept can be extended to exploit information from the\nentire trajectory by casting temporal regularization as a time series prediction problem.\nLet us focus on the simplest case where the value estimate at the current state is regularized using\nonly the value estimate at the previous state in the trajectory, yielding updates of the form:\nv\u03b2(st) =\n\n[r(st) + \u03b3((1 \u2212 \u03b2)v\u03b2(st+1) + \u03b2v\u03b2(st\u22121))]\n\nE\n\nst+1,st\u22121\u223c\u03c0\n\n= r(st) + \u03b3(1 \u2212 \u03b2)\n\np(st+1|st)v\u03b2(st+1) + \u03b3\u03b2\n\n(cid:88)\n\nst+1\u2208S\n\n(cid:88)\n\nst\u22121\u2208S\n\np(st|st\u22121)p(st\u22121)\n\np(st)\n\nv\u03b2(st\u22121),\n\n(5)\nfor a parameter \u03b2 \u2208 [0, 1] and p(st+1|st) the transition probability induced by the policy \u03c0. It can\n\nbe rewritten in matrix form as v\u03b2 = r + \u03b3(((1 \u2212 \u03b2)P \u03c0 + \u03b2(cid:102)P \u03c0)v\u03b2), where (cid:102)P \u03c0 corresponds to the\nTo alleviate the notation, we denote P \u03c0 as P and (cid:102)P \u03c0 as (cid:101)P .\n\n\u03b2 v\u03b2 = r + \u03b3((1 \u2212 \u03b2)P \u03c0v\u03b2 + \u03b2(cid:102)P \u03c0v\u03b2).\n\nreversal Markov chain of the MDP. We de\ufb01ne a temporally regularized Bellman operator as:\n\nT \u03c0\n\n(6)\n\nRemark. For \u03b2 = 0, Eq. 6 corresponds to the original Bellman operator.\n\nWe can prove that this operator has the following property.\nTheorem 1. The operator T \u03c0\n\u03b2 has a unique \ufb01xed point v\u03c0\nProof. We \ufb01rst prove that T \u03c0\n\n(cid:13)(cid:13)(cid:13)T \u03c0\n\n\u03b2 u \u2212 T \u03c0\n\u03b2 v\n\n(cid:13)(cid:13)(cid:13)\u221e =\n\n\u03b2 is a contraction mapping in L\u221e norm. We have that\n\n(cid:13)(cid:13)(cid:13)r + \u03b3((1 \u2212 \u03b2)P u + \u03b2(cid:101)P u) \u2212 (r + \u03b3((1 \u2212 \u03b2)P v + \u03b2(cid:101)P v))\n(cid:13)(cid:13)(cid:13)((1 \u2212 \u03b2)P + \u03b2(cid:101)P )(u \u2212 v)\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u03b2 and T \u03c0\n\n\u03b2 is a contraction mapping.\n\n= \u03b3\n\u2264 \u03b3(cid:107)u \u2212 v(cid:107)\u221e ,\n\n(7)\n\nwhere the last inequality uses the fact that the convex combination of two row stochastic matrices is\nalso row stochastic (the proof can be found in the appendix). Then using Banach \ufb01xed point theorem,\nwe obtain that v\u03c0\n\nFurthermore the new induced Markov chain (1 \u2212 \u03b2)P + \u03b2(cid:101)P has the same stationary distribution as\n\n\u03b2 is a unique \ufb01xed point.\n\nthe original P (the proof can be found in the appendix).\n\n3\n\n\fIn the policy evaluation setting, the bias between the original value function v\u03c0 and the regularized\none v\u03c0\nweighted by \u03b2 and the reward distribution.\n\nLemma 1. P and (1 \u2212 \u03b2)P + \u03b2(cid:101)P have the same stationary distribution \u00b5 \u2200\u03b2 \u2208 [0, 1].\n\u03b2 can be characterized as a function of the difference between P and its Markov reversal (cid:101)P ,\n\u03b2 =(cid:80)\u221e\nProposition 1. Let v\u03c0 =(cid:80)\u221e\ni=0 \u03b3i((1 \u2212 \u03b2)P + \u03b2(cid:101)P )ir. We have that\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n(cid:13)(cid:13)(cid:13)\u221e .\n\u03b3i(cid:13)(cid:13)(cid:13)(P i \u2212 ((1 \u2212 \u03b2)P + \u03b2(cid:101)P )i)r\n(cid:13)(cid:13)(cid:13)\u221e =\n(cid:13)(cid:13)(cid:13)v\u03c0 \u2212 v\u03c0\n\u03b3i(P i \u2212 ((1 \u2212 \u03b2)P + \u03b2(cid:101)P )i)r\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u221e(cid:88)\n\ni=0 \u03b3iP ir and v\u03c0\n\n\u221e(cid:88)\n\n\u2264\n\ni=0\n\ni=0\n\n\u03b2\n\n(8)\n\nThis quantity is naturally bounded for \u03b3 < 1.\nRemark. Let P \u221e denote a matrix where columns consist of the stationary distribution \u00b5. By\nthe property of reversal Markov chains and lemma 1, we have that limi\u2192\u221e (cid:107)P ir \u2212 P \u221er(cid:107) \u2192 0\n\nand limi\u2192\u221e (cid:107)((1 \u2212 \u03b2)P + \u03b2(cid:101)P )ir \u2212 P \u221er(cid:107) \u2192 0, such that the Marvov chain P and its reversal\n(1\u2212 \u03b2)P + \u03b2(cid:101)P converge to the same value. Therefore, the norm (cid:107)(P i \u2212 ((1\u2212 \u03b2)P + \u03b2(cid:101)P )i)r(cid:107)p also\nRemark. It can be interesting to note that if the chain is reversible, meaning that P = (cid:101)P , then the\n\nconverges to 0 in the limit.\n\n\ufb01xed point of both operators is the same, that is v\u03c0 = v\u03c0\n\u03b2 .\n\nDiscounted average reward case: The temporally regularized MDP has the same discounted\naverage reward as the original one as it is possible to de\ufb01ne the discounted average reward [35] as\na function of the stationary distribution \u03c0, the reward vector and \u03b3 . This leads to the following\nproperty (the proof can be found in the appendix).\nProposition 2. For a reward vector r, the MDPs de\ufb01ned by the the transition matrices P and\n\n(1 \u2212 \u03b2)P + \u03b2(cid:101)P have the same average reward \u03c1.\n\nIntuitively, this means that temporal regularization only reweighs the reward on each state based on\nthe Markov reversal, while preserving the average reward.\n\nv(st) = r(s) + \u03b3\n\nn\u22121(cid:88)\n[\u03b2(i)(cid:101)vi(st+1)],\n\nTemporal Regularization as a time series prediction problem:\nIt is possible to cast this problem\nof temporal regularization as a time series prediction problem, and use richer models of temporal\ndependencies, such as exponential smoothing [12], ARMA model [5], etc. We can write the update\n\nin a general form using n different regularizers ((cid:101)v0,(cid:101)v1...(cid:93)vn\u22121):\nwhere(cid:101)v0(st+1) = v(st+1) and(cid:80)n\u22121\n(cid:101)v(st+1) = (1 \u2212 \u03bb)v(st\u22121) + (1 \u2212 \u03bb)\u03bbv(st\u22122)..., the update can be written in operator form as:\n(cid:18)\n\u221e(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(P i \u2212 ((1 \u2212 \u03b2)P + \u03b2(1 \u2212 \u03bb)\n\nand a similar argument as Theorem 1 can be used to show the contraction property. The bias of\nexponential smoothing in policy evaluation can be characterized as:\n\ni=0 \u03b2(i) = 1. For example, using exponential smoothing where\n\n(cid:13)(cid:13)(cid:13)v\u03c0 \u2212 v\u03c0\n\n\u03bbj\u22121(cid:101)P j)i)r\n\n(1 \u2212 \u03b2) P v + \u03b2 (1 \u2212 \u03bb)\n\n\u03bbi\u22121(cid:101)P iv\n\nT \u03c0\n\u03b2 v = r + \u03b3\n\n\u221e(cid:88)\n\nj=1\n\n\u221e(cid:88)\n\ni=0\n\n(cid:19)\n\n,\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u03b2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n.\n\n\u2264\n\n\u03b3i\n\ni=1\n\ni=0\n\n(9)\n\n(10)\n\n(11)\n\nUsing more powerful regularizers could be bene\ufb01cial, for example to reduce variance by smoothing\nover more values (exponential smoothing) or to model the trend of the value function through the\ntrajectory using trend adjusted model [13]. An example of a temporal policy evaluation with temporal\nregularization using exponential smoothing is provided in Algorithm 1.\n\n4\n\n\fAlgorithm 1 Policy evaluation with exponential smoothing\n1: Input: \u03c0, \u03b1, \u03b3, \u03b2, \u03bb\n2: p = v(s)\n3: for all steps do\n4:\n5:\n6:\n7:\n8: end for\n\nChoose a \u223c \u03c0(s)\nTake action a, observe reward r(s) and next state s(cid:48)\nv(s) = v(s) + \u03b1(r(s) + \u03b3((1 \u2212 \u03b2)v(s(cid:48)) + \u03b2p))\np = (1 \u2212 \u03bb)v(s) + \u03bbp\n\nControl case: Temporal regularization can be extended to MDPs with actions by modifying the\ntarget of the value function (or the Q values) using temporal regularization. Experiments (Sec. 5.6)\npresent an example of how temporal regularization can be applied within an actor-critic framework.\nThe theoretical analysis of the control case is outside the scope of this paper.\n\nTemporal difference with function approximation:\nIt is also possible to extend temporal regu-\nlarization using function approximation such as semi-gradient TD [33]. Assuming a function v\u03b2\n\u03b8\nparameterized by \u03b8, we can consider r(s) + \u03b3((1 \u2212 \u03b2)v\u03b2\n\u03b8 (st) as the target\nand differentiate with respect to v\u03b2\n\u03b8 (st). An example of a temporally regularized semi-gradient TD\nalgorithm can be found in the appendix.\n\n\u03b8 (st\u22121)) \u2212 v\u03b2\n\n\u03b8 (st+1) + \u03b2v\u03b2\n\n5 Experiment\n\nWe now presents empirical results illustrating potential advantages of temporal regularization, and\ncharacterizing its bias and variance effects on value estimation and control.\n\n5.1 Mixing time\n\nThis \ufb01rst experiment showcases that the underlying Markov chain of a MDP can have a smaller\nmixing time when temporally regularized. The mixing time can be seen as the number of time\nsteps required for the Markov chain to get close enough to its stationary distribution. Therefore,\nthe mixing time also determines the rate at which policy evaluation will converge to the optimal\nvalue function [3]. We consider a synthetic MDP with 10 states where transition probabilities are\nsampled from the uniform distribution. Let P \u221e denote a matrix where columns consists of the\nstationary distribution \u00b5. To compare the mixing time, we evaluate the error corresponding to the\n\ndistance of P i and(cid:0)(1 \u2212 \u03b2)P + \u03b2(cid:101)P(cid:1)i to the convergence point P \u221e after i iterations. Figure 1\n\ndisplays the error curve when varying the regularization parameter \u03b2. We observe a U-shaped error\ncurve, that intermediate values of \u03b2 in this example yields faster mixing time. One explanation is\nthat transition matrices with extreme probabilities (low or high) yield poorly conditioned transition\nmatrices. Regularizing with the reversal Markov chain often leads to a better conditioned matrix at\nthe cost of injecting bias.\n\n5.2 Bias\n\nIt is well known that reducing variance comes at the expense of inducing (smaller) bias. This has\nbeen characterized previously (Sec. 4) in terms of the difference between the original Markov chain\nand the reversal weighted by the reward. In this experiment, we attempt to give an intuitive idea\nof what this means. More speci\ufb01cally, we would expect the bias to be small if values along the\ntrajectories have similar values. To this end, we consider a synthetic MDP with 10 states where\nboth transition functions and rewards are sampled randomly from a uniform distribution. In order to\ncreate temporal dependencies in the trajectory, we smooth the rewards of N states that are temporally\nclose (in terms of trajectory) using the following formula: r(st) = r(st)+r(st+1)\n. Figure 2 shows the\ndifference between the regularized and un-regularized MDPs as N changes, for different values of\nregularization parameter \u03b2. We observe that increasing N, meaning more states get rewards close\n\n2\n\n5\n\n\fFigure 1: Distance between the stationary\ntransition probabilities and the estimated tran-\nsition probability for different values of regu-\nlarization parameter \u03b2.\n\nFigure 2: Mean difference between v\u03c0\n\u03b2 and\nv\u03c0 given the regularization parameter \u03b2, for\ndifferent amount of smoothed states N.\n\nFigure 3: Synthetic MDP where state S1 has\nhigh variance.\n\nFigure 4: Left plot shows absolute difference\nbetween original (v\u03c0(S1)) and regularized\n(v\u03c0\n\u03b2 (S1)) state value estimates to the optimal\nvalue v\u2217(S1). Right plot shows the variance\nof the estimates v.\n\nto one another, results into less bias. This is due to rewards putting emphasis on states where the\noriginal and reversal Markov chain are similar.\n\n5.3 Variance\n\nThe primary motivation of this work is to reduce variance, therefore we now consider an experiment\ntargeting this aspect. Figure 3 shows an example of a synthetic, 3-state MDP, where the variance of\nS1 is (relatively) high. We consider an agent that is evolving in this world, changing states following\nthe stochastic policy indicated. We are interested in the error when estimating the optimal state value\nof S1, v\u2217(S1), with and without temporal regularization, denoted v\u03c0\nFigure 4 shows these errors at each iteration, averaged over 100 runs. We observe that temporal\nregularization indeed reduces the variance and thus helps the learning process by making the value\nfunction easier to learn.\n\n\u03b2 (S1), v\u03c0(S1), respectively.\n\n5.4 Propagation of the information\n\nWe now illustrate with a simple experiment how temporal regularization allows the information to\nspread faster among the different states of the MDP. For this purpose, we consider a simple MDP,\nwhere an agent walks randomly in two rooms (18 states) using four actions (up, down, left, right),\nand a discount factor \u03b3 = 0.9. The reward is rt = 1 everywhere and passing the door between rooms\n(shown in red on Figure 5) only happens 50% of the time (on attempt). The episode starts at the\n\n6\n\nBeta0.00.20.40.60.81.0Iteration01234Error0.0020.0040.0060.0080.0100.0120.00.20.40.60.80.00.20.40.60.81.01.21.41.6BiasN=0N=2N=3N=4\fFigure 5: Proximity of the estimated state value to the optimal value after N trajectories. Top row is\nthe original room environment and bottom row is the regularized one (\u03b2 = 0.5). Darker is better.\n\ntop left and terminates when the agent reaches the bottom right corner. The sole goal is to learn the\noptimal value function by walking along this MDP (this is not a race toward the end).\nFigure 5 shows the proximity of the estimated state value to the optimal value with and without\ntemporal regularization. The darker the state, the closer it is to its optimal value. The heatmap\nscale has been adjusted at each trajectory to observe the difference between both methods. We\n\ufb01rst notice that the overall propagation of the information in the regularized MDP is faster than\nin the original one. We also observe that, when \ufb01rst entering the second room, bootstrapping on\nvalues coming from the \ufb01rst room allows the agent to learn the optimal value faster. This suggest\nthat temporal regularization could help agents explore faster by using their prior from the previous\nvisited state for learning the corresponding optimal value faster. It is also possible to consider more\ncomplex and powerful regularizers. Let us study a different time series prediction model, namely\nexponential averaging, as de\ufb01ned in (10). The complexity of such models is usually articulated by\nhyper-parameters, allowing complex models to improve performance by better adapting to problems.\nWe illustrate this by comparing the performance of regularization using the previous state and an\nexponential averaging of all previous states. Fig. 6 shows the average error on the value estimate using\npast state smoothing, exponential smoothing, and without smoothing. In this setting, exponential\nsmoothing transfers information faster, thus enabling faster convergence to the true value.\n\nFigure 6: Bene\ufb01ts of complex regularizers on the room domain.\n\n5.5 Noisy state representation\n\nThe next experiment illustrates a major strength of temporal regularization, that is its robustness to\nnoise in the state representation. This situation can naturally arise when the state sensors are noisy or\ninsuf\ufb01cient to avoid aliasing. For this task, we consider the synthetic, one dimensional, continuous\nsetting. A learner evolving in this environment walks randomly along this line with a discount factor\n\u03b3 = 0.95. Let xt \u2208 [0, 1] denote the position of the agent along the line at time t. The next position\nxt+1 = xt + at, where action at \u223c N (0, 0.05). The state of the agent corresponds to the position\n\n7\n\n010203040Iteration0.00.51.01.52.02.5Error=0=0=0.5=0,=0.5=0.1\fFigure 7: Absolute distance from the original\n( \u03b8\u03c0) and the regularized (\u03b8\u03c0\n\u03b2) state value esti-\nmates to the optimal parameter \u03b8\u2217 given the\nnoise variance \u03c32 in state sensors.\n\nFigure 8: Impact of complex regularizer pa-\nrameterization (\u03bb) on the noisy walk using\nexponential smoothing.\n\nperturbed by a zero-centered Gaussian noise \u0001t, such that st = xt + \u0001t, where \u0001t \u223c N (0, \u03c32) are\ni.i.d. When the agent moves to a new position xt+1, it receives a reward rt = xt+1. The episode\nends after 1000 steps. In this experiment we model the value function using a linear model with a\nsingle parameter \u03b8. We are interested in the error when estimating the optimal parameter function\n\u03b8\u2217 with and without temporal regularization, that is \u03b8\u03c0\n\u03b2 and \u03b8\u03c0, respectively. In this case we use\nthe TD version of temporal regularization presented at the end of Sec. 4. Figure 7 shows these\nerrors, averaged over 1000 repetitions, for different values of noise variance \u03c32. We observe that\nas the noise variance increases, the un-regularized estimate becomes less accurate, while temporal\nregularization is more robust. Using more complex regularizer can improve performance as shown\nin the previous section but this potential gain comes at the price of a potential loss in case of model\nmis\ufb01t. Fig. 8 shows the absolute distance from the regularized state estimate (using exponential\nsmoothing) to the optimal value while varying \u03bb (higher \u03bb = more smoothing). Increasing smoothing\nimproves performance up to some point, but when \u03bb is not well \ufb01t the bias becomes too strong and\nperformance declines. This is a classic bias-variance tradeoff. This experiment highlights a case\nwhere temporal regularization is effective even in the absence of smoothness in the state space (which\nother regularization methods would target). This is further highlighted in the next experiments.\n\n5.6 Deep reinforcement learning\n\nTo showcase the potential of temporal regularization in high dimensional settings, we adapt an\nactor-critic based method (PPO [28]) using temporal regularization. More speci\ufb01cally, we incorporate\ntemporal regularization as exponential smoothing in the target of the critic. PPO uses the general\nadvantage estimator \u02c6At = \u03b4t + \u03b3\u03bb\u03b4t+1 + ... + (\u03b3\u03bb)T\u2212t+1\u03b4T where \u03b4t = rt + \u03b3v(st+1) \u2212 v(st).\nWe regularize \u03b4t such that \u03b4\u03b2\n\nt = rt + \u03b3((1 \u2212 \u03b2)v(st+1) + \u03b2(cid:101)v(st\u22121))) \u2212 v(st) using exponential\nsmoothing(cid:101)v(st) = (1 \u2212 \u03bb)v(st) + \u03bb(cid:101)v(st\u22121) as described in Eq. (10).(cid:101)v is an exponentially decaying\n\nsum over all t previous state values encountered in the trajectory. We evaluate the performance in the\nArcade Learning Environment [4], where we consider the following performance measure:\n\nregularized \u2212 baseline\nbaseline \u2212 random\n\n.\n\n(12)\nThe hyper-parameters for the temporal regularization are \u03b2 = \u03bb = 0.2 and a decay of 1e\u22125. Those\nare selected on 7 games and 3 training seeds. All other hyper-parameters correspond to the one used\nin the PPO paper. Our implementation2 is based on the publicly available OpenAI codebase [8]. The\nprevious four frames are considered as the state representation [22]. For each game, 10 independent\nruns (10 random seeds) are performed.\nThe results reported in Figure 9 show that adding temporal regularization improves the performance\non multiple games. This suggests that the regularized optimal value function may be smoother and\nthus easier to learn, even when using function approximation with deep learning. Also, as shown in\n\n2The code can be found https://github.com/pierthodo/temporal_regularization.\n\n8\n\n0.000.020.040.060.080.100.120.14Varianceofthenoisymeasurement3.03.54.04.55.05.56.06.57.0Errorvalueestimate=0=0.50.000.050.100.150.200.250.300.350.40value6.16.26.36.46.56.66.7Errorvalueestimate\fFigure 9: Performance (Eq. 12) of a temporally regularized PPO on a suite of Atari games.\n\nprevious experiments (Sec. 5.5), temporal regularization being independent of spatial representation\nmakes it more robust to mis-speci\ufb01cation of the state features, which is a challenge in some of these\ngames (e.g. when assuming full state representation using some previous frames).\n\n6 Discussion\n\nNoisy states:\nIs is often assumed that the full state can be determined, while in practice, the Markov\nproperty rarely holds. This is the case, for example, when taking the four last frames to represent the\nstate in Atari games [22]. A problem that arises when treating a partially observable MDP (POMDP)\nas a fully observable is that it may no longer be possible to assume that the value function is smooth\nover the state space [31]. For example, the observed features may be similar for two states that are\nintrinsically different, leading to highly different values for states that are nearby in the state space.\nPrevious experiments on noisy state representation (Sec. 5.5) and on the Atari games (Sec. 5.6) show\nthat temporal regularization provides robustness to those cases. This makes it an appealing technique\nin real-world environments, where it is harder to provide the agent with the full state.\n\nChoice of the regularization parameter: The bias induced by the regularization parameter \u03b2 can\nbe detrimental for the learning in the long run. A \ufb01rst attempt to mitigate this bias is just to decay the\nregularization as learning advances, as it is done in the deep learning experiment (Sec. 5.6). Among\ndifferent avenues that could be explored, an interesting one could be to aim for a state dependent\nregularization. For example, in the tabular case, one could consider \u03b2 as a function of the number of\nvisits to a particular state.\n\nSmoother objective: Previous work [18] looked at how the smoothness of the objective function\nrelates to the convergence speed of RL algorithms. An analogy can be drawn with convex optimization\nwhere the rate of convergence is dependent on the Lipschitz (smoothness) constant [6]. By smoothing\nthe value temporally we argue that the optimal value function can be smoother. This would be\nbene\ufb01cial in high-dimensional state space where the use of deep neural network is required. This\ncould explain the performance displayed using temporal regularization on Atari games (Sec. 5.6).\nThe notion of temporal regularization is also behind multi-step methods [32]; it may be worthwhile\nto further explore how these methods are related.\n\nConclusion: This paper tackles the problem of regularization in RL from a new angle, that is from\na temporal perspective. In contrast with typical spatial regularization, where one assumes that rewards\nare close for nearby states in the state space, temporal regularization rather assumes that rewards\nare close for states visited closely in time. This approach allows information to propagate faster into\nstates that are hard to reach, which could prove useful for exploration. The robustness of the proposed\napproach to noisy state representations and its interesting properties should motivate further work to\nexplore novel ways of exploiting temporal information.\n\nAcknowledgments\n\nThe authors wish to thank Pierre-Luc Bacon, Harsh Satija and Joshua Romoff for helpful discussions.\nFinancial support was provided by NSERC and Facebook. This research was enabled by support\nprovided by Compute Canada. We thank the reviewers for insightful comments and suggestions.\n\n9\n\n\fReferences\n[1] L. Baird. Residual algorithms: Reinforcement learning with function approximation.\n\nMachine Learning Proceedings 1995, pages 30\u201337. Elsevier, 1995.\n\nIn\n\n[2] P. L. Bartlett and A. Tewari. Regal: A regularization based algorithm for reinforcement learning\nin weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty\nin Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\n[3] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research, 15:319\u2013350, 2001.\n\n[4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\n2013.\n\n[5] G. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control (3rd\n\ned.). Prentice-Hall, 1994.\n\n[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[7] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher. Chernoff-hoeffding bounds for markov\n\nchains: Generalized and simpli\ufb01ed. arXiv preprint arXiv:1201.0559, 2012.\n\n[8] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor,\n\nand Y. Wu. Openai baselines. https://github.com/openai/baselines, 2017.\n\n[9] B. Dhingra, L. Li, X. Li, J. Gao, Y.-N. Chen, F. Ahmed, and L. Deng. Towards end-to-end\nreinforcement learning of dialogue agents for information access. In Proceedings of the 55th\nAnnual Meeting of the Association for Computational Linguistics, volume 1, pages 484\u2013495,\n2017.\n\n[10] A.-m. Farahmand. Regularization in reinforcement learning. PhD thesis, University of Alberta,\n\n2011.\n\n[11] A.-m. Farahmand, M. Ghavamzadeh, C. Szepesv\u00e1ri, and S. Mannor. Regularized \ufb01tted q-\niteration for planning in continuous-space markovian decision problems. In American Control\nConference, pages 725\u2013730. IEEE, 2009.\n\n[12] E. S. Gardner. Exponential smoothing: The state of the art\u2014part ii. International journal of\n\nforecasting, 22(4):637\u2013666, 2006.\n\n[13] E. S. Gardner Jr. Exponential smoothing: The state of the art. Journal of forecasting, 4(1):1\u201328,\n\n1985.\n\n[14] C. Harrigan. Deep reinforcement learning with regularized convolutional neural \ufb01tted q iteration.\n\n2016.\n\n[15] P. Henderson, R. Islam, and P. J. P. D. M. D. Bachman, P. Deep reinforcement learning that\n\nmatters. In AAAI, 2018.\n\n[16] J. G. Kemeny and J. L. Snell. Finite markov chains, undergraduate texts in mathematics. 1976.\n[17] K. Koedinger, E. Brunskill, R. Baker, and E. McLaughlin. New potentials for data-driven\n\nintelligent tutoring system development and optimization. AAAI magazine, 2018.\n\n[18] V. S. Laroche. In reinforcement learning, all objective functions are not equal. ICLR Workshop,\n\n2018.\n\n[19] D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical\n\nSoc., 2008.\n\n[20] L. Li. A worst-case comparison between temporal difference and residual gradient with linear\nIn Proceedings of the 25th international conference on machine\n\nfunction approximation.\nlearning, pages 560\u2013567. ACM, 2008.\n\n[21] B. Liu, S. Mahadevan, and J. Liu. Regularized off-policy td-learning. In Advances in Neural\n\nInformation Processing Systems, pages 836\u2013844, 2012.\n\n[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n10\n\n\f[23] G. Neu, A. Jonsson, and V. G\u00f3mez. A uni\ufb01ed view of entropy-regularized markov decision\n\nprocesses. arXiv preprint arXiv:1705.07798, 2017.\n\n[24] J. Pazis and R. Parr. Non-parametric approximate linear programming for mdps. In AAAI, 2011.\n[25] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in\napproximate linear programs for markov decision processes. arXiv preprint arXiv:1005.1860,\n2010.\n\n[26] N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. Engelhardt. A reinforcement learning\n\napproach to weaning of mechanical ventilation in intensive care units. In UAI, 2017.\n\n[27] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 1994.\n\n[28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[29] S. Shortreed, E. Laber, D. Lizotte, S. Stroup, J. Pineau, and S. Murphy. Informing sequential\nclinical decision-making through reinforcement learning: an empirical study. Machine Learning,\n2011.\n\n[30] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of go with deep neural networks and tree search. Nature, 2016.\n\n[31] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially\nobservable markovian decision processes. In Machine Learning Proceedings 1994, pages\n284\u2013292. Elsevier, 1994.\n\n[32] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge,\n\n1st edition, 1998.\n\n[33] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge,\n\n(in progress) 2nd edition, 2017.\n\n[34] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxima-\ntion. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n993\u20131000. ACM, 2009.\n\n[35] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporal-difference\n\nlearning. Machine Learning, 49(2-3):179\u2013191, 2002.\n\n[36] Z. Xu, J. Modayil, H. P. van Hasselt, A. Barreto, D. Silver, and T. Schaul. Natural value\napproximators: Learning when to trust past estimates. In Advances in Neural Information\nProcessing Systems, pages 2117\u20132125, 2017.\n\n11\n\n\f", "award": [], "sourceid": 894, "authors": [{"given_name": "Pierre", "family_name": "Thodoroff", "institution": "McGill"}, {"given_name": "Audrey", "family_name": "Durand", "institution": "McGill University"}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University / DeepMind Montreal"}]}