{"title": "Natural Value Approximators: Learning when to Trust Past Estimates", "book": "Advances in Neural Information Processing Systems", "page_first": 2120, "page_last": 2128, "abstract": "Neural networks have a smooth initial inductive bias, such that small changes in input do not lead to large changes in output. However, in reinforcement learning domains with sparse rewards, value functions have non-smooth structure with a characteristic asymmetric discontinuity whenever rewards arrive. We propose a mechanism that learns an interpolation between a direct value estimate and a projected value estimate computed from the encountered reward and the previous estimate. This reduces the need to learn about discontinuities, and thus improves the value function approximation. Furthermore, as the interpolation is learned and state-dependent, our method can deal with heterogeneous observability. We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm.", "full_text": "Natural Value Approximators:\n\nLearning when to Trust Past Estimates\n\nZhongwen Xu\n\nDeepMind\n\nzhongwen@google.com\n\nJoseph Modayil\n\nDeepMind\n\nmodayil@google.com\n\nHado van Hasselt\n\nDeepMind\n\nhado@google.com\n\nAndre Barreto\n\nDeepMind\n\nandrebarreto@google.com\n\nDavid Silver\nDeepMind\n\ndavidsilver@google.com\n\nTom Schaul\nDeepMind\n\nschaul@google.com\n\nAbstract\n\nNeural networks have a smooth initial inductive bias, such that small changes in\ninput do not lead to large changes in output. However, in reinforcement learning\ndomains with sparse rewards, value functions have non-smooth structure with\na characteristic asymmetric discontinuity whenever rewards arrive. We propose\na mechanism that learns an interpolation between a direct value estimate and a\nprojected value estimate computed from the encountered reward and the previous\nestimate. This reduces the need to learn about discontinuities, and thus improves\nthe value function approximation. Furthermore, as the interpolation is learned\nand state-dependent, our method can deal with heterogeneous observability. We\ndemonstrate that this one change leads to signi\ufb01cant improvements on multiple\nAtari games, when applied to the state-of-the-art A3C algorithm.\n\n1 Motivation\n\nThe central problem of reinforcement learning is value function approximation: how to accurately\nestimate the total future reward from a given state. Recent successes have used deep neural networks\nto approximate the value function, resulting in state-of-the-art performance in a variety of challenging\ndomains [9]. Neural networks are most effective when the desired target function is smooth. However,\nvalue functions are, by their very nature, discontinuous functions with sharp variations over time. In\nthis paper we introduce a representation of value that matches the natural temporal structure of value\nfunctions.\nA value function represents the expected sum of future discounted rewards. If non-zero rewards occur\ninfrequently but reliably, then an accurate prediction of the cumulative discounted reward rises as\nsuch rewarding moments approach and drops immediately after. This is depicted schematically with\nthe dashed black line in Figure 1. The true value function is quite smooth, except immediately after\nreceiving a reward when there is a sharp drop. This is a pervasive scenario because many domains\nassociate positive or negative reinforcements to salient events (like picking up an object, hitting a\nwall, or reaching a goal position). The problem is that the agent\u2019s observations tend to be smooth\nin time, so learning an accurate value estimate near those sharp drops puts strain on the function\napproximator \u2013 especially when employing differentiable function approximators such as neural\nnetworks that naturally make smooth maps from observations to outputs.\nTo address this problem, we incorporate the temporal structure of cumulative discounted rewards into\nthe value function itself. The main idea is that, by default, the value function can respect the reward\nsequence. If no reward is observed, then the next value smoothly matches the previous value, but\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: After the same amount of training, our proposed method (red) produces much more accurate\nestimates of the true value function (dashed black), compared to the baseline (blue). The main plot\nshows discounted future returns as a function of the step in a sequence of states; the inset plot shows\nthe RMSE when training on this data, as a function of network updates. See section 4 for details.\n\nbecomes a little larger due to the discount. If a reward is observed, it should be subtracted out from\nthe previous value: in other words a reward that was expected has now been consumed. The natural\nvalue approximator (NVA) combines the previous value with the observed rewards and discounts,\nwhich makes this sequence of values easy to represent by a smooth function approximator such as a\nneural network.\nNatural value approximators may also be helpful in partially observed environments. Consider a\nsituation in which an agent stands on a hill top. The goal is to predict, at each step, how many steps it\nwill take until the agent has crossed a valley to another hill top in the distance. There is fog in the\nvalley, which means that if the agent\u2019s state is a single observation from the valley it will not be able\nto accurately predict how many steps remain. In contrast, the value estimate from the initial hill top\nmay be much better, because the observation is richer. This case is depicted schematically in Figure 2.\nNatural value approximators may be effective in these situations, since they represent the current\nvalue in terms of previous value estimates.\n\n2 Problem de\ufb01nition\n\nWe consider the typical scenario studied in reinforcement learning, in which an agent interacts with\nan environment at discrete time intervals: at each time step t the agent selects an action as a function\nof the current state, which results in a transition to the next state and a reward. The goal of the agent\nis to maximize the discounted sum of rewards collected in the long run from a set of initial states [12].\nThe interaction between the agent and the environment is modelled as a Markov Decision Process\n(MDP). An MDP is a tuple (S,A, R, \u03b3, P ) where S is a state space, A is an action space, R :\nS\u00d7A\u00d7S \u2192 D(R) is a reward function that de\ufb01nes a distribution over the reals for each combination\nof state, action, and subsequent state, P : S \u00d7 A \u2192 D(S) de\ufb01nes a distribution over subsequent\nstates for each state and action, and \u03b3t \u2208 [0, 1] is a scalar, possibly time-dependent, discount factor.\nOne common goal is to make accurate predictions under a behaviour policy \u03c0 : S \u2192 D(A) of the\nvalue\n\nv\u03c0(s) \u2261 E [R1 + \u03b31R2 + \u03b31\u03b32R3 + . . . | S0 = s] .\n\n(1)\nThe expectation is over the random variables At \u223c \u03c0(St), St+1 \u223c P (St, At), and Rt+1 \u223c\nR(St, At, St+1), \u2200t \u2208 N+. For instance, the agent can repeatedly use these predictions to improve\nits policy. The values satisfy the recursive Bellman equation [2]\n\nv\u03c0(s) = E [Rt+1 + \u03b3t+1v\u03c0(St+1) | St = s] .\n\nWe consider the common setting where the MDP is not known, and so the predictions must be\nlearned from samples. The predictions made by an approximate value function v(s; \u03b8), where \u03b8 are\nparameters that are learned. The approximation of the true value function can be formed by temporal\n\n2\n\n\fdifference (TD) learning [10], where the estimate at time t is updated towards\n\nt \u2261 Rt+1 + \u03b3t+1v(St+1; \u03b8)\nZ 1\n\nor Z n\n\n(\u03a0i\u22121\n\nk=1\u03b3t+k)Rt+i + (\u03a0n\n\nk=1\u03b3t+k)v(St+n; \u03b8) ,(2)\n\nwhere Z n\n\nt is the n-step bootstrap target, and the TD-error is \u03b4n\n\nt \u2261 Z n\n\nt \u2212 v(St; \u03b8).\n\nt \u2261 n(cid:88)\n\ni=1\n\n3 Proposed solution: Natural value approximators\n\nThe conventional approach to value function approximation produces a value estimate from features\nassociated with the current state. In states where the value approximation is poor, it can be better to\nrely more on a combination of the observed sequence of rewards and older but more reliable value\nestimates that are projected forward in time. Combining these estimates can potentially be more\naccurate than using one alone.\nThese ideas lead to an algorithm that produces three estimates of the value at time t. The \ufb01rst estimate,\nVt \u2261 v(St; \u03b8), is a conventional value function estimate at time t. The second estimate,\n\nt\u22121 \u2212 Rt\n\nt \u2261 G\u03b2\nGp\n\n\u03b3t\n\nif \u03b3t > 0 and t > 0 ,\n\n(3)\n\nis a projected value estimate computed from the previous value estimate, the observed reward, and\nthe observed discount for time t. The third estimate,\n\nt \u2261 \u03b2tGp\nG\u03b2\n\nt + (1 \u2212 \u03b2t)Vt = (1 \u2212 \u03b2t)Vt + \u03b2t\n\n,\n\n(4)\n\nt\u22121 \u2212 Rt\nG\u03b2\n\n\u03b3t\n\nis a convex combination of the \ufb01rst two estimates1 formed by a time-dependent blending coef\ufb01cient\n\u03b2t. This coef\ufb01cient is a learned function of state \u03b2(\u00b7; \u03b8) : S \u2192 [0, 1], over the same parameters \u03b8,\nand we denote \u03b2t \u2261 \u03b2(St; \u03b8). We call G\u03b2\nt the natural value estimate at time t and we call the overall\napproach natural value approximators (NVA). Ideally, the natural value estimate will become more\naccurate than either of its constituents from training.\nThe value is learned by minimizing the sum of two losses. The \ufb01rst loss captures the difference\nbetween the conventional value estimate Vt and the target Zt, weighted by how much it is used in the\nnatural value estimate,\n\nJV \u2261 E(cid:2)[[1 \u2212 \u03b2t]]([[Zt]] \u2212 Vt)2(cid:3) ,\n\n(5)\nwhere we introduce the stop-gradient identity function [[x]] = x that is de\ufb01ned to have a zero gradient\neverywhere, that is, gradients are not back-propagated through this function. The second loss captures\nthe difference between the natural value estimate and the target, but it provides gradients only through\nthe coef\ufb01cient \u03b2t,\n\nJ\u03b2 \u2261 E(cid:2)([[Zt]] \u2212 (\u03b2t [[Gp\n\nt ]] + (1 \u2212 \u03b2t)[[Vt]]))2(cid:3) .\n\n(6)\n\nThese two losses are summed into a joint loss,\n\nJ = JV + c\u03b2J\u03b2,\n\n(7)\nwhere c\u03b2 is a scalar trade-off parameter. When conventional stochastic gradient descent is applied\nto minimize this loss, the parameters of Vt are adapted with the \ufb01rst loss and parameters of \u03b2t are\nadapted with the second loss.\nWhen bootstrapping on future values, the most accurate value estimate is best, so using G\u03b2\nVt leads to re\ufb01ned prediction targets\n\nt instead of\n\nor Z \u03b2,n\n\n(\u03a0i\u22121\n\nk=1\u03b3t+k)Rt+i + (\u03a0n\n\nk=1\u03b3t+k)G\u03b2\n\nt+n .\n\n(8)\n\nt \u2261 n(cid:88)\n\ni=1\n\nt \u2261 Rt+1 + \u03b3t+1G\u03b2\nZ \u03b2,1\n\nt+1\n\n4\n\nIllustrative Examples\n\nWe now provide some examples of situations where natural value approximations are useful. In both\nexamples, the value function is dif\ufb01cult to estimate well uniformly in all states we might care about,\nand the accuracy can be improved by using the natural value estimate G\u03b2\nt instead of the direct value\nestimate Vt.\n\n1Note the mixed recursion in the de\ufb01nition, Gp depends on G\u03b2, and vice-versa.\n\n3\n\n\fSparse rewards Figure 1 shows an example of value function approximation. To separate concerns,\nthis is a supervised learning setup (regression) with the true value targets provided (dashed black\nline). Each point 0 \u2264 t \u2264 100 on the horizontal axis corresponds to one state St in a single sequence.\nThe shape of the target values stems from a handful of reward events, and discounting with \u03b3 = 0.9.\nWe mimic observations that smoothly vary across time by 4 equally spaced radial basis functions,\nso St \u2208 R4. The approximators v(s) and \u03b2(s) are two small neural networks with one hidden layer\nof 32 ReLU units each, and a single linear or sigmoid output unit, respectively. The input to \u03b2 is\naugmented with the last k = 16 rewards. For the baseline experiment, we \ufb01x \u03b2t = 0. The networks\nare trained for 5000 steps using Adam [5] with minibatch size 32. Because of the small capacity of\nthe v-network, the baseline struggles to make accurate predictions and instead it makes systematic\nerrors that smooth over the characteristic peaks and drops in the value function. The natural value\nestimation obtains ten times lower root mean squared error (RMSE), and it also closely matches the\nqualitative shape of the target function.\n\nHeterogeneous observability Our approach is not limited to the sparse-reward setting. Imagine\nan agent that stands on the top of a hill. By looking in the distance, the agent may be able to predict\nhow many steps should be taken to take it to the next hill top. When the agent starts descending the\nhill, it walks into fog in the valley between the hills. There, it can no longer see where it is. However,\nit could still determine how many steps until the next hill by using the estimate from the \ufb01rst hill and\nthen simply counting steps. This is exactly what the natural value estimate G\u03b2\nt will give us, assuming\n\u03b2t = 1 on all steps in the fog. Figure 2 illustrates this example, where we assumed each step has\na reward of \u22121 and the discount is one. The best observation-dependent value v(St) is shown in\ndashed blue. In the fog, the agent can then do no better than to estimate the average number of steps\nfrom a foggy state until the next hill top. In contrast, the true value, shown in red, can be achieved\nexactly with natural value estimates. Note that in contrast to Figure 1, rewards are dense rather than\nsparse.\nIn both examples, we can sometimes trust past value functions more than current estimations, either\nbecause of function approximation error, as in the \ufb01rst example, or partial observability.\n\nFigure 2: The value is the negative number of steps until reaching the destination at t = 100. In some\nparts of the state space, all states are aliased (in the fog). For these aliased states, the best estimate\nbased only on immediate observations is a constant value (dashed blue line). Instead, if the agent\nrelies on the value just before the fog and then decrements it by encountered rewards, while ignoring\nobservations, then the agent can match the true value (solid red line).\n\n5 Deep RL experiments\n\nIn this section, we integrate our method within A3C (Asynchronous advantage actor-critic [9]), a\nwidely used deep RL agent architecture that uses a shared deep neural network to both estimate the\npolicy \u03c0 (actor) and a baseline value estimate v (critic). We modify it to use G\u03b2\nt estimates instead\nof the regular value baseline Vt. In the simplest, feed-forward variant, the network architecture\nis composed of three layers of convolutions, followed by a fully connected layer with output h,\nwhich feeds into the two separate heads (\u03c0 with an additional softmax, and a scalar v, see the black\ncomponents in the diagram below). The updates are done online with a buffer of the past 20-state\ntransitions. The value targets are n-step targets Z n\nt (equation 2) where each n is chosen such that it\nbootstraps on the state at the end of the 20-state buffer. In addition, there is a loss contribution from\nthe actor\u2019s policy gradient update on \u03c0. We refer the reader to [9] for details.\n\n4\n\n050100step100500valuefog\fTable 1: Mean and median human-normalized scores on 57 Atari games, for the A3C baselines and\nour method, using both evaluation metrics. N75 indicates the number of games that achieve at least\n75% human performance.\n\nhuman starts\n\nAgent\nA3C baseline\nA3C + NVA\n\nN75\n28/57\n30/57\n\nmean\n\nmedian\nN75\n68.5% 310.4% 31/57\n93.5% 373.3% 32/57\n\nno-op starts\nmean\nmedian\n91.6% 334.0%\n117.0% 408.4%\n\nt\n\ninstead of Z n\n\nOur method differs from the baseline A3C setup in the form of the\nvalue estimator in the critic (G\u03b2\nt instead of Vt), the bootstrap targets\n(Z \u03b2,n\nt ) and the value loss (J instead of JV ) as discussed\nin section 3. The diagram on the right shows those new components in\ngreen; thick arrows denote functions with learnable parameters, thin\nones without. In terms of the network architecture, we parametrize the\nblending coef\ufb01cient \u03b2 as a linear function of the hidden representation\nh concatenated with a window of past rewards Rt\u2212k:t followed by a\nsigmoid:\n\n\u03b2(St; \u03b8) \u2261\n\n1 + exp\n\n(cid:16)\n\n(cid:17) ,\n\n\u03b3t\n\n\u03b8(cid:62)\n\u03b2 [h(St); Rt\u2212k:t]\n\n(9)\n\nwhere \u03b8\u03b2 are the parameters of the \u03b2 head of the network, and we set k to 50. The extra factor of\n\u03b3t handles the otherwise unde\ufb01ned beginnings of episode (when \u03b30 = 0), and it ensures that the\ntime-scale across which estimates can be projected forward cannot exceed the time-scale induced by\nthe discounting2.\nWe investigate the performance of natural value estimates on a collection of 57 video games games\nfrom the Atari Learning Environment [1], which has become a standard benchmark for Deep RL\nmethods because of the rich diversity of challenges present in the various games. We train agents for\n80 Million agent steps (320 Million Atari game frames) on a single machine with 16 cores, which\ncorresponds to the number of frames denoted as \u20181 day on CPU\u2019 in the original A3C paper. All agents\nare run with one seed and a single, \ufb01xed set of hyper-parameters. Following [8], the performance of\nthe \ufb01nal policy is evaluated under two modes, with a random number of no-ops at the start of each\nepisode, and from randomized starting points taken from human trajectories.\n\n5.1 Results\n\nTable 1 summarizes the aggregate performance results across all 57 games, normalized by human\nperformance. The evaluation results are presented under two different conditions, the human starts\ncondition evaluates generalization to a different starting state distribution than the one used in training,\nand the no-op starts condition evaluates performance on the same starting state distribution that was\nused in training. We summarize normalized performance improvements in Figure 3. In the appendix,\nwe provide full results for each game in Table 2 and Table 3. Across the board, we \ufb01nd that adding\nNVA improves the performance on a number of games, and improves the median normalized score\nby 25% or 25.4% for the respective evaluation metrics.\nThe second measure of interest is the change in value error when using natural value estimates; this is\nshown in Figure 4. The summary across all games is that the the natural value estimates are more\naccurate, sometimes substantially so. Figure 4 also shows detailed plots from a few representative\ngames, showing that large accuracy gaps between Vt and G\u03b2 lead to the learning of larger blending\nproportions \u03b2.\nThe fact that more accurate value estimates improve \ufb01nal performance on only some games should\nnot be surprising, as they only directly affect the critic and they affect the actor indirectly. It is also\n\n2This design choice may not be ideal in all circumstances, sometimes projecting old estimates further can\nperform better\u2014our variant however has the useful side-effect that the weight for the Vt update (Equation 5) is\nnow greater than zero independently of \u03b2. This prevents one type of vicious cycle, where an initially inaccurate\nVt leads to a large \u03b2, which in turn reduces the learning of Vt, and leads to an unrecoverable situation.\n\n5\n\n\u03c0v\u03b2hG\u03b2tRt-k:tRt\u03b3tStG\u03b2t-1\fFigure 3: The performance gains of the proposed architecture over the baseline system, with\nthe performance normalized for each game with the formula\nmax(human,baseline)\u2212random used\npreviously in the literature [15].\n\nproposed\u2212baseline\n\nunclear for how many games the bottleneck is value accuracy instead of exploration, memory, local\noptima, or sample ef\ufb01ciency.\n\n6 Variants\n\nWe explored a number of related variants on the subset of tuning games, with mostly negative results,\nand report our \ufb01ndings here, with the aim of adding some additional insight into what makes NVA\nwork\u2014and to prevent follow-up efforts from blindly repeating our mistakes.\n\n\u03b2-capacity We experimented with adding additional capacity to the \u03b2-network in Equation 9,\nnamely inserting a hidden ReLU layer with nh \u2208 {16, 32, 64}; this neither helped nor hurt perfor-\nmance, so opted for the simplest architecture (no hidden layer). We hypothesize that learning a binary\ngate is much easier than learning the value estimate, so no additional capacity is required.\n\nWeighted v-updates We also validated the design choice of weighting the update to v by its\nusage (1 \u2212 \u03b2) (see Equation 5). On the 6 tuning games, weighting by usage obtains slightly higher\nperformance than an unweighted loss on v. One hypothesis is that the weighting permits the direct\nestimates to be more accurate in some states than in others, freeing up function approximation\ncapacity for where it is most needed.\n\nSemantic versus aggregate losses Our proposed method separates the semantically different\nupdates on \u03b2 and v, but of course a simpler alternative would be to directly regress the natural\nvalue estimate G\u03b2\nt toward its target, and back-propagate the aggregate loss into both \u03b2 and v jointly.\nThis alternative performs substantially worse, empirically. We hypothesize one reason for this: in\na state where Gp\nt structurally over-estimates the target value, an aggregate loss will encourage v to\ncompensate by under-estimating it. In contrast, the semantic losses encourage v to simply be more\naccurate and then reduce \u03b2.\n\nTraining by back-propagation through time The recursive form of Equation 4 lends itself to\nan implementation as a speci\ufb01c form of recurrent neural network, where the recurrent connection\ntransmits a single scalar G\u03b2\nt . In this form, the system can be trained by back-propagation through\ntime (BPTT [17]). This is semantically subtly different from our proposed method, as the gates \u03b2\nno longer make a local choice between Vt and Gp\nt , but instead the entire sequence of \u03b2t\u2212k to \u03b2t is\n\n6\n\n -12% assault -12% ms_pacman -10% chopper_command -8% tutankham -5% battle_zone -5% centipede -4% ice_hockey -3% star_gunner -2% alien -1% boxing -1% gravitar -0% bank_heist -0% pong -0% pitfall -0% solaris -0% beam_rider -0% montezuma_revengeenduro 0%freeway 0%venture 0%private_eye 0%seaquest 0%frostbite 1%skiing 1%bowling 1%yars_revenge 1%robotank 1%double_dunk 1%riverraid 2%kung_fu_master 2%fishing_derby 2%kangaroo 2%zaxxon 3%road_runner 4%jamesbond 4%surround 5%gopher 5%amidar 7%hero 8%krull 9%defender 9%qbert 10%crazy_climber 11%time_pilot 11%wizard_of_wor 18%asterix 20%phoenix 24%tennis 25%atlantis 25%demon_attack 25%name_this_game 25%breakout 31%berzerk 36%up_n_down 38%asteroids 50%space_invaders 70%video_pinball 453%\fFigure 4: Reduction in value estimation error compared to the baseline. The proxies we use are average\nsquared TD-errors encountered during training, comparing \u0001v = 1\nt )2.\nTop: Summary graph for all games, showing relative change in error (\u0001\u03b2 \u2212 \u0001v)/\u0001v, averaged over the full\ntraining run. As expected, the natural value estimate consistently has equal or lower error, validating our core\nhypothesis. Bottom: Detailed plots on a handful of games. It shows the direct estimate error \u0001v (blue) and\nnatural value estimate error \u0001\u03b2 (red). In addition, the blending proportion \u03b2 (cyan) adapts over time to use more\nof the prospective value estimate if that is more accurate.\n\n2 (Zt \u2212 v(St; \u03b8))2 and \u0001\u03b2 = 1\n\n2 (Zt \u2212 G\u03b2\n\ntrained to provide the best estimate G\u03b2\nt at time t (where k is the truncation horizon of BPTT). We\nexperimented with this variant as well: it led to a clear improvement over the baseline as well, but its\nperformance was substantially below the simpler feed-forward setup with reward buffer in Equation 9\n(median normalized scores of 78% and 103% for the human and no-op starts respectively).\n\n7 Discussion\n\nt = Rt+1 + \u03b3t+1(1 \u2212 \u03bb)Vt+1 + \u03b3t+1\u03bbG\u03bb\n\nRelation to eligibility traces\nIn TD(\u03bb) [11], a well-known and successful variant of TD, the value\nfunction (1) is not learned by a one-step update, but instead relies on multiple value estimates from\nfurther in the future. Concretely, the target for the update of the estimate Vt is then G\u03bb\nt , which can\nbe de\ufb01ned recursively by G\u03bb\nt+1, or as a mixture of several\nn-step targets [12]. The trace parameter \u03bb is similar to our \u03b2 parameter, but faces backwards in time\nrather than forwards.\nA quantity very similar to G\u03b2\nt was discussed by van Hasselt and Sutton [13], where this quantity\nwas then used to update values prior to time t. The inspiration was similar, in the sense that it was\nacknowledged that G\u03b2\nt may be a more accurate target to use than either the Monte Carlo return or any\nsingle estimated state value. The use of G\u03b2\nt itself for online predictions, apart from using it as a target\nto update towards, was not yet investigated.\n\nExtension to action-values There is no obstacle to extend our approach to estimators of action-\nvalues q(St, At, \u03b8). One generalization from TD to SARSA is almost trivial. The quantity G\u03b2\nt then\nhas the semantics of the value of action At in state St.\nIt is also possible to consider off-policy learning. Consider the Bellman optimality equation\nQ\u2217(s, a) = E [Rt+1 + \u03b3t+1 maxa(cid:48) Q\u2217(St+1, a(cid:48))]. This implies that for the optimal value function\nQ\u2217,\n\nE(cid:104)\n\nmax\n\na\n\n(cid:105)\n\nQ\u2217(St, a)\n\n= E\n\n(cid:20) Q\u2217(St\u22121, At\u22121) \u2212 Rt\n\n(cid:21)\n\n.\n\n\u03b3t\n\n7\n\n0%-25%-50%Relative change in value losscentipedeenduroventureseaquestatlantissurroundpongup_n_downjamesbondtime_pilotherobeam_riderbank_heistasterixfrostbitetennisdemon_attackspace_invaderswizard_of_wordefendername_this_gamebreakoutbattle_zonefishing_derbyfreewayamidarqbertroad_runnerriverraidzaxxoncrazy_climberrobotankdouble_dunkalienasteroidssolaristutankhamassaultvideo_pinballberzerkphoenixkung_fu_masteryars_revengekrullice_hockeymontezuma_revengegopherstar_gunnerms_pacmangravitarchopper_commandprivate_eyebowlingboxingskiingpitfallkangaroo020406080100average squared TD errorerror on verror on G\u03b201020304050607080average squared TD errorerror on verror on G\u03b20100020003000400050006000average squared TD errorerror on verror on G\u03b2010203040average squared TD errorerror on verror on G\u03b20.00.51.0average \u03b2seaquest0.00.51.0average \u03b2time_pilot0.00.51.0average \u03b2up_n_down0.00.51.0average \u03b2surround\fThis implies that we may be able to use the quantity (Q(St\u22121, At\u22121) \u2212 Rt)/\u03b3t as an estimate for the\ngreedy value maxa Q(St, a). For instance, we could blend the value as in SARSA, and de\ufb01ne\n\nt = (1 \u2212 \u03b2t)Q(St, At) + \u03b2t\nG\u03b2\n\n\u03b3t\n\nt\u22121 \u2212 Rt\nG\u03b2\n\n.\n\nPerhaps we could require \u03b2t = 0 whenever At (cid:54)= arg maxa Q(St, a), in a similar vein as Watkins\u2019\nQ(\u03bb) [16] that zeros the eligibility trace for non-greedy actions. We leave this and other potential\nvariants for more detailed consideration in future work.\n\nMemory NVA adds a small amount of memory to the system (a single scalar), which raises the\nquestion of whether other forms of memory, such as the LSTM [4], provide a similar bene\ufb01t. We\ndo not have a conclusive answer, but the existing empirical evidence indicates that the bene\ufb01t of\nnatural value estimation goes beyond just memory. This can be seen by comparing to the A3C+LSTM\nbaseline (also proposed in [9]), which has vastly larger memory and number of parameters, yet did not\nachieve equivalent performance (median normalized scores of 81% for the human starts). To some\nextent this may be caused by the fact that recurrent neural networks are more dif\ufb01cult to optimize.\n\nRegularity and structure Results from the supervised learning literature indicate that computing\na reasonable approximation of a given target function is feasible when the learning algorithm exploits\nsome kind of regularity in the latter [3]. For example, one may assume that the target function is\nbounded, smooth, or lies in a low-dimensional manifold. These assumptions are usually materialised\nin the choice of approximator. Making structural assumptions about the function to approximate\nis both a blessing and a curse. While a structural assumption makes it possible to compute an\napproximation with a reasonable amount of data, or using a smaller number of parameters, it can also\ncompromise the quality of the solution from the outset. We believe that while our method may not be\nthe ideal structural assumption for the problem of approximating value functions, it is at least better\nthan the smooth default.\n\nOnline learning By construction, the natural value estimates are an online quantity, that can only be\ncomputed from a trajectory. This means that the extension to experience replay [6] is not immediately\nobvious. It may be possible to replay trajectories, rather than individual transitions, or perhaps it\nsuf\ufb01ces to use stale value estimates at previous states, which might still be of better quality than the\ncurrent value estimate at the sampled state. We leave a full investigation of the combination of these\nmethods to future work.\n\nPredictions as state\nIn our proposed method the value is estimated in part as a function of a single\npast prediction, and this has some similarity to past work in predictive state representations [7].\nPredictive state representations are quite different in practice: their state consists of only predictions,\nthe predictions are of future observations and actions (not rewards), and their objective is to provide a\nsuf\ufb01cient representation of the full environmental dynamics. The similarities are not too strong with\nthe work proposed here, as we use a single prediction of the actual value, this prediction is used as a\nsmall but important part of the state, and the objective is to estimate only the value function.\n\n8 Conclusion\n\nThis paper argues that there is one speci\ufb01c structural regularity that underlies the value function\nof many reinforcement learning problems, which arises from the temporal nature of the problem.\nWe proposed natural value approximation, a method that learns how to combine a direct value\nestimate with ones projected from past estimates. It is effective and simple to implement, which\nwe demonstrated by augmenting the value critic in A3C, and which signi\ufb01cantly improved median\nperformance across 57 Atari games.\n\nAcknowledgements\n\nThe authors would like to thank Volodymyr Mnih for his suggestions and comments on the early\nversion of the paper, the anonymous reviewers for constructive suggestions to improve the paper. The\nauthors also thank the DeepMind team for setting up the environments and building helpful tools\nused in the paper.\n\n8\n\n\fReferences\n[1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[2] Richard Bellman. A Markovian decision process. Technical report, DTIC Document, 1957.\n\n[3] L\u00e1szl\u00f3 Gy\u00f6r\ufb01. A Distribution-Free Theory of Nonparametric Regression. Springer Science &\n\nBusiness Media, 2002.\n\n[4] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[5] Diederik Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. In ICLR, 2014.\n\n[6] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and\n\nteaching. Machine learning, 8(3-4):293\u2013321, 1992.\n\n[7] Michael L Littman, Richard S Sutton, and Satinder Singh. Predictive representations of state.\n\nIn NIPS, pages 1555\u20131562, 2002.\n\n[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[9] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In ICML, pages 1928\u20131937, 2016.\n\n[10] Richard S Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University\n\nof Massachusetts Amherst, 1984.\n\n[11] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning,\n\n3(1):9\u201344, 1988.\n\n[12] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.\n\n1998.\n\n[13] Hado van Hasselt and Richard S. Sutton. Learning to predict independent of span. CoRR,\n\nabs/1508.04582, 2015.\n\n[14] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\n\nQ-learning. In AAAI, pages 2094\u20132100, 2016.\n\n[15] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas.\nDueling network architectures for deep reinforcement learning. In ICML, pages 1995\u20132003,\n2016.\n\n[16] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis,\n\nUniversity of Cambridge England, 1989.\n\n[17] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of\n\nthe IEEE, 78(10):1550\u20131560, 1990.\n\n9\n\n\f", "award": [], "sourceid": 1276, "authors": [{"given_name": "Zhongwen", "family_name": "Xu", "institution": "DeepMind"}, {"given_name": "Joseph", "family_name": "Modayil", "institution": "Deepmind"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "Andre", "family_name": "Barreto", "institution": "DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}, {"given_name": "Tom", "family_name": "Schaul", "institution": "DeepMind"}]}