{"title": "How fast to work: Response vigor, motivation and tonic dopamine", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1026, "abstract": "", "full_text": "How fast to work: Response vigor, motivation\n\nand tonic dopamine\n\nYael Niv1;2 Nathaniel D. Daw2\n\nPeter Dayan2\n\n1ICNC, Hebrew University, Jerusalem 2Gatsby Computational Neuroscience Unit, UCL\n\nyaelniv@alice.nc.huji.ac.il\n\nfdaw,dayang@gatsby.ucl.ac.uk\n\nAbstract\n\nReinforcement learning models have long promised to unify computa-\ntional, psychological and neural accounts of appetitively conditioned be-\nhavior. However, the bulk of data on animal conditioning comes from\nfree-operant experiments measuring how fast animals will work for rein-\nforcement. Existing reinforcement learning (RL) models are silent about\nthese tasks, because they lack any notion of vigor. They thus fail to ad-\ndress the simple observation that hungrier animals will work harder for\nfood, as well as stranger facts such as their sometimes greater produc-\ntivity even when working for irrelevant outcomes such as water. Here,\nwe develop an RL framework for free-operant behavior, suggesting that\nsubjects choose how vigorously to perform selected actions by optimally\nbalancing the costs and bene\ufb01ts of quick responding. Motivational states\nsuch as hunger shift these factors, skewing the tradeoff. This accounts\nnormatively for the effects of motivation on response rates, as well as\nmany other classic \ufb01ndings. Finally, we suggest that tonic levels of\ndopamine may be involved in the computation linking motivational state\nto optimal responding, thereby explaining the complex vigor-related ef-\nfects of pharmacological manipulation of dopamine.\n\n1\n\nIntroduction\n\nA banal, but nonetheless valid, behaviorist observation is that hungry animals work harder\nto get food [1]. However, associated with this observation are two stranger experimental\nfacts and a large theoretical failing. The \ufb01rst weird fact is that hungry animals will in some\ncircumstances work more vigorously even for motivationally irrelevant outcomes such as\nwater [2, 3], which seems highly counterintuitive. Second, contrary to the emphasis theo-\nretical accounts have placed on the effects of dopamine (DA) on learning to choose between\nactions, the most overt behavioral effects of DA interventions are similar swings in undi-\nrected vigor [4], at least part of which appear immediately, without learning [5]. Finally,\ncomputational theories fail to deliver on the close link they trumpet between DA, behavior,\nand reinforcement learning (RL; eg [6]), as they do not address the whole experimental\nparadigm of free-operant tasks [7], whence hail those and many other results.\n\nRather than the standard RL problem of discrete choices between alternatives at prespeci-\n\ufb01ed timesteps [8], free-operant experiments investigate tasks in which subjects pace their\nown responding (typically on a lever or other manipulandum). The primary choice in these\ntasks is of how rapidly/vigorously to behave, rather than what behavior to choose (as typi-\ncally only one relevant action is available). RL models are silent about these aspects, and\nthus fail to offer a principled understanding of the policies selected by the animals.\n\n\f(a)\n\ne\nt\nu\nn\nm\n\ni\n\nr\ne\np\n\ne\nt\na\nr\n\n40\n\n30\n\n20\n\n10\n\n0\n0\n\nHungry\nSated\n\n10\n\n20\n\n30\n\n40\n\n(b)\n\ni\n\nn\nm\n/\ns\ne\ns\nn\no\np\ns\ne\nr\n\n(c)\n\ns\ne\nt\nu\nn\nm\n0\n3\n\ni\n\nn\n\ni\n\ns\nP\nL\n\nseconds since reinforcement\n\nreinforcements per hour\n\nFR schedule\n\nFigure 1: (a) Leverpress (blue, right) and consummatory nose poke (red, left) response\nrates of rats leverpressing for food on a modi\ufb01ed RI30 schedule. Hungry rats (open cir-\ncles) clearly press the lever at a higher rate than sated rats (\ufb01lled circles). Data from [11],\naveraged over 19 rats in each group. (b) The relationship between rate of responding and\nrate of reinforcement (reciprocal of the interval) on an RI schedule, is hyperbolic (of the\nform y = B (cid:1) x=(x + x0)). This is an instantiation of Herrnstein\u2019s matching law for one\nresponse (adapted from [9]). (c) Total number of leverpresses per session averaged over\n\ufb01ve 30 minute sessions by rats pressing for food on different FR schedules. Rats with\nnucleus accumbens 6-OHDA dopamine lesions (gray) press signi\ufb01cantly less than control\nrats (black), with the difference larger for higher ratio requirements. Adapted from [12].\n\nHere, we address these issues by constructing an RL account of behavior rates in free-\noperant settings (Sections 2,3). We consider optimal control in a continuous-time Markov\nDecision Process (MDP), in which agents must choose both an action and the latency with\nwhich to emit it (ie how vigorously, or at what instantaneous rate to perform it). Our model\ntreats response vigor as being determined normatively, as the outcome of a battle between\nthe cost of behaving more expeditiously and the bene\ufb01t of achieving desirable outcomes\nmore quickly. We show that this simple, normative framework captures many classic fea-\ntures of animal behavior that are obscure in our and others\u2019 earlier treatments (Section 4).\nThese include the characteristic time-dependent pro\ufb01les of response rates on tasks with dif-\nferent payoff scheduling [7], the hyperbolic relationship between response rate and payoff\n[9], and the difference in response rates between tasks in which reinforcements are allo-\ncated based on the number of responses emitted and those allocating reinforcements based\non the passage of time [10].\n\nA key feature of this model is that response rates are strongly dependent on the expected\naverage reward rate, because this determines the opportunity cost of sloth. By in\ufb02uencing\nthe value of reinforcers \u2014 and through this, the average reward rate \u2014 motivational states\nsuch as hunger in\ufb02uence the output response latencies (and not only response choice).\nThus, in our model, hungry animals should optimally also work harder for water, since\nin typical circumstances, this should allow them to return more quickly to working for\nfood. Further, we identify tonic levels of dopamine with the representation of average\nreward rate, and thereby suggest an account of a wealth of experiments showing that DA\nin\ufb02uences response vigor [4, 5], thus complementing existing ideas about the role of phasic\nDA signals in learned action selection (Section 5).\n\n2 Free-operant behavior\n\nWe consider the free-operant scenario common in experimental psychology, in which an\nanimal is placed in an experimental chamber, and can choose freely which actions to emit\nand when. Most actions have no programmed consequences; however, one action (eg lev-\nerpressing; LP) is rewarded with food (which falls into a food magazine) according to an\nexperimenter-determined schedule of reinforcement. Food delivery makes a characteristic\nsound, signalling its availability for harvesting via a nose poke (NP) into the magazine.\n\n\fThe schedule of reinforcement de\ufb01nes the (possibly stochastic) relationship between the\ndelivery of a reward and one or both of (a) the number of LPs, and (b) the time since the\nlast reward was delivered. In common use are \ufb01xed-ratio (FR) schedules, in which a \ufb01xed\nnumber of LPs is required to obtain a reinforcer; random-ratio (RR) schedules, in which\neach LP has a constant probability of being reinforced; and random interval (RI) schedules,\nin which the \ufb01rst LP after an (exponentially distributed) interval of time has elapsed, is\nreinforced. Schedules are often labelled by their type and a parameter, so RI30 is a random\ninterval schedule with the exponential waiting time having a mean of 30 seconds [7].\n\nDifferent schedules induce different patterns of responding [7]. Fig 1a shows response\nmetrics from rats leverpressing on an RI30 schedule. Leverpressing builds up to a rela-\ntively constant rate following a rather long pause after gaining each reward, during which\nthe food is consumed. Hungry rats leverpress more vigorously than sated ones. A simi-\nlar overall pattern is also characteristic of responding on RR schedules. Figure 1b shows\nthe total number of LP responses in a 30 minute session for different interval schedules.\nThe hyperbolic relationship between the reward rate (the inverse of the interval) and the\nresponse rate is a classic hallmark of free operant behavior [9].\n\n3 The model\n\nWe model a free-operant task as a continuous MDP. Based on its state, the agent chooses\nboth an action (a), and a latency ((cid:28) ) at which to emit it. After time (cid:28) has elapsed, the\naction is completed, the agent receives rewards and incurs costs associated with its choice,\nand then selects a new (a; (cid:28) ) pair based on its new state. We de\ufb01ne three possible actions\na 2 fLP; NP; otherg, where we take a = other to include the various miscellaneous\nbehaviors such as grooming, rearing, and snif\ufb01ng which animals typically perform during\nthe experiment. For simplicity we consider unit actions, with the latency (cid:28) related to the\nvigor with which this unit is performed. To account for consumption time (which is non-\nnegligible [11, 13]), if the agent nose-pokes and food is available, a prede\ufb01ned time teat\npasses before the next decision point (and the next state) is reached.\n\nCrucially, performing actions incurs costs as well as potentially gains rewards. Following\nStaddon [14], we assume one part of the cost of an action to be proportional to the vigor of\nits execution, ie inversely proportional to (cid:28) . The constant of proportionality Kv depends on\nboth the previous and the current action, since switching between different action types can\nrequire travel between different parts of the experimental chamber (say, the magazine to\nthe lever), and can thus be more costly. Each action also incurs a \ufb01xed \u2018internal\u2019 reward or\ncost of (cid:26)(a) per unit, typically with other being rewarding. The reinforcement schedule\nde\ufb01nes the probability of reward delivery for each state-action-latency triplet. An available\nreward can be harvested by a = NP into the magazine, and we assume that the thereby\nobtained subjective utility U (r) of the food reward is motivation-dependent, such that food\nis worth more to a hungry animal than to a sated one.\n\nWe consider the simpli\ufb01ed case of a state space comprised of all the parameters relevant\nto the task. Speci\ufb01cally, the state space includes the identity of the previous action, an\nindicator as to whether a reward is available in the food magazine, and, as necessary, the\nnumber of LPs since the previous reinforcement (for FR) or the elapsed time since the pre-\nvious LP (for RI). The transitions between the states P (S 0jS; a; (cid:28) ) and the reward function\nPr(S; a; (cid:28) ) are de\ufb01ned by the dynamics of the schedule of reinforcement, and all rewards\nand costs are harvested at state transitions and considered as point events. In the following\nwe treat the problem of optimising a policy (which action to take and with what latency,\ngiven the state) in order to maximize the average rate of return (rewards minus costs per\ntime). An exponentially discounted model gives the same qualitative results.\nIn the average reward case [15, 16], the Bellman equation for the long-term differential (or\n\n\f(a)\n\n(b)\n\nt\n\ne\nu\nn\nm\n\ni\n\n \nr\ne\np\n\n \n\nt\n\ne\na\nr\n\n30\n\n20\n\n10\n\nLP\nNP\n\n0\n0\nseconds since reinforcement\n\n20\n\n40\n\nt\n\ne\nu\nn\nm\n\ni\n\n \nr\ne\np\n\n \n\nt\n\ne\na\nr\n\n30\n\n20\n\n10\n\nLP\nNP\n\n0\n0\nseconds since reinforcement\n\n20\n\n40\n\n(c)\n30\n\n20\n\n10\n\n0\n0\n\n#LPs in 5min\nLP latency (sec)\n\n5\n\nreinforcement rate/min\n\n10\n\nFigure 2: Data generated by the model captures the essence of the behavioral data: Lever-\npress (solid blue; circles) and nose poke (dashed red; stars) response rates on (a) an RR10\nschedule and (b) a matched (yoked) RI schedule show constant LP rates which are higher\nfor the ratio schedule. (c) The relationship between the total number of responses (circles)\nand rate of reinforcement is hyperbolic (solid line: hyperbolic curve \ufb01t). The mean latency\nto leverpress (dashed line) decreases as the rate of reinforcement increases.\n\naverage-adjusted) value of state S is:\n\n(cid:28)\n\nKv(aprev; a)\n\nV (cid:3)(S) = max\n\na;(cid:28) (cid:26)(cid:26)(a)(cid:0)\n\n+U (r)Pr(S; a; (cid:28) )(cid:0)(cid:28) (cid:1) r+Z dS 0P (S 0jS; a; (cid:28) )V (cid:3)(S 0)(cid:27) (1)\nwhere r is the long term average reward rate (whose subtraction from the value quanti\ufb01es\nthe opportunity cost of delay). Building on ideas from [16], we suggest that the average\nreward rate is reported by tonic (baseline) levels of dopamine (and not serotonin [16]) in\nbasal ganglia structures relevant for action selection, and that changes in tonic DA (eg as a\nresult of pharmacological interventions) would thus alter the assumed average reward rate.\n\nIn this paper, we eschew learning, and examine the steady state behavior that arises when\nactions are chosen stochastically (via the so-called softmax or Boltzmann distribution) from\nthe optimal one-step look-ahead model-based Q(S; a; (cid:28) ) state-action-latency values. For\nratio schedules, the simple transition structure of the task allows the Bellman equation to\nbe solved analytically to determine the Q values. For interval schedules, we use average-\nreward value iteration [15] with time discretized at a resolution of 100ms. For simulations\n(eg of dopaminergic manipulations) where r was assumed to change independent of any\nchange in the task contingencies, we used value iteration to \ufb01nd values approximately\nsatisfying the Bellman equation (which is no longer exactly solvable). Our overriding aim\nis to replicate basic aspects of free operant behavior qualitatively, in order to understand\nthe normative foundations of response vigor. We do not \ufb01t the parameters of the model\nto experimental data in a quantitative way, and the results we describe below are general,\nrobust, characteristics of the model.\n\n4 Results\n\nFig 2a depicts the behavior of our model on an RR10 schedule. In rough accordance with\nthe behavior displayed by animals (which is similar to that shown in Fig 1a), the LP rate\nis constant over time, bar a pause for consumption. Fig 2b depicts the model\u2019s behavior\nin a yoked random interval schedule, in which the intervals between rewards were set to\nmatch exactly the intervals obtained by the agent trained on the ratio schedule in Fig 2a.\nThe response rate is again constant over time, but it is also considerably lower than that\nin the corresponding RR schedule, although the external reward density is similar. This\nphenomenon has also been observed experimentally, and although the apparent anomaly\nhas been much discussed in the associative learning literature, its explanation is not fully\nresolved [10]. Our model suggests that it is the result of an optimal cost/bene\ufb01t tradeoff.\nWe can analyse this difference by considering the Q values for leverpressing at different\n\n\flatencies in random schedules\n\nQ(Snr; LP; (cid:28) ) = (cid:26)(LP) (cid:0)\n\nKv(LP; LP)\n\n(cid:28)\n\n(cid:0) (cid:28) (cid:1) r + P (Srj(cid:28) )V (cid:3)(Sr) + [1 (cid:0) P (Srj(cid:28) )]V (cid:3)(Snr)\n\n(2)\n\nwhere we are looking at consecutive leverpresses in the absence of available reward, and\nSr and Snr designate the states in which a reward is or is not available in the magazine,\nrespectively. In ratio schedules, since P (Srj(cid:28) ) is independent of (cid:28) , the optimizing latency\nis (cid:28) (cid:3)\nLP = pKv(LP; LP)=r, its inverse de\ufb01ning the optimal rate of leverpressing. In inter-\nval schedules, however, P (Srj(cid:28) ) = 1 (cid:0) expf(cid:0)(cid:28) =T g where T is the schedule interval.\nTaking the derivative of eq. (2) we \ufb01nd that the optimal latency to leverpress (cid:28) (cid:3)\nLP satis\ufb01es\nLP=T g = 0. Although no longer\nKv(LP; LP)=(cid:28) (cid:3)2\nanalytically solvable, it is easily seen that this latency will always be longer than that found\nabove for ratio schedules.\nIntuitively, since longer inter-response intervals increase the\nprobability of reward per press in interval schedules but not in ratio schedules, the optimal\nleverpressing rate is lower in the former than in the latter.\n\nLP (cid:0) r + (1=T )[V (cid:3)(Sr) (cid:0) V (cid:3)(Snr)] (cid:1) expf(cid:0)(cid:28) (cid:3)\n\nFig 2c shows the average number of LPs in a 5 minute session for different interval sched-\nules. This \u2018molar\u2019 measure of rate shows the well documented hyperbolic relationship (cf\nFig 1b). On the \u2018molecular\u2019 level of single action choices, the mean latency h(cid:28)LPi between\nconsecutive LPs decreases as the probability of reinforcement increases. This measure of\nresponse vigor is actually more accurate than the overall response measure, as it is not con-\ntaminated by competition with other actions, or confounded with the number of reinforcers\nper session for different schedules (and the time forgone when consuming them). For this\nreason, although we (correctly; see [13]) predict that inter-response latency should slow for\nhigher ratio requirements, raw LP counts can actually increase, as in Fig. 1c, probably due\nto fewer rewards and less time spent eating [13].\n\n5 Drive and dopamine\n\nHaving provided a qualitative account of the basic patterns of free operant rates of behavior,\nwe turn to the main theoretical conundrum \u2014 the effects of drive and DA manipulations\non response vigor. The key to understanding these is the role that the average reward r\nplays in the tradeoffs determining optimal response vigor. In effect, the average expected\nreward per unit time quanti\ufb01es the opportunity cost for doing nothing (and receiving no\nreward) for that time; its increase thus produces general pressure for faster work. A direct\nconsequence of making the agent hungrier is that the subjective utility of food is enhanced.\nThis will have interrelated effects on the optimal average reward r, the optimal values V (cid:3),\nand the resultant optimal action choices and vigors. Notably, so long as the policy obtains\nfood, its average reward rate will increase.\n\nConsider a \ufb01xed or random ratio schedule. The increase in r will increase the optimal\nLP rate 1=(cid:28) (cid:3)\nLP = pr=Kv(LP; LP), as the higher reward utility offsets higher procurement\ncosts. Importantly, because the optimal (cid:28) (cid:3) has a similar dependence on r even for actions\nirrelevant to obtaining food, they also become more vigorous. The explanation of this effect\nis presented graphically in Fig 3e. The higher r increases the cost of sloth, since every\n(cid:28) time without reward forgoes an expected ((cid:28) (cid:1) r) mean reward. Higher average rewards\npenalize late actions more than they do early ones, thus tilting action selection toward faster\nbehavior, for all pre-potent actions. Essentially, hunger encourages the agent to complete\nirrelevant actions faster, in order to be able to resume leverpressing more quickly.\n\nFor other schedules, the same effects generally hold (although the analytical reasoning is\ncomplicated by the fact that the optimal latencies may in these cases depend not only on the\nnew average reward but also on the new values V (cid:3)). Fig 3a shows simulated responding\non an RI25 schedule in which the internal reward for the food-irrelevant action other has\nbeen set high enough to warrant non-negligible base responding. Fig 3b shows that when\n\n\fLP\nNP\nOther\n\n(a)\n20\n\n15\n\n10\n\n5\n\ne\nt\nu\nn\nm\n\ni\n\nr\ne\np\n\ne\nt\na\nr\n\n(d)\n\n0\n0\nsec from reinforcement\n\n10\n\n20\n\n30\n\n40\n\n)\nc\ne\ns\n(\n \ny\nc\nn\ne\nt\na\nl\n \nn\na\ne\nm\n\n6\n\n4\n\n2\n\n0\n\nLP\n\nother\n\nb\no\nr\np\n/\ne\nu\na\nv\nQ\n\nl\n\n(b)\n20\n\n(c)\n20\n\n15\n\n10\n\n5\n\ne\nt\nu\nn\nm\n\ni\n\nr\ne\np\n\ne\nt\na\nr\n\n(e)\n\n0\n0\nsec from reinforcement\n\n30\n\n10\n\n20\n\n40\n\n15\n\n10\n\n5\n\ne\nt\nu\nn\nm\n\ni\n\nr\ne\np\n\ne\nt\na\nr\n\ns\ne\nt\nu\nn\nm\n0\n3\n\ni\n\nn\n\ni\n\ns\nP\nL\n\n(f)\n\n1500\n\n1000\n\n0\n0\nsec from reinforcement\n\n10\n\n20\n\n30\n\n40\n\nControl\n60% DA depleted\n\n500\n\n0\n\n1 \n\n3 \n\n9 \n\n27\n\nFigure 3: The effects of drive on response rates. (a) Responding on a RI25 schedule, with\nhigh internal rewards (0.35) for a = other (open circles). (b) The effects of hunger: U (r)\nwas changed from 10 to 15. (c) The effect of an irrelevant drive (hungry animals lever-\npressing for water rewards): r was increased by 4% compared to (a). (d) Mean latencies\nto responding h(cid:28) i for LP and other in baseline (a; black), increased hunger (b; white) and\nirrelevant drive (c; gray). (e) Q values for leverpressing at different latencies (cid:28) . In black\n(top) are the unadjusted Q values, before subtracting ((cid:28) (cid:1)r). In red (middle, solid) and green\n(bottom, solid) are the values adjusted for two different average reward rates. The higher\nreward rate penalizes late actions more, thereby causing faster responding, as shown by\nthe corresponding softmaxed action probability curves (dashed). (f) Simulation of DA de-\npletion: overall leverpress count over 30 minute sessions (each bar averaging 15 sessions),\nfor different FR requirements (bottom). In black is the control condition, and in gray is\nsimulated DA depletion, attained by lowering r by 60%. The effects of the depletion seem\nmore pronounced in higher schedules (compare to Fig 1c), but this actually results from the\ninteraction with the number of rewards attained (see text).\n\nthe utility of food is increased by 50%, the agent chooses to leverpress more, at the expense\nof other actions. This illustrates the \u2018directing\u2019 effect of motivation, by which the agent is\ndirected more forcefully toward the motivationally relevant action [17]. Furthermore, the\nsecond, \u2018driving\u2019 effect, by which motivation increases vigor globally [17], is illustrated in\nFig 3d which shows that, in fact, the latency to both actions has decreased. Thus, although\nselected less often, when other is selected, it is performed more vigorously than it was\nwhen the agent was sated.\n\nThis general drive effect can be better isolated if we examine hungry agents leverpressing\nfor water (rather than food), without competition from actions for food. We can view our\nleverpressing MDP as a portion of a larger one, which also includes (for instance) occa-\nsional opportunities for visits to a home cage where food is available. Without explicitly\nspecifying all this extra structure, a good approximation is to take hunger as again causing\nan increase in the global rate of reinforcement r, re\ufb02ecting the increase in the utility of\nfood received elsewhere. Fig 3c shows the effects on responding on an interval schedule,\nof estimating the average reward rate to be 4% higher than in Fig 3a, and deriving new Q\nvalues from the previous V (cid:3) with this new r as illustrated in Fig 3e. As above, the adjusted\nvigors of all behaviors are faster (Fig 3d, gray bars), as a result of the higher \u2018drive\u2019.\n\nHow do these drive effects relate to dopamine? Pharmacological and lesion studies show\nthat enhancing DA levels (through agonists such as amphetamine) increases general activity\n\nt\n\f[5, 18, 19], while depleting or antagonising DA causes a general slowing of responding\n(eg [4]). Fig. 1c is representative of a host of results from the lab of Salamone [4, 12]\nwhich show that lower levels of DA in the nucleus accumbens (a structure in the basal\nganglia implicated in action selection) result in lower response rates. This effect seems\nmore pronounced in higher \ufb01xed-ratio schedules, those requiring more work per reinforcer.\nAs a result of this apparent dependence on the response requirement, Salamone and his\ncolleagues have hypothesized that DA enables animals to overcome higher work demands.\n\nWe suggest that tonic levels of DA represent the average reward rate (a role tentatively\nproposed for serotonin in [16]). Thus a higher tonic level of DA represents a situation akin\nto higher drive, in which behavior is more vigorous, and lower tonic levels of DA cause a\ngeneral slowing of behavior. Fig. 3f shows the simulated response counts for different FR\nschedules in two conditions. The control condition is the standard model described above;\nDA depletion was modeled by decreasing tonic DA levels (and therefore r) to 40% of their\noriginal levels. The results match the data in Fig. 1c. Here, the apparently small effect on\nthe number of LPs for low ratio schedules actually arises because of the large amount of\ntime spent eating. Thus, according to the model DA is not really allowing animals to cope\nwith higher work requirements, but rather is important for optimal choice of vigor at any\nwork requirement, with the slowing effect of DA depletion more prominent (in the crude\nmeasure of LPs per session) when more time is spent leverpressing.\n\n6 Discussion\n\nThe present model brings the computational machinery and neural grounding of RL models\nfully into contact with the vast reservoir of data from free-operant tasks. Classic quanti-\ntative accounts of operant behavior (such as Herrnstein\u2019s matching law [9], and variations\nsuch as melioration) lack RL\u2019s normative grounding in sound control theory, and tend in-\nstead toward descriptive curve-\ufb01tting. Most of these theories do not address that \ufb01ne scale\n(molecular) structure of behavior, and instead concentrate on fairly crude molar measures\nsuch as total number of leverpresses over long durations. In addition to the normative start-\ning point it offers for investigations of response vigor, our theory provides a relatively \ufb01ne\nscalpel for dissecting the temporal details of behavior, such as the distributions of inter-\nresponse intervals at particular state transitions. There is thus great scope for revealing\nre-analyses of many existing data sets. In particular, the effects of generalized drive have\nproved mixed and complex [17]. Our theory suggests that studies of inter-response in-\ntervals (eg Fig 3d) may reveal more robust changes in vigor, uncontaminated by shifts in\noverall action propensity.\n\nResponse vigor and dopamine\u2019s role in controlling it have appeared in previous RL mod-\nels of behavior [20, 21], but only as fairly ad-hoc bolt-ons \u2014 for instance, using repeated\nchoices between doing nothing versus something to capture response latency. Here, these\naspects are wholly integrated into the explanatory framework: optimizing response vigor\nis treated as itself an RL problem, with a natural dopaminergic substrate. To account for\nimmediate (unlearned) effects of motivational or dopaminergic manipulations, the main\nassumption we make is that tonic levels of DA can be sensitive to predicted changes in\nthe average reward occasioned by changes in the motivational state, and that behavioral\npolicies are in turn immediately affected. This sensitivity would be easy to embed in a\ntemporal-difference RL system, producing \ufb02exible adaptation of response vigor. By con-\ntrast, due to the way they cache outcome values, the action choices of such RL systems are\ncharacteristically insensitive to the \u2018directing\u2019 effects of motivational manipulations [22].\nIn animal behavior, \u2018habitual actions\u2019 (the ones associated with the DA system) are indeed\nmotivationally insensitive for action choice, but show a direct effect of drive on vigor [23].\n\nOur model is easy to accommodate within a framework of temporal difference (TD) learn-\ning. Thus, it naturally preserves the link between phasic DA signals and online learning\n\n\fof optimal values [24]. We further elaborate this link by suggesting an additional role for\ntonic levels of DA in online vigor selection. A major question remains as to whether pha-\nsic responses (which are known to correlate with response latency [25]) play an additional\nrole in determining response vigor. Further, it is pressing to reconcile the present account\nwith our previous suggestion (based on microdialysis \ufb01ndings) [16] that tonic levels of DA\nmight track average punishment.\n\nThe most critical avenues to develop this work will be an account of learning, and neu-\nrally and psychologically more plausible state and temporal representations. On-line value\nlearning should be a straightforward adaptation of existing TD models of phasic DA based\non the continuous-time semi-Markov setting [26]. The representation of state is more chal-\nlenging \u2014 the assumption of a fully observable state space automatically appropriate for\nthe schedule of reinforcement is not realistic. Indeed, apparently sub-optimal actions emit-\nted by animals, eg engaging in excessive nose-poking even when a reward has not audibly\ndropped into the food magazine [11], may provide clues to this issue. Finally, it will be\ncrucial to consider the fact that animals\u2019 decisions about vigor may translate only noisily\ninto response times, due, for instance, to the variability of internal timing [27].\n\nAcknowledgments\n\nThis work was funded by the Gatsby Charitable Foundation, a Dan David fellowship (YN), the Royal Society (ND) and\nthe EU BIBA project (ND and PD). We are grateful to Jonathan Williams for discussions on free operant behavior.\n\nReferences\n[1] Dickinson A. and Balleine B.W. The role of learning in the operation of motivational systems. Steven\u2019s Handbook\n\nof Experimental Psychology Volume 3, pages 497\u2013533. John Wiley & Sons, New York, 2002.\n\n[2] Hull C.L. Principles of behavior: An introduction to behavior theory. Appleton-Century-Crofts, New York, 1943.\n[3] B\u00b4elanger D. and T\u00b4etreau B. L\u2019in\ufb02uence d\u2019une motivation inappropriate sur le comportement du rat et sa fr\u00b4equence\n\ncardiaque. Can. J. of Psych., 15:6\u201314, 1961.\n\n[5]\n\n[4] Salamone J.D. and Correa M. Motivational views of reinforcement: implications for understanding the behavioral\n\nfunctions of nucleus accumbens dopamine. Behavioural Brain Research, 137:3\u201325, 2002.\nIkemoto S. and Panksepp J. The role of nucleus accumbens dopamine in motivated behavior: a unifying interpreta-\ntion with special reference to reward-seeking. Brain Res. Rev., 31:6\u201341, 1999.\n\n[6] Schultz W. Predictive reward signal of dopamine neurons. J. Neurophys., 80:1\u201327, 1998.\n[7] Domjan M. The principles of learning and behavior. Brooks/Cole, Paci\ufb01c Grove, California, 3rd edition, 1993.\n[8] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.\n[9] Herrnstein R.J. On the law of effect. J. of the Exp. Anal. of Behav., 13(2):243\u2013266, 1970.\n[10] Dawson G.R. and Dickinson A. Performance on ratio and interval schedules with matched reinforcement rates. Q.\n\nJ. of Exp. Psych. B, 42:225\u2013239, 1990.\n\n[11] Niv Y., Daw N.D., Joel D., and Dayan P. Motivational effects on behavior: Towards a reinforcement learning model\n\nof rates of responding. In CoSyNe, Salt Lake City, Utah, 2005.\n\n[12] Aberman J.E. and Salamone J.D. Nucleus accumbens dopamine depletions make rats more sensitive to high ratio\n\nrequirements but do not impair primary food reinforcement. Neuroscience, 92(2):545\u2013552, 1999.\n\n[13] Foster T.M., Blackman K.A., and Temple W. Open versus closed economies: performance of domestic hens under\n\n\ufb01xed-ratio schedules. J. of the Exp. Anal. of Behav., 67:67\u201389, 1997.\n\n[14] Staddon J.E.R. Adaptive dynamics. MIT Press, Cambridge, Mass., 2001.\n[15] Mahadevan S. Average reward reinforcement learning: Foundations, algorithms and empirical results. Machine\n\n[16] Daw N.D., Kakade S., and Dayan P. Opponent interactions between serotonin and dopamine. Neural Networks,\n\nLearning, 22:1\u201338, 1996.\n\n15(4-6):603\u2013616, 2002.\n\n[17] Bolles R.C. Theory of Motivation. Harper & Row, 1967.\n[18] Carr G.D. and White N.M. Effects of systemic and intracranial amphetamine injections on behavior in the open\n\n\ufb01eld: a detailed analysis. Pharmacol. Biochem. Behav., 27:113\u2013122, 1987.\n\n[19] Jackson D.M., Anden N., and Dahlstrom A. A functional effect of dopamine in the nucleus accumbens and in some\n\nother dopamine-rich parts of the rat brain. Psychopharmacologia, 45:139\u2013149, 1975.\n\n[20] Dayan P. and Balleine B.W. Reward, motivation and reinforcement learning. Neuron, 36:285\u2013298, 2002.\n[21] McClure S.M., Daw N.D., and Montague P.R. A computational substrate for incentive salience. Trends in Neurosc.,\n\n26(8):423\u2013428, 2003.\n\n[22] Daw N.D., Niv Y., and Dayan P. Uncertainty based competition between prefrontal and dorsolateral striatal systems\n\nfor behavioral control. Nature Neuroscience, 8(12):1704\u20131711, 2005.\n\n[23] Dickinson A., Balleine B., Watt A., Gonzalez F., and Boakes R.A. Motivational control after extended instrumental\n\ntraining. Anim. Learn. and Behav., 23(2):197\u2013206, 1995.\n\n[24] Montague P.R., Dayan P., and Sejnowski T.J. A framework for mesencephalic dopamine systems based on predic-\n\ntive hebbian learning. J. of Neurosci., 16(5):1936\u20131947, 1996.\n\n[25] Satoh T., Nakai S., Sato T., and Kimura M. Correlated coding of motivation and outcome of decision by dopamine\n\nneurons. J. of Neurosci., 23(30):9913\u20139923, 2003.\n\n[26] Daw N.D., Courville A.C., and Touretzky D.S. Timing and partial observability in the dopamine system. In T.G.\n\nDietterich, S. Becker, and Z. Ghahramani, editors, NIPS, volume 14, Cambridge, MA, 2002. MIT Press.\n\n[27] Gallistel C.R. and Gibbon J. Time, rate and conditioning. Psych. Rev., 107:289\u2013344, 2000.\n\n\f", "award": [], "sourceid": 2842, "authors": [{"given_name": "Yael", "family_name": "Niv", "institution": null}, {"given_name": "Nathaniel", "family_name": "Daw", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}