{"title": "Environmental statistics and the trade-off between model-based and TD learning in humans", "book": "Advances in Neural Information Processing Systems", "page_first": 127, "page_last": 135, "abstract": "There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence -- especially in humans -- as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rule-based vs. incremental learning.", "full_text": "Environmental statistics and the trade-off between\n\nmodel-based and TD learning in humans\n\nDylan A. Simon\n\nDepartment of Psychology\n\nNew York University\nNew York, NY 10003\ndylex@nyu.edu\n\nCenter for Neural Science and Department of Psychology\n\nNathaniel D. Daw\n\nNew York University\nNew York, NY 10003\n\nnathaniel.daw@nyu.edu\n\nAbstract\n\nThere is much evidence that humans and other animals utilize a combination of\nmodel-based and model-free RL methods. Although it has been proposed that\nthese systems may dominate according to their relative statistical ef\ufb01ciency in\ndifferent circumstances, there is little speci\ufb01c evidence \u2014 especially in humans\n\u2014 as to the details of this trade-off. Accordingly, we examine the relative perfor-\nmance of different RL approaches under situations in which the statistics of reward\nare differentially noisy and volatile. Using theory and simulation, we show that\nmodel-free TD learning is relatively most disadvantaged in cases of high volatility\nand low noise. We present data from a decision-making experiment manipulating\nthese parameters, showing that humans shift learning strategies in accord with\nthese predictions. The statistical circumstances favoring model-based RL are also\nthose that promote a high learning rate, which helps explain why, in psychology,\nthe distinction between these strategies is traditionally conceived in terms of rule-\nbased vs. incremental learning.\n\n1\n\nIntroduction\n\nThere are many suggestions that humans and other animals employ multiple approaches to learned\ndecision making [1]. Precisely delineating these approaches is key to understanding human deci-\nsion systems, especially since many problems of behavioral control such as addiction have been at-\ntributed to partial failures of one component [2]. In particular, understanding the trade-offs between\napproaches in order to bring them under experimental control is critical for isolating their unique\ncontributions and ultimately correcting maladaptive behavior. Psychologists primarily distinguish\nbetween declarative rule learning and more incremental learning of stimulus-response (S\u2013R) habits\nacross a broad range of tasks [3, 4]. They have shown that large problem spaces, probabilistic feed-\nback (as in the weather prediction task), and dif\ufb01cult to verbalize rules (as in information integration\ntasks from category learning) all seem to promote the use of a habit learning system [5, 6, 7, 8, 9].\nThe alternative strategies, which these same manipulations disfavor, are often described as imput-\ning (inherently deterministic) \u2018rules\u2019 or \u2018maps\u2019, and are potentially supported by dissociable neural\nsystems also involved in memory for one-shot episodes [10].\nNeuroscientists studying rats have focused on more speci\ufb01c tasks that test whether animals are sen-\nsitive to changes in the outcome contingency or value of actions. For instance, under different task\ncircumstances or following different brain lesions, rats are more or less willing to continue working\nfor a devalued food reward [11]. In terms of reinforcement learning (RL) theories, such evidence\nhas been proposed to re\ufb02ect a distinction between parallel systems for model-based vs. model-free\nRL [12, 13]: a world model permits updating a policy following a change in food value, while\nmodel-free methods preclude this.\n\n1\n\n\fIntuitively, S\u2013R habits correspond well to the policies learned by TD methods such as actor/critic\n[14, 15], and rule-based cognitive planning strategies seem to mirror model-based algorithms. How-\never, the implication that this distinction fundamentally concerns the use or non-use of a world model\nin representation and algorithm seems somewhat at odds with the conception in psychology. Specif-\nically, neither the gradation of update (i.e., incremental vs. abrupt) nor the nature of representation\n(i.e., verbalizable rules) posited in the declarative system seem obviously related to the model-use\ndistinction. Although there have been some suggestions about how episodic memory may support\nTD learning [16], a world model as conceived in RL is typically inherently probabilistic, so as to\nsupport computing expected action values in stochastic environments, and thus must be learned by\nincrementally composing multiple experiences. It has also been suggested that episodic memory\nsupports yet a third decision strategy distinct from both model-based and model-free [17], although\nthere is no experimental evidence for such a triple dissociation or in particular for a separation be-\ntween the putative episodic and model-based controllers.\nHere we suggest that an explanation for this mismatch may follow from the circumstances under\nwhich each RL approach dominates. It has previously been proposed that model-free and model-\nbased reasoning should be traded off according to their relative statistical ef\ufb01ciency (proxied by\nuncertainty) in different circumstances [13]. In fact, what ultimately matters to a decision-maker is\nrelative advantage in terms of reward [18]. Focusing speci\ufb01cally on task statistics, we extend the\nuncertainty framework to investigate under what circumstances the performance of a model-based\nsystem excels suf\ufb01ciently to make it worthwhile.\nWhen the environment is completely static, TD is well known to converge to the optimal policy\nalmost as quickly as model-based approaches [19], and so environmental change must be key to\nunderstanding its computational disadvantages. Primarily, model-free Monte Carlo (MC) methods\nsuch as TD are unable to propagate learned information around the state space ef\ufb01ciently, and in\nparticular to generalize to states not observed in the current trajectory. This is not the only way in\nwhich MC methods learn slowly, however: they must also take samples of outcomes and average\nover them. This process introduces additional noise to the sampling process which must be averaged\nover, as observational deviations resulting from the learner\u2019s own choice variability or transition\nstochasticity in the environment are confounded with variability in immediate rewards. In effect, this\naveraging imposes an upper bound on the learning rate needed to achieve reasonable performance,\nand, correspondingly, on how well it can keep up with task volatility.\nConversely, the key bene\ufb01t of model-based reasoning lies in its ability to react quickly to change,\napplying single-trial experience \ufb02exibly in order to construct values. We provide a more formal\nargument of this observation in MDPs with dynamic rewards and static transitions, and \ufb01nd that\nthe environments in which TD is most impaired are those with frequent changes and little noise.\nThis suggests a strategy by which these two approaches should optimally trade-off, which we test\nempirically using a decision task in humans while manipulating reward statistics. The high-volatility\nenvironments in which model-based learning dominates are also those in which a learning rate near\none optimally applies. This may explain why a model-based system is associated with or perhaps\nspecialized for rapid, declarative rule learning.\n\n2 Theory\n\nModel-free and model-based methods differ in their strategies for estimating action values from\nsamples. One key disadvantage of Monte Carlo sampling of long-run values in an MDP, relative to\nmodel-based RL (in which immediate rewards are sampled and aggregated according to the sampled\ntransition dynamics), is the need to average samples over both reward and state transition stochas-\nticity. This impairs its ability to track changes in the underlying MDP, with the disadvantage most\npronounced in situations of high volatility and low noise.\nBelow, we develop the intuition for this disadvantage by applying Kalman \ufb01lter analysis [20] to\nexamine uncertainties in the simplest possible MDP that exhibits the issue. Speci\ufb01cally, consider a\nstate with two actions, each associated with a pair of terminal states. Each action leads to one of the\ntwo states with equal probability, and each of the four terminal states is associated with a reward. The\nrewards are stochastic and diffusing, according to a Gaussian process, and the transitions are \ufb01xed.\nWe consider the uncertainty and reward achievable as a function of the volatility and observation\nnoise. We have here made some simpli\ufb01cations in order to make the intuition as clear as possible:\n\n2\n\n\fthat each trajectory has only a single state transition and reward; that in the steady state the static\ntransition matrix has been fully learned; and that all analyzed distributions are Gaussian. We test\nsome of these assumptions empirically in section 3 by showing that the same pattern holds in more\ncomplex tasks.\n\n2.1 Model\n\nIn general Xt(i) or just X will refer to an actual sample of the ith variable (e.g., reward or value) at\ntime t, \u00afX refers to the (latent) true mean of X, and \u02c6X refers to estimates of \u00afX made by the learning\nprocess. Given i.i.d. Gaussian diffusion processes on each value, Xt(i), described by:\n\n2 =\u2326( \u00afXt+1(i) \u00afXt(i))2\u21b5\n\"2 =\u2326(Xt(i) \u00afXt(i))2\u21b5\n\ndiffusion or volatility,\nand observation noise,\n\nthe optimal learning rate that achieves the minimal uncertainty (from the Kalman gain) is:\n\n\u21b5\u21e4 =\n\np2 + 4\"2 2\n\n2\"2\n\nNote that this function is monotonically increasing with and decreasing with \" (and in particular,\n\u21b5\u21e4 ! 1 as \" ! 0). When using this learning rate the resulting asymptotic uncertainty (variance of\nestimates) will be:\n\nUX(\u21b5\u21e4) =D( \u02c6X \u00afX)2E =\n\np2 + 4\"2 + 2\n\n2\n\nThis, as expected, increases monotonically in both parameters.\nWhat often matters, however, is identifying the highest of multiple values, e.g., \u00afX(i) and \u00afX(j). If\n\u00afX(i) \u00afX(j) = d, the marginal value of the choice will be \u00b1d. Given some uncertainty, U, the\nprobability of this choice, i.e., \u02c6X(i) > \u02c6X(j), compared to chance is:\n\nc(U ) = 2Z 1\n\n1\n\n\u2713x \n\nd\n\npU\u25c6 (x)dx 1\n\n(Where and are the density and distribution functions for the standard normal.) The resulting\n\nvalue of the choice is thus c(U )d. While c is \ufb02at at 1 as U ! 0, it shrinks as \u21e5(1/pU ) (since\n\n0(0) = 0). Our goal is now to determine c(UQ) for each algorithm.\n\n2.2 Value estimation\n\n2\n\n\u00afR(A)+ \u00afR(B)\n\nConsider the value of one of the actions in our two-action MDP which leads to state A or B. Here,\nthe true expected value of the choice is \u00afQ =\n. If each reward is changing according to\nthe Gaussian diffusion process described above, this will induce a change process on Q. A model-\nbased system that has fully learned the transition dynamics will be able to estimate \u02c6R(A) and \u02c6R(B)\nseparately, and thus take the expectation to produce \u02c6Q. By assuming each reward is sampled equally\noften and adopting the appropriate effective , the resulting uncertainty of this expectation, UMB,\nfollows Equation 4, with X = Q.\nOn the other hand, a Monte Carlo system that must take samples over transitions will observe Q =\n4 from the\nmixture of the two reward distributions. Treating this noise as Gaussian and adding it to the noise of\nthe rewards, this decreases the optimal learning rate and increases the minimal uncertainty to:\n\nR(A) or Q = R(B). If \u00afR(A) \u00afR(B) = d, it will observe an additional variance of d2\n\nUMC =D( \u02c6Q \u00afQ)2E =\n\np2 + d2 + 4\"2 + 2\n\n2\n\nOther forms of stochasticity, whether from changing policies or more complex MDPs, will similarly\nin\ufb02ate the effective noise term, albeit with a different form.\nClearly UMC UMB. However, the more relevant measure is how these uncertainties translate into\nvalues [18]. For this we want to compare their relative success rates, c(U ) from Equation 5, which\nscale directly to outcome. The relative advantage of the model-based (MB) approach, c(UMB) \n\n(1)\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n3\n\n\f0\n\n0.1\n\n\"\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0\n\n0.1\n\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n5\n1\n0\n\n.\n\n)\ny\nt\ni\nl\ni\nb\na\nb\no\nr\np\n(\n\n0\n1\n0\n\n.\n\ne\ng\na\nt\nn\na\nv\nd\na\nD\nT\n\u2013\nB\nM\n\n5\n0\n0\n\n.\n\n0\n0\n0\n\n.\n\n0.0\n\n0.1\n\n0.2\n\n\n\n0.3\n\n0.4\n\n0.5\n\n0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n\"\n\nFigure 1: Difference in theoretical success rate between MB and MC\n\nc(UMC), is plotted in Figure 1 for an arbitrary reward deviation d = 1. As expected, as either\nthe volatility or noise parameter gets very large and the task gets harder, the uncertainty increases,\nperformance approaches chance, and the relative advantage vanishes. However, for reasonable sizes\nof , the model-based advantage \ufb01rst increases to a peak as increases, which is largest for small\nvalues of \". No comparable increasing advantage is seen for model-based valuation for increasing\n\".\nWhile these techniques may also be extended more generally to other MDPs (see Supplemental\nMaterials), the core observation presented above should illuminate the remainder of our discussion.\n\n3 Simulation\n\nTo examine our qualitative predictions in a more realistic setting, we simulated randomly generated\nMDPs with 8 states, 2 actions, and transition and reward functions following the assumptions given\nin the previous section, with the addition of a contractive factor on rewards, ', to prevent divergence:\n\n\u00afR0(s, a) \u21e0 N (0, 1)\n' =p1 2\n\n\u00afRt(s, a) = ' \u00afRt1(s, a) + wt(s, a)\nRt(s, a) = \u00afRt(s, a) + vt\n\nstationary distribution\nvar \u00afR = 1\nwt(s, a) \u21e0 N (0, 2)\nvt \u21e0 N (0, \"2)\n\nEach transition had (at most) three possible outcome, with probabilities 0.6, 0.3, and 0.1, assigned\nrandomly with replacement from the 8 states. In order to avoid bias related to the exploration policy,\neach learning algorithm observed the same set of 1000 choices (chosen according to the objectively\noptimal policy, plus softmax decision noise), and the greedy policy resulting from its learned values\nwas assessed according to the true \u00afR values at that point. The entire process was repeated 5000\ntimes for each different setting of and \" parameters.\nWe compared the performance of a model-based approach using value iteration with a \ufb01xed, optimal\nreward learning rate and transition counting (MB) against various model-free algorithms including\nQ(0), SARSA(0), and SARSA(1) (with \ufb01xed optimal learning rates), all using a discount factor of\n = 0.9. As expected, all learners showed a decrement in reward as increased. Figure 2 shows the\ndifference in mean reward obtained between MB and SARSA(0). Q(0) and SARSA(1) showed the\nsame pattern of results.\nThe correspondence between the theoretical results and the simulation con\ufb01rms that the theoretical\n\ufb01ndings do hold more generally, and we claim that the same underlying effects drive these results.\n\n4\n\n\f5\n1\n\n.\n\n0.2\n\n0.3\n\n\"\n\n0.4\n\n0.5\n\n0.6\n\n0.02\n\n0.03\n\n\n\n0.04\n\n0.05\n\n0.06\n\n)\nd\nr\na\nw\ne\nr\n(\n\n.\n\n0\n1\n\ne\ng\na\nt\nn\na\nv\nd\na\nD\nT\n\u2013\nB\nM\n\n.\n\n5\n0\n\n0\n0\n\n.\n\n0.2\n\n0.3\n\n0.4\n\n\n0.5\n\n0.6\n\n0.2\n\n0.3\n\n0.4\n\"\n\n0.5\n\n0.6\n\nFigure 2: Difference in reward obtained between MB and SARSA(0)\n\n4 Human behavior\n\nHuman subjects performed a decision task that represented an MDP with 4 states and 2 actions.\nThe rewards followed the same contractive Gaussian diffusion process used in section 3, with \nand \" parameters varied across subjects. We sought changes in the reliance on model-based and\nmodel-free strategies via regressions of past events onto current choices [21]. We hypothesized that\nmodel-based RL would be uniquely favored for large and small \".\n\n4.1 Methods\n\n4.1.1 Participants\n55 individuals from the undergraduate subject pool and the surrounding community participated in\nthe experiment. Twelve received monetary compensation based on performance, and the remainder\nreceived credit ful\ufb01lling course requirements. All participants gave informed consent and the study\nwas approved by the human subjects ethics board of the institute.\n\n4.1.2 Task\nSubjects viewed a graphical representation of a rotating disc with four pairs of colored squares\nequally spaced around the edge. Each pair of squares constituted a state (s 2 S = {N, E, S, W})\nand had a unique distinguishable color and icon indicating direction (an arrow of some type). Each\nof the two squares in a state represented an action (a 2 A = {L, R}), and had a left- or right-directed\nicon. During the task, only the top quadrant of the disc was visible at any time, and at decision time\nsubjects could select the left or right action by pressing the left or right arrow button on a keyboard.\nImmediately after selecting an action, between zero and \ufb01ve coins (including a pie-fraction of a\ncoin) appeared under the selected action square, representing a reward (R 2 [0, 5]). After 600 ms,\nthe disc began rotating and the reward became slowly obscured over the next 1150 ms until a new\npair of squares was at the top of the disc and the next decision could be entered, as seen in Figure 3.\nThe state dynamics were determined by a \ufb01xed transition function (T : S \u21e5 A ! A) such that\neach action was most likely to lead to the next adjacent state along the edge of the disc (e.g.,\nT (N, L) = W). To this, additional uniform outcome noise was added with probability 0.4. The re-\nward distribution followed the same Gaussian process given in the previous sections, except shifted\nand trimmed. The parameters and \" were varied by condition.\n\nT : S \u21e5 A \u21e5 S ! [0, 1]\nRt : S \u21e5 A ! [0, 5]\n\nT (s, a, s0) =\u21e20.7\n\n0.1\n\nif s0 = T (s, a)\notherwise\n\nRt(s, a) = min(max( \u00afRt(s, a) + vt + 2.5, 0), 5)\n\n5\n\n\fFigure 3: Abstract task layout and screen shot shortly after a choice is made (yellow box indicates\nvisible display): Each state has two actions, right (red) and left (blue), which lead to the indicated\nstate with 70% probability, and otherwise to another state at random. Each action also results in a\nreward of 0\u20135 coins.\n\nEach subject was \ufb01rst trained on the transition and reward dynamics of the task, including 16 ob-\nservations of reward samples where the latent value \u00afR was shown so as to get a feeling for both\nthe change and noise processes. They then performed 500 choice trials in a single condition. Each\nsubject was randomly assigned to one of 12 conditions, made up of 2 {0.03, 0.0462, 0.0635,\n0.0882, 0.1225, 0.1452} partially crossed with \" 2 {0, 0.126, 0.158, 0.316, 0.474, 0.506}.\n4.1.3 Analysis\nBecause they use different sampling strategies to estimate action values, TD and model-based RL\ndiffer in their predictions of how experience with states and rewards should affect subsequent\nchoices. Here, we use a regression analysis to measure the extent to which choices at a state are\nin\ufb02uenced by recent previous events characteristic of either approach [21]. This approach has\nthe advantage of making only very coarse assumptions about the learning process, as opposed to\nlikelihood-based model-\ufb01ts which may be biased by the speci\ufb01c learning equations assumed. By\ncon\ufb01ning our analyses to the most recent samples we remain agnostic about free parameters with\nnon-linear effects such as learning rates and discount factors, but rather measure the relative strength\nof reliance on either sort of evidence directly using a general linear model. Regardless of the actual\nlearning process, the most recent sample should have the strongest effect [22]. Accordingly, below\nwe de\ufb01ne explanatory variables that capture the most recently experienced reward sample that would\nbe relevant to a choice under either Q(1) TD or model-based planning.\nThe data for each subject were considered to be the sequence of states visited, St, actions taken,\nAt, and rewards received, Rt. We de\ufb01ne additional vector time sequences a, j, r, q, and p, each\nindexed by time and state and referred to generally as xt(s), with all x0 initially unde\ufb01ned. For each\nobservation we perform the following updates:\n\nwt = [At = at(St)]\n\nat+1(St) = At\njt+1(St) = [St+1 6= T (St, At)]\nrt+1(St) = Rt\nqt+1(St1) = Rt\n\n\u2018stay\u2019 vs. \u2018switch\u2019 (boolean indicator)\nlast action\n\u2018jump\u2019 unexpected transition\nimmediate reward\nsubsequent reward\nexpected reward\nfor x = a, j, r, q, and p\nchange\n\nFor convenience, we use xt to mean xt(St). Note that these vectors are step functions, such that\neach value is updated (xt 6= xt1) only when a relevant observation is made. They thus always\nrepresent the most recent relevant sample.\n\npt+1(St) = rt+1(T (St, At))\nxt+1(s) = xt(s)8s 6= St\n\ndt+1 = |Rt rt|\n\n6\n\n\fGiven the task dynamics, we can consider how a TD-based Q-learning system and a model-based\nplanning system would compute values. Both take into account the last sample of the immediate\nreward, rt. They differ in how they account for the reward from the \u201cnext state\u201d: either, for Q(1), as\nqt (the last reward received from the state visited after the last visit to St) or, for model-based RL, as\npt (the last sample of the reward at the true successor state). That is, while TD(1) will incorporate the\nreward observed following Rt, regardless of the state, a model-based system will instead consider\nthe expected successor state [21]. While the latter two reward observations will be the same in some\ncases, they can disagree either after a jump trial (j, where the model-based and sample successor\nstates differ), or when the successor state has more recently been visited from a different predecessor\nstate (providing a reward sample known to model-based but not to TD).\nGiven this, we can separate the effects of model-based and model-free learning by de\ufb01ning addi-\ntional explanatory variables:\n\nif qt = pt\notherwise (after mean correction)\n\ncommon\n\nunique\n\n0\n\nr0t =\u21e2qt\nq\u21e4t = qt r0t\np\u21e4t = pt r0t\n\nWhile r0 represents the cases where the two systems use the same reward observation, q\u21e4 and p\u21e4 are\nthe residual rewards unique to each learning system.\nWe applied a mixed-effects logistic regression model using glmer [23] to predict \u2018stay\u2019 (wt = 0)\ntrials. Any regressors of interest were mean-corrected before being entered into the design. Any\ntrial in which one of the variables was unde\ufb01ned (e.g., the \ufb01rst visit to a state) was omitted. Also,\nwe required that subjects have at least 50 (10%) switch trials to be included.\nFirst we examined the main effects with a regression including \ufb01xed effects of interest for r, r0, q\u21e4,\np\u21e4, and random effects of no interest for r, q, and p (without covariances). Next, we ran a regression\nadding all the interactions between the condition variables (, \") and the speci\ufb01c reward effects (q\u21e4,\np\u21e4). Finally, we additionally included the interaction between change in reward on the previous trial\n(d) and the speci\ufb01c reward effects.\n\n4.2 Results\n\nA total of 5 subjects failed to meet the inclusion criterion of 50 switch trials (in each case because\nthey pressed the same button on almost all trials), leaving 500 decision trials from each of 50 sub-\njects. Subjects were observed to switch on 143 \u00b1 55 trials (mean \u00b1 1 SD). As designed, there were\nan average of 151\u00b1 17 \u2018jump\u2019 trials per subject. The number of trials in which TD and model-based\ndisagreed as to the most recent relevant sample of the next-state reward (r0 = 0) was 243 \u00b1 26, and\nfor 181\u00b1 19 of these, it was due to a more recent visit to the next state. The results of the regressions\nare shown in Table 1.\nBeyond the trivial effects of perseveration and reward, subjects showed a substantial amount of TD-\ntype learning (q\u21e4 > 0), and a smaller but signi\ufb01cant amount of model-based lookahead (p\u21e4 > 0).\nThe interactions of these effects by condition demonstrated that subjects in higher drift conditions\nshowed signi\ufb01cantly less TD (\u21e5q\u21e4 < 0) but unreduced model-based learning (\u21e5p\u21e4), possibly due\nto the relative disadvantage of TD with increased drift. Similarly, higher noise conditions showed\ndecreased model-based effects (\" \u21e5 p\u21e4 < 0) and no change in TD, which may be driven by the\ndecreasing advantage of MB. Note that, since the (nonsigni\ufb01cant) trend on the unaffected variable is\npositive, it is unlikely that either interaction effect results from a nonspeci\ufb01c change in performance\nor the \u201cnoisiness\u201d of choices. Both of these effects are consistent with the pattern of differential\nreliance predicted by the theoretical analysis. The effect of change on the previous trial (d) provides\none hint as to how subjects may adjust their reliance on either system dynamically: higher changes\nare indicative of noisier environments which are thus expected to promote TD learning.\n\n5 Discussion\n\nWe have shown that humans systematically adjust their reliance on learning approaches according\nto the statistics of the task, in a way qualitatively consistent with the theoretical considerations\n\n7\n\n\fTable 1: Behavioral effects from nested regressions (each including preceding groups)\n\nvariable\neffects\nconstant mixed\nr mixed\nr0 mixed\nq\u21e4 mixed\np\u21e4 mixed\n\ufb01xed\n \u21e5 q\u21e4\n\ufb01xed\n \u21e5 p\u21e4\n\ufb01xed\n\" \u21e5 q\u21e4\n\ufb01xed\n\" \u21e5 p\u21e4\nd \u21e5 q\u21e4 mixed\nd \u21e5 p\u21e4 mixed\n\nz\n\np\n\ndescription\n11.61 * 1029\nperseveration\n14.99 * 1049\nlast immediate r\n5.55 * 107\ncommon next r\n6.40 * 109\nTD(1) next-step r\n2.51 \" 0.012\nmodel predicted r\n-4.07 + 0.00005 TD with change\n0.67\nmodel with change\n0.50\nTD with noise\n0.99\n0.32\nmodel with noise\n-2.11 # 0.035\nTD after change\n0.10\n1.63\n-3.06 # 0.0022\nmodel after change\n\npresented. Model-based methods, while always superior to TD in terms of performance, have the\nlargest advantage in the presence of change paired with low environmental noise, because the Monte\nCarlo sampling strategy of TD interferes with tracking fast change. If the additional costs of model-\nbased computation are \ufb01xed, this would motivate employing the system only in the regime where\nits advantage was most pronounced [18]. Consistent with this, human behavior exhibited relatively\nlarger use of model-based RL with increased reward volatility and lesser use of it with increased\nobservation noise.\nOf course, increasing either the volatility or noise parameters makes the task harder, and a decline in\nthe marker for either sort of learning, as we observed, implies an overall decrement in performance.\nHowever, as the decrement was speci\ufb01c to one or the other explanatory variable, this may also be\ninterpreted as a relative increase in use of the unaffected strategy. It is also worth noting that the\nlinearized regression analysis examines only the effect of the most recent rewards, and the weighting\nof those relative to earlier samples will depend on the learning rate [22]. Thus a decrease in learning\nrate for either system may be confounded with a decrease in the strength of its effect in our analysis.\nHowever, while the optimal learning rates are also predicted to differ between conditions, these\npredictions are common to both systems, and it seems unlikely that each would differentially adjust\nits learning rate in response to a different manipulation.\nThe characteristics associated with these learning systems in psychology can be seen as conse-\nquences of the relative strengths of model-based and model-free learning. If the model-based system\nis most useful in conditions of low noise and high volatility, then the appropriate learning rates for\nsuch a system are large: there is less need and utility to take multiple samples for the purpose of\naveraging. In this case of a high learning rate, model-based learning is closely aligned with single-\nshot episodic encoding, possibly subsuming such a system [17], as well as with learning categorical,\nverbalizable rules in the psychological sense, rather than averages. This may also explain the selec-\ntive engagement of putatively model-based brain regions such as the dorsolateral prefrontal cortex\nin tasks with less stochastic outcomes [24]. Finally, this relates indirectly to the well known phe-\nnomenon whereby dominance shifts from the model-based to the model-free controller with over-\ntraining: a model-based system dominates early not simply because it learns faster, but because it is\ncapable of better learning with fewer trials.\nThe speci\ufb01c advantage of high learning rates may well motivate the brain to use a restricted model-\nbased system, such as one with learning rate \ufb01xed to 1. Indeed (see Supplemental materials), this\nrestriction has little detriment on the system\u2019s advantage over TD in the circumstances where it\nwould be expected to be used, but causes drastic performance problems as observation noise in-\ncreases, since averaging over samples is then required. Such a limitation might have useful compu-\ntational advantages. Transition matrices learned this way, for instance, will be sparse: just records\nof trajectories. Such matrices admit both compressed representations and more ef\ufb01cient planning al-\ngorithms (e.g., tree search) as, in the fully deterministic case, only one trajectory must be examined.\nConversely, evaluations in a model based system are extremely costly when transitions are highly\nstochastic, since averages must be computed over exponentially many paths, while they add no cost\nto model-free learning.\nAcknowledgments This work was supported by Award Number R01MH087882 from NIMH as part of the\nNSF/NIH CRCNS Program, and by a Scholar Award from the McKnight Foundation.\n\n8\n\n\fReferences\n[1] Bernard W. Balleine, Nathaniel D. Daw, and John P. O\u2019Doherty. Multiple forms of value learning and\nthe function of dopamine. In Paul W. Glimcher, Colin F. Camerer, Ernst Fehr, and Russell A. Poldrack,\neditors, Neuroeconomics: Decision Making and the Brain, chapter 24, pages 367\u2013387. Academic Press,\nLondon, 2008.\n\n[2] Antoine Bechara. Decision making, impulse control and loss of willpower to resist drugs: a neurocogni-\n\ntive perspective. Nat Neurosci, 8(11):1458\u201363, 2005.\n\n[3] Frederick Toates. The interaction of cognitive and stimulus-response processes in the control of behaviour.\n\nNeuroscience & Biobehavioral Reviews, 22(1):59\u201383, 1997.\n\n[4] Peter Dayan. Goal-directed control and its antipodes. Neural Netw, 22:213\u2013219, 2009.\n[5] Neal Schmitt, Bryan W. Coyle, and Larry King. Feedback and task predictability as determinants of\nperformance in multiple cue probability learning tasks. Organ Behav Hum Perform, 16(2):388\u2013402,\n1976.\n\n[6] Berndt Brehmer and Jan Kuylenstierna. Task information and performance in probabilistic inference\n\ntasks. Organ Behav Hum Perform, 22:445\u2013464, 1978.\n\n[7] B J Knowlton, L R Squire, and M A Gluck. Probabilistic classi\ufb01cation learning in amnesia. Learn Mem,\n\n1(2):106\u2013120, 1994.\n\n[8] W. Todd Maddox and F. Gregory Ashby. Dissociating explicit and procedural-learning based systems of\n\nperceptual category learning. Behavioural Processes, 66(3):309\u2013332, 2004.\n\n[9] W. Todd Maddox, J. Vincent Filoteo, Kelli D. Hejl, and A. David Ing. Category number impacts\nrule-based but not information-integration category learning: Further evidence for dissociable category-\nlearning systems. J Exp Psychol Learn Mem Cogn, 30(1):227\u2013245, 2004.\n\n[10] R. A. Poldrack, J. Clark, E. J. Par\u00b4e-Blagoev, D. Shohamy, J. Creso Moyano, C. Myers, and M. A. Gluck.\n\nInteractive memory systems in the human brain. Nature, 414(6863):546\u2013550, 2001.\n\n[11] Bernard W. Balleine and Anthony Dickinson. Goal-directed instrumental action: contingency and incen-\n\ntive learning and their cortical substrates. Neuropharmacology, 37(4\u20135):407\u2013419, 1998.\n\n[12] Kenji Doya. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex?\n\nNeural Netw, 12(7\u20138):961\u2013974, 1999.\n\n[13] Nathaniel D. Daw, Yael Niv, and Peter Dayan. Uncertainty-based competition between prefrontal and\n\ndorsolateral striatal systems for behavioral control. Nat Neurosci, 8(12):1704\u20131711, 2005.\n\n[14] Ben Seymour, John P. O\u2019Doherty, Peter Dayan, Martin Koltzenburg, Anthony K. Jones, Raymond J.\nDolan, Karl J. Friston, and Richard S. Frackowiak. Temporal difference models describe higher-order\nlearning in humans. Nature, 429(6992):664\u2013667, 2004.\n\n[15] John P. O\u2019Doherty, Peter Dayan, Johannes Schultz, Ralf Deichmann, Karl Friston, and Raymond J. Dolan.\nDissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304(5669):452\u2013\n454, 2004.\n\n[16] Adam Johnson and A. David Redish. Hippocampal replay contributes to within session learning in a\n\ntemporal difference reinforcement learning model. Neural Netw, 18(9):1163\u20131171, 2005.\n[17] M\u00b4at\u00b4e Lengyel and Peter Dayan. Hippocampal contributions to control: The third way.\n\nIn J.C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 889\u2013896. MIT Press, Cambridge, MA, 2008.\n\n[18] Mehdi Keramati, Amir Dezfouli, and Payam Piray. Speed/accuracy trade-off between the habitual and\n\nthe goal-directed processes. PLoS Comput Biol, 7(5):e1002055, 2011.\n\n[19] Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algo-\nrithms. In Michael S. Kearns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information\nProcessing Systems 11, volume 11, pages 996\u20131002. MIT Press, Cambridge, MA, 1999.\n\n[20] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. J Basic Eng, 82(1):35\u201345,\n\n1960.\n\n[21] Nathaniel D Daw, S. J. Gershman, B. Seymour, P. Dayan, and R. J. Dolan. Model-based in\ufb02uences on\n\nhumans\u2019 choices and striatal prediction errors. Neuron, 69(6):1204\u20131215, 2011.\n\n[22] Brian Lau and Paul W Glimcher. Dynamic response-by-response models of matching behavior in rhesus\n\nmonkeys. J Exp Anal Behav, 84(3):555\u2013579, 2005.\n\n[23] Douglas Bates, Martin Maechler, and Ben Bolker. lme4: Linear mixed-effects models using S4 classes,\n\n2011. R package version 0.999375-39.\n\n[24] Saori C Tanaka, Kazuyuki Samejima, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, Shigeto Ya-\nmawaki, and Kenji Doya. Brain mechanism of reward prediction under predictable and unpredictable\nenvironmental dynamics. Neural Netw, 19(8):1233\u20131241, 2006.\n\n9\n\n\f", "award": [], "sourceid": 104, "authors": [{"given_name": "Dylan", "family_name": "Simon", "institution": null}, {"given_name": "Nathaniel", "family_name": "Daw", "institution": null}]}