{"title": "Structure Learning in Human Sequential Decision-Making", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 8, "abstract": "We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure.", "full_text": "Structure Learning in Human Sequential\n\nDecision-Making\n\nDaniel Acu\u02dcna\n\nDept. of Computer Science and Eng.\nUniversity of Minnesota\u2013Twin Cities\n\nacuna002@umn.edu\n\nPaul Schrater\n\nDept. of Psychology and Computer Science and Eng.\n\nUniversity of Minnesota\u2013Twin Cities\n\nschrater@umn.edu\n\nAbstract\n\nWe use graphical models and structure learning to explore how people learn poli-\ncies in sequential decision making tasks. Studies of sequential decision-making\nin humans frequently \ufb01nd suboptimal performance relative to an ideal actor that\nknows the graph model that generates reward in the environment. We argue that\nthe learning problem humans face also involves learning the graph structure for re-\nward generation in the environment. We formulate the structure learning problem\nusing mixtures of reward models, and solve the optimal action selection prob-\nlem using Bayesian Reinforcement Learning. We show that structure learning in\none and two armed bandit problems produces many of the qualitative behaviors\ndeemed suboptimal in previous studies. Our argument is supported by the results\nof experiments that demonstrate humans rapidly learn and exploit new reward\nstructure.\n\n1 Introduction\n\nHumans daily perform sequential decision-making under uncertainty to choose products, services,\ncareers, and jobs; and to mate and survive as species. One of the central problems in sequential de-\ncision making with uncertainty is balancing exploration and exploitation in the search for good poli-\ncies. Using model-based (Bayesian) Reinforcement learning [1], it is possible to solve this problem\noptimally by \ufb01nding policies that maximize the expected discounted future reward [2]. However,\nsolutions are notoriously hard to compute, and it is unclear whether optimal models are appropriate\nfor human decision-making. For tasks simple enough to allow comparison between human behavior\nand normative theory, like the multi-armed bandit problem, human choices appear suboptimal. In\nparticular, earlier studies suggested human choices re\ufb02ect inaccurate Bayesian updating with sub-\noptimalities in exploration [3, 4, 5, 6]. Moreover, in one-armed bandit tasks where exploration is\nnot necessary, people frequently converge to probability matching [7, 8, 9, 10], rather than the better\noption, even when subjects are aware which option is best [11]. However, failures against normative\nprediction may re\ufb02ect optimal decion-making, but for a task that differs from the experimenter\u2019s\nintention. For example, people may assume the environment is potentially dynamically varying.\nWhen this assumption is built into normative predictions, these models account much better for hu-\nman choices in one-armed bandit problems [12], and potentially multi-armed problems [13]. In this\npaper, we investigate another possibility, that humans may be learning the structure of the task by\nforming beliefs over a space of canonical causal models of reward-action contingencies.\nMost human performance assessments view the subject\u2019s task as parameter estimation (e.g.\nre-\nward probabilities) within a known model (a \ufb01xed causal graph structure) that encodes the relations\nbetween environmental states, rewards and actions created by the experimenter. However despite\ninstruction, it is reasonable that subjects may be uncertain about the model, and instead try to learn\nit. To illustrate structure learning in a simple task, suppose you are alone in a casino with many\nrooms. In one room you \ufb01nd two slot machines. It is typically assumed you know the machines are\n\n1\n\n\findependent and give rewards either 0 (failure) or 1 (success) with unknown probabilities that must\nbe estimated. The structure learning viewpoint allows for more possibilities: Are they independent,\nor are are they rigged to covary? Do they have the same probability? Does reward accrue when\nthe machine is not played for a while? We believe answers to these questions form a natural set of\ncausal hypotheses about how reward/action contingencies may occur in natural environments.\nIn this work, we assess the effect of uncertainty between two critical reward structures in terms of the\nneed to explore. The \ufb01rst structure is a one-arm bandit problem in which exploration is not necessary\n(reward generation is coupled across arms); greedy action is optimal [14]. And the other structure\nis a two-arm bandit problem in which exploration is necessary (reward generation is independent\nat each arm); each action needs to balance the exploration/exploitation tradeoff [15]. We illustrate\nhow structure learning affects action selection and the value of information gathering in a simple\nsequential choice task resembling a Multi-armed Bandit (MAB), but with uncertainty between the\ntwo previous models of reward coupling. We develop a normative model of learning and action\nfor this class of problems, illustrate the effect of model uncertainty on action selection, and show\nevidence that people perform structure learning.\n\n2 Bayesian Reinforcement Learning: Structure Learning\n\nThe language of graphical models provides a useful framework for describing the possible structure\nof rewards in the environment. Consider an environment with several distinct reward sites that can\nbe sampled, but the way models generate these rewards is unknown. In particular, rewards at each\nsite may be independent, or there may be a latent cause which accounts for the presence of rewards\nat both sites. Even if independent, if the reward sites are homogeneous, then they may have the same\nprobability.\nUncertainty about which reward model is correct naturally produces a mixture as the appropriate\nlearning model. This structure learning model is a special case of Bayesian Reinforcement Learning\n(BRL), where the states of the environment are the reward sites and the transitions between states\nare determined by the action of sampling a reward site. Uncertainty about reward dynamics and\ncontingencies can be modeled by including within the belief state not only reward probabilities, but\nalso the possibility of independent or coupled rewards. Then, the optimal balance of exploration\nand exploitation in BRL results in action selection that seeks to maximize (1) expected rewards (2)\ninformation about rewards dynamics, and (3) information about task structure.\nGiven that tasks tested in this research involve mixtures of Multi-Armed Bandit (MAB) problems,\nwe borrow MAB language to call a reward site, an arm, and sample a choice or pull. However, the\nmixture models we describe are not MAB problems. MAB problems require the dynamics of one\nsite (arm) remain frozen until visited again, which is not true in general for our mixture model.\nLet \u03b3 (0 < \u03b3 < 1) be a discounting factor such that a possibly stochastic reward x obtained t time\nsteps in the future means \u03b3tx today. Optimality requires an action selection policy that maximizes\nthe expectation over the total discounted future reward Eb\nover environment dynamics. Let xa be a reward acquired from arm a. After observing reward xa, we\npredicted probability of reward xa given belief b. Let r(b,a) \u2261 \u2211xa f (xa | b) be the expected reward\nof sampling arm a at state b. The value of a state can be found using the Bellman equation [2],\n\n(cid:2)x + \u03b3x + \u03b32x + . . .(cid:3), where b is the belief\ncompute a belief state posterior bxa \u2261 p(b|xa) \u221d p(xa|b)p(b). Let f (xa|b) \u2261(cid:82) db p(xa|b)p(b) be the\n\n(1)\n\n(2)\n\n.\n\n(cid:40)\nr(b,a) + \u03b3 \u2211\n(cid:40)\nxa\nThe optimal action can be recovered by choosing arm\nr(b,a(cid:48)) + \u03b3 \u2211\n\nV (b) = max\n\na = argmax\na(cid:48)\n\na\n\nf (xa | b)V (bxa)\n\n(cid:41)\n(cid:41)\nf (xa(cid:48) | b)V (bxa(cid:48))\n\n.\n\nxa\n\nThe belief over dynamics is effectively a probability distribution over possible Markov Decision\nProcesses that would explain observables. As such, the optimal policy can be described as a mapping\nfrom belief states to actions. In principle, the optimal solution can be found by solving Bellman\noptimality equations but generally there are countably or uncountably in\ufb01nitely many states and\nsolutions need approximations.\n\n2\n\n\f(a) 2-arm bandit with no cou-\npling\n\n(b) 1-arm, reward coupling\n\n(c) Mixture of generative mod-\nels\n\nFigure 1: Different graphical models for generation of rewards at two known sites in the environ-\nment. The agent faces M bandit tasks each comprising a random number of N choices (a) Reward\nsites are independent. (b) Rewards are dependent within a bandit task (c) Mixture of generative\nmodels used by the learning model. The causes of reward may be independent or coupled. The node\nc acts as a \u201cXOR\u201d switch between coupled and independent reward.\n\nIn Figure 1, we show the two reward structures considered on this paper. Figure 1(a) illustrates a\nstructure where arms are independent and (b) coupled. When independent, rewards xa at arm a are\nsamples from a unknown distribution p(xa|\u03b8a). When coupled, rewards xa depends on a \u201chidden\u201d\nstate of reward x3 sampled from p(x3|\u03b83). In this case, the rewards x1 and x2 are coupled and depends\non x3.\nIf we were certain which of the two models were right, the action selection problem has known\nsolution for both cases, presented below.\nIndependent Rewards. Learning and acting in an environment like the one described in Figure 1(a)\nis known as the Multi-Armed Bandit (MAB) problem. The MAB problem is a special case of BRL\nbecause we can partition the belief b into a disjoint set of beliefs about each arm {ba}. Because be-\nliefs about non-sampled arms remain frozen until sampled again and sampling one arm doesn\u2019t affect\nthe belief about any other, independent learning and action selection for each arm is possible. Let\n\u03bba be the reward of a deterministic arm in V (ba) = max{\u03bba/(1\u2212 \u03b3),r(ba,a) + \u03b3 \u2211 f (xa|ba)V (bxa)}\nsuch that both terms inside the maximization are equal. Gittins [16] proved that it is optimal to\nchoose the arm a with the highest such reward \u03bba (called the Gittins Index). This allows speedup of\ncomputation by transforming a many-arm bandit problem to many 2-arm bandit problems.\nIn our\nthe belief about a binary reward may be represented by a Beta Distri-\nbution with suf\ufb01cient statistics parameters \u03b1,\u03b2 (both > 0) such that xa \u223c p(xa|\u03b8a) =\na (1 \u2212 \u03b8a)1\u2212xa, where \u03b8a \u223c p(\u03b8a;\u03b1a,\u03b2a) \u221d \u03b8 \u03b1a\u22121\n\u03b8 xa\nthe expected reward\nr(\u03b1a,\u03b2a,a) and predicted probability of reward f (xa = 1|\u03b1a,\u03b2a) are \u03b1a(\u03b1a + \u03b2a)\u22121.\nThe\nbelief state transition is bxa = (cid:104)\u03b1a + xa,\u03b2a + 1\u2212 xa(cid:105) . Therefore,\nthe Gittins index may\nbe found by solving the Bellman equations using dynamic programming V (\u03b1a,\u03b2a) =\n\nmax(cid:8)\u03bba(1\u2212 \u03b3)\u22121 , (\u03b1a + \u03b2a)\u22121 [\u03b1a + \u03b3 (\u03b1a + \u03b1aV (\u03b1a + 1,\u03b2a) + \u03b2aV (\u03b1a,\u03b2a + 1))](cid:9) to a suf\ufb01ciently\n\n(1 \u2212 \u03b8a)\u03b2a\u22121.Thus,\n\nlarge horizon. In experiments, we use \u03b3 = 0.98, for which a horizon of H = 1000 suf\ufb01ces.\nCoupled Rewards. Learning and acting in coupled environments (Figure 1b) is trivial because\nthere is no need to maximize information in acting [14]. The belief state is represented by a Beta\ndistribution with suf\ufb01cient statistics \u03b13,\u03b23 (> 0). Therefore, the optimal action is to choose the arm\na with highest expected reward\n\ntask,\n\na\n\n(cid:40) \u03b13\n\n\u03b13+\u03b23\n\u03b13+\u03b23\n\n\u03b23\n\na = 1\na = 2\n\nr(\u03b13,\u03b23,a) =\n\nThe belief state transitions are b1 = (cid:104)\u03b13 + x1,\u03b23 + 1\u2212 x1(cid:105) and b2 = (cid:104)\u03b13 + 1\u2212 x2,\u03b23 + x2(cid:105).\n\n3\n\nNM\u03981x1\u03982x2NM\u03983x3x1x2NM\u03981x1\u03982x2c\u03983x3\f3 Learning and acting with model uncertainty\n\nIn this section, we consider the case where there is uncerainty about the reward model. The agent\u2019s\nbelief is captured by a graphical model for a family of reward structures that may or may not be\ncoupled. We show that learning can be accurate and that action selection is relatively ef\ufb01cient.\nWe restrict ourselves to the following scenario. The agent is presented with a block of M bandit\ntasks, each with initially unknown Bernoulli reward probabilities and coupling. Each task involves\nN discrete choices, where N is sampled from a Geometric distribution (1\u2212 \u03b3)\u03b3N.\nFigure 1(c) shows the mixture of two possible reward models shown in \ufb01gure 1(a) and (b). Node c\nswitches the mixture between the two possible reward models and encodes part of the belief state of\nthe process. Notice that c is acting as a \u201cXOR\u201d gate between the two generative models. Given that\nit is unknown, the probability distribution p(c = 0) is the mixed proportion for independent reward\nstructure and p(c = 1) is the mixed proportion for coupled reward structure. We put a prior on the\nstate c using the distribution p(c;\u03c6) = \u03c6 c(1\u2212 \u03c6)1\u2212c, with parameter \u03c6. The posterior is\n\np(\u03b81,\u03b82,\u03b83,c|s1, f1,s2, f2) =\n\n(cid:40)(1\u2212 \u03c6)\u00d7(cid:16)\n\n(\u03c6)\u00d7 (\u03b8 \u03b11\u22121\n\n1\n\n\u221d\n\n(1\u2212 \u03b81)\u03b21\u22121+ f1\u03b8 \u03b12\u22121+s2\n\n1\n\n\u03b8 \u03b11\u22121+s1\n(1\u2212 \u03b81)\u03b21\u22121\u03b8 \u03b12\u22121\n\n(1\u2212 \u03b82)\u03b22\u22121\u03b8 \u03b13\u22121+s1+ f2\n\n(1\u2212 \u03b82)\u03b22\u22121+ f2\u03b8 \u03b13\u22121\n\n(1\u2212 \u03b83)\u03b23\u22121+s2+ f1\n\n3\n\n2\n\n2\n\n3\n\n(1\u2212 \u03b83)\u03b23\u22121(cid:17)\n\nc = 0\nc = 1\n\n(3)\n\nwhere sa and fa is the number of successes and failures observed in arm a.\nIt is clear that the\nposterior (3) is a mixture of the beliefs on parameters \u03b8 j, for 1 \u2264 j \u2264 3. With mixed proportion \u03c6,\nsuccesses of arm 1 and failures of arm 2 are attributed to successes on the shared \u201chidden\u201d arm 3,\nwhereas failures of arm 1 and successes of arm 2 are attributed to failures of arm 3. On the other\nhand, the usual Beta-Bernoulli learning of independent arms happens with mixed proportion 1\u2212 \u03c6.\nAt the beginning of each bandit task, we assume the agent \u201cresets\u201d its belief about arms (si = fi =\n0), but the posterior over p(c) is carried over and used as the prior on the next bandit task. Let\nBeta(\u03b1,\u03b2 ) be the Beta function. The marginal posterior on c is as follows\n\n(cid:40)(1\u2212 \u03c6) Beta(\u03b11+s1,\u03b21+ f1)Beta(\u03b12+s2,\u03b22+ f2)\n\nBeta(\u03b11,\u03b21)Beta(\u03b12,\u03b22)\n\n\u03c6 Beta(\u03b13+s1+ f2,\u03b23+ f1+s2)\n\nBeta(\u03b13,\u03b23)\n\nc = 0\nc = 1\n\np(c|s1, f1,s2, f2) \u221d\n\nbelief\n\nThe\n(cid:104)s1, f1,s2, f2;\u03c6 ,\u03b11,\u03b21,\u03b12,\u03b22,\u03b13,\u03b23(cid:105). The predicted probability of reward x1 and x2 are:\n\nprocess may\n\ncompletely\n\nstate\n\nthis\n\nbe\n\nof\n\nb\n\nrepresented\n\nby\n\n(4)\n\n(5)\n\nf (x1|s1, f1,s2, f2) =\n\nand similarly\n\nf (x2|s1, f1,s2, f2) =\n\n(cid:40)(1\u2212 \u03c6)\n(cid:40)(1\u2212 \u03c6)\n\n(1\u2212 \u03c6)\n\n(1\u2212 \u03c6)\n\n\u03b11+s1\n\u03b21+ f1\n\n\u03b11+s1+\u03b21+ f1\n\u03b11+s1+\u03b21+ f1\n\n+ \u03c6\n+ \u03c6\n\n\u03b13+s1+ f2\n\u03b23+s2+ f1\n\n\u03b13+s1+ f2+\u03b23+s2+ f1\n\u03b13+s1+ f2+\u03b23+s2+ f1\n\nx1 = 1\nx1 = 0\n\n\u03b12+s2\n\u03b22+ f2\n\n\u03b12+s2+\u03b22+ f2\n\u03b12+s2+\u03b22+ f2\n\n+ \u03c6\n+ \u03c6\n\n\u03b23+s2+ f1\n\u03b13+s1+ f2\n\n\u03b13+s1+ f2+\u03b23+s2+ f1\n\u03b13+s1+ f2+\u03b23+s2+ f1\n\nx2 = 1\nx2 = 0\n\nLet us drop prior parameters \u03b1 j,\u03b2 j, 1 \u2264 j \u2264 3, and \u03c6 from b. The action selection involves solving\nthe following Bellman equations\n\nV (s1, f1,s2, f2) =\n\n(cid:26)r(b,1) + \u03b3 [ f (x1 = 0|b)V (s1, f1 + 1,s2, f2) + f (x1 = 1|b)V (s1 + 1, f1,s2, f2)] a = 1\n\nr(b,2) + \u03b3 [ f (x2 = 0|b)V (s1, f1,s2, f2 + 1) + f (x2 = 1|b)V (s1, f1,s2 + 1, f2)] a = 2\n\nmax\na=1,2\n\n(6)\n\n4\n\n\f(a) Learning in coupled environment\n\n(b) Learning in independent enviroment\n\nFigure 2: Learning example. A block of four bandit tasks of 50 trials each for each environment.\nMarginal beliefs on reward probabilities and coupling are shown as functions of time. The brightness\nindicates the relative probability mass. The coupling belief distribution starts uniform with \u03c6 = 0.5\nand is not reset within a block. The priors p(\u03b8i;\u03b1i,\u03b2i) are reset at the beginning of each task with\n\u03b1i,\u03b2i = 1 (1 \u2264 i \u2264 3) . Note that how well the reward probabilities sum to one forms critical evidence\nfor or against coupling.\n\nTo obtain (6) using dynamic programing for a horizon H, there will be a total of (1/24)(1 + H)(2 +\nH)(3 + H)(4 + H) computations which represent different occurrences of si, fi out of 4H possible\nhistories of rewards. This dramatic reduction allows us to be relatively accurate in our approximation\nto the optimal value of an action.\n\n4 Simulation Results\n\nIn Figure 2, we perform simulations of learning on blocks of four bandit tasks, each comprising 50\ntrials. In one simulation, (a) rewards are coupled and the other (b) independent. Note that the model\nlearns quickly on both cases, but it is slower when task is truly coupled because fewer cases support\nthis hypothesis (when compared to the independent hypothesis).\nThe importance of the belief on the coupling parameter is that it has a decisive in\ufb02uence on ex-\nploratory behavior. Coupling between the two arms corresponds to the case where one arm is a\nwinner and the other is a loser by experimenter design. When playing coupled arms, evidence that\nan arm is \u201cgood\u201d (e.g. > 0.5) necessarily entails the other is \u201cbad\u201d, and hence eliminates the need for\nexploratory behavior - the optimal choice is to \u201cstick with the winner\u201d, and switch when the prob-\nability estimate suggests dips below 0.5. An agent learning a coupling parameter while sampling\narms can manifest a range of exploratory behaviors that depend critically on both the recent reward\nhistory and the current state of the belief about c, illustrated in \ufb01gure 3. The top row shows the value\nof both arms as a function of coupling belief p(c) after different amounts of evidence for the success\nof arm 2. The plots show that optimal actions stick with the winner when belief in coupling is high,\neven for small amounts of data. Thus belief in coupling produces underexploration compared to a\nmodel assuming independence, and generates behavior similar to a \u201cwin stay, lose switch\u201d heuristic\nearly in learning. However, overexploration can also occur when the expected values of both arms\nare similar. Figure 3 (lower left) shows that uncertainty about c provides an exploratory bonus to the\nlower probability arm which incentivizes switching, and hence overexploration. In fact, when the\ndifference in probability between arms is small, action selection can fail to converge to the better op-\ntion. Figure 3 (to the right) shows that p(c) together with the probability of the better arm determine\nthe transition between exploration vs. exploitation. These results show that optimal action selec-\ntion with model uncertainty can generate several kinds of behavior typically labeled suboptimal in\nmulti-armed bandit experiments. Next we provide evidence that people are capable of learning and\nexploiting coupling\u2013evidence that structure learning may play a role in apparent failures of humans\nto behave optimally in multi-armed bandit tasks.\n\n5\n\np(\u03b81)05010015000.51p(\u03b82)05010015000.51p(\u03b83)05010015000.5105010015020000.51p(c)p(\u03b81)05010015000.51p(\u03b82)05010015000.51p(\u03b83)05010015000.5105010015020000.51p(c)\fFigure 3: Value of arms as a function of coupling. The priors are uniform (\u03b1 j = \u03b2 j = 1, 1 \u2264 j \u2264 3),\nthe evidence for arm 1 remains \ufb01xed for all cases (s1 = 1, f1 = 0), and successes of arm 2 remains\n\ufb01xed as well (s2 = 5). Failures for arm 2 ( f2) vary from 1 to 6 . Upper left: Belief that arms are\ncoupled (p(c)) versus reward per unit time (V (1\u2212 \u03b3), where V is the value) of arm 1 (dashed line)\nand arm 2 (solid line). In all cases, an independent model would choose arm 1 to pull. Vertical line\nshows the critical coupling belief value where the structure learning model switches to exploitative\nbehavior. Lower left: Exploratory bonus (V (1\u2212\u03b3)\u2212r, where r is the expected reward) for each arm.\nRight panel: Critical coupling belief values for exploitative behavior vs. the expected probability of\nreward of arm 2. Individual points correspond to different information states (successes and failures\non both arms).\n\n5 Human Experiments\n\nEach of 16 subjects ran on 32 bandit tasks, a block of 16 in a independent environment and a\nblock of 16 coupled. Within blocks, the presentation order was randomized, and the order of the\ncoupled environment was randomized accross subjects. On average each task required 48 pulls.\nFor independent environment, the subjects made 1194 choices across the 16 tasks, and 925 for the\ncoupled environment.\nEach arm is shown in the screen as a slot machine. Subjects pull a machine by pressing a key in\nthe keyboard. When pulled, an animation of the lever is shown, 200 msec later the reward appears\nin the machine\u2019s screen, and a sound mimicking dropping coins lasts proportionally to the amount\ngathered. We provide several cues, some redundant, to help subjects keep track of previous rewards.\nAt the top, the machine shows the number of pulls, total reward, and average reward per pull so\nfar. Instead of binary rewards 0 and 1, the task presented 0 and 100. The machine\u2019s screen changes\nthe color according to the average reward, from red (zero points), through yellow (\ufb01fty points), and\ngreen (one hundred points). The machine\u2019s total reward is shown as a pile of coins underneath it.\nThe total score, total pulls, and rankings within a game were presented.\n\n6 Results\n\nWe analyze how task uncertainty affects decisions by comparing human behavior to that of the\noptimal model and models that assume a structure. For each agent, be human or not, we compute\nthe (empirical) probability that it selects the oracle-best action versus the optimal belief that a block\nof tasks is coupled. The idea behind this measure is to show how the belief on task structure changes\nthe behavior and which of the models better captures human behavior.\nWe run 1000 agents for each of the models with task uncertainty (optimal model), assumed coupled\nreward task (coupled model), and assumed independent reward task (independent model) under the\nsame conditions that subjects faced on both the blocks of coupled and independent tasks. And for\neach of the decisons of these models and the 33904 decisions performed by the 16 subjects, we\ncompute the optimal belief on coupling according to our model and bin the proportion of times\nthe agent chooses the (oracle) best arm according to this belief. The results are summarized in\nFigure 4. The independent model tends to perform equally well on both coupled and independent\nreward tasks. The coupled model tends to perform well only in the coupled task and worse in the\nindependent tasks. As expected, the optimal model has better overall performance, but does not\nperform better than models with \ufb01xed task structure\u2014in their respective tasks\u2014because it pays the\nprice of learning early in the block. The optimal model behaves like a mixture between the coupled\nand independent model. Human behavior is much better captured by the optimal model (Figure\n\n6\n\n10\u22120.310\u22120.210\u22120.110\u22121100Expected value of \u03b82Critical value of p(c)0.040.70.750.8 f2=1V(1-\u03b3) 0.240.650.70.75f2=20.50.60.650.7 f2=30.60.650.7 f2=40.60.650.7 f2=50.60.65 f2=600.510.10.20.3p(c)Bonus00.510.10.2p(c)00.510.060.080.10.120.14p(c)00.510.050.10.15p(c)00.510.050.10.15p(c)00.510.050.10.150.2p(c)\fFigure 4: Effect of coupling on behavior. For each of the decisions of subjects and simulated\nmodels under the same conditions, we compute the optimal belief on coupling according to the\nmodel proposed in this paper and bin the proportion of times an agent chooses the (oracle) best arm\naccording to this belief. This plot represents the empirical probability that an agent would pick the\nbest arm at a given belief on coupling.\n\n4). This is evidence that human behavior shares the characteristics of the optimal model, namely,\nit contains task uncertainty and exploit the knowledge of the task structure to maximize its gains.\nThe gap in performance that exists between the optimal model and humans may be explained by\nmemory limitations or more complicated task structures being entertained by subjects. Because the\nsubjects are not told the coupling state of the environment and the arms appear as separate options\nwe conclude that people are capable of learning and exploiting task structure. Together these results\nsuggest that structure learning may play a signi\ufb01cant role in explaining differences between human\nbehavior and previous normative predictions.\n\n7 Conclusions and future directions\n\nWe have provided evidence that structure learning may be an important missing piece in evaluating\nhuman sequential decision making. The idea of modeling sequential decision making under uncer-\ntainty as a structure learning problem is a natural extension of previous work on structure learning\nin Bayesian models of cognition [17, 18] and animal learning [19] to sequential decision making\nproblems under uncertainty. It also extends previous work on Bayesian approaches to modeling se-\nquential decision making in the multi-armed bandit [20] by adding structure learning. It is important\nto note that we have intentionally focused on reward structure, ignoring issues involving dependen-\ncies across trials. Clearly reward structure learning must be integrated with learning about temporal\ndependencies [21].\nAlthough we focused on learning coupling between arms, there are other kinds of reward structure\nlearning that may account for a broad variety of human decision making performance. In particular,\nallowing dependence between the probability of reward at a site and previous actions can produce\nlarge changes in decision making behavior. For instance, in a \u201cforaging\u201d model where reward is col-\nlected from a site and probabilistically replenished, optimal strategies will produce choice sequences\nthat alternate between reward sites. Thus uncertainty about the independence of reward on previous\nactions can produce a continuum of behavior, from maximization to probability matching. Note that\nstructure learning explanations for probability matching is signi\ufb01cantly different than explanations\nbased on reinforcing previously successful actions (the \u201claw of effect\u201d) [22]. Instead of explaining\nbehavior in terms of the idiosynchracies of a learning rule, structure learning constitutes a fully ra-\ntional response to uncertainty about the causal structure of rewards in the environment. We intend to\ntest the predictive power of a range of structure learning ideas on experimental data we are currently\ncollecting. Our hope is that, by expanding the range of normative hypotheses for human decision-\nmaking, we can begin to develop more principled accounts of human sequential decision-making\nbehavior.\n\n7\n\n00.250.50.7510.780.80.820.840.860.880.90.920.940.96Coupling belief p(c=1|D)Prob. of choosing best armHumanOptimal modelCoupled modelIndependent model\fAcknowledgements\n\nThe work was supported by NIH NPCS 1R90 DK71500-04, NIPS 2008 Travel Award, CONICYT-\nFIC-World Bank 05-DOCFIC-BANCO-01, ONR MURI N 00014-07-1-0937, and NIH EY02857.\n\nReferences\n[1] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete\nIn 23rd International Conference on Machine Learning,\n\nbayesian reinforcement learning.\nPittsburgh, Penn, 2006.\n\n[2] Richard Ernest Bellman. Dynamic programming. Princeton University Press, Princeton, 1957.\n[3] Noah Gans, George Knox, and Rachel Croson. Simple models of discrete choice and their\nperformance in bandit experiments. Manufacturing and Service Operations Management,\n9(4):383\u2013408, 2007.\n\n[4] C.M. Anderson. Behavioral Models of Strategies in Multi-Armed Bandit Problems. PhD thesis,\n\nPasadena, CA., 2001.\n\n[5] Jeffrey Banks, David Porter, and Mark Olson. An experimental analysis of the bandit problem.\n\nEconomic Theory, 10(1):55\u201377, 1997.\n\n[6] R. J. Meyer and Y. Shi. Sequential choice under ambiguity: Intuitive solutions to the armed-\n\nbandit problem. Management Science, 41:817\u201383, 1995.\n\n[7] N Vulkan. An economist\u2019s perspective on probability matching. Journal of Economic Surveys,\n\n14:101\u2013118, 2000.\n\n[8] Yvonne Brackbill and Anthony Bravos. Supplementary report: The utility of correctly predict-\n\ning infrequent events. Journal of Experimental Psychology, 64(6):648\u2013649, 1962.\n\n[9] W Edwards. Probability learning in 1000 trials. Journal of Experimental Psychology, 62:385\u2013\n\n394, 1961.\n\n[10] W Edwards. Reward probability, amount, and information as determiners of sequential two-\n\nalternative decisions. J Exp Psychol, 52(3):177\u201388, 1956.\n\n[11] E. Fantino and A Esfandiari. Probability matching: Encouraging optimal responding in hu-\n\nmans. Canadian Journal of Experimental Psychology, 56:58 \u2013 63, 2002.\n\n[12] Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rushworth. Learn-\n\ning the value of information in an uncertain world. Nat Neurosci, 10(9):1214\u20131221, 2007.\n\n[13] N. D. Daw, J. P. O\u2019Doherty, P. Dayan, B. Seymour, and R. J. Dolan. Cortical substrates for\n\nexploratory decisions in humans. Nature, 441(7095):876\u2013879, 2006.\n\n[14] JS Banks and RK Sundaram. A class of bandit problems yielding myopic optimal strategies.\n\nJournal of Applied Probability, 29(3):625\u2013632, 1992.\n\n[15] John Gittins and You-Gan Wang. The learning component of dynamic allocation indices. The\n\nAnnals of Statistics, 20(2):1626\u20131636, 1992.\n\n[16] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experi-\n\nments. Progress in Statistics, pages 241\u2013266, 1974.\n\n[17] Joshua B. Tenenbaum, Thomas L. Grif\ufb01ths, and Charles Kemp. Theory-based bayesian models\n\nof inductive learning and reasoning. Trends in Cognitive Sciences, 10(7):309\u2013318, 2006.\n\n[18] Joshua B. Tenenbaum and Thomas L. Grif\ufb01ths. Structure learning in human causal induction.\n\nNIPS 13, pages 59\u201365, 2000.\n\n[19] A. C. Courville, N. D. Daw, G. J. Gordon, and D. S. Touretzky. Model uncertainty in classical\n\nconditioning. Advances in Neural Information Processing Systems, (16):977\u2013986, 2004.\n\n[20] Daniel Acuna and Paul Schrater. Bayesian modeling of human sequential decision-making on\n\nthe multi-armed bandit problem. In CogSci, 2008.\n\n[21] Michael D. Lee. A hierarchical bayesian model of human decision-making on an optimal\n\nstopping problem. Cognitive Science: A Multidisciplinary Journal, 30:1 \u2013 26, 2006.\n\n[22] Ido Erev and Alvin E. Roth. Predicting how people play games: Reinforcement learning in\nexperimental games with unique, mixed strategy equilibria. The American Economic Review,\n88(4):848\u2013881, 1998.\n\n8\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Daniel", "family_name": "Acuna", "institution": null}, {"given_name": "Paul", "family_name": "Schrater", "institution": null}]}