{"title": "A Probabilistic Model of Social Decision Making based on Reward Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 2901, "page_last": 2909, "abstract": "A fundamental problem in cognitive neuroscience is how humans make decisions, act, and behave in relation to other humans. Here we adopt the hypothesis that when we are in an interactive social setting, our brains perform Bayesian inference of the intentions and cooperativeness of others using probabilistic representations. We employ the framework of partially observable Markov decision processes (POMDPs) to model human decision making in a social context, focusing specifically on the volunteer's dilemma in a version of the classic Public Goods Game. We show that the POMDP model explains both the behavior of subjects as well as neural activity recorded using fMRI during the game. The decisions of subjects can be modeled across all trials using two interpretable parameters. Furthermore, the expected reward predicted by the model for each subject was correlated with the activation of brain areas related to reward expectation in social interactions. Our results suggest a probabilistic basis for human social decision making within the framework of expected reward maximization.", "full_text": "A Probabilistic Model of Social Decision Making\n\nbased on Reward Maximization\n\nKoosha Khalvati\n\nDepartment of Computer Science\n\nUniversity of Washington\n\nSeattle, WA 98105\n\nkoosha@cs.washington.edu\n\nInstitut des Sciences Cognitives Marc Jeannerod\n\nSeongmin A. Park\nCNRS UMR 5229\n\nLyon, France\n\npark@isc.cnrs.fr\n\nInstitut des Sciences Cognitives Marc Jeannerod\n\nJean-Claude Dreher\nCNRS UMR 5229\n\nLyon, France\n\ndreher@isc.cnrs.fr\n\nRajesh P. N. Rao\n\nDepartment of Computer Science\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nrao@cs.washington.edu\n\nAbstract\n\nA fundamental problem in cognitive neuroscience is how humans make decisions,\nact, and behave in relation to other humans. Here we adopt the hypothesis that when\nwe are in an interactive social setting, our brains perform Bayesian inference of the\nintentions and cooperativeness of others using probabilistic representations. We em-\nploy the framework of partially observable Markov decision processes (POMDPs)\nto model human decision making in a social context, focusing speci\ufb01cally on the\nvolunteer\u2019s dilemma in a version of the classic Public Goods Game. We show that\nthe POMDP model explains both the behavior of subjects as well as neural activity\nrecorded using fMRI during the game. The decisions of subjects can be modeled\nacross all trials using two interpretable parameters. Furthermore, the expected\nreward predicted by the model for each subject was correlated with the activation of\nbrain areas related to reward expectation in social interactions. Our results suggest\na probabilistic basis for human social decision making within the framework of\nexpected reward maximization.\n\n1\n\nIntroduction\n\nA long tradition of research in social psychology recognizes volunteering as the hallmark of human\naltruistic action, aimed at improving the survival of a group of individuals living together [15].\nVolunteering entails a dilemma wherein the optimal decision maximizing an individual\u2019s utility\ndiffers from the strategy which maximizes bene\ufb01ts to the group to which the individual belongs. The\n\"volunteer\u2019s dilemma\" characterizes everyday group decision-making whereby one or few volunteers\nare enough to bring common goods to the group [1, 6]. Examples of such volunteering include\nvigilance duty, serving on school boards or town councils, and donating blood. The fact that makes\nthe volunteer\u2019s dilemma challenging is not only that a lack of enough volunteers would lead to no\ncommon goods being produced, but also that resources would be wasted if more than the required\nnumber of group members volunteer. As a result, to achieve maximum utility in the volunteer\u2019s\ndilemma, each member must have a very good sense of others\u2019 intentions in the absence of any\n\nThis work was supported by LABEX ANR-11-LABEX-0042, ANR-11-IDEX-0007, NSF-ANR \u2019So-\ncial_POMDP\u2019 and ANR BrainCHOICE n\u25e614-CE13-0006 to JC. D, NSF grants EEC-1028725 and 1318733,\nONR grant N000141310817, and CRCNS/NIMH grant 1R01MH112166-01.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: The computer screen that players see during one round of PGG.\n\ncommunication between them, before choosing their actions. To model such social decision making,\none therefore needs a framework that can model the uncertainty associated with the \"theory of\nmind\" of each player. In this paper, we tackle this problem by combining a probabilistic model with\nbehavioral measures and fMRI recordings to provide an account of the decisions made under the\nvolunteer\u2019s dilemma and the underlying neural computations.\nThe Public Goods Game (PGG) is a classic paradigm in behavioral economics. It has previously been\nemployed as a useful tool to study the neural mechanisms underlying group cooperation [2, 3, 4, 16].\nHere we recast the PGG to investigate the volunteer\u2019s dilemma and conducted an experiment where\n30 subjects played a version of the PGG while their brain activity was simultaneously recorded\nusing fMRI. We show how a probabilistic model based on Partially Observable Markov Decision\nProcesses (POMDPs) with a simple and intuitive state model can explain our subjects\u2019 behavior.\nThe normative model explains the behavior of the subjects in all trials of the game, including the\n\ufb01rst trial of each round, using only two parameters. The validity of the model is demonstrated by\nthe correlation between the reward predicted by the model and activation of brain areas implicated\nin reward expectation in social interactions. Also, the values of the parameters of our model are\ninterpretable, and the differences among players and games can be explained by these parameters.\n\n2 Public Goods Game and Experimental Paradigm\n\nIn a Public Goods Game (PGG), N strangers make collective decisions together as a group. In the\ncurrent study, we keep the number of members in a group constant at N = 5. No communication\namong members is allowed. The game is composed of 15 rounds of interactions with the same\npartners. At the beginning of each round, 1 monetary unit (MU) is endowed (E) to each of N = 5\nindividuals. Each individual can choose between two decisions, contribution (c) or free-riding\n(f). When the participant makes a decision, the selected option is highlighted on the screen. The\nparticipant must make a decision within three seconds, otherwise a warning message appears and the\ntrial is repeated. After all members of the group have made their decisions, the feedback screen is\nshown to all. According to the decisions of group members, public good is produced as the group\nreward (R = 2M U) only if at least k individuals have contributed their resources (k = 2 or k = 4).\nValue of k was conveyed to group members before decision-making and is kept \ufb01xed for any single\nPGG. From the feedback screen, participants only know the number of other contributors and not\nindividual single member decisions, which are represented by white icons. A yellow icon stands\nfor the individual playing in the scanner and served to track their decisions. Each PGG consists of\na \ufb01nite round of interactions (T = 15). This is informed to all participants. The computer screen\nin front of each player during one round of a PGG is shown in Figure 1. Each contribution has a\ncost (C = 1M U). Therefore, the resultant MUs after one round is E \u2212 C + R = 2M U for the\ncontributor and E + R = 3M U for the free-rider when public good is produced (SUCCESS). On the\nother hand, the contributor has E \u2212 C = 0M U and the free-rider has E = 1M U when no public\ngood is produced (FAILURE) . Each participant plays 14 games. During the \ufb01rst 2 PGGs, they receive\nno feedback, but the following 12 PGGs provide social and monetary feedback as shown in Figure 1.\nOur analyses are from the 12 PGGs with feedback. Importantly, we inform participants before the\nexperiment that they get a \ufb01nal monetary reward as much as the result of one PGG randomly selected\nby the compute r at the end of the study [23].\n\n2\n\n\fWe recruited 30 right-handed subjects to participate in the Public Goods Game and make decisions\nin an fMRI scanner. Data from 29 participants (fourteen women, mean age 22.97 years old 1.99\nS.D.) were analyzed (one participant aborted the experiment due to anxiety). Based on self-reported\nquestionnaires, none of our subjects had a history of neurological or psychiatric disorders. Each\nparticipant was told that they would play with 19 other participants located in another room; in\nactuality, a computer selected the actions instead of 19 others. Each action selected by our computer\nalgorithm in any round is a probabilistic function of the participant\u2019s action in the previous round\n(at\u22121\n). Given the average contribution rate of others\n1\u2212K )e2 \u00afat\u22121\u2212i \u2212 K) where K = k/N.\n\u00afat\u2212i =\nThis model has 3 free parameters: e0, e1, e2. These are obtained by \ufb01tting the above function to the\nactual behavior of individuals in another PGG study [16]. Therefore, this function is a simulation of\nreal individuals\u2019 behavior in a PGG. For the \ufb01rst round, we use the mean contribution rate of each\nsubject as their fellow members\u2019 decision.\n\n), and its own previous actions ((cid:80)\n(cid:80)\n\nj\u2208\u2212i at\u22121\n\nN\u22121 we have logit(\u00afat\u2212i) = e0at\u22121\nj\u2208\u2212i at\n\ni + e1(( 1\u2212KT \u2212t+1\n\nj\n\ni\n\nj\n\n3 Markov Decision Processes and POMDPs\n\ndiscounted reward Est[(cid:80)\u221e\n\nThe family of Markov Decision Processes (MDPs and POMDPs) provide a mathematical framework\nfor decision making in stochastic environments [22]. A Markov Decision Process (MDP) is formally\na tuple (S, A, T, R, \u03b3) with the following description: S is the set of states of the environment, A is\nthe set of actions, T is the transition function P (s|s(cid:48), a), i.e., the probability of going from a state s(cid:48)\nto state s after performing action a. R : S \u00d7 A \u2192 R is a bounded function determining the reward\nobtained after performing action a in state s. \u03b3 is the discount factor which we assume here is 1.\nStarting from an initial state s0, the goal is to \ufb01nd the sequence of actions to maximize expected\nt=0 \u03b3tR(st, at)]. This sequence of actions is given by an optimal policy,\nwhich is a mapping from states to actions: \u03c0\u2217 : |S| \u2192 |A| representing the best action at a given state.\nThe optimal policy can be computed by an ef\ufb01cient algorithm called value iteration [22]. MDPs\nassume that the current state is always fully observable to the agent. When this is not the case, a more\ngeneral framework, known as Partially Observable Decision Processes (POMDPs), can be used. In a\nPOMDP, the agent reasons about the current state based on an observation. Therefore, POMDP can\nbe regarded as an MDP with observations, Z and an observation function O : Z \u00d7 A \u00d7 S \u2192 [0, 1],\nwhich determines P (z|a, s), the probability of observing z after performing action a in state s. In\na POMDP, instead of knowing the current state, st, the agent computes the belief state, bt, which\nis the posterior probability over states given all past observations and actions. The belief state can\ns(cid:48) T (s(cid:48), s, at)bt(s(cid:48)). Consequently, the optimal policy\nof a POMDP is a mapping from belief state to actions: \u03c0\u2217 : B \u2192 A where B = [0, 1]|S|. One\ncould easily see that a POMDP is an MDP whose states are belief states. As the belief state space is\nexponentially larger than the original state space (B = [0, 1]|S|), solving POMDPs is computationally\nvery expensive (NP-hard [19]). Therefore, heuristic methods are used to approximate the optimal\npolicy for a POMDP [11, 20]. In the case that the belief state can be expressed in closed form, e.g.,\nGaussian, one can solve the POMDP by considering it as an MDP whose state space is the POMDP\u2019s\nbelief state space and performing the value iteration algorithm. We use this technique in our model.\n\nbe updated as: bt+1(s) \u221d O(s, at, zt+1)(cid:80)\n\n4 Model of the Game\n\nIn a PGG with N players and known minimum number of required volunteers (k), the reward of a\nplayer, say player i, in each round is determined only by their action (free ride (f) or contribution\n(c)), and the total number of contributors among other players. We use the notation \u2212i to represent\nall players except player i. We denote the action of each player as a and the reward of each player as\nr. The occurrence of an event is given by an indicator function I (for event x, I(x) is equal to 1 if\nevent x happens and 0 otherwise). Then, the reward expected by player i at round t is:\n\n\uf8ee\uf8f0E \u2212 I(ai\n\uf8ee\uf8f0E \u2212 I(ai\n\nri\nt = E\n\n= E\n\nt = c) \u00b7 C + I\n\nt = c) \u00b7 C + I\n\n\uf8eb\uf8ed N(cid:88)\n\uf8eb\uf8edI(ai\n\nj=1\n\n3\n\n\uf8f6\uf8f8 \u00b7 R\n\n\uf8f9\uf8fb\n\nI(aj\n\nt = c) \u2265 k\n(cid:88)\n\nt = c) +\n\nj\u2208\u2212i\n\n\uf8f6\uf8f8 \u00b7 R\n\n\uf8f9\uf8fb\n\n(1)\n\nI(aj\n\nt = c) \u2265 k\n\n\fof(cid:80)\n\nThis means that in order to choose the best action at step t (ai\n\nt), player i should estimate the probability\nt = c). Now if each player is a contributor with probability \u03b8c, the probability of this\n\nI(aj\n\nj\u2208\u2212i\n\nsum would be a binomial distribution:\n\nP (\n\nI(aj\n\nt = c) = k(cid:48)) =\n\nc (1 \u2212 \u03b8c)N\u22121\u2212k(cid:48)\n\u03b8k(cid:48)\n\n(2)\n\n(cid:18)N \u2212 1\n\n(cid:19)\n\nk(cid:48)\n\n(cid:88)\n\nj\u2208\u2212i\n\nWe could model the whole group with one parameter, because players only get the total number\nof contributions made by others and not individual contributions. Individuals cannot be tracked\nby others, and all group members can be seen together as one group. In other words, \u03b8c could be\ninterpreted as cooperativeness of the group on average. With \u03b8c, the reward that player i expects at\ntime step t is:\n\n(cid:33)\n\nt = E \u2212 I(ai\nri\n\nt = c) \u00b7 C + I(ai\n\nt = c).\n\n+ I(ai\n\nt = f ).\n\nc (1 \u2212 \u03b8c)N\u22121\u2212k(cid:48)\n\u03b8k(cid:48)\n(cid:33)\n\nc (1 \u2212 \u03b8c)N\u22121\u2212k(cid:48)\n\u03b8k(cid:48)\n\n\u00b7 R\n\n\u00b7 R\n\n(3)\n\n(cid:32) N\u22121(cid:88)\n(cid:32)N\u22121(cid:88)\n\n(cid:19)\n(cid:18)N \u2212 1\n(cid:19)\n(cid:18)N \u2212 1\n\nk(cid:48)\n\nk(cid:48)=k\u22121\n\nk(cid:48)\n\nk(cid:48)=k\n\nThis is only for one round. The game however, contains multiple rounds (15 here) and the goal is\nto maximize the total expected reward, not the reward of a speci\ufb01c round. In addition, \u03b8c changes\nafter each round because players see others\u2019 actions and update the probability of cooperativeness in\nthe group. For example, if a player sees that others are not contributing, they may reduce \u03b8c when\npicking an action in the next round. Also, since our subjects think they are playing with other humans,\nthey may assume others make these updates too. As a result, each player thinks their own action will\nchange \u03b8c as well. In fact, although they are playing with computers, our algorithm does depend on\ntheir actions and their assumption is thus, correct. In addition, because subjects think they have a\ncorrect model of the group, they assume all group members have the same \u03b8c as them. If we de\ufb01ne\neach possible value of \u03b8c as a discrete state (this set is in\ufb01nite, but we could discretize the space,\ne.g., 100 values from 0 to 1) and model the change in \u03b8c with a transition function, our problem of\nmaximizing total expected reward becomes equivalent to an MDP.\nUnfortunately, the subject does not know \u03b8c and therefore must maintain a probability distribution\n(belief state) over \u03b8c denoting belief about the average cooperativeness of the group. The model\ntherefore becomes a POMDP. The beta distribution is a conjugate prior for the binomial distribution,\nmeaning that when the prior distribution is a beta distribution and the likelihood function is a binomial\ndistribution, the posterior will also be a beta distribution. Therefore, in our model, the subject starts\nwith a beta distribution as their initial belief, and updates their belief over the course of the game\nusing the transition and observation functions which are both binomial, implying that their belief\nalways remains a beta distribution. The beta distribution contains two parameters, \u03b1 and \u03b2. Using\nmaximum likelihood estimation (MLE), the posterior distribution after seeing k(cid:48) true events from\ntotal of N events with prior Beta(\u03b1, \u03b2) is Beta(\u03b1 + k(cid:48), \u03b2 + N \u2212 k(cid:48)):\n\nPrior : Beta(\u03b1, \u03b2) \u2192 P (\u03b8) =\n\n\u03b8\u03b1\u22121(1 \u2212 \u03b8)\u03b2\u22121\n\nPosterior : Beta(\u03b1 + k(cid:48), \u03b2(t) + N \u2212 k(cid:48)) \u2192 P (\u03b8) =\n\nwhere B(\u03b1, \u03b2) is the normalizing constant: B(\u03b1, \u03b2) =(cid:82) 1\n\nB(\u03b1, \u03b2)\n\u03b8\u03b1+k(cid:48)\u22121(1 \u2212 \u03b8)\u03b2+N\u2212k(cid:48)\u22121\nB(\u03b1 + k(cid:48), \u03b2 + N \u2212 k(cid:48))\n\n0 \u03b8\u03b1\u22121(1 \u2212 \u03b8)\u03b2\u22121d\u03b8.\n\nAs mentioned before, each POMDP is an MDP whose state space is the belief state of the original\nPOMDP. As our belief state has a closed form, we can estimate the solution of our POMDP by\ndiscretizing this belief space, e.g., considering a bounded set of integers for \u03b1 and \u03b2, and solving it\nas an MDP. Also, the transition function of this MDP would be based on the maximum likelihood\nestimate shown above. This transition function is as follows:\n\nP ((\u03b1 + k(cid:48) + 1, \u03b2 + N \u2212 1 \u2212 k(cid:48))|(\u03b1, \u03b2), c) =\n\nP ((\u03b1 + k(cid:48), \u03b2 + N \u2212 k(cid:48))|(\u03b1, \u03b2), f ) =\n\n(cid:19) B(\u03b1 + k(cid:48), \u03b2 + N \u2212 1 \u2212 k(cid:48))\n\n(cid:18)N \u2212 1\n(cid:19) B(\u03b1 + k(cid:48), \u03b2 + N \u2212 1 \u2212 k(cid:48))\n\nB(\u03b1, \u03b2)\n\nk(cid:48)\n\n(cid:18)N \u2212 1\n\nB(\u03b1, \u03b2)\n\n(4)\n\n(5)\n\n(6)\n\nk(cid:48)\n\n4\n\n\fThe pair (\u03b1, \u03b2) is the state and represents the belief of the player about \u03b8c, given by Beta(\u03b1, \u03b2). The\nreward function of this belief-based MDP is:\n\nN(cid:88)\n(cid:18)N \u2212 1\n\nk(cid:48)=k\u22121\n\n(cid:19) B(\u03b1 + k(cid:48), \u03b2 + N \u2212 1 \u2212 k(cid:48))\n\n(cid:18)N \u2212 1\n(cid:19) B(\u03b1 + k(cid:48), \u03b2 + N \u2212 1 \u2212 k(cid:48))\n\nB(\u03b1, \u03b2)\n\nk(cid:48)\n\nB(\u03b1, \u03b2)\n\nR\n\nN(cid:88)\n\nk(cid:48)=k\n\nR((\u03b1, \u03b2), c) = E \u2212 C +\n\nR((\u03b1, \u03b2), f ) = E +\n\nk(cid:48)\n\nR\n\n(7)\n\nThis MDP shows how the subject plays and learns their group dynamics simultaneously by updating\ntheir belief about the group during the course of the game. Note that although we are reducing\nthe problem to an MDP for computational ef\ufb01ciency, conceptually the player is being modeled by\na POMDP because the player maintains a belief about the environment and updates it based on\nobservations (here, other players\u2019 actions).\n\n5 Results\n\nThe parameters of our model are all known, so the question is how the model differs for different\nindividuals. The difference is in the initial belief of the player about the group that they are playing\nwithin, in other words, the state that our belief-based MDP starts from (b0 in POMDP parlance). This\nmeans that each individual has a pair \u03b10 and \u03b20 for each k that shapes their behavior through the\ngame. For example, an \u03b10 signi\ufb01cantly larger than \u03b20 means that the player starts the game with the\nbelief that the group is cooperative. Also, \u03b10 and \u03b20 for the same individual differs for different k\u2019s\nsince the number of required volunteers changes the game and consequently the belief of the player\nabout the optimal strategy. We investigate these differences in our analysis below.\n\n5.1 Modeling behavioral data\n\nwords we \ufb01nd the \u03b10 and \u03b20 that minimize(cid:80)15\n\nTo \ufb01nd \u03b10 and \u03b20 of each player (and for each k), we run our model with different values of \u03b10 and\n\u03b20, and using the actions that the player sees as other players\u2019 actions during the experiment, we\ncheck if the actions predicted by our model is the same as the actual actions of the player. In other\nt is the action of our subject at\nstep t, and \u02dcai\nt is the predicted action from our model. Note that we only give the model other players\u2019\ndata and do not correct the predicted action for the previous state if the model has made a mistake.\nAlso, we calculate the error on all games of that player with the same k, i.e., we assume each game is\nan independent data point. This is justi\ufb01ed because subjects are explicitly told that at each game they\ncould play with different players and also, they get reward for only one game chosen randomly. As a\nresult one cannot use one game for training for the next ones. For each player, we call the average of\n\nt| where ai\n\nt=1 |ai\n\nt \u2212 \u02dcai\n\nt \u2212 \u02dcai\n\nt| among all of their games with the same k, round by round error.\n\n(cid:80)15\nt=1 |ai\n\nThe average round by round error among all players for the POMDP was 3.38 for k = 2 and 2.15\nfor k = 4 (Table 1). For example, only around 2 out of 15 rounds were predicted incorrectly by our\nmodel for k = 4. The possible \u03b10 and \u03b20 values for each player ranged over all integers between 1\nand 100, yielding 1002 pairs to evaluate as s0 for the belief-based MDP; this evaluation process was\ncomputationally ef\ufb01cient. We found that MDPs with horizons longer than the true number of rounds\n\ufb01t our data better. As a result, we set our horizon to a number much larger than 15, in this case, 50.\nSuch an error in estimating the dynamics of a game in humans is consistent with previous reports [3].\nTo compare our results with other state-of-the-art methods, we \ufb01t a previously proposed descriptive\nmodel [24] to our data. This model assumes that the action of the player in each round is a function\nof their action in the previous round (the \"Previous Action\" model). Therefore, to \ufb01t the data we\nneed to estimate p(ai\nt\u22121 = c)\nand p(ai\nt\u22121 = f ). Note that this descriptive model is unable to predict the \ufb01rst action. We\nfound that its average round by round error for the last 14 rounds (3.90 for k = 2 and 3.25 for k = 4)\nis more than the POMDP model\u2019s error (Table 1), even though it considers one round less than the\nPOMDP model.\nWe also used Leave One Out Cross Validation (LOOCV) to compare the models (see Table 1).\nAlthough the LOOCV error for k = 2 is larger for the POMDP model, the POMDP model\u2019s error\n\nt\u22121). This means that the model has two parameters, i.e. p(ai\n\nt = c|ai\n\nt = c|ai\n\nt|ai\n\n5\n\n\ft|ai\n\nt\u22121), and the most general descriptive mode, p(ai\n\nTable 1: Average round by round error by POMDP, the descriptive model based on previous action,\nt\u22121) . In front of each error,\np(ai\nthe normalized error (divided by number of rounds) is written in parenthesis to facilitate comparison.\nFitting error LOOCV error LOOCV error Total number\n\nFitting error\n\nj\u2208\u2212i aj\n\nmodel\n\nt|ai\n\nt\u22121,(cid:80)\n\nPOMDP\n\nPrevious Action\n\nAll Actions\n\n(k = 2)\n\n3.38 (0.22)\n3.90 (0.28)\n3.75 (0.27)\n\n(k = 4)\n\n2.15 (0.14)\n3.25 (0.23)\n2.74 (0.19)\n\n(k = 2)\n\n4.23 (0.28)\n4.00 (0.29)\n5.52 (0.39)\n\n(k = 4)\n\n2.67 (0.18)\n3.48 (0.25)\n7.33 (0.52)\n\nof rounds\n\n15\n14\n14\n\nis for 15 rounds while the error for the descriptive model is for 14 (note that the error divided by\nnumber of rounds is larger for the descriptive model). In addition, to examine if another descriptive\nmodel based on previous rounds can outperform the POMDP model, we tested a model based on all\nt\u22121). The POMDP model outperforms this model as well.\nprevious actions, i.e. p(ai\n\nt\u22121,(cid:80)\n\nj\u2208\u2212i aj\n\nt|ai\n\n5.2 Comparing model predictions to neural data\n\nBesides modeling human behavior better than the descriptive models, the POMDP model can\nalso predict the amount of reward the subject is expecting since it is formulated based on reward\nmaximization. We use the parameters obtained by the behavioral \ufb01t and generate the expected reward\nfor each subject before playing the next round.\nTo validate these predictions about reward expectation, we checked if there is any correlation between\nneural activity recorded by fMRI and the model\u2019s predictions. Image preprocessing was performed\nusing the SPM8 software package. The time-series of images were registered in three-dimensional\nspace to minimize any effects from the participant\u2019s head motion. Functional scans were realigned to\nthe last volume, corrected for slice timing, co-registered with structural maps, spatially normalized\ninto the standard Montreal Neurological Institute (MNI) atlas space, and then spatially smoothed with\nan 8mm isotropic full-width-at-half-maximum (FWHM) Gaussian kernel using standard procedures\nin SPM8. Speci\ufb01cally, we construct a general linear model (GLM) and run a \ufb01rst-level analysis\nmodeling brain responses related to outcome while informing the judgments of others. They are\nmodeled as a box-car function time-locked to the onset of outcome with the duration of 4 sec. Brain\nresponses related to decision-making with knowledge of the outcome of the previous trial are modeled\nseparately. These are modeled as a box-car function time-locked to the onset of decision-making\nwith duration of reaction times in each trial. They are further modulated by parametric regressors\naccounting for the expected reward. In addition, the six types of motion parameters produced for head\nmovement, and the two motor parameters produced for buttons pressing with the right and the left\nhands are also entered as additional regressors of no interest to account for motion-related artifacts.\nAll these regressors are convolved with the canonical hemodynamic response function. Contrast\nimages are calculated and entered into a second-level group analysis. In the GLM, brain regions\nwhose blood-oxygen-level-dependent (BOLD) response are correlated with POMDP-model-based\nestimates of expected reward are \ufb01rst identi\ufb01ed. To correct for multiple comparisons, small volume\ncorrection (SVC) is applied to a priori anatomically de\ufb01ned regions of interests (ROI). The search\nvolume is de\ufb01ned by a 10mm diameter spherical ROI centered on the dorsolateral prefrontal cortex\n(dlPFC) and the ventral striatum (vS) that have been identi\ufb01ed in previous studies. The role of the\ndlPFC has been demonstrated in the control of strategic decision-making [14], and its function and\ngray matter volume have been implicated in individual differences in social value computation [7, 21].\nMoreover, vS has been found to mediate rewards signal engaged in mutual contribution, altruism,\nand social approval [18]. In particular, the left vS has been found to be associated with both social\nand monetary reward prediction error [13].\nWe \ufb01nd a strong correlation between our model\u2019s prediction of expected reward and activity in\nbilateral dlPFC (the peak voxel in the right dlPFC: (x, y, z) = (42, 47, 19), T = 3.45, and the peak\nvoxel in the left dlPFC: (x, y, z) = (\u221230, 50, 25), T = 3.17) [7], and left vS (the peak voxel in\nthe vS: (x, y, z) = (\u221224, 17,\u22122), T = 2.98) [13](Figure 2). No other brain area was found to\nhave a higher activation than them at the relatively liberal threshold, uncorrected p < 0.005. Large\nactivations were found in these regions when participants received the outcome of a trial (p < 0.05,\nFWE corrected within small-volume clusters). This is because after seeing the outcome of one round,\nthey update their belief and consequently their expected reward for the next round.\n\n6\n\n\fFigure 2: Strong correlation between brain activity in the dlPFC and the left vS after seeing the\noutcome of a round and the predicted expected reward for the next round by our model. The\nactivations were reported with a signi\ufb01cance of p < 0.05, FWE across participants corrected in a\npriori region of interest. The activation maps are acquired at the threshold, p < 0.005 (uncorrected).\nThe color in each cluster indicates the level of z-score activation in each voxel.\n\n5.3 Modeling subjects\u2019 perception of group cooperativeness\n\nThe ratio and the sum of the best \ufb01tting \u03b10 and \u03b20 that we obtain from the model are interpretable\nwithin the context of cognitive science. In the Beta-binomial distribution update equations 4 and 5, \u03b1\nis related to the occurrence of the action \"contribution\" and \u03b2 to \"free-riding.\" Therefore, the ratio\nof \u03b10 to \u03b20 captures the player\u2019s prior belief about the cooperativeness of the group. On the other\nhand, after every binomial observation (here round), N (here N = 5) is added to the prior. Therefore,\nthe absolute values of \u03b1 and \u03b2 determine the weight that the player gives to the prior compared to\ntheir observations during the game. For example, adding 5 does not change Beta(100, 100) much\nbut changes Beta(2, 2) to Beta(7, 2); the former does not alter the chance of contribution versus\nfree riding much while the latter indicates that the group is cooperative.\nWe estimated the best initial parameters for each player, but is there a unique pair of \u03b10 and \u03b20\nvalues that minimizes the round by round error or are there multiple values for the best \ufb01t? We\ninvestigated this question by examining the error for all possible parameter values for all players in\nour experiments. The error, as a function of \u03b10 and \u03b20, for one of the players is shown in Figure\n3a as a heat map (darker means smaller error, i.e. better \ufb01t). We found that the error function is\ncontinuous and although there exist multiple best-\ufb01tting parameter values, these values de\ufb01ne a set of\nlines \u03b1 = a\u03b2 + c with bounds min \u2264 \u03b1 \u2264 max. The lines and bounds are linked to the ratio and\nprior weight alluded to above, suggesting that players do consider prior probability and the weight,\nand best-\ufb01tting \u03b10 and \u03b20 values have similar characteristics.\nWe also calculated the average error function over all players for both values of k. As shown in\n\ufb01gures 3b and 3c, \u03b10 is larger than \u03b20 for k = 2, while for k = 4, they are close to each other. Also,\nthe absolute value of these parameters are larger for k = 2. This implies that when k = 4, players\nstart out with more caution to ascertain whether the group is cooperative or not. For k = 2 however,\nbecause only 2 volunteers are enough, they start by giving cooperativeness a higher probability.\nHigher absolute value for k = 2 is indicative of the fact that the game tends towards mostly free riders\nfor k = 2 and the prior is weighted much more than observations. Players know only 2 volunteers\nare enough and they can free-ride more frequently but still get the public good.1\n\n6 Related Work\n\nPGG has previously been analyzed using descriptive models, assuming that only the actions of\nplayers in the previous trial affect decisions in the current trial [8, 24, 25]. As a result, the \ufb01rst trial of\neach round cannot be predicted by these models. Moreover, these models only predicts with what\nprobability each player changes their action. The POMDP model, in contrast, takes all trials of each\n\n1We should emphasize that this is the behavior on average and a few subjects do deviate from this behavior.\n\n7\n\n\f(a) One player\n\n(b) k = 2\n\n(c) k = 4\n\nFigure 3: Round by round error based on different initial parameters \u03b10 and \u03b20. Darker means lower\nerror. (a) Error function for one of the players. The function for other players and other ks has the\nsame linear pattern in terms of continuity but the location of the low error line is different among\nindividuals and ks. (b) Average error function over all players for k = 2. (c) Average error function\nfor k = 4.\n\nround into account and predicts actions based on prior belief of the player about the cooperativeness\nof the group, within the context of maximizing expected reward. Most importantly, the POMDP\nmodel predicts not only actions, but also the expected reward for the next round for each player as\ndemonstrated in our results above.\nPOMDPs have previously been used in perceptual decision making [12, 17] and value-based decision\nmaking [5]. The modeled tasks, however, are all single player tasks. A model based on iPOMCPs, an\ninteractive framework based on POMDPs ([9]) with Monte Carlo sampling, has been used to model a\ntrust game [10] involving two players. The PGG task we consider involves a larger group of players\n(5 in our experiments). Also, the iPOMCP algorithm is complicated and its neural implementation\nremains unclear. By comparison, our POMDP model is relatively simple and only uses two parameters\nto represent the belief state.\n\n7 Discussion\n\nThis paper presents a probabilistic model of social decision making that not only explains human\nbehavior in volunteer\u2019s dilemma but also predicts the expected reward in each round of the game.\nThis prediction was validated using neural data recorded from an fMRI scanner. Unlike other existing\nmodels for this task, our model is based on the principle of reward maximization and Bayesian\ninference, and does not rely on a subject\u2019s actions directly. In other words, our model is normative.\nIn addition, as we discussed above, the model parameters that we \ufb01t to an individual or k are\ninterpretable.\nOne may argue that our model ignores empathy among group members since it assumes that the\nplayers attempt to maximize their own reward. First, an extensive study with auxiliary tasks has shown\nthat pro-social preferences such as empathy do not explain human behaviour in the public goods\ngames [3]. Second, one\u2019s own reward is not easily separable from others\u2019 rewards as maximizing\nexpected reward requires cooperation among group members. Third, a major advantage of a normative\nmodel is the fact that different hypotheses can be tested by varying the components of the model.\nHere we presented the most general model to avoid over-\ufb01tting. Testing different reward functions\ncould be a fruitful direction of future research.\nAlthough we have not demonstrated that our model can be neurally implemented in the brain, the\nmodel does capture the fundamental components of social decision making required to solve tasks\nsuch as the volunteer\u2019s dilemma, namely, belief about others (belief state in our model), updating\nof belief with new observations, knowing that other group members will update their beliefs as\nwell (modeled via transition function), prior belief about people playing the game (ratio of \u03b10 to\n\u03b20), weight of the prior in comparison to observations (absolute value of initial parameters), and\nmaximizing total expected reward (modeled via reward function in MDP/POMDP). Some of these\ncomponents may be simpli\ufb01ed or combined in a neural implementation but we believe acknowledging\nthem explicitly in our models will help pave the way for a deeper understanding of the neural\nmechanisms underlying human social interactions.\n\n8\n\n\fReferences\n[1] M. Archetti. A strategy to increase cooperation in the volunteer\u2019s dilemma: Reducing vigilance improves\n\nalarm calls. Evolution, 65(3):885\u2013892, 2011.\n\n[2] N. Bault, B. Pelloux, J. J. Fahrenfort, K. R. Ridderinkhof, and F. van Winden. Neural dynamics of social\ntie formation in economic decision-making. Social Cognitive and Affective Neuroscience, 10(6):877\u2013884,\n2015.\n\n[3] M. N. Burton-Chellew and S. A. West. Prosocial preferences do not explain human cooperation in\n\npublic-goods games. Proceedings of the National Academy of Sciences, 110(1):216\u2013221, 2013.\n\n[4] D. Chung, K. Yun, and J. Jeong. Decoding covert motivations of free riding and cooperation from\nmulti-feature pattern analysis of signals. Social Cognitive and Affective Neuroscience, 10(9):1210\u20131218,\n2015.\n\n[5] P. Dayan and N. D. Daw. Decision theory, reinforcement learning, and the brain. Cognitive, Affective, &\n\nBehavioral Neuroscience, 8(4):429\u2013453, 2008.\n\n[6] D. Diekmann. Cooperation in an asymmetric volunteer\u2019s dilemma game: theory and experimental evidence.\n\nInternational Journal of Game Theory, 22(1):75\u201385, 1993.\n\n[7] A. S. R. Fermin, M. Sakagami, T. Kiyonari, Y. Li, Y. Matsumoto, and T. Yamagishi. Representation of\neconomic preferences in the structure and function of the amygdala and prefrontal cortex. Scienti\ufb01c reports,\n6, 2016.\n\n[8] U. Fischbacher, S. Gatcher, and E. Fehr. Are people conditionally cooperative? Evidence from a public\n\ngoods experiment. Economics Letters, 71(3):397 \u2013 404, 2001.\n\n[9] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings. Journal of\n\nArti\ufb01cial Intelligence Research, 24:49\u201379, 2005.\n\n[10] A. Hula, P. R. Montague, and P. Dayan. Monte carlo planning method estimates planning horizons during\n\ninteractive social exchange. PLoS Computational Biology, 11(6):e1004254, 2015.\n\n[11] K. Khalvati and A. K. Mackworth. A fast pairwise heuristic for planning under uncertainty. In Proceedings\n\nof The Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence, pages 187\u2013193, 2013.\n\n[12] K. Khalvati and R. P. N. Rao. A Bayesian framework for modeling con\ufb01dence in perceptual decision\n\nmaking. In Advances in Neural Information Processing Systems (NIPS) 28, pages 2413\u20132421. 2015.\n\n[13] A. Lin, R. Adolphs, and A. Rangel. Social and monetary reward learning engage overlapping neural\n\nsubstrates. Social cognitive and affective neuroscience, 7(3):274\u2013281, 2012.\n\n[14] E. K. Miller and J. D. Cohen. An integrative theory of prefrontal cortex function. Annual Review of\n\nNeuroscience, 24:167\u2013202, 2001.\n\n[15] M. Olson. The Logic of Collective Action: Public Goods and the Theory of Groups. Harvard University\n\nPress, 1971.\n\n[16] S. A. Park, S. Jeong, and J. Jeong. TV programs that denounce unfair advantage impact women\u2019s sensitivity\n\nto defection in the public goods game. Social Neuroscience, 8(6):568\u2013582, 2013.\n\n[17] R. P. N. Rao. Decision making under uncertainty: a neural model based on partially observable Markov\n\ndecision processes. Frontiers in computational neuroscience, 4, 2010.\n\n[18] C. C. Ruff and E. Fehr. The neurobiology of rewards and values in social decision making. Nature Reviews\n\nNeuroscience, 15:549\u2013562, 2014.\n\n[19] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable markov processes over a\n\n\ufb01nite horizon. Operations Research, 21(5):1071\u20131088, 1973.\n\n[20] T. Smith and R. G. Simmons. Heuristic search value iteration for POMDPs. In Proceedings of International\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2004.\n\n[21] N. Steinbeis and E. A. Crone. The link between cognitive control and decision-making across child and\n\nadolescent development. Current Opinion in Behavioral Sciences, 10:28\u201332, 2016.\n\n[22] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, Cambridge, MA\u201e 2005.\n[23] S. M. Tom, C. R. Fox, C. Trepel, and R. A. Poldrack. The neural basis of loss aversion in decision-making\n\nunder risk. Science, 315(5811):515\u2013518, 2007.\n\n[24] J. Wanga, S. Surib, and D. J. Wattsb. Cooperation and assortativity with dynamic partner updating.\n\nProceedings of the National Academy of Sciences, 109(36):14363\u201314368, 2012.\n\n[25] M. Wunder, S. Suri, and D. J. Watts. Empirical agent based models of cooperation in public goods games.\nIn Proceedings of the Fourteenth ACM Conference on Electronic Commerce (EC), pages 891\u2013908, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1459, "authors": [{"given_name": "Koosha", "family_name": "Khalvati", "institution": "University of Washington"}, {"given_name": "Seongmin", "family_name": "Park", "institution": "Cognitive Neuroscience Center"}, {"given_name": "Jean-Claude", "family_name": "Dreher", "institution": "Centre de Neurosciences Cognitives"}, {"given_name": "Rajesh", "family_name": "Rao", "institution": "University of Washington"}]}