{"title": "Game Design for Eliciting Distinguishable Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 4684, "page_last": 4693, "abstract": "The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.", "full_text": "Game Design for Eliciting Distinguishable Behavior\n\nZachary C. Lipton1\u2020, Pradeep Ravikumar1\u2217, William W. Cohen1,2\u2217, Tom Mitchell1\u2020\n\nFan Yang1\u2217, Liu Leqi1\u2217, Yifan Wu1\u2217,\n\n\u2217{fanyang1,leqil,yw4,pradeepr,wcohen}@cs.cmu.edu\n\n1Carnegie Mellon University\n2 Google Inc.\n\u2020{zlipton, tom.mitchell}@cmu.edu\n\nAbstract\n\nThe ability to inferring latent psychological traits from human behavior is key to\ndeveloping personalized human-interacting machine learning systems. Approaches\nto infer such traits range from surveys to manually-constructed experiments and\ngames. However, these traditional games are limited because they are typically\ndesigned based on heuristics. In this paper, we formulate the task of designing\nbehavior diagnostic games that elicit distinguishable behavior as a mutual informa-\ntion maximization problem, which can be solved by optimizing a variational lower\nbound. Our framework is instantiated by using prospect theory to model varying\nplayer traits, and Markov Decision Processes to parameterize the games. We vali-\ndate our approach empirically, showing that our designed games can successfully\ndistinguish among players with different traits, outperforming manually-designed\nones by a large margin.\n\n1\n\nIntroduction\n\nHuman behavior can vary widely across individuals. For instance, due to varying risk preferences,\nsome people arrive extremely early at an airport, while others arrive the last minute. Being able to\ninfer these latent psychological traits, such as risk preferences or discount factors for future rewards,\nis of broad multi-disciplinary interest, within psychology, behavioral economics, as well as machine\nlearning. As machine learning \ufb01nds broader societal usage, understanding users\u2019 latent preferences is\ncrucial to personalizing these data-driven systems to individual users.\nIn order to infer such psychological traits, which require cognitive rather than physiological assess-\nment (e.g. blood tests), we need an interactive environment to engage users and elicit their behavior.\nApproaches to do so have ranged from questionnaires [7, 17, 9, 24] to games [2, 10, 21, 20] that\ninvolve planning and decision making. It is this latter approach of game that we consider in this\npaper. However, there has been some recent criticism of such manually-designed games [3, 5, 8]. In\nparticular, a game is said to be effective, or behavior diagnostic, if the differing latent traits of players\ncan be accurately inferred based on their game play behavior. However, manually-designed games\nare typically speci\ufb01ed using heuristics that may not always be reliable or ef\ufb01cient for distinguishing\nhuman traits given game play behavior.\nAs a motivating example, consider a game environment where the player can choose to stay or\nmoveRight on a Path. Each state on the Path has a reward. The player accumulates the reward as\nthey move to a state. Suppose players have different preferences (e.g. some might magnify positive\nreward and not care too much about negative reward, while others might be the opposite), but are\notherwise rational, so that they choose optimal strategies given their respective preferences. If we\nwant to use this game to tell apart such players, how should we assign reward to each state in the\nPath? Heuristically, one might suggest there should be positive reward to lure gain-seeking players\nand negative reward to discourage the loss-averse ones, as shown in Figure 1a. However, as shown\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin Figure 1b and 2a, the induced behavior (either policy or sampled trajectories) are similar for\nplayers with different loss preferences, and consequently not helpful in distinguishing them based on\ntheir game play behavior. In Figure 1c, an alternative reward design is shown, which elicits more\ndistinguishable behavior (see Figure 1d and 2b). This example illustrates that it is nontrivial to design\neffective games based on intuitive heuristics, and a more systematic approach is needed.\n\n(a) Reward designed by heuristics\n\n(b) Polices by different players (induced by reward in Figure 1a)\n\n(d) Polices by different players (induced by reward in Figure 1c)\n\n(c) Reward designed by optimization\nFigure 1: Comparing reward designed by heuristics and by optimization. The game is a 6-state\nMarkov Decision Process. Each state is represented as a square (see 1a or 1c) and player can choose\nstay or moveRight. The goal is to design reward for each state such that different types of players\n(Loss-Neutral, Gain-Seeking, Loss-Averse) have different behaviors. We show reward designed by\nheuristics in 1a and by optimization in 1c. Using these rewards, the policies of different players are\nshown on the right. For each player, its policy speci\ufb01cs the probability of taking an action (stay\nor moveRight) at each state. For example, Loss-Neutral\u2019s policy in 1d shows that it is more likely\nto choose stay than moveRight at the \ufb01rst (i.e. left-most) state, while in the second to \ufb01fth states,\nchoosing moveRight has a higher probability.\n\n(a) Sampled trajectories using policies in Figure 1b\n\n(b) Sampled trajectories using policies in Figure 1d\n\nFigure 2: Comparing sampled trajectories using policies induced by different rewards. To further\nvisualize how each type of players behave given different rewards, we sample trajectories using\ntheir induced policies. Given reward designed by heuristics (Figure 1a), all players behave similarly\nby traversing all the states (see 2a). However, given reward designed by optimization (Figure 1c),\nGain-Seeking and Loss-Averse players behave differently. In particular, Loss-Averse chooses stay\nmost of the time (see 2b), since the \ufb01rst state has a relatively large reward. Hence, reward designed\nby optimization is more effective at eliciting distinctive behaviors.\n\nIn this work, we formalize this task of behavior diagnostic game design, introducing a general\nframework consisting of a player space, game space, and interaction model. We use mutual informa-\ntion to quantitatively capture game effectiveness, and present a practical algorithm that optimizes a\nvariational lower bound. We instantiate our framework by setting the player space using prospect\ntheory [15], and setting the game space and interaction model using Markov Decision Processes [13].\nEmpirically, our quantitative optimization-based approach designs games that are more effective\nat inferring players\u2019 latent traits, outperforming manually-designed games by a large margin. In\naddition, we study how the assumptions in our instantiation affect the effectiveness of the designed\ngames, showing that they are robust to modest misspeci\ufb01cation of the assumptions.\n\n2\n\n2024stayLoss-NeutralmoveRightstayGain-SeekingmoveRightstayLoss-AversemoveRight101stayLoss-NeutralmoveRightstayGain-SeekingmoveRightstayLoss-AversemoveRightLoss-NeutralGain-SeekingLoss-AverseLoss-NeutralGain-SeekingLoss-Averse\f2 Behavior Diagnostic Game Design\n\nWe consider the problem of designing interactive games that are informative in the sense that a\nplayer\u2019s type can be inferred from their play. A game-playing process contains three components: a\nplayer z, a game M and an interaction model \u03a8. Here, we assume each player (which is represented\nby its latent trait) lies in some player space Z. We also denote Z \u2208 Z as the random variable\ncorresponding to a randomly selected player from Z with respect to some prior (e.g. uniform)\ndistribution pZ over Z. Further, we assume there is a family of parameterized games {M\u03b8 : \u03b8 \u2208 \u0398}.\nGiven a player z \u223c pZ, a game M\u03b8, the interaction model \u03a8 describes how a behavioral observation\nx from some observation space X is generated. Speci\ufb01cally, each round of game play generates\nbehavioral observations x \u2208 X as x \u223c \u03a8 (z,M\u03b8), where the interaction model \u03a8 (z,M\u03b8) is some\ndistribution over the observation space X . In this work, we assume pZ and \u03a8 are \ufb01xed and known.\nOur goal is to design a game M\u03b8 such that the generated behavior observations x are most informative\nfor inferring the player z \u2208 Z.\n\n2.1 Maximizing Mutual Information\n\nOur problem formulation introduces a probabilistic model over the players Z (as speci\ufb01ed by\nthe prior distribution pZ) and the behavioral observations X (as speci\ufb01ed by \u03a8 (Z,M\u03b8)), so that\npZ,X (z, x) = pZ(z)\u00b7 \u03a8 (z,M\u03b8) (x). Our goal can then be interpreted as maximizing the information\non Z contained in X, which can be captured by the mutual information between Z and X:\n\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n\nI(Z, X) =\n\n=\n\npZ,X (z, x) log\n\npZ,X (z, x)\npZ(z)pX (x)\npZ(z) \u00b7 \u03a8 (z,M\u03b8) (x) log\n\ndzdx\npZ|X (z|x)\npZ(z)\n\n(1)\n\n(2)\n\ndzdx,\n\nso that the mutual information is a function of the game parameters \u03b8 \u2208 \u0398.\nDe\ufb01nition 2.1. (Behavior Diagnostic Game Design) Given a player space Z, a family of parameter-\nized games M\u03b8, and an interaction model \u03a8(z,M\u03b8), our goal is to \ufb01nd:\n\n\u03b8\u2217 = arg max\n\n\u03b8\n\nI(Z; X).\n\n(3)\n\n2.2 Variational Mutual Information Maximization\n\n+ log qZ|X (z|x)\n\n+ H(Z)\n\n(5)\n\nIt is dif\ufb01cult to directly optimize the mutual information objective in Eq (2), as it requires access to\na posterior pZ|X (z|x) that does not have a closed analytical form. Following the derivations in [6]\nand [1], we opt to maximize a variational lower bound of the mutual information objective. Letting\nqZ|X (z|x) denote any variational distribution that approximates pZ|X (z|x), and H(Z) denote the\nmarginal entropy of Z, we can bound the mutual information as:\npZ|X (z|x)\npZ(z)\n\npZ(z) \u00b7 \u03a8 (z,M\u03b8) (x) log\n\n(cid:90) (cid:90)\n\nI(Z, X) =\n\ndzdx\n\n(4)\n\n(cid:20)\n\n(cid:20)\n(cid:2)log qZ|X (z|x)(cid:3) + H(Z),\n\npZ|X (z|x)\nqZ|X (z|x)\n\nlog\n\nEx\u223c\u03a8(z,M\u03b8)\n\n= Ez\u223cpZ\n\u2265 Ez\u223cpZ ,x\u223c\u03a8(z,M\u03b8)\n\n(cid:21)(cid:21)\n\n(6)\nso that the expression in Eq (6) forms a variational lower bound for the mutual information I(Z, X).\n\n3\n\nInstantiation: Prospect Theory based MDP Design\n\nOur framework in Section 2 provides a systematic view of behavior diagnostic game design, and each\nof its components can be chosen based on contexts and applications. We present one instantiation\nby setting the player space Z, the game M, and the interaction model \u03a8. For the player space Z,\nwe use prospect theory [15] to describe how players perceive or distort values. We model the game\nM as a Markov Decision Process [13]. Finally, a (noisy) value iteration is used to model players\u2019\nplanning and decision-making, which is part of the interaction model \u03a8. In the next subsection, we\nprovide a brief background of these key ingredients.\n\n3\n\n\f3.1 Background\n\nProspect Theory Prospect theory [15] describes the phenomenon that different people can perceive\nthe same numerical values differently. For example, people who are averse to loss, e.g. it is better\nto not lose $5 than to win $5, magnify the negative reward or penalty. Following [23], we use the\nfollowing distortion function v to describe how people shrink or magnify numerical values,\n\n(cid:26)(r \u2212 \u03beref)\u03bepos\n\n\u2212(\u03beref \u2212 r)\u03beneg\n\nr \u2265 \u03beref\nr < \u03beref\n\n(7)\n\nv(r; \u03bepos, \u03beneg, \u03beref) =\n\nwhere \u03beref is the reference point that people compare numerical values against. \u03bepos and \u03beneg are the\namount of distortion applied to the positive and negative amount of the reward with respect to the\nreference point.\nWe use this framework of prospect theory to specify our player space. Speci\ufb01cally, we represent\na player z by their personalized distortion parameters, so that z = (\u03bepos, \u03beneg, \u03beref). In this work,\nunless we specify otherwise, we assume that \u03beref is set to zero. Given these distortion parameters, the\nplayers perceive a distorted v(R; z) of any reward R in the game, as we detail in the discussion of\nthe interaction model \u03a8 subsequently.\nMarkov Decision Process A Markov Decision Process (MDP) M is de\ufb01ned by (S,A, T, R, \u03b3),\nwhere S is the state space and A is the action space. For each state-action pair (s, a), T (\u00b7|s, a) is a\nprobability distribution over the next state. R : S \u2192 R speci\ufb01es a reward function. \u03b3 \u2208 (0, 1) is a\ndiscount factor. We assume that both states and actions are discrete and \ufb01nite. For all s \u2208 S, a policy\n\u03c0(\u00b7|s) de\ufb01nes a distribution over actions to take at state s. A policy for an MDP M is denoted as\n\u03c0M.\nValue Iteration Given a Markov Decision Process (S,A, T, R, \u03b3), value iteration is an algorithm\nthat can be written as a simple update operation, which combines one sweep of policy evaluation and\none sweep of policy improvement [25],\nV (s) \u2190 maxa\n\nT (s(cid:48)|s, a)(R(s(cid:48)) + \u03b3V (s(cid:48)))\n\n(cid:88)\n\n(8)\n\nand computes a value function V : S \u2192 R for each state s \u2208 S. A probabilistic policy can be de\ufb01ned\nsimilar to maximum entropy policy [28, 18] based on the value function, i.e.,\nT (s(cid:48)|s, a)(R(s(cid:48)) + \u03b3V (s(cid:48)))\n\n\u03c0(a|s) = softmax\n\n(cid:88)\n\n(9)\n\ns(cid:48)\n\na\n\ns(cid:48)\n\nValue iteration converges to an optimal policy for discounted \ufb01nite MDPs [25].\n\n3.2 Instantiation\nIn this instantiation, we consider the game M\u03b8 := (S,A, T\u03b8, R\u03b8, \u03b3), and design the reward function\nand the transition probabilities of the game by learning the parameters \u03b8. We assume that each player\nz = (\u03bepos, \u03beneg) behaves according to a noisy near-optimal policy \u03c0 de\ufb01ned by value iteration and\nan MDP with distorted reward (S,A, T\u03b8, v(R\u03b8; z), \u03b3). The game play has L steps in total. The\ninteraction model \u03a8 (z,M\u03b8) is then the distribution of the trajectories x = {(st, at)}L\nt=1, where\nthe player always starts at sinit, and at each state st, we sample an action at using the Gumbel-\nmax trick [11] with a noise parameter \u03bb. Speci\ufb01cally, let the probability over actions \u03c0(\u00b7|st) be\n(u1, . . . , u|A|) and gi be independent samples from Gumbel(0, 1), a sampled action at can be de\ufb01ned\nas\n\nlog(ui)\n\ni=1,...,|A|\n\nat = arg max\n\n(10)\nWhen \u03bb = 1, there is no noise and at distributes according to \u03c0(\u00b7|st). The amount of noise increases\nas \u03bb increases. Similarly, we sample the next state st+1 from T (\u00b7|st, at) using Gumbel-max, with \u03bb\nalways set to one.\nOur goal in behavior diagnostic game design then reduces to solving the optimization in Eq (3), where\nthe player space Z consisting of distortion parameters z = (\u03bepos, \u03beneg), the game space parameterized\n\n+ gi.\n\n\u03bb\n\n4\n\n\fas M\u03b8 = (S,A, T\u03b8, R\u03b8, \u03b3), and an interaction model \u03a8(z,M\u03b8) of trajectories x generated by noisy\nvalue iteration using distorted reward v(R\u03b8; z),\nAs discussed in the previous section, for computational tractability, we optimize the variational lower\nbound from Eq (6) on this mutual information. As the variational distribution qZ|X, we use a factored\nGaussian, with means parameterized by a Recurrent Neural Network [12]. The input at each step\nis the concatenation of a one-hot (or softmax approximated) encoding of state st and action at. We\noptimize this variational bound via stochastic gradient descent [16]. In order for the objective to\nbe end-to-end differentiable, during trajectory sampling, we use the Gumbel-softmax trick [14, 19],\nwhich uses the softmax function as a continuous approximation to the argmax operator in Eq (10).\n\n4 Experiments\n\n4.1 Learning to Design Games\nWe learn to design games M by maximizing the mutual information objective in Eq (6), with known\nplayer prior pZ and interaction model \u03a8. We study how the degrees of freedom in games M affect\nthe mutual information. In particular, we consider environments that are Path or Grid of various\nsizes. Path environment has two actions, stay and moveRight. And Grid has one additional action\nmoveUp. Besides learning the reward R\u03b8, we also consider learning part of the transition T\u03b8. To be\nmore speci\ufb01c, we learn the probability \u03b1s that the action moveRight actually stays in the same state.\nTherefore moveRight becomes a probabilistic action that at each state s,\n\np(s(cid:48)|s, moveRight) =\n\nif s(cid:48) = s\nif s(cid:48) = s + 1\notherwise\n\n(11)\n\n\uf8f1\uf8f2\uf8f3\u03b1s\n\n1 \u2212 \u03b1s\n0\n\nWe experiment with Path of length 6 and Grid of size 3 by 6. The player prior pZ is uniform over\n[0.5, 1.5] \u00d7 [0.5, 1.5]. For the baseline, we manually design the reward for each state s to be1\n\n\uf8f1\uf8f2\uf8f3\u22123\n\n5\n0\n\nRPath(s) =\n\nif s = 3\nif s = 6\notherwise\n\nRGrid(s) =\n\nif s = 9 (a middle state)\nif s = 18 (a corner state)\notherwise\n\n\uf8f1\uf8f2\uf8f3\u22123\n\n5\n0\n\nThe intuition behind this design is that the positive reward at the end of the Path (or Grid) will\nencourage players to explore the environment, while the negative reward in the middle will discourage\nplayers that are averse to loss but not affect gain-seeking players, hence elicit distinctive behavior.\nIn Table 1, mutual information optimization losses for different settings are listed. The baselines have\nhigher mutual information loss than other learnable settings. When only the reward is learnable, the\nGrid setting achieves slightly better mutual information than the Path one. However, when both\nreward and transition are learnable, the Grid setting signi\ufb01cantly outperforms the others. This shows\nthat our optimization-based approach can effectively search through the large game space and \ufb01nd\nthe ones that are more informative.\n\nTable 1: Mutual Information Optimization Loss\n\nBaselines\n\nLearn Reward Only\n\nPath (1 x 6) Grid (3 x 6)\n\nPath (1 x 6) Grid (3 x 6)\n\nLearn Reward and Transition\nGrid (3 x 6)\nPath (1 x 6)\n\n0.111\n\n0.115\n\n0.108\n\n0.099\n\n0.107\n\n0.078\n\n1We have also experimented with different manually designed baseline reward functions. Their performances\nare similar to the presented baselines, both in terms of mutual information loss and player classi\ufb01er accuracy.\nThe performance of randomly generated rewards is worse than the manually designed baselines.\n\n5\n\n\f4.2 Player Classi\ufb01cation Using Designed Games\n\nz \u223c K(cid:88)\n\n(cid:18)(cid:20)\u03bek\n\n(cid:21)\n\npos\n\u03bek\nneg\n\n\u00b7 N\n\n1\nK\n\n(cid:19)\n\n, 0.1 \u00b7 I\n\nTo further quantify the effectiveness of learned games, we create downstream classi\ufb01cation tasks.\nIn the classi\ufb01cation setting, each data instance consists of a player label y and its behavior (i.e.\ntrajectory) x. We assume that the player attribute z is sampled from a mixture of Gaussians.\n\nk=1\n\nThe label y for each player corresponds to the component k, and the trajectory is sampled from\n\n(cid:1), where M\u02c6\u03b8 is a learned game. There are three types (i.e. components) of players,\n\nx \u223c \u03a8(cid:0)z,M\u02c6\u03b8\n\nnamely Loss-Neutral (\u03bepos = 1, \u03beneg = 1), Gain-Seeking (\u03bepos = 1.2, \u03beneg = 0.7), and Loss-Averse\n(\u03bepos = 0.7, \u03beneg = 1.2). We simulate 1000 data instances for train, and 100 each for validation and\ntest. We use a model similar to the one used for qZ|X, except for the last layer, which now outputs a\ncategorical distribution. The optimization is run for 20 epochs and \ufb01ve rounds with different random\nseed. Validation set is used to select the test set performance. Mean test accuracy and its standard\ndeviation are reported in Table 2.\n\n(12)\n\nTable 2: Classi\ufb01cation Task Accuracy\n\nBaselines\n\nLearn Reward Only\n\nPath (1 x 6)\n0.442 (0.056)\n\nGrid (3 x 6)\n0.482(.052)\n\nPath (1 x 6)\n0.678 (0.044)\n\nGrid (3 x 6)\n0.658 (0.066)\n\nLearn Reward and Transition\nGrid (3 x 6)\nPath (1 x 6)\n0.822 (0.027)\n0.686 (0.044)\n\nSimilar to the trend in mutual information, baseline methods have the lowest accuracies, which are\nabout 35% less than any learned games. Grid with learned reward and transition outperforms other\nmethods with a large margin of a 20% improvement in accuracy.\nTo get a more concrete understanding of learned games and the differences among them, we visualize\ntwo examples below. In Figure 3a, the learned reward for each state in a Path environment is shown\nvia heatmap. The learned reward is similar to the manually designed one\u2014highest at the end and\nnegative in the middle\u2014but with an important twist: there are also smaller positive rewards at the\nbeginning of the Path. This twist successfully induces distinguishable behavior from Loss-Averse\nplayers. The induced policy (Figure 3b)) and sampled trajectories (Figure 3c)) are very different\nbetween Loss-Averse and Gain-Seeking. However, Loss-Neutral and Loss-Averse players still behave\nquite similarly in this game.\n\n(a) Learned reward for each state in a\n1 x 6 Path\n\n(b) Induced policies by different types of players, using reward in 3a\n\n(c) Sampled trajectories by different types of players, using policies in 3b\n\nFigure 3: A 1 x 6 Path with learned reward. Gain-Seeking and Loss-Averse behave distinctively.\n\nIn Figure 4c and 4d, we show induced policies and sampled trajectories in a Grid environment\nwhere both reward and transition are learned. The learned game elicits distinctive behavior from\ndifferent types of players. Loss-Averse players choose moveUp at the beginning and then always stay.\nLoss-Neutral players explore along the states in the bottom row, while Gain-Seeking players choose\nmoveUp early on and explore the middle row. The learned reward and transition are visualized in\nFigure 4a and 4b. The middle row is particular interesting. The states have very mixed reward\u2014some\nare relatively high and some are the lowest. We conjecture that the possibility of stay (when take\nmoveRight) at some states with high reward (e.g. the \ufb01rst and third state from left in the middle row)\nmakes Gain-Seeking behave differently from Loss-Neutral.\n\n6\n\n4202stayLoss-NeutralmoveRightstayGain-SeekingmoveRightstayLoss-AversemoveRightLoss-NeutralGain-SeekingLoss-Averse\f(a) Learned reward for each state in a 3 x 6 Grid\n\n(b) Learned transition p(s|s, moveRight) at each state\n\n(c) Induced policies by different types of players, using reward in 4a and transition in 4b\n\n(d) Sampled trajectories by different types of players, using policies in 4c and transition in 4b\n\nFigure 4: A 3 x 6 Grid with learned reward and transition (see 4a and 4b) elicit distinguishable\nbehaviors from different types of players.\n\nWe also consider the case where the interaction model has noise, as described in Eq (10), when\ngenerating trajectories for classi\ufb01cation data. In practice, it is unlikely that one interaction model\ndescribes all players, since players have a variety of individual differences. Hence it is important to\nstudy how effective the learned games are when downstream task interaction model \u03a8 is noisy and\ndeviates from assumption. In Table 3, classi\ufb01cation accuracy on test set is shown at different noise\nlevel. We consider three designs here. As de\ufb01ned above, a baseline method which uses manually\ndesigned reward in Path, a Path environment with learned reward, and a Grid environment with both\nlearned reward and transition. Interestingly, while adding noise decreases classi\ufb01cation performance\nof learned games, the manually designed game (i.e. baseline method) achieves higher accuracy in the\npresence of noise. Nevertheless, the learned Grid outperforms others, though the margin decreases\nfrom 20% to 12%.\n\nTable 3: Classi\ufb01cation Accuracy at Different Noise Level\n\nPath (1 x 6, Baseline)\nPath (1 x 6, Learn Reward Only)\nGrid (3 x 6, Learn Reward and Transition)\n\n\u03bb = 1\n0.442 (0.056)\n0.678 (0.044)\n0.822 (0.027)\n\n\u03bb = 1.5\n0.510 (0.053)\n0.678 (0.039)\n0.778 (0.044)\n\n\u03bb = 2.5\n0.482 (0.041)\n0.650 (0.048)\n0.730 (0.061)\n\n7\n\nstayLoss-NeutralmoveRightmoveUpstayGain-SeekingmoveRightmoveUpstayLoss-AversemoveRightmoveUpLoss-NeutralGain-SeekingLoss-Averse\fIn Figure 5, we visualize the trajectories when the noise in the interaction model \u03a8 is \u03bb = 2.5. This\nprovides intuition for why the classi\ufb01cation performance decreases, as the boundary between the\nbehavior of Loss-Neutral and Gain-Seeking is blurred.\n\nFigure 5: Sampled trajectories with noise \u03bb = 2.5\n\n4.3 Ablation Study\n\nLastly, we consider using different distributions for player prior pZ on (\u03bepos, \u03beneg), which could be\nagnostic of the downstream tasks or not. We compare the classi\ufb01cation performance when pZ is\nuniform or biased towards the distribution of player types. We consider two cases: Full and Diagonal.\nIn Full, the player prior pZ is uniform over [0.5, 1.5] \u00d7 [0.5, 1.5]. In Diagonal, pZ is uniform over the\nunion of [0.5, 1] \u00d7 [1, 1.5] and [1, 1.5] \u00d7 [0.5, 1], which is a strict subset of the Full case and arguably\ncloser to the player types distribution in the classi\ufb01cation task. In Table 4, we show performance of\ngames learned with Full or Diagonal player prior.\n\nTable 4: Comparison of Learned Games with Different Player Prior pZ\n\nMethod\n\nPath\nReward Only\nGrid\nReward Only\nReward and Transition\nPath\nReward and Transition Grid\n\nDiagonal\n\nMutual Information Loss\nFull\n0.108\n0.099\n0.107\n0.078\n\n0.043\n0.039\n0.043\n0.036\n\nClassi\ufb01cation Accuracy\n\nFull\n\n0.678 (0.044)\n0.658 (0.066)\n0.686 (0.044)\n0.822 (0.027)\n\nDiagonal\n\n0.658 (0.034)\n0.662 (0.060)\n0.668 (0.048)\n0.712 (0.051)\n\nAcross all methods, using the Diagonal prior achieves lower mutual information loss compared to\nusing the Full one. However, this trend does not generalize to classi\ufb01cation. Using the Diagonal\nprior hurts classi\ufb01cation accuracy. We visualize the sampled trajectories in Figure 6. As we can\nsee, Loss-Neutral no longer has its own distinctive behavior, which is the case using Full prior (see\nFigure 4d). It seems that learned game is more likely to over\ufb01t the Diagonal prior, which leads to\npoor generalization on downstream tasks. Therefore, using a play prior pZ agnostic to downstream\ntask might be preferred.\n\nFigure 6: Sampled trajectories using Diagonal player prior\n\n5 Conclusion and Discussion\n\nWe consider designing games for the purpose of distinguishing among different types of players. We\npropose a general framework and use mutual information to quantify the effectiveness. Comparing\nwith games designed by heuristics, our optimization-based designs elicit more distinctive behaviors.\n\n8\n\nLoss-NeutralGain-SeekingLoss-AverseLoss-NeutralGain-SeekingLoss-Averse\fOur behavior-diagnostic game design framework can be applied to various applications, with player\nspace, game space and interaction model instantiated by domain-speci\ufb01c ones. For example, [22]\nstudies how to generate games for the purpose of differentiating players using player performance\ninstead of behavior trajectory as the observation space. In addition, we have considered the case\nwhen the player traits inferred from their game playing behavior are stationary. However, as pointed\nout by [26, 27, 4], there can be complex relationships between players\u2019 in-game and outside-game\npersonality traits. In future work, we look forward to addressing this distribution shift.\n\nAcknowledgement W.C. acknowledges the support of Google. L.L. and P.R. acknowledge the\nsupport of ONR via N000141812861. Z.L. acknowledges the support of the AI Ethics and Governance\nFund. This research was supported in part by a grant from J. P. Morgan.\n\nReferences\n[1] David Barber Felix Agakov. The im algorithm: a variational approach to information maximiza-\n\ntion. Advances in Neural Information Processing Systems, 16:201, 2004.\n\n[2] Antoine Bechara, Antonio R Damasio, Hanna Damasio, and Steven W Anderson. Insensitivity\nto future consequences following damage to human prefrontal cortex. Cognition, 50(1-3):7\u201315,\n1994.\n\n[3] Melissa T Buelow and Julie A Suhr. Construct validity of the iowa gambling task. Neuropsy-\n\nchology review, 19(1):102\u2013114, 2009.\n\n[4] Alessandro Canossa, Josep B Martinez, and Julian Togelius. Give me a reason to dig minecraft\nand psychology of motivation. In Conference on Computational Intelligence in Games (CIG),\n2013.\n\n[5] Gary Charness, Uri Gneezy, and Alex Imas. Experimental methods: Eliciting risk preferences.\n\nJournal of Economic Behavior & Organization, 87:43\u201351, 2013.\n\n[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[7] Sheldon Cohen, Tom Kamarck, and Robin Mermelstein. A global measure of perceived stress.\n\nJournal of health and social behavior, pages 385\u2013396, 1983.\n\n[8] Paolo Crosetto and Antonio Filippin. A theoretical and experimental appraisal of four risk\n\nelicitation methods. Experimental Economics, 19(3):613\u2013641, 2016.\n\n[9] Ed Diener, Derrick Wirtz, William Tov, Chu Kim-Prieto, Dong-won Choi, Shigehiro Oishi,\nand Robert Biswas-Diener. New well-being measures: Short scales to assess \ufb02ourishing and\npositive and negative feelings. Social Indicators Research, 97(2):143\u2013156, 2010.\n\n[10] Alexandre Y Dombrovski, Luke Clark, Greg J Siegle, Meryl A Butters, Naho Ichikawa, Bar-\nbara J Sahakian, and Katalin Szanto. Reward/punishment reversal learning in older suicide\nattempters. American Journal of Psychiatry, 167(6):699\u2013707, 2010.\n\n[11] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a\n\nseries of lectures, volume 33. US Government Printing Of\ufb01ce, 1954.\n\n[12] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[13] Ronald A Howard. Dynamic programming and markov processes. 1960.\n\n[14] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[15] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk.\n\nEconometrica, 47(2):263\u2013292, 1979.\n\n9\n\n\f[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] Kurt Kroenke, Robert L Spitzer, and Janet BW Williams. The phq-9: validity of a brief\n\ndepression severity measure. Journal of general internal medicine, 16(9):606\u2013613, 2001.\n\n[18] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and\n\nreview. arXiv preprint arXiv:1805.00909, 2018.\n\n[19] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Con-\ntinuous Relaxation of Discrete Random Variables. In International Conference on Learning\nRepresentations, 2017.\n\n[20] Joseph T McGuire and Joseph W Kable. Decision makers calibrate behavioral persistence on\n\nthe basis of time-interval experience. Cognition, 124(2):216\u2013226, 2012.\n\n[21] Ahmed A Moustafa, Michael X Cohen, Scott J Sherman, and Michael J Frank. A role for\ndopamine in temporal decision making and reward maximization in parkinsonism. Journal of\nNeuroscience, 28(47):12294\u201312304, 2008.\n\n[22] Thorbj\u00f8rn S Nielsen, Gabriella AB Barros, Julian Togelius, and Mark J Nelson. Towards gener-\nating arcade game rules with vgdl. In 2015 IEEE Conference on Computational Intelligence\nand Games (CIG), pages 185\u2013192. IEEE, 2015.\n\n[23] Lillian J Ratliff, Eric Mazumdar, and T Fiez. Risk-sensitive inverse reinforcement learning via\n\ngradient methods. arXiv preprint arXiv:1703.09842, 2017.\n\n[24] Daniel W Russell. Ucla loneliness scale (version 3): Reliability, validity, and factor structure.\n\nJournal of personality assessment, 66(1):20\u201340, 1996.\n\n[25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[26] Shoshannah Tekofsky, Jaap Van Den Herik, Pieter Spronck, and Aske Plaat. Psyops: Personality\nassessment through gaming behavior. In International Conference on the Foundations of Digital\nGames, 2013.\n\n[27] Nick Yee, Nicolas Ducheneaut, Les Nelson, and Peter Likarish. Introverted elves & conscien-\ntious gnomes: the expression of personality in world of warcraft. In Conference on Human\nFactors in Computing Systems (CHI), 2011.\n\n[28] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. 2010.\n\n10\n\n\f", "award": [], "sourceid": 2621, "authors": [{"given_name": "Fan", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Liu", "family_name": "Leqi", "institution": "Carnegie Mellon University"}, {"given_name": "Yifan", "family_name": "Wu", "institution": "Carnegie Mellon University"}, {"given_name": "Zachary", "family_name": "Lipton", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}, {"given_name": "Tom", "family_name": "Mitchell", "institution": "Carnegie Mellon University"}, {"given_name": "William", "family_name": "Cohen", "institution": "Google AI"}]}