{"title": "Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 5634, "page_last": 5642, "abstract": "Interactive partially observable Markov decision processes (I-POMDPs) provide a principled framework for planning and acting in a partially observable, stochastic and multi-agent environment. It extends POMDPs to multi-agent settings by including models of other agents in the state space and forming a hierarchical belief structure. In order to predict other agents' actions using I-POMDPs, we propose an approach that effectively uses Bayesian inference and sequential Monte Carlo sampling to learn others' intentional models which ascribe to them beliefs, preferences and rationality in action selection. Empirical results show that our algorithm accurately learns models of the other agent and has superior performance than methods that use subintentional models. Our approach serves as a generalized Bayesian learning algorithm that learns other agents' beliefs, strategy levels, and transition, observation and reward functions. It also effectively mitigates the belief space complexity due to the nested belief hierarchy.", "full_text": "Learning Others\u2019 Intentional Models in Multi-Agent\n\nSettings Using Interactive POMDPs\n\nPiotr Gmytrasiewicz\n\nYanlin Han\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\n{yhan37,piotr}@uic.edu\n\nAbstract\n\nInteractive partially observable Markov decision processes (I-POMDPs) provide a\nprincipled framework for planning and acting in a partially observable, stochastic\nand multi-agent environment.\nIt extends POMDPs to multi-agent settings by\nincluding models of other agents in the state space and forming a hierarchical\nbelief structure. In order to predict other agents\u2019 actions using I-POMDPs, we\npropose an approach that effectively uses Bayesian inference and sequential Monte\nCarlo sampling to learn others\u2019 intentional models which ascribe to them beliefs,\npreferences and rationality in action selection. Empirical results show that our\nalgorithm accurately learns models of the other agent and has superior performance\nthan methods that use subintentional models. Our approach serves as a generalized\nBayesian learning algorithm that learns other agents\u2019 beliefs, strategy levels, and\ntransition, observation and reward functions. It also effectively mitigates the belief\nspace complexity due to the nested belief hierarchy.\n\n1\n\nIntroduction\n\nPartially observable Markov decision processes (POMDPs) [11] is a general decision-theoretic\nframework for planning under uncertainty in a partially observable, stochastic environment. An\nautonomous agent acts rationally in such settings by constantly maintaining beliefs of the physical\nstate and sequentially choosing the optimal actions that maximize the expected value of future rewards.\nThus, solutions of POMDPs map an agent\u2019s beliefs to actions. Although POMDPs can be used in\nmulti-agent settings, it usually treats the impacts of other agents\u2019 actions as noise and embeds them\ninto the state transition function. Examples of such POMDPs are Utile Suf\ufb01x Memory [14], in\ufb01nite\ngeneralized policy representation [13], and in\ufb01nite POMDPs [3]. Therefore, an agent\u2019s beliefs about\nother agents are not part of the solutions of POMDPs.\nInteractive POMDPs (I-POMDPs) [7] generalize POMDPs to multi-agent settings by replacing\nPOMDP belief spaces with interactive belief systems. Speci\ufb01cally, an I-POMDP augments the plain\nbeliefs about the physical states in POMDP by including models of other agents. The models of\nother agents included in the new augmented belief space consist of two types: intentional models and\nsubintentional models. An intentional model ascribes beliefs, preferences, and rationality to other\nagents [7], while a simpler subintentional model, such as \ufb01nite state controllers [15], does not. The\naugmentation of intentional models forms a hierarchical belief structure that represents an agent\u2019s\nbelief about the physical state, belief about the other agents and their beliefs about others\u2019 beliefs,\nand so on. Solutions of I-POMDPs map an agent\u2019s belief about the environment and other agents\u2019\nmodels to actions. It has been shown [7] that the added sophistication of modeling others as rational\nagents results in a higher value function compared to one obtained from treating others as noise,\nwhich implies the modeling superiority of I-POMDPs over other approaches.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, the interactive belief augmentation of I-POMDPs results in a drastic increase of the belief\nspace complexity, because the agent models grow exponentially as the belief nesting level increases.\nTherefore, the complexity of the belief representation is proportional to belief dimensions, which\nis known as the curse of dimensionality. Moreover, due to the fact that exact solutions to POMDPs\nare PSPACE-complete and undecidable for \ufb01nite and in\ufb01nite time horizon respectively[16], the time\ncomplexity of more generalized I-POMDPs is at least PSPACE-complete and undecidable for \ufb01nite\nand in\ufb01nite horizon, since an I-POMDP may contain multiple POMDPs or I-POMDPs of other agents.\nDue to these complexities, a solution which accounts for an agent\u2019s belief over an entire intentional\nmodel has not been implemented up to date. There are partial solutions that depend on what is known\nabout other agents\u2019 beliefs about the physical states [2], but they do not include the state of an agent\u2019s\nknowledge about others\u2019 reward, transition, and observation functions. Indirect approach such as\nsubintentional \ufb01nite state controllers [15] do not include any of these elements either. To unleash the\nfull modeling power of intentional models and mitigate the aforementioned complexities, a robust\napproximation algorithm is needed. The purpose of this algorithm is to compute the nested interactive\nbelief over elements of the intentional models and predict other agents\u2019 actions. It is crucial to the\ntrade-off between solution quality and computational complexity.\nTo address these issues, we propose an approach that uses Bayesian inference and customized\nsequential Monte Carlo sampling [4] to obtain approximate solutions to I-POMDPs. We assume that\nthe modeling agent maintains beliefs over intentional models of other agents and make sequential\nBayesian updates using observations from the environment. While in multi-agent settings, others\nagents\u2019 models other than their beliefs are usually assumed to be known, in our assumption the\nmodeling agent does not know any information about others\u2019 beliefs, strategy levels, and transition,\nobservation, and reward functions. It only relies on learning indirectly from observations about the\nenvironment, which is in\ufb02uenced by others agents\u2019 actions. Since this Bayesian inference task is\nanalytically intractable due to the requirement of computing high dimensional integrations, we have\ndevised a customized sequential Monte Carlo method, extending the interactive particle \ufb01lter (I-PF)\n[2] to the entire intentional model space. The main idea of this method is to descend the nested\nbelief hierarchy, parametrize other agents\u2019 model functions, and sample all model parameters at each\nnesting level according to observations.\nOur approach successfully recovers other agents\u2019 models over the intentional model space which\ncontains their beliefs, strategy levels, and transition, observation and reward functions. It extends\nI-POMDP\u2019s belief update to larger model space, and therefore it serves as a generalized Bayesian\nlearning method for multi-agent systems in which other agents\u2019 beliefs, transition, observation and\nreward functions are unknown. By approximating Bayesian inference using a customized sequential\nMonte Carlo sampling method, we signi\ufb01cantly mitigate the belief space complexity of I-POMDPs.\n\n2 The Model\n\n2.1\n\nI-POMDP framework\n\nI-POMDPs [7] generalize POMDPs [11] to multi-agent settings by including models of other agents\nin the belief state space. The resulting hierarchical belief structure represents an agent\u2019s belief about\nthe physical state, belief about the other agents and their beliefs about others\u2019 beliefs, and can be\nnested in\ufb01nitely in this recursive manner. Here we focus on the computable counterparts of in\ufb01nitely\nnested I-POMDPs: \ufb01nitely nested I-POMDPs. For simplicity of presentation, we consider two\ninteracting agents i and j. This formalism generalizes to more number of agents in a straightforward\nmanner.\nA \ufb01nitely nested interactive POMDP of agent i , I-POMDPi,l, is de\ufb01ned as:\n\nI-P OM DPi,l = (cid:104)ISi,l, A, \u2126i, Ti, Oi, Ri(cid:105)\n\n(1)\nwhere ISi,l is a set of interactive states, de\ufb01ned as ISi,l = S \u00d7 Mj,l\u22121, l \u2265 1, S is the set of physical\nstates, Mj,l\u22121 is the set of possible models of agent j, and l is the strategy (nesting) level. The set of\nmodels, Mj,l\u22121, can be divided into two classes, the intentional models, IMj,l\u22121, and subintentional\nmodels, SMj. Thus, Mj,l\u22121 = IMj,l\u22121 \u222a SMj.\nThe intentional models, IMj,l\u22121, ascribe beliefs, preferences, and rationality in action selection to\nother agents, thus they are analogous to types, \u03b8j, used in Bayesian games [10]. The intentional\nmodels, \u0398j,l\u22121, of agent j at level l \u2212 1 is de\ufb01ned as \u03b8j,l\u22121 = (cid:104)bj,l\u22121, A, \u2126j, Tj, Oj, Rj, OCj(cid:105),\n\n2\n\n\fwhere bj,l\u22121 is agent j\u2019s belief nested to the level (l \u2212 1), bj,l\u22121 \u2208 \u2206(ISj,l\u22121), and OCj is j\u2019s\noptimality criterion. It can be rewritten as \u03b8j,l\u22121 = (cid:104)bj,l\u22121, \u02c6\u03b8j(cid:105), where \u02c6\u03b8j includes all elements of the\nintentional model other than the belief and is called the agent j\u2019s frame.\nThe subintentional models, SMj, constitute the remaining models in Mj,l\u22121. Examples of subinten-\ntional models are \ufb01nite state controllers [15], no-information models [8] and \ufb01ctitious play models\n[5].\nThe ISi,l can be de\ufb01ned in an inductive manner:\n\nISi,0 = S,\nISi,1 = S \u00d7 Mj,0,\n......\nISi,l = S \u00d7 Mj,l\u22121,\n\n\u0398j,0 = {(cid:104)bj,0, \u02c6\u03b8j(cid:105) : bj,0 \u2208 \u2206(ISj,0)}\n\u0398j,1 = {(cid:104)bj,1, \u02c6\u03b8j(cid:105) : bj,1 \u2208 \u2206(ISj,1)}\n\nMj,0 = \u0398j,0 \u222a SMj\nMj,1 = \u0398j,1 \u222a Mj,0\n\n(2)\n\n\u0398j,l = {(cid:104)bj,l, \u02c6\u03b8j(cid:105) : bj,l \u2208 \u2206(ISj,l)}\n\nMj,l = \u0398j,l \u222a Mj,l\u22121\n\nAll remaining components in an I-POMDP are similar to those in a POMDP. A = Ai \u00d7 Aj is the set\nof joint actions of all agents. \u2126i is the set of agent i\u2019s possible observations. Ti : S \u00d7 A \u00d7 S \u2192 [0, 1]\nis the transition function. Oi : S \u00d7 A \u00d7 \u2126i \u2192 [0, 1] is the observation function. Ri : ISi \u00d7 A \u2192 R\nis the reward function.\n\n2.2\n\nInteractive belief update\n\nGiven the de\ufb01nitions above, the interactive belief update can be performed as follows, by considering\nothers\u2019 actions and anticipated observations:\n\n(cid:88)\ni,l(ist) = P r(ist|bt\u22121\nbt\n= \u03b1\n\nbi,l(ist\u22121)\n\ni,l\n\n, at\u22121\n\ni\n\n, ot\ni)\nP r(at\u22121\n\nj\n\n(cid:88)\n\u00d7(cid:88)\n\not\nj\n\nist\u22121\nOj(st, at\u22121, ot\n\nat\u22121\nj)\u03c4 (bt\u22121\n\nj\n\nj,l\u22121, at\u22121\n\nj\n\n, ot\n\nj, bt\n\nj,l\u22121)\n\n|\u03b8t\u22121\nj,l\u22121)T (st\u22121, at\u22121, st)Oi(st, at\u22121, ot\ni)\n\n(3)\n\nCompared with POMDP, the interactive belief update in I-POMDP takes two additional elements\ninto account. First, the probability of other\u2019s actions given his models needs to be computed since\nthe state now depends on both agents\u2019 actions (the second summation). Second, the modeling agent\nneeds to update his beliefs based on the anticipation of what observations the other agent might get\nand how it updates (the third summation).\nSimilarly to POMDPs, the value associated with a belief state in I-POMDPs can be updated using\nvalue iteration:\n\n(cid:111)\nP (oi|ai, bi,l)V ((cid:104)SE\u03b8i(bi,l, ai, oi), \u02c6\u03b8i(cid:105))\n\nbi,l(is)ERi(is, ai) + \u03b3\n\n(cid:88)\n\noi\u2208\u2126i\n\n(cid:110) (cid:88)\nwhere ERi(is, ai) =(cid:80)\n\nV (\u03b8i,l) = max\nai\u2208Ai\n\nis\u2208IS\n\nThen the optimal action, a\u2217\noptimal actions, OP T (\u03b8i), for the belief state:\n\nRi(is, ai, aj)P r(aj|\u03b8j,l\u22121).\ni , for an in\ufb01nite horizon criterion with discounting, is part of the set of\n\n(4)\n\n(cid:111)\n\n(cid:88)\n\noi\u2208\u2126i\n\nP (oi|ai, bi,l)V ((cid:104)SE\u03b8i (bi,l, ai, oi), \u02c6\u03b8i(cid:105))\n(5)\n\nOP T (\u03b8i,l) = arg max\n\nai\u2208Ai\n\nbi,l(is)ERi(is, ai) + \u03b3\n\naj\n\n(cid:110) (cid:88)\n\nis\u2208IS\n\n3 Sampling Algorithms\n\nThe Markov Chain Monte Carlo (MCMC) method [6] is widely used to approximate probability\ndistributions that are dif\ufb01cult to compute directly. Sequential versions of Monte Carlo methods, such\nas as particle \ufb01lters [1], work on sequential inference tasks, especially sequential decision making\nunder Markov assumption. At each time step, a particle \ufb01lter draws samples (or particles) from a\n\n3\n\n\fproposal distribution, commonly the conditional distribution p(xt|xt\u22121) of the current state xt given\nthe previous xt\u22121, then uses the observation function p(yt|xt) to compute importance weights for all\nparticles and resample them according to the weights.\nThe Interactive Particle Filter (I-PF) was devised as a \ufb01ltering algorithm for interactive belief update\nin I-POMDP, which generalizes the classic particle \ufb01lter algorithm to multi-agent settings [2]. It uses\nthe state transition function as the proposal distribution, which is usually used in a speci\ufb01c particle\n\ufb01lter algorithm called bootstrap \ufb01lter [9]. However, due to the enormous belief space, the I-PF\nimplementation assumes that the other agent\u2019s frame \u02c6\u03b8j is known to the modeling agent, therefore it\nsimpli\ufb01es the belief update from S \u00d7 \u0398j,l\u22121 to a signi\ufb01cantly smaller space S \u00d7 {bj,l\u22121}, where j\nrepresents the other agent and \u0398j,l\u22121 is j\u2019s model space.\nOur interactive belief update described in Algorithm 1 and 2, however, generalizes I-POMDP\u2019s\nbelief update to larger intentional model space which contains other agents\u2019 beliefs, and transition,\nobservation and reward functions. In the remaining part of this section, we will \ufb01rstly give a brief\nintroduction of our algorithms and discuss the motivations of each sampling step. Then we will show\nthe major differences between our algorithm and the I-PF, since this generalization is nontrivial. A\nconcrete example of the algorithm is given in Figure 1 in the next section as well.\n\nk\n\n, ot\n\nk\n\nk\n\n,at\u22121\u2212k , ot\u2212k)\n\n=< s(n),t, \u03b8(n),t\u2212k,0 >\n\nAlgorithm 1: Interactive Belief Update\nk,l = InteractiveBeliefUpdate(\u02dcbt\u22121\nk,l , at\u22121\n\u02dcbt\nk, l > 0)\n=< s(n),t\u22121, \u03b8(n),t\u22121\n1 For is(n),t\u22121\n\u2212k,l\u22121 >\u2208 \u02dcbt\u22121\nk,l :\nsample at\u22121\u2212k \u223c P (a\u2212k|\u03b8(n),t\u22121\n2\n\u2212k,l\u22121 )\nsample s(n),t \u223c Tk(st|s(n),t\u22121, at\u22121\n, at\u22121\u2212k )\n3\nfor ot\u2212k \u2208 \u2126\u2212k:\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\nk\n15 normalize all w(n)\nn=1 w(n)\nt = 1\n16 resample {is(n),t\naccording to normalized {w(n)\nt }\n17 resample \u03b8(n),t\u2212k,l\u22121 \u223c N (\u03b8t\u2212k,l\u22121|\u03b8(n),t\u22121\n=< s(n),t, \u03b8(n),t\u2212k,l\u22121 >\n18 return \u02dcbt\n\nif l = 1:\nb(n),t\u2212k,0 = Level0BeliefUpdate(\u03b8(n),t\u22121\n\u2212k,0\nis(n),t\nk\nelse:\nb(n),t\u2212k,l\u22121 = InteractiveBeliefUpdate(\u02dcb(n),t\u22121\n\u03b8(n),t\u2212k,l\u22121 =< b(n),t\u2212k,l\u22121, \u02c6\u03b8(n),t\u22121\n=< s(n),t, \u03b8(n),t\u2212k,l\u22121 >\nis(n),t\nk\nt = O(n)\u2212k (ot\u2212k|s(n),t, at\u22121\nw(n)\nt \u00d7 Ok(ot\nw(n)\nt = w(n)\n, w(n)\nk,l =< is(n),t\n\u02dcbtemp\n\nk|s(n),t, at\u22121\nt >\n\nso that(cid:80)N\n\n} from \u02dcbtemp\n\nk,l = is(n),t\n\nk\n\n\u2212k,l\u22121 , \u03a3)\n\n, at\u22121\u2212k )\n\n\u2212k,l\u22121 >\n\nk\n\n, at\u22121\u2212k )\n\nk\n\nt\n\nk\n\nk,l\n\n\u2212k,l\u22121 , at\u22121\u2212k , ot\u2212k, l \u2212 1)\n\nk\n\nk\n\n, the observation, ot\n\n, along with the action, at\u22121\n\nThe Algorithm 1 requires inputs of the modeling agent\u2019s prior belief, \u02dcbt\u22121\nk,l , which is represented\nas a set of n samples is(n),t\u22121\nk, and the belief\nnesting level, l > 0. Here k represents either agent i or j, and \u2212k represents the other agent, j or\ni, correspondingly. We assume that the modeled agent\u2019s action set A\u2212k, observation set \u2126\u2212k and\noptimality criteria OCk are known to all agents. We want to learn the other agent\u2019s initial belief about\nthe physical state, b0\u2212k, the transition function, T\u2212k, the observation function, O\u2212k and the reward\nfunction, R\u2212k.\nThe initial belief samples, is(n),t\u22121\n, are generated from the prior nested belief in the similar way\nas described in the I-PF literature [2] except that T (n)\u2212k , O(n)\u2212k , and R(n)\u2212k are sampled from their\nprior distributions as well. Notice that T (n)\u2212k , O(n)\u2212k , and R(n)\u2212k are all part of the frame, namely\n\u02c6\u03b8(n)\u2212k =< A\u2212k, \u2126\u2212k, T (n)\u2212k , O(n)\u2212k , R(n)\u2212k , OCk >, as appeared in line 7 and 11 in Algorithm 1.\n\nk\n\n4\n\n\fWith initial belief samples, the Algorithm 1 starts from propagating each sample forward in time\nand computing their weights (line 1-15), then it resamples according to the weights and similarity\nbetween models (line 16-18). Intuitively, the samples associated with actual observations perceived\nby agent k will gradually carry larger weights and be resampled more often, therefore they will\napproximately represent the exact belief. Speci\ufb01cally, for each of is(n),t\u22121\n, the algorithm samples\nthe other agent\u2019s optimal actions at\u22121\u2212k given its model from P (A\u2212k|\u03b8(n),t\u22121\n) (line 2), which equals\n1|OP T| if a\u2212k \u2208 OP T or 0 otherwise. Then it samples the physical state s(n),t using the state\ntransition function Tk(St|S(n),t\u22121, at\u22121\n, at\u22121\u2212k ) (line 3). Then for each possible observation, if the\ncurrent nesting level l is 1, it calls the 0-level belief update, described in Algorithm 2, to update other\nagents\u2019 beliefs over physical state bt\u2212k,0 (line 5 to 7); or it recursively calls itself at a lower level l \u2212 1\n(line 8 to 11), if l is greater than 1. The sample weights w(n)\nare computed according to observation\nlikelihoods of modeling and modeled agents (line 12, 13). Lastly, the algorithm normalizes the\nweights (line 15), resamples the intermediate particles(line 16) and resamples another time from\nsimilar neighboring models using a Gaussian distribution to avoid divergence (line 17).\n\nk\n\u2212k\n\nk\n\nt\n\n, ot\nk)\n\nk\n\nk,0\n\nget Tk and Ok from \u03b8t\u22121\n\nAlgorithm 2: Level-0 Belief Update\nk,0 , at\u22121\nk,0 =Level0BeliefUpdate(\u03b8t\u22121\nbt\n1\n2 P (at\u22121\u2212k ) = 1/|A\u2212k|\n3 for st \u2208 S:\nsum=0\n4\nfor st\u22121:\n5\nfor at\u22121\u2212k \u2208 A\u2212k:\n6\nP (st|st\u22121, at\u22121\n7\nsum+ = P (st|st\u22121, at\u22121\n8\nfor at\u22121\u2212k \u2208 A\u2212k:\n9\nk|st, at\u22121\n10\nP (ot\nk,0 = sum \u00d7 P (ot\n11\nbt\n12 normalize and return bt\n\n)+ = Ok(ot\nk|st, at\u22121\n\nk\n\nk\n\nk\n\nk,0\n\n)+ = Tk(st|st\u22121, at\u22121\n\nk\n\n, at\u22121\u2212k )P (at\u22121\u2212k )\n\nk\n\n)bt\u22121\nk,0 (st\u22121)\nk|st, at\u22121\n)\n\nk\n\n, at\u22121\u2212k )P (at\u22121\u2212k )\n\nk\n\nk\n\nk\n\nk,0 , action, at\u22121\n\n, at\u22121\u2212k ) and Ok(ot\n\nk, as input arguments and returns the belief about the physical state, bt\n\nThe 0-level belief update, described in Algorithm 2, takes agent model, \u03b8t\u22121\n, and\nobservation, ot\nk,0. The\nother agent\u2019s actions are treated as noise (line 2), and transition and observation functions are\nk,0 . For each possible action at\u22121\u2212k , it computes the\npassed in within the \ufb01rst input argument \u03b8t\u22121\nactual state transition (line 7) and observation function (line 10) by marginalizing over others\u2019\nk,0. Notice that the transition and observation functions,\nactions, and returns the normalized belief bt\nTk(st|st\u22121, at\u22121\nk|st, at\u22121\n, at\u22121\u2212k ) contained in \u03b8t\u22121\n, depend on particular model\nparameters of the actual agent on the 0th level.\nOur interactive belief update algorithm differs in three major ways from the I-PF. First, in order\nto update the belief over this intentional model space of other agents, their initial belief, transition\nfunction, observation function and reward function in their frames are all unknown and become\nsamples. For instance, the set of n samples of other agents\u2019 intentional models \u03b8(n),t\u22121\n\u2212k,l\u22121 =<\nb(n),t\u22121\n\u2212k,l\u22121 , A\u2212k, \u2126\u2212k, T (n)\u2212k , O(n)\u2212k , R(n)\u2212k , OCk >. The observation function of the modeled agents,\nO(n)\u2212k (ot\u2212k|s(n),t, at\u22121\n, at\u22121\u2212k ) in line 12 of Algorithm 1, is now randomized consequently. Second, the\ntransition and observation functions of the level-0 agent, in line 7 and 10 of Algorithm 2, are passed\nin as input arguments which correspond to each model sample. Lastly, we add another resampling\nstep in line 17 of Algorithm to avoid divergence, by resampling the model samples from a Gaussian\ndistribution with the mean of current sample value. This additional resampling step is nontrivial,\nsince empirically the samples diverge quickly due to the enormously enlarged sample space.\n\nk\n\nk\n\n5\n\n\f4 Experiments\n\nWe evaluate our algorithm on the multi-agent tiger problem [7] and UAV reconnaissance problem [2].\nThe multi-agent tiger game is a generalization of the classic single agent tiger game [11]. It contains\nadditional observations caused by others\u2019 actions, and the transition and reward functions involve\nothers\u2019 actions as well. The UAV reconnaissance problem contains a 3x3 grid in which the agent\n(UAV) tries to capture the moving target [2].\n\n4.1 Parameterization\n\nThe initial step of solving an I-POMDP in our approach is to parameterize other agents\u2019 models in\nterms of an I-POMDP or POMDP, depending on the modeling agent\u2019s strategy level. Then, the model\nparameters can be sampled and updated using the interactive belief update algorithm for solving the\nplanning task.\nHere we give an example of parameterization using the tiger problem. The UAV problem has a\nsimilar process accordingly. For the simplicity of presentation, assume there are two agent i and\nj in the game and the strategy level is 1 (but we experiment with higher strategy levels in later\nsections), then for the two-agent tiger problem: ISi,1 = S \u00d7 \u03b8j,0, where S = {tiger on the left (TL),\ntiger on the right (TR)} and \u03b8j,0 =< bj(s), Aj, \u2126j, Tj, Oj, Rj, OCj >}; A = Ai \u00d7 Aj are joint\nactions of listen (L), open left door (OL) and open right door(OR); \u2126i: {growl from left (GL) or right\n(GR)} \u00d7 {creak from left (CL), right (CR) or silence (S)}; Ti = Tj : S \u00d7 Ai \u00d7 Aj \u00d7 S \u2192 [0, 1];\nOi : S \u00d7 Ai \u00d7 Aj \u00d7 \u2126i \u2192 [0, 1]; Ri : IS \u00d7 Ai \u00d7 Aj \u2192 R.\nAs mentioned before we assume that Aj and \u2126j are known, and OCj is in\ufb01nite horizon with\ndiscounting. We want to recover the possible initial belief b0\nj about the physical state, the transition,\nTj, the observation, Oj and the reward, Rj. Thus the main idea of our experiment is to do Bayesian\nparametric learning with the help of our sampling algorithm.\n\nTable 1: Parameters for transition, observation and reward functions\n\nA\nS\nL\nTL\nTR L\n*\n*\n\np(TL)\npT 1\n1 \u2212 pT 1\nOL\npT 2\nOR 1 \u2212 pT 2\n\np(TR)\n1 \u2212 pT 1\npT 1\n1 \u2212 pT 2\npT 2\n\nA\nS\nL\nTL\nTR L\n*\n*\n\np(GL)\npO1\n1 \u2212 pO1\nOL\npO2\nOR 1 \u2212 pO2\n\np(GR)\n1 \u2212 pO1\npO1\n1 \u2212 pO2\npO2\n\nA\nL\n\nS\nR\n*\nrR1\nTL OL\nrR2\nTR OR rR2\nTL OR rR3\nTR OL\nrR3\n\nj \u00d7 pT 1 \u00d7 pT 2 \u00d7 pO1 \u00d7 pO2 \u00d7\nWe see in Table 1 that it is a large 8-dimensional space to learn from: b0\nrR1\u00d7 rR2\u00d7 rR3, where {bj, pT 1, pT 2, pO1, pO2} \u2208 [0, 1]5 \u2282 R and {rR1, rR2, rR3} \u2208 [\u2212\u221e, +\u221e]5.\nFigure 1 illustrates the interactive belief update in the game described above, assuming the sample\nsize is 8. The subscripts denote the corresponding agents and each dot represents a particular belief\nsample. The propagation step in implemented in lines 2 to 11 in Algorithm 1, the weighting step is in\nlines 12 to 15, and the resampling step is in lines 16 and 17. The belief update for a particular level-0\nmodel sample, \u03b8j = (cid:104)0.5, 0.67, 0.5, 0.85, 0.5, -1, -100, 10(cid:105), is solved using Algorithm 2.\n\n4.2 Results\n\nWe \ufb01rstly \ufb01x the modeled agent j to be a level-2 I-POMDP agent and experiment with different\nmodeling approaches for agent i in order to compare the performance in terms of average reward. We\ncompare level-3, level-2, level-1 intentional I-POMDP models with a subintentional model, in which\nagent j is assumed to choose his action according to a \ufb01xed but unknown distribution and therefore is\ncalled a frequency-based (\ufb01ctitious play) model [5].\nIn Figure 2, we see that the intentional I-POMDP approaches has signi\ufb01cantly higher reward as agent\ni perceives more observations, and level-2 I-POMDP performs slightly better than level-1 while\nlevel-3 has high variance but at least competes with level-2. The subintentional approach has certain\nlearning ability but is not sophisticated enough to model a rational (level-2 intentional I-POMDP)\nagent, therefore its performance is worse than all I-POMDP models.\n\n6\n\n\fFigure 1: An illustration of interactive belief update algorithm using tiger problem for two-agent and\nlevel-1 nesting.\n\nFigure 2: Performance comparison in terms of average reward per time step versus observation length.\nThe plot is averaged on 5 runs and uses 2000 and 1000 samples for tiger and UAV respectively. The\nvertical bars stand for standard deviations.\n\nFigure 3: Learning quality, measured by KL-divergence, improves as the number of particles increases.\nIt measures the difference between the ground truth of the model parameters and the learned posterior\ndistributions. The vertical bars are the standard deviations.\n\nIn Figure 3 we show that the learning quality, in terms of the sum of independent KL-divergence of\neach model parameter dimension, becomes better as the number of particles increases. It measures the\n\n7\n\n\fdifference between the ground truth of the model parameters and the learned posterior distributions\nby giving the relative entropy of the truth with respect to the posteriors.\n\nFigure 4: Agent i learns agent j\u2019s most likely nesting level. Samples representing j\u2019s models of\ndifferent nesting levels evolve as agent i perceives more observations. Totally 1000 samples are used\nand experiments start from equal number of level-1, level-0 and frequency-based samples.\n\nThen we \ufb01x the modeling agent i\u2019s strategy level to be 2 and try to observe the changes of j\u2019s samples\nwhich represent different possible models or strategy levels. Speci\ufb01cally, we start from equal number\nof samples that representing j as level-1 I-POMDP, level-0 POMDP, and frequency based agents, and\nthen gradually learn that the majority of samples converge or become close to the ground truth: j is a\nlevel-1 I-POMDP.\n\nTable 2: Running time for tiger and UAV problems using various number of samples\n\nBelief Level\n\n1\n\n2\n\nN=500\n1.96s\u00b10.43s\n5m27.23s\n\u00b15.19s\n\nN=1000\n3.68s\u00b11.01s\n16m36.07s\n\u00b110.84s\n\nTiger\n\nN=2000\n35.2s\u00b12.82s\n49m36.07s\n(single run)\n\nBelief Level\n\n1\n\n2\n\nN=100\n4.86s\u00b11.34ss\n2m43.1s\n\u00b13.98s\n\nUAV\n\nN=500\n12.31s\u00b11.39s\n9m53.7s\n\u00b16.48s\n\nN=1000\n2m1.43s\n\u00b13.29s\n36m19.5s\n\u00b118.63s\n\nLastly, we report the running time of our sampling algorithm in Table 2. The computing machine has\nan Intel Core i5 2GHz, 8GB RAM, and runs macOS 10.13 and MATLAB R2017.\n\n5 Conclusions and Future Work\n\nWe have described a novel approach to learn other agents\u2019 intentional models by making the interactive\nbelief update using Bayesian inference and Monte Carlo sampling methods. We show the correctness\nof our theoretical framework using the multi-agent tiger and UAV problems in which it accurately\nlearns others\u2019 models over the entire intentional model space and can be generalized to problems\nof larger scale in a straightforward manner. Therefore, it provides a generalized Bayesian learning\nalgorithm for multi-agent sequential decision making problems.\nFor future research opportunities, the applications presented in this paper can be extended to more\ncomplex problems by leveraging emerging deep reinforcement learning (DRL) methods, which\nalready solves POMDPs in an neural analogy [12]. DRL should also be capable of approximating\nkey functions in I-POMDPs, thus has the potential to serve as an ef\ufb01cient computational tool for\nI-POMDPs.\n\n8\n\n\fReferences\n[1] Pierre Del Moral. Non-linear \ufb01ltering: interacting particle resolution. Markov processes and related \ufb01elds,\n\n2(4):555\u2013581, 1996.\n\n[2] Prashant Doshi and Piotr J Gmytrasiewicz. Monte Carlo sampling methods for approximating interactive\n\nPOMDPs. Journal of Arti\ufb01cial Intelligence Research, 34:297\u2013337, 2009.\n\n[3] Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. Bayesian nonparametric methods\nfor partially-observable reinforcement learning. IEEE transactions on pattern analysis and machine\nintelligence, 37(2):394\u2013407, 2015.\n\n[4] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods.\n\nIn Sequential Monte Carlo methods in practice, pages 3\u201314. Springer, 2001.\n\n[5] Drew Fudenberg and David K Levine. The theory of learning in games, volume 2. MIT press, 1998.\n\n[6] Walter R Gilks, Sylvia Richardson, and David J Spiegelhalter. Introducing Markov chain Monte Carlo.\n\nMarkov chain Monte Carlo in practice, 1:19, 1996.\n\n[7] Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings. J.\n\nArtif. Intell. Res.(JAIR), 24:49\u201379, 2005.\n\n[8] Piotr J Gmytrasiewicz and Edmund H Durfee. Rational coordination in multi-agent environments. Au-\n\ntonomous Agents and Multi-Agent Systems, 3(4):319\u2013350, 2000.\n\n[9] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian\nBayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 140, pages\n107\u2013113. IET, 1993.\n\n[10] John C Harsanyi. Games with incomplete information played by \u201cBayesian\u201d players, i\u2013iii part i. the basic\n\nmodel. Management science, 14(3):159\u2013182, 1967.\n\n[11] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially\n\nobservable stochastic domains. Arti\ufb01cial intelligence, 101(1):99\u2013134, 1998.\n\n[12] Peter Karkus, David Hsu, and Wee Sun Lee. Qmdp-net: Deep learning for planning under partial\n\nobservability. In Advances in Neural Information Processing Systems, pages 4697\u20134707, 2017.\n\n[13] Miao Liu, Xuejun Liao, and Lawrence Carin. The in\ufb01nite regionalized policy representation. In Proceedings\n\nof the 28th International Conference on Machine Learning (ICML-11), pages 769\u2013776, 2011.\n\n[14] Andrew Kachites McCallum and Dana Ballard. Reinforcement learning with selective perception and\n\nhidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.\n\n[15] Alessandro Panella and Piotr J Gmytrasiewicz. Bayesian learning of other agents\u2019 \ufb01nite controllers for\ninteractive POMDPs. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, pages\n2530\u20132536, 2016.\n\n[16] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathemat-\n\nics of operations research, 12(3):441\u2013450, 1987.\n\n9\n\n\f", "award": [], "sourceid": 2711, "authors": [{"given_name": "Yanlin", "family_name": "Han", "institution": "University of Illinois at Chicago"}, {"given_name": "Piotr", "family_name": "Gmytrasiewicz", "institution": "UIC"}]}