{"title": "Efficient Offline Communication Policies for Factored Multiagent POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1917, "page_last": 1925, "abstract": "Factored Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent's local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents' decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.", "full_text": "Ef\ufb01cient Of\ufb02ine Communication Policies for\n\nFactored Multiagent POMDPs\n\nJo\u02dcao V. Messias\n\nInstitute for Systems and Robotics\n\nInstituto Superior T\u00b4ecnico\n\nLisbon, Portugal\n\nMatthijs T.J. Spaan\n\nDelft University of Technology\n\nDelft, The Netherlands\n\nm.t.j.spaan@tudelft.nl\n\njmessias@isr.ist.utl.pt\n\nPedro U. Lima\n\nInstitute for Systems and Robotics\n\nInstituto Superior T\u00b4ecnico\n\nLisbon, Portugal\n\npal@isr.ist.utl.pt\n\nAbstract\n\nFactored Decentralized Partially Observable Markov Decision Processes (Dec-\nPOMDPs) form a powerful framework for multiagent planning under uncertainty,\nbut optimal solutions require a rigid history-based policy representation. In this\npaper we allow inter-agent communication which turns the problem in a central-\nized Multiagent POMDP (MPOMDP). We map belief distributions over state fac-\ntors to an agent\u2019s local actions by exploiting structure in the joint MPOMDP pol-\nicy. The key point is that when sparse dependencies between the agents\u2019 decisions\nexist, often the belief over its local state factors is suf\ufb01cient for an agent to un-\nequivocally identify the optimal action, and communication can be avoided. We\nformalize these notions by casting the problem into convex optimization form, and\npresent experimental results illustrating the savings in communication that we can\nobtain.\n\n1 Introduction\n\nIntelligent decision making in real-world scenarios requires an agent to take into account its limita-\ntions in sensing and actuation. These limitations lead to uncertainty about the state of environment,\nas well as how the environment will respond to performing a certain action. When multiple agents\ninteract and cooperate in the same environment, the optimal decision-making problem is particularly\nchallenging. For an agent in isolation, planning under uncertainty has been studied using decision-\ntheoretic models like Partially Observable Markov Decision Processes (POMDPs) [4]. Our focus\nis on multiagent techniques, building on the factored Multiagent POMDP model. In this paper, we\npropose a novel method that exploits sparse dependencies in such a model in order to reduce the\namount of inter-agent communication.\n\nThe major source of intractability for optimal Dec-POMDP solvers is that they typically reason over\nall possible histories of observations other agents can receive. In this work, we consider factored\nDec-POMDPs in which communication between agents is possible, which has already been explored\nfor non-factored models [10, 11, 15, 13] as well as for factored Dec-MDPs [12]. When agents share\ntheir observations at each time step, the decentralized problem reduces to a centralized one, known\nas a Multiagent POMDP (MPOMDP) [10]. In this work, we develop individual policies which map\nbeliefs over state factors to actions or communication decisions.\n\n1\n\n\fMaintaining an exact, factorized belief state is typically not possible in cooperative problems. While\nbounded approximations are possible for probabilistic inference [2], these results do not carry over\ndirectly to decision-making settings (but see [5]). Intuitively, even a small difference in belief can\nlead to a different action being taken. However, when sparse dependencies between the agents\u2019\ndecisions exist, often the belief over its local state factors is suf\ufb01cient for an agent to identify the\naction that it should take, and communication can be avoided. We formalize these notions as con-\nvex optimization problems, extracting those situations in which communication is super\ufb02uous. We\npresent experimental results showing the savings in communication that we can obtain, and the\noverall impact on decision quality.\n\nThe rest of the paper is organized as follows. First, Section 2 presents the necessary background\nmaterial. Section 3 presents the formalization of our method to associate belief points over state\nfactors to actions. Next, Section 4 illustrates the concepts with experimental results, and Section 5\nprovides conclusions and discusses future work.\n\n2 Background\n\nIn this section we provide background on factored Dec-POMDPs and Multiagent POMDPs.\n\nA factored Dec-POMDP is de\ufb01ned as the following tuple [8]:\n\nD = {1, ..., n} is the set of agents. Di will be used to refer to agent i;\nS = \u00d7iXi, i = 1, . . . , nf is the state space, decomposable into nf factors Xi \u2208 {1, ..., mi} which\nlie inside a \ufb01nite range of integer values. X = {X1, . . . , Xnf } is the set of all state factors;\nA = \u00d7iAi, i = 1, ..., n is the joint action space. At each step, every agent i takes an individual\n\naction ai \u2208 Ai, resulting in the joint action a = ha1, ..., ani \u2208 A;\n\nO = \u00d7iOi, i = 1, ..., n is the space of joint observations o = ho1, ..., oni, where oi \u2208 Oi are the\n\nindividual observations. An agent receives only its own observation;\n\nT : S \u00d7 S \u00d7 A \u2192 [0, 1] speci\ufb01es the transition probabilities Pr (s\u2032|s, a);\nO : O \u00d7 S \u00d7 A \u2192 [0, 1] speci\ufb01es the joint observation probabilities Pr (o|s\u2032, a);\nR : S \u00d7 A \u2192 R speci\ufb01es the reward for performing action a \u2208 A in state s \u2208 S;\nb0 \u2208 B is the initial state distribution. The set B is the space of all possible distributions over S;\nh is the planning horizon.\n\nThe main advantage of factored (Dec-)POMDP models over their standard formulation lies in their\nmore ef\ufb01cient representation. Existing methods for factored Dec-POMDPs can partition the decision\nproblem across local subsets of agents, due to the possible independence between their actions and\nobservations [8]. A natural state-space decomposition is to perform an agent-wise factorization, in\nwhich a state in the environment corresponds to a unique assignment over the states of individual\nagents. Note that this does not preclude the existence of state factors which are common to multiple\nagents.\n\nThe possibility of exchanging information between agents greatly in\ufb02uences the overall complexity\nof solving a Dec-POMDP. In a fully communicative Dec-POMDP, the decentralized model can be\nreduced to a centralized one, the so-called Multiagent POMDP (MPOMDP) [10]. An MPOMDP is\na regular single-agent POMDP but de\ufb01ned over the joint models of all agents. In a Dec-POMDP,\nat each t an agent i knows only ai and oi, while in an MPOMDP, it is assumed to know a and o.\nIn the latter case, inter-agent communication is necessary to share the local observations. Solving\nan MPOMDP is of a lower complexity class than solving a Dec-POMDP (PSPACE-Complete vs.\nNEXP-Complete) [1].\n\nIt is well-known that, for a given decision step t, the value function V t of a POMDP is a piecewise\nlinear, convex function [4], which can be represented as\n\nV t(bt) = max\n\u03b1\u2208\u0393t\n\n\u03b1T \u00b7 bt\n\n,\n\n(1)\n\nwhere \u0393t is a set of vectors (traditionally referred to as \u03b1-vectors). Every \u03b1 \u2208 \u0393t has a particular\njoint action a associated to it, which we will denote as \u03c6(\u03b1). The transpose operator is here denoted\nas (\u00b7)T. In this work, we assume that a value function is given for the Multiagent POMDP. However,\n\n2\n\n\fthis value function need not be optimal, nor stationary. Our techniques preserve the quality of the\nsupplied value function, even if it is an approximation.\n\nA joint belief state is a probability distribution over the set of states S, and encodes all of the\ninformation gathered by all agents in the Dec-POMDP up to a given time t:\n\nbt(s) = Pr(st|ot\u22121, at\u22121, ot\u22122, at\u22122, . . . , o1, a1, b0)\n\n= Pr(X t\n\n1 , . . . , X t\n\nnf |\u00b7)\n\n(2)\n\nA factored belief state is a representation of this very same joint belief as the product of nF assumed\nindependent belief states over the state factors Xi, which we will refer to as belief factors:\n\nEvery factor bt\nFi\n\nis de\ufb01ned over a subset Fi \u2286 X of state factors, so that:\n\nbt = \u00d7nF\n\ni=1bt\nFi\n\nbt(s) \u2243 Pr(F t\n\n1|\u00b7)Pr(F t\n\n2|\u00b7) \u00b7 \u00b7 \u00b7 Pr(F t\n\nnF |\u00b7)\n\n(3)\n\n(4)\n\nWith Fi \u2229 Fj = \u2205 , \u2200i 6= j. A belief point over factors L which are locally available to the agent\nwill be denoted bL.\nThe marginalization of b onto bF is:\n\nF (F t) = Pr(cid:0)F t|a1,\u00b7\u00b7\u00b7 ,t\u22121, o1,\u00b7\u00b7\u00b7 ,t\u22121(cid:1)\nbt\n2 , \u00b7 \u00b7 \u00b7 , X t\n\n1 , X t\n\nPr(cid:16)X t\n\n= X\n\nX t\\F t\n\nnf |\u00b7(cid:17) = X\n\nX t\\F t\n\nwhich can be viewed as a projection of b onto the smaller subspace BF :\n\nbF = M X\nF b\n\nbt(st),\n\n(5)\n\n(6)\n\nF is a matrix where M X\n\nF (u, v) = 1 if the assignments to all state factors contained in\nwhere M X\nstate u \u2208 F are the same as in state v \u2208 X , and 0 otherwise. This intuitively carries out the\nmarginalization of points in B onto BF .\n\n3 Exploiting Sparse Dependencies in Multiagent POMDPs\n\nIn the implementation of Multiagent POMDPs, an important practical issue is raised: since the joint\npolicy arising from the value function maps joint beliefs to joint actions, all agents must maintain\nand update the joint belief equivalently for their decisions to remain consistent. The amount of\ncommunication required to make this possible can then become problematically large. Here, we\nwill deal with a fully-communicative team of agents, but we will be interested in minimizing the\nnecessary amount of communication. Even if agents can communicate with each other freely, they\nmight not need to always do so in order to act independently, or even cooperatively.\n\nThe problem of when and what to communicate has been studied before for Dec-MDPs [12], where\nfactors can be directly observed with no associated uncertainty, by reasoning over the possible local\nalternative actions to a particular assignment of observable state features. For MPOMDPs, this\nhad been approximated at runtime, but implied keeping track and reasoning over a rapidly-growing\nnumber of possible joint belief points [11].\n\nWe will describe a method to map a belief factor (or several factors) directly to a local action, or to a\ncommunication decision, when applicable. Our approach is the \ufb01rst to exploit, of\ufb02ine, the structure\nof the value function itself in order to identify regions of belief space where an agent may act inde-\npendently. This raises the possibility of developing more \ufb02exible forms for joint policies which can\nbe ef\ufb01ciently decoupled whenever this is advantageous in terms of communication. Furthermore,\nsince our method runs of\ufb02ine, it is not mutually exclusive with online communication-reduction\ntechniques: it can be used as a basis for further computations at runtime, thereby increasing their\nef\ufb01ciency.\n\n3\n\n\f3.1 Decision-making with factored beliefs\n\nNote that, as fully described in [2], the factorization (4) typically results in an approximation of the\ntrue joint belief, since it is seldom possible to decouple the dynamics of a MDP into strictly inde-\npendent subprocesses. The dependencies between factors, induced by the transition and observation\nmodel of the joint process, quickly develop correlations when the horizon of the decision problem\nis increased, even if these dependencies are sparse. Still, it was proven in [2] that, if some of these\ndependencies are broken, the resulting error (measured as the KL-divergence) of the factored belief\nstate, with respect to the true joint belief, is bounded. Unfortunately, even a small error in the belief\nstate can lead to different actions being selected, which may signi\ufb01cantly affect the decision quality\nof the multiagent team in some settings [5, 9]. However, in rapidly-mixing processes (i.e., models\nwith transition functions which quickly propagate uncertainty), the overall negative effect of using\nthis approximation is minimized.\n\nEach belief factor\u2019s dynamics can be described using a two-stage Dynamic Bayesian Network\n(DBN). For an agent to maintain, at each time step, a set of belief factors, it must have access\nto the state factors contained in a particular time slice of the respective DBNs. This can be ac-\ncomplished either through direct observation, when possible, or by requesting this information from\nother agents. In the latter case, it may be necessary to perform additional communication in order\nto keep belief factors consistent. The amount of data to be communicated in this case, as well as its\nfrequency, depends largely on the factorization scheme which is selected for a particular problem.\nWe will not be here concerned with the problem of obtaining a suitable partition scheme of the joint\nbelief onto its factors. Such a partitioning is typically simple to identify for multi-agent teams which\nexhibit sparsity of interaction. Instead we will focus on the amount of communication which is\nnecessary for the joint decision-making of the multi-agent team.\n\n3.2 Formal model\n\nWe will hereafter focus on the value function, and its associated quantities, at a given decision step t,\nand, for simplicity, we shall omit this dependency. However, we restate that the value function does\nnot need to be stationary \u2013 for a \ufb01nite-horizon problem, the following methods can simply be applied\nfor every t = 1, . . . , h.\n\n3.2.1 Value Bounds Over Local Belief Space\n\nRecall that, for a given \u03b1-vector, V\u03b1(b) = \u03b1 \u00b7 b represents the expected reward for selecting the\naction associated with \u03b1. Ideally, if this quantity could be mapped from a local belief point bL, then\nit would be possible to select the best action for an agent based only on its local information. This\nis typically not possible since the projection (6) is non-invertible. However, as we will show, it is\npossible to obtain bounds on the achievable value of any given vector, in local belief space.\nThe available information regarding V\u03b1(b) in local space can be expressed in the linear forms:\n\nV\u03b1(b) = \u03b1 \u00b7 b\n\n1T\n\nnb = 1\nL b = bL\n\nM X\n\n(7)\n\nT\n\nV\u03b1(bL) = (cid:0)\u03b2 + \u03b3(cid:1) \u00b7 bL + \u03b4\n\n.\n\n4\n\n. . . 1 ]\n\nwhere 1n = [ 1 1\n\u2208 Rn. Let m be size of the local belief factor which contains bL.\nReducing this system, we can associate V\u03b1(b) with b and bL, having at least n \u2212 m free variables in\nthe leading row, induced by the locally unavailable dimensions of b. The resulting equation can be\nrewritten as:\n\nV\u03b1(b) = \u03b2 \u00b7 b + \u03b3 \u00b7 bL + \u03b4\n\n(8)\nwith \u03b2 \u2208 Rn, \u03b3 \u2208 Rm and \u03b4 \u2208 R. By maximizing (or minimizing) the terms associated with the\npotentially free variables, we can use this form to establish the maximum (and minimum) value that\ncan be attained at bL.\nTheorem 1. Let Iu = (cid:8)v : M X\nL (u, v) = 1(cid:9), \u03b2 \u2208 Rm : \u03b2i = maxj\u2208Ii \u03b2j, i = 1, . . . , m and\n\u03b2 \u2208 Rm : \u03b2i = minj\u2208Ii \u03b2j, i = 1, . . . , m. The maximum achievable value for a local belief\npoint, bL, according to \u03b1, is:\n\n,\n\n(9)\n\n\fAnalogously, the minimum achievable value is\n\nV\u03b1(bL) = (cid:0)\u03b2 + \u03b3(cid:1) \u00b7 bL + \u03b4\n\n,\n\n(10)\n\nProof. First, we shall establish that V\u03b1(bL) is an upper bound on V\u03b1(b). The set Ii contains the\nindexes of the elements of b which marginalize onto (bL)i. From the de\ufb01nition of \u03b2 it follows that,\n\u2200b \u2208 B:\n\nX\n\nj\u2208Ii\n\n\u03b2ibj \u2265 X\n\nj\u2208Ii\n\n\u03b2jbj\n\n, i = 1, . . . , m \u21d4\n\n\u21d4 \u03b2i(bL)i \u2265 X\n\nj\u2208Ii\n\n\u03b2jbj\n\n, i = 1, . . . , m ,\n\nwhere we used the fact that Pj\u2208Ii\nUsing (8) and (9),\n\nbj = (bL)i. Summing over all i, this implies that \u03b2 \u00b7 bL \u2265 \u03b2 \u00b7 b.\n\n\u03b2 \u00b7 bL + \u03b3 \u00b7 bL + \u03b4 \u2265 \u03b2 \u00b7 b + \u03b3 \u00b7 bL + \u03b4 \u21d4 V\u03b1(bL) \u2265 V\u03b1(b)\n\nNext, we need to show that \u2203b \u2208 B : V\u03b1(bL) = V\u03b1(b). Since 1T\nconvex combination of the elements in \u03b2. Consequently,\n\u03b2 \u00b7 M X\n\nL b = max\n\n\u03b2i\n\nmax\nb\u2208B\n\n\u03b2 \u00b7 b = max\nb\u2208B\n\ni\n\nnb = 1 and bi \u2265 0 \u2200i, \u03b2 \u00b7 b is a\n\nTherefore, for bm = arg max\nb\u2208B\n\n\u03b2 \u00b7 b, we have that V\u03b1(M X\n\nL bm) = V\u03b1(bm).\n\nThe proof for the minimum achievable value V\u03b1(bL) is analogous.\n\nBy obtaining the bounds (9) and (10), we have taken a step towards identifying the correct action\nfor an agent to take, based on the local information contained in bL. From their evaluation, the\nfollowing remarks can be made: if \u03b1 and \u03b1\u2032 are such that V\u03b1\u2032 (bL) \u2264 V\u03b1(bL), then \u03b1\u2032 is surely\nnot the maximizing vector at b; if this property holds for all \u03b1\u2032 such that (\u03c6(\u03b1\u2032))i 6= (\u03c6(\u03b1))i, then\nby following the action associated with \u03b1, agent i will accrue at least as much value as with any\nother vector for all possible b subject to (6). That action can be safely selected without needing to\ncommunicate.\n\nThe complexity of obtaining the local value bounds for a given value function is basically that of\nreducing the system (7) for each vector. This is typically achieved through Gaussian Elimination,\nwith an associated complexity of O(n(m + 2)2) [3]. Note that the dominant term corresponds to\nthe size of the local belief factor, which is usually exponentially smaller than n. This is repeated\nfor all vectors, and if pruning is then done over the resulting set (the respective cost is O(|\u0393|2)), the\ntotal complexity is O(|\u0393|n(m + 2)2 + |\u0393|2). The pruning process used here is the same as what is\ntypically done by POMDP solvers [14].\n\n3.2.2 Dealing With Locally Ambiguous Actions\n\nThe de\ufb01nition of the value bounds (9) and (10) only allows an agent to act in atypical situations in\nwhich an action is clearly dominant in terms of expected value. However, this is often not the case,\nparticularly when considering a large decision horizon, since the present effects of any given action\non the overall expected reward are typically not pronounced enough for these considerations to be\npractical. In a situation where multiple value bounds are con\ufb02icting (i.e. V\u03b1(bL) > V\u03b1\u2032 (bL) and\nV\u03b1(bL) < V\u03b1\u2032 (bL)), an agent is forced to further reason about which of those actions is best.\nIn order to tackle this problem, let us assume that two actions a and a\u2032 have con\ufb02icting bounds at\nbL. Given \u0393a = {\u03b1 \u2208 \u0393 : (\u03c6(\u03b1))i = a} and similarly de\ufb01ned \u0393a\u2032, we will de\ufb01ne the matrices\n|. Then, the vectors v = Ab\nA = [\u0393a\nand v\u2032 = A\u2032b (in Rk and Rk\u2032 respectively) contain all possible values attainable at b through the\nvectors in \u0393a and \u0393a\u2032. Naturally, we will be interested in the maximum of these values for each\naction. In particular, we want to determine if maxi vi is greater than maxj v\u2032\nj for all possible b such\nthat bL = M X\nL b. If this is the case, then a should be selected as the best action, since it is guaranteed\nto provide a higher value at bL than a\u2032.\n\ni = 1, . . . , |\u0393a| and A\u2032 = [\u0393a\u2032\n\ni = 1, . . . , |\u0393a\u2032\n\ni ]k\u2032\u00d7n,\n\ni ]k\u00d7n,\n\n5\n\n\fThe problem of determining the minimum value of v \u2212 v\u2032 at bL can be expressed as the following\nset of Linear Programs (LPs) [6]. Note that x (cid:23) y is here assumed to mean that xi \u2265 yi \u2200i:\n\n\u2200i = 1, . . . , |\u0393a\u2032\n\n| maximize \u0393a\u2032\n\ni b \u2212 s\nsubject to Ab (cid:22) 1ks\n\nb (cid:23) 0n\n\n(11)\n\nM X\n\nL b = bL 1T\n\nn b = 1\n\nIf the solution bopt to each of these LPs is such that maxi(Abopt)i \u2265 maxj(A\u2032bopt)j, then action a\ncan be safely selected based on bL. If this is not the case for any of the solutions, then it is not\npossible to map the agent\u2019s best action solely through bL. In order to disambiguate every possible\naction, this optimization needs to be carried out for all con\ufb02icting pairs of actions. However, a less\ncomputationally expensive alternative is to approximate the optimization (11) by a single LP (refer\nto [6] for more details):\n\nmaximize 1T\nsubject to Ab (cid:22) 1ks\n\nk\u2032 \u03be\n\nA\u2032b = 1k\u2032 s + \u03be 1T\n\nb (cid:23) 0n M X\nn b = 1\n\nL b = bL\n\n(12)\n\n3.2.3 Mapping Local Belief Points to Communication Decisions\n\nFor an environment with only two belief factors, the method described so far could already incor-\nporate an explicit communication policy: given the local belief bL of an agent, if it is possible to\nunequivocally identify any action as being maximal, then that action can be safely executed without\nany loss of expected value. Otherwise, the remaining belief factor should be requested from other\nagents, in order to reconstruct b through (4), and map that agent\u2019s action through the joint policy.\nHowever, in most scenarios, it is not suf\ufb01cient to know whether or not to communicate: equally\nimportant are the issues of what to communicate, and with whom.\nLet us consider the general problem with nF belief factors contained in the set F. In this case there\nare 2|F |\u22121 combinations of non-local factors which the agent can request. Our goal is to identify one\nsuch combination which contains enough information to disambiguate the agent\u2019s actions. Central\nto this process is the ability to quickly determine, for a given set of belief factors G \u2286 F, if there are\nno points in bG with non-decidable actions. The exact solution to this problem would require, in the\nworst case, the solution of |\u0393a| \u00d7 |\u0393a\u2032\n| LPs of the form (11) for every pair of actions with con\ufb02icting\nvalue bounds. However, a modi\ufb01cation of the approximate LP (12) allows us to tackle this problem\nef\ufb01ciently:\n\nmaximize 1T\nk\u2032 \u03be\u2032 + 1T\nk\u03be\nsubject to Ab (cid:22) 1ks\n\nA\u2032b\u2032 (cid:22) 1k\u2032 s\u2032\nb (cid:23) 0n\n\nA\u2032b = 1k\u2032 s + \u03be\nAb\u2032 = 1ks\u2032 + \u03be\u2032\nb\u2032 (cid:23) 0n\n\nM X\nM X\nM X\n\nL b = bL\nL b\u2032 = bL\nG b = M X\n\nG b\u2032\n\n(13)\n\nThe rationale behind this formulation is that any solution to the LP, in which maxi \u03bei > 0 and\nj > 0 simultaneously, identi\ufb01es two different points b and b\u2032 which map to the same point\nmaxj \u03be\u2032\nbG in G, but share different maximizing actions a\u2032 and a respectively. This implies that, in order to\nselect an action unambiguously from the belief over G, no such solution may be possible.\nEquipped with this result, we can now formulate a general procedure that, for a set of belief points\nin local space, returns the corresponding belief factors which must be communicated in order for an\nagent to act unambiguously. We refer to this as obtaining the communication map for the problem.\nThis procedure is as follows (a more detailed version is included in [6]): we begin by computing the\nvalue bounds of V over local factors L, and sampling N reachable local belief points bL; for each\nof these points, if the value bounds of the best action are not con\ufb02icting (see Section 3.2.1), or any\ncon\ufb02icting bounds are resolved by LP (12), we can mark bL as safe, add it to the communication\nmap, and continue on to the next point; otherwise, using LP (13), we search for the minimum set of\nnon-local factors G which resolves all con\ufb02icts; we then associate bL with G and add it to the map.\nDuring execution, an agent updates its local information bL, \ufb01nds the nearest neighbor point in the\ncommunication map, and requests the corresponding factors from the other agents. The agent then\nselects the action which exhibits the highest maximum value bound given the resulting information.\n\n6\n\n\fL1\n\nL2\n\nD\n1\n\nR1\n\nD\n\n2\n\nR2\n\n(a) Relay-Small.\n\n(b) Relay-Large.\n\n)\n1\n)\n1\nX\nb\n(\n(\np\na\nM\n\n1\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n(bX1)1\n\n(c) Communication Map.\n\nFigure 1: (a) Layout of the Relay-Small problem.\nCommunication map for the Relay-Small problem.\n\n(b) Layout of the Relay-Large problem.\n\n(c)\n\n4 Experiments\n\nWe now analyze the results of applying the aforementioned of\ufb02ine communication mapping process\nto three different MPOMDP environments, each with a different degrees of interdependency between\nagents. The \ufb01rst and smallest of the test problems, shown in Figure 1a, is named the Relay-Small\nproblem, and is mainly used for explanatory purposes. In this world each agent is con\ufb01ned to a\ntwo-state area. One of the agents possesses a package which it must hand over to the other agent,\nthrough the non-traversable opening between the rooms L1 and R1. Each agent can move randomly\ninside its own room (a Shuf\ufb02e action), Exchange the package with the other agent, or Sense its\nenvironment in order to \ufb01nd the opening. An Exchange is only successful if both agents are in\nthe correct position (L1, R1) and if both agents perform this action at the same time, which makes\nit the only available cooperative action. The fact that, in this problem, each belief factor is two-\ndimensional (each factor spans one of the rooms) allows us to visualize the results of our method. In\nFigure 2, we see that some of the agent\u2019s expected behavior is already contained in the value bounds\nover its local factor: if an agent is certain of being in room R1 (i.e. (bX1)1 = 0), then the action\nwith the highest-valued bound is Shuf\ufb02e. Likewise, an Exchange should only be carried out when\nthe agent is certain of being in L1, but it is an ambiguous action since the agent needs to be sure that\nits teammate can cooperate. In Figure 1c we represent the communication map which was obtained\nof\ufb02ine through the proposed algorithm. Since there are only two factors, the agent only needs to\nmake a binary decision of whether or not to communicate for a given local belief point. The belief\npoints considered safe are marked as 0, and those associated with a communication decision are\nmarked as 1. In terms of quantitative results, we see that \u223c 30 \u2212 40% of communication episodes\nare avoided in this simple example, without a signi\ufb01cant loss of collected reward.\n\nAnother test scenario is the OneDoor environment of [7], which is further described in [6]. In this\n49-state world, two agents lie inside opposite rooms, akin to the Relay-Small problem, but each\nagent has the goal of moving to the other room. There is only one common passage between both\nrooms, where the agents may collide. For shorter-horizon solutions, agents may not be able to reach\ntheir goal, and they communicate so as to minimize negative reward (collisions). For the in\ufb01nite-\nhorizon case, however, typically only one of the agents communicates, while waiting for its partner\nto clear the passage. Note that this relationship between the problem\u2019s horizon and the amount of\ncommunication savings does not hold for all of the problems. The proposed method exploits the\ninvariance of local policies over subsets of the joint belief space, and this may arbitrarily change\nwith the problem\u2019s horizon.\n\nA larger example is displayed in Figure 1b. This is an adaptation of the Relay-Small problem (aptly\nnamed Relay-Large) to a setting in which each room has four different states, and each agent may be\ncarrying a package at a given time. Agent D1 may retrieve new packages from position L1, and D2\n\nRelay-Small\n\nOneDoor\n\nRelay-Large\n\nRed. Comm.\n\nFull Comm.\n\nh.\nRed. Comm.\nFull Comm.\nFull Comm.\n15.4, 100% 14.8, 56.9% 0.35, 100% 0.30, 89.0% 27.4, 100%\n25.8, 44.1%\n6\n39.8, 100% 38.7, 68.2% 1.47, 100% 1.38, 76.2% -19.7, 100% -21.6, 62,5%\n10\n\u221e 77.5, 100% 73.9, 46.1% 2.31, 100% 2.02, 61.3% 134.0, 100% 129.7, 58.9%\n\nRed. Comm.\n\nTable 1: Results of the proposed method for various environments. For settings assuming full and\nreduced communication, we show empirical control quality, online communication usage.\n\n7\n\n\fh\n\nPerseus\n\nComm. Map\n\nRelay-Small\n6\n1.1\n5.9\n\n10 \u221e 6\n7.3\n4.3\n21.4\n12.4\n\n0.1\n7.4\n\nOneDoor\n\nRelay-Large\n\n10 \u221e\n5.3\n33.3\n57.7\n5.9\n\n6\n\n239.5\n368.7\n\n10\n\n643.0\n859.5\n\n\u221e\n31.5\n138.1\n\nTable 2: Running time (in seconds) of the proposed method in comparison to the Perseus point-based\nPOMDP solver.\n\nV\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\nValue Bounds (Relay)\n\nPruned Value Bounds (Relay)\n\nShuffle\n\nExchange\n\nSense\n\nV\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0.2\n\n0.4\n0.6\n(bX1)1\n\n0.8\n\n1\n\n0\n\n0.2\n\n0.4\n0.6\n(bX1)1\n\n0.8\n\n1\n\nFigure 2: Value bounds for the Relay-Small problem. The dashed lines indicate the minimum value\nbounds, and the \ufb01lled lines represent the maximum value bounds, for each action.\n\ncan deliver them to L2, receiving for that a positive reward. There are a total of 64 possible states for\nthe environment. Here, since the agents can act independently for a longer time, the communication\nsavings are more pronounced, as shown in Table 1.\n\nFinally, we argue that the running time of the proposed algorithm is comparable to that of general\nPOMDP solvers for these same environments. Even though both the solver and the mapper algo-\nrithms must be executed in sequence, the results in Table 2 show that they are typically both in the\nsame order of magnitude.\n\n5 Conclusions and Future Work\n\nTraditional multiagent planning on partially observable environments mostly deals with fully-\ncommunicative or non-communicative situations. For a more realistic scenario where communi-\ncation should be used only when necessary, state-of-the-art methods are only capable of approxi-\nmating the optimal policy at run-time [11, 15]. Here, we have analyzed the properties of MPOMDP\nmodels which can be exploited in order to increase the ef\ufb01ciency of communication between agents.\nWe have shown that these properties hold, for various MPOMDP scenarios, and that the decision\nquality can be maintained while signi\ufb01cantly reducing the amount of communication, as long as the\ndependencies within the model are sparse.\n\nAlthough one of the main features of these techniques is that they may be applied to any given\nMPOMDP value function, in some situations this value function may be costly to obtain. As future\nwork, we will investigate methods for obtaining MPOMDP value functions that are easy to partition\nusing our techniques.\n\nAcknowledgments\n\nThis work was funded in part by Fundac\u00b8\u02dcao para a Ci\u02c6encia e a Tecnologia (ISR/IST pluriannual fund-\ning) through the PIDDAC Program funds and was supported by project CMU-PT/SIA/0023/2009\nunder the Carnegie Mellon-Portugal Program. J.M. was supported by a PhD Student Scholarship,\nSFRH/BD/44661/2008, from the Portuguese FCT POCTI programme. M.S. is funded by the FP7\nMarie Curie Actions Individual Fellowship #275217 (FP7-PEOPLE-2010-IEF).\n\n8\n\n\fReferences\n[1] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity\nof decentralized control of Markov decision processes. Mathematics of Operations Research,\n27(4):819\u2013840, 2002.\n\n[2] Xavier Boyen and Daphne Koller. Tractable inference for complex stochastic processes.\n\nProc. of Uncertainty in Arti\ufb01cial Intelligence, 1998.\n\nIn\n\n[3] X.G. Fang and G. Havas. On the worst-case complexity of integer gaussian elimination. In\nProceedings of the 1997 international symposium on Symbolic and algebraic computation, pages\n28\u201331. ACM, 1997.\n\n[4] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[5] David A. McAllester and Satinder Singh. Approximate planning for factored POMDPs using\n\nbelief state simpli\ufb01cation. In Proc. of Uncertainty in Arti\ufb01cial Intelligence, 1999.\n\n[6] J.V. Messias, M.T.J. Spaan, and P. U. Lima. Supplementary material for \u201cEf\ufb01cient Of\ufb02ine\n\nCommunication Policies for Factored Multiagent POMDPs\u201d. ISR/IST, 2011.\n\n[7] Frans A. Oliehoek, Matthijs T. J. Spaan, and Nikos Vlassis. Dec-POMDPs with delayed com-\nmunication. In Multi-agent Sequential Decision Making in Uncertain Domains, 2007. Workshop\nat AAMAS07.\n\n[8] Frans A. Oliehoek, Matthijs T. J. Spaan, Shimon Whiteson, and Nikos Vlassis. Exploiting\nIn Proc. of Int. Conference on Autonomous\n\nlocality of interaction in factored Dec-POMDPs.\nAgents and Multi Agent Systems, 2008.\n\n[9] P. Poupart and C. Boutilier. Value-directed belief state approximation for POMDPs. In Proc. of\n\nUncertainty in Arti\ufb01cial Intelligence, volume 130, 2000.\n\n[10] David V. Pynadath and Milind Tambe. The communicative multiagent team decision problem:\nAnalyzing teamwork theories and models. Journal of Arti\ufb01cial Intelligence Research, 16:389\u2013\n423, 2002.\n\n[11] M. Roth, R. Simmons, and M. Veloso. Decentralized communication strategies for coordinated\nmulti-agent policies. In Multi-Robot Systems: From Swarms to Intelligent Automata, volume IV.\nKluwer Academic Publishers, 2005.\n\n[12] Maayan Roth, Reid Simmons, and Manuela Veloso. Exploiting factored representations for\ndecentralized execution in multi-agent teams. In Proc. of Int. Conference on Autonomous Agents\nand Multi Agent Systems, 2007.\n\n[13] Matthijs T. J. Spaan, Frans A. Oliehoek, and Nikos Vlassis. Multiagent planning under uncer-\ntainty with stochastic communication delays. In Proc. of Int. Conf. on Automated Planning and\nScheduling, pages 338\u2013345, 2008.\n\n[14] Chelsea C. White. Partially observed Markov decision processes: a survey. Annals of Opera-\n\ntions Research, 32, 1991.\n\n[15] Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. Multi-agent online planning with commu-\n\nnication. In Int. Conf. on Automated Planning and Scheduling, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1083, "authors": [{"given_name": "Jo\u00e3o", "family_name": "Messias", "institution": null}, {"given_name": "Matthijs", "family_name": "Spaan", "institution": null}, {"given_name": "Pedro", "family_name": "Lima", "institution": null}]}