{"title": "Competition Adds Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 568, "abstract": "It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for the class NEXP with an oracle for NP.", "full_text": "Competition adds complexity\n\nJudy Goldsmith\n\nMartin Mundhenk\n\nDepartment of Computer Science\n\nFriedrich-Schiller-Universit\u00a8at Jena\n\nUniversity of Kentucky\n\nLexington, KY\n\nJena, Germany\n\nmundhenk@cs.uni-jena.de\n\ngoldsmit@cs.uky.edu\n\nAbstract\n\nIt is known that determinining whether a DEC-POMDP, namely, a cooperative\npartially observable stochastic game (POSG), has a cooperative strategy with pos-\nitive expected reward is complete for NEXP. It was not known until now how\ncooperation affected that complexity. We show that, for competitive POSGs, the\ncomplexity of determining whether one team has a positive-expected-reward strat-\negy is complete for NEXPNP.\n\n1 Introduction\n\nFrom online auctions to Texas Hold\u2019em, AI is captivated by multi-agent interactions based on com-\npetition. The problem of \ufb01nding a winning strategy harks back to the \ufb01rst days of chess programs.\nNow, we are starting to have the capacity to handle issues like stochastic games, partial informa-\ntion, and real-time video inputs for human player modeling. This paper looks at the complexity of\ncomputations involving the \ufb01rst two factors: partially observable stochastic games (POSGs).\n\nThere are many factors that could affect the complexity of different POSG models: Do the players,\ncollectively, have suf\ufb01cient information to reconstruct a state? Do they communicate or cooperate?\nIs the game zero sum, or do the players\u2019 individual utilities depend on other players\u2019 utilities? Do\nthe players even have models for other players\u2019 utilities?\n\nThe ultimate question is, what is the complexity of \ufb01nding a winning strategy for a particular player,\nwith no assumptions about joint observations or knowledge of other players\u2019 utilities. Since a special\ncase of this is the DEC-POMDP, where \ufb01nding an optimal (joint, cooperative) policy is known to be\nNEXP-hard [1], this problem cannot be any easier than in NEXP.\nWe show that one variant of this problem is hard for the class NEXPNP.\n\n2 De\ufb01nitions and Preliminaries\n\n2.1 Partially observable stochastic games\n\nA partially observable stochastic game (POSG) describes multi-player stochastic game with imper-\nfect information by its states and the consequences of the players actions on the system. We follow\nthe de\ufb01nition from [2] and denote it as a tuple M = (I,S,s0,A,O,t,o,r), where\n\n\u2022 I is the \ufb01nite set {1,2, . . . ,k} of agents (or players), S is a \ufb01nite set of states, with distin-\nguished initial state s0 \u2208 S, A is a \ufb01nite set of actions, and O is a \ufb01nite set of observations\n\u2022 t : S \u00d7 Ak \u00d7 S \u2192 [0,1] is the transition probability function, where t(s,a1, . . . ,ak,s\u2032) is the\n\nprobability that state s\u2032 is reached from state s when each agent i chooses action ai\n\n\u2022 o : S \u00d7 I \u2192 O is the observation function , where o(s,i) is the observation made in state s\n\nby agent i, and\n\n1\n\n\f\u2022 r : S \u00d7 Ak \u00d7 I \u2192 Z is the reward function, where r(s,a1, . . . ,ak,i) is the reward gained by\n\nagent i in state s, when the agents take actions a1, . . . ,ak. (Z is the set of integers.)\n\nA POSG where all agents have the same reward function is called a decentralized partially-\nobservable Markov decision process (see [1]).\nLet M = (I,S,s0,A,O,t,o,r) be a POSG. A step of M is a transition from one state to another\naccording to the transition probability function t. A run of M is a sequence of steps that starts in\nthe initial state s0. The outcome of each step is probabilistic and depends on the actions chosen. For\neach agent, a policy describes how to choose actions depending on observations made during the\nrun of the process. A (history-dependent) policy p chooses an action dependent on all observations\nmade by the agent during the run of the process. This is described as a function p\n: O\u2217 \u2192 A, mapping\neach \ufb01nite sequence of observations to an action.\nA trajectory q of length |q | = m for M is a sequence of states q = s 1,s 2, . . . ,s m (m \u2265 1, s i \u2208 S)\nwhich starts with the initial state of M , i.e. s 1 = s0. Given policies p 1, . . . ,p k, each trajectory q\nhas a probability prob(q ,p 1, . . . ,p k). We will use some abbreviations in the sequel. For p 1, . . . ,p k\nj,k)) we will write p k\n1(q\nwe will write p k\nj\n1 )\naccordingly. Then prob(q ,p 1, . . . ,p k) is de\ufb01ned by\n|q |\u22121\n\n1, and for p 1(o(s 1,1) \u00b7 \u00b7 \u00b7 o(s\n\nj,1)), . . . ,p k(o(s 1,k) \u00b7 \u00b7 \u00b7 o(s\n\nprob(q ,p k\n\n1) =\n\nt(s i,p k\n\n1(q i\n\n1),s i+1) .\n\ni=1\n\nWe use Tl(s) to denote all length l trajectories which start in the initial state s0 and end in state s. The\nexpected reward Ri(s,l,p k\n1 is\nthe reward obtained in s by the actions according to p k\n1 weighted by the probability that s is reached\nafter l steps,\n\n1 ) obtained by agent i in state s after exactly l steps under policies p k\n\nRi(s,l,p k\n\n1 ) =\n\nr(s,p k\n\n1 (q l\n\n1),i) \u00b7 prob(q ,p k\n\n1) .\n\nq \u2208Tl (s),q =(s 1,...,s l )\n\nA POSG may behave differently under different policies. The quality of a policy is determined by\nits performance, i.e. by the sum of expected rewards received on it. We use |M | to denote the size\nof the representation of M .1 The short-term performance for policies p k\n1 for agent i with POSG M\nis the expected sum of rewards received by agent i during the next |M | steps by following the policy\np k\n1, i.e.\n\nperfi(M ,p k\n\n1) = (cid:229)\n\nRi(s, |M |,p k\n\n1 ) .\n\ns\u2208S\nThe performance is also called the expected reward.\n\nAgents may cooperate or compete in a stochastic game. We want to know whether a stochastic game\ncan be won by some agents. This is formally expressed in the following decision problems.\nThe cooperative agents problem for k agents:\n\ninstance:\nquery:\n\na POSG M for k agents\nare there policies p 1, . . . ,p k under which every agent has positive performance ?\n\nThe competing agents problem for 2k agents:\n\n(I.e. \u2203p 1, . . . ,p k :Vk\n\ni=1 perfi(M ,p k\n\n1) > 0 ?)\n\ninstance:\nquery:\n\na POSG M for 2k agents\nare there policies p 1, . . . ,p k under which all agents 1,2, . . . ,k have positive per-\nformance independent of which policies agents k + 1,k + 2, . . . ,2k choose? (I.e.\n\n\u2203p 1, . . . ,p k\u2200p k+1, . . . ,p 2k :Vk\n\ni=1 perfi(M ,p 2k\n\n1 ) > 0 ?)\n\nIt was shown by Bernstein et al. [1] that the cooperative agents problem for two or more agents is\ncomplete for NEXP.\n\n1The size of the representation of M is the number of bits to encode the entire model, where the function\nt, o, and r are encoded by tables. We do not consider smaller representations. In fact, smaller representations\nmay increase the complexity.\n\n2\n\n(cid:213)\n(cid:229)\n\f2.2 NEXPNP\n\nA Turing machine M has exponential running time, if there is a polynomial p such that for every\ninput x, the machine M on input x halts after at most 2p(|x|) steps. NEXP is the class of sets that\ncan be decided by a nondeterministic Turing machine within exponential time. NEXPNP is the class\nof sets that can be decided by a nondeterministic oracle Turing machine within exponential time,\nwhen a set in NP is used as an oracle. Similar as for the class NPNP, it turns out that a NEXPNP\ncomputation can be performed by an NEXP oracle machine that asks exactly one query to a co NP\noracle and accepts if and only if the oracle accepts.\n\n2.3 Domino tilings\n\nDomino tiling problems are useful for reductions between different kinds of computations. They\nhave been proposed by Wang [3], and we will use it according to the following de\ufb01nition.\n\nDe\ufb01nition 2.1 We use [m] to denote the set {0,1,2, . . . ,m \u2212 1}. A tile type T = (V,H) consists of\ntwo \ufb01nite sets V,H \u2286 N \u00d7 N. A T -tiling of an m-square (m \u2208 N) is a mapping t : [m] \u00d7 [m] \u2192 N that\nsatis\ufb01es both the following conditions.\n\n1. Every pair of two neighboured tiles in the same row is in H.\nI.e. for all r \u2208 [m] and c \u2208 [m \u2212 1], (t (r,c),t (r,c + 1)) \u2208 H.\n\n2. Every pair of two neighboured tiles in the same column is in V .\n\nI.e. for all r \u2208 [m \u2212 1] and c \u2208 [m], (t (r,c),t (r + 1,c)) \u2208 V .\n\nThe exponential square tiling problem is the set of all pairs (T,1k), where T is a tile type and 1k is\na string consisting of k 1s (k \u2208 N), such that there exists a T -tiling of the 2k-square.\n\nIt was shown by Savelsbergh and van Emde Boas [4] that the exponential square tiling problem\nis complete for NEXP. We will consider the following variant, which we call the exponential S 2\nsquare tiling problem: given a pair (T,1k), does there exist a row w of tiles and a T -tiling of the\n2k-square with \ufb01nal row w, such that there exists no T -tiling of the 2k-square with initial row w?\nThe proof technique of Theorem 2.29 in [4], which translates Turing machine computations into\ntilings, is very robust in the sense that simple variants of the square tiling problem can analogously\nbe shown to be complete for different complexity classes. Together with the above characterization\nof NEXPNP it can be used to prove the following.\nTheorem 2.2 The exponential S 2 square tiling problem is complete for NEXPNP.\n\n3 Results\n\nPOSGs can be seen as a generalization of partially-observable Markov decision processes (PO-\nMDPs) in that POMDPs have only one agent and POSGs allow for many agents. Papadimitriou\nand Tsitsiklis [5] proved that it is PSPACE-complete to decide the cooperative agents problem for\nPOMDPs. The result of Bernstein et al. [1] shows that in case of history-dependent policies, the\ncomplexity of POSGs is greater than the complexity of POMDPs. We show that this difference\ndoes not appear when stationary policies are considered instead of history-dependent policies. For\nPOMDPs, the problem appears to be NP-complete [6]. A stationary policy is a mapping O \u2192 A\nfrom observations to actions. Whenever the same observation is made, the same action is chosen by\na stationary policy.\n\nTheorem 3.1 For any k \u2265 2, the cooperative agents problem for k agents for stationary policies is\nNP-complete.\n\nProof We start with proving NP-hardness. A POSG with only one agent is a POMDP. The problem\nof deciding, for a given POMDP M , whether there exists a stationary policy such that the short-term\nperformance of M is greater than 0, is NP-complete [6]. Hence, the cooperative agents problem for\nstationary policies is NP-hard.\n\n3\n\n\fIt remains to show containment in NP. Let M = (I,S,s0,A,O,t,o,r) be a POSG. We assume that t\nis represented in a straightforward way as a table. Let p 1, . . . ,p k be a sequence of stationary policies\nfor the k agents. This sequence can be straightforwardly represented using not more space than\nthe representation of t takes. Under a \ufb01xed sequence of policies, the performance of the POSG for\nall of the agents can be calculated in polynomial time. Using a guess and check approach (guess\nthe stationary policies and evaluate the POSG), this shows that the cooperative agents problem for\nstationary policies is in NP.\n2\n\nIn the same way we can characterize the complexity of a problem that we will need in the proof of\nLemma 3.3.\n\nCorollary 3.2 The following problem is coNP-complete.\n\ninstance:\nquery:\n\na POSG M for k agents\ndo all agents under every stationary policy have positive performance? (I.e.\n\n\u2200stationary p 1 . . .p k :Vk\n\ni=1 perfi(M ,p k\n\n1) > 0 ?)\n\nThe cooperative agents problem was shown to be NEXP-complete by Bernstein et al. [1]. Not\nsurprisingly, if the agents compete, the problem becomes harder.\n\nLemma 3.3 For every k \u2265 1, the competing agents problem for 2k agents is in NEXPNP.\nProof The basic idea is as follows. We guess policies p 1,p 2, . . . ,p k for agents 1,2, . . . ,k, and\nconstruct a POSG that \u201cimplements\u201d these policies and leaves open the actions chosen by agents\nk + 1, . . . ,2k.\nThis new POSG has states for all short-term trajectories through the origin POSG. Therefore, its\nsize is exponential in the size of the origin POSG. Because the history is stored in every state, and\nthe POSG is loop-free, it turns out that the new POSG can be taken as a POMDP for which a (joint)\npolicy with positive reward is searched. This problem is known to be NP-complete.\nLet M = (I,S,s0,A,O,t,o,r) be a POSG with 2k agents, and let p 1, . . . ,p k be short-term policies for\n,r\u2032) as follows2. In M \u2032, we have as agents\nM . We de\ufb01ne a k-agent POSG M \u2032 = (I\u2032\nthose of M , whose policies are not \ufb01xed, i.e. I\u2032 = {k + 1, . . . ,2k}. The set of states of M \u2032 is the\ncross product of states from M and all trajectories up to length |M | over S, i.e. S\u2032 = S \u00d7 S\u2264|M |+1.\nThe meaning of state (s,u) \u2208 S\u2032 is, that state s can be reached on a trajectory u (that ends with s)\n0 = (s0,s0). The state (s0,e ) is taken as\nthrough M with the \ufb01xed policies. The initial state s\u2032\na special sink state. After |M | + 2 steps, the sink state is entered in M \u2032 and it is not left thereafter.\nAll rewards gained in the sink state are 0. Now for the transition probabilities. If s is reached on\ntrajectory u in M and the actions a1, . . . ,ak are according to the \ufb01xed policies p 1, . . . ,p k, then the\nprobabiliy of reaching state s\u2032 on trajectory us\u2032 according to t in M is the same as to reach (s\u2032\n,us\u2032)\nin M \u2032 from (s,u). In the formal description, the sink state has to be considered, too.\nt\u2032((s,u),ak, . . . ,a2k, ( \u02c6s, \u02c6u)) =\n\n0,A,O\u2032\n\n0 is s\u2032\n\n,S\u2032\n\n,s\u2032\n\n,t\u2032\n\n,o\u2032\n\n0,\n\uf8f1\uf8f2\nt(s,p 1(o(us,1)), \u00b7 \u00b7 \u00b7 ,p k(o(us,k)),ak+1, . . . ,a2k, \u02c6s),\n1,\n\uf8f3\n\nif u 6= e and u \u02c6s 6= \u02c6u\nif \u02c6u = u \u02c6s, | \u02c6u| \u2264 |M |, u 6= e\nif |u| = |M | + 1 or u = e , and \u02c6u = e\n\nThe observation in M \u2032 is the sequence of observations made in the trajectory that is contained in\neach state, i.e. o\u2032((s,w)) = o(w), where o(e ) is any element of O. Finally, the rewards. Essentially,\nwe are interested in the rewards obtained by the agents 1,2, . . . ,k. The rewards obtained by the other\nagents have no impact on this, only the actions the other agents choose. Therefore, agent i obtains the\nrewards in M \u2032 that are obtained by agent i \u2212 k in M . In this way, the agents k + 1, . . . ,2k obtain in\nM \u2032 the same rewards that are obtained by agents 1,2, . . . ,k in M , and this is what we are interested\nin. This results in r\u2032((s,u),ak, . . . ,a2k,i) = r(s,p 1(o(u,1)), \u00b7 \u00b7 \u00b7 ,p k(o(u,k)),ak+1, . . . ,a2k,i \u2212 k) for\ni = k + 1, . . . ,2k.\n\n2S\u2264|M | denotes the set of sequences up to |M | elements from S. The empty sequence is denoted by\ne . For w \u2208 S\u2264|M | we use o(w, i) to describe the sequence of observations made by agent i on trajectory w.\nThe concatenation of sequences u and w is denoted uw. We do not distinguish between elements of sets and\nsequences of one element.\n\n4\n\n\fNotice that the size of M \u2032 is exponential in the size of M . The sink state in M \u2032 is the only state that\nlies on a loop. This means, that on all trajectories through M \u2032, the sink state is the only state that\nmay appear more than once. All states other than the sink state contain the full history of how they\nare reached. Therefore, there is a one-to-one correspondence between history-dependent policies\nfor M and stationary policies for M \u2032 (with regard to horizon |M |). Moreover, the corresponding\npolicies have the same performances.\nClaim 1 Let p 1, . . . ,p 2k be short-term policies for M , and let \u02c6p k+1, . . . , \u02c6p 2k be their corresponding\nstationary policies for M \u2032.\nFor |M | steps and i = 1,2, . . . ,k, perfi(M ,p 2k\n\n1 ) = perfi+k(M \u2032\n\n, \u02c6p 2k\n\nk+1).\n\nThus, this yields an NEXPNP algorithm to decide the competitive agents problem. The input is a\nPOSG M for 2k agents. In the \ufb01rst step, the policies for the agents 1,2, . . . ,k are guessed. This takes\nnondeterministic exponential time. In the second step, the POSG M \u2032 is constructed from the input\nM and the guessed policies. This takes exponential time (in the length of the input M ). Finally, the\noracle is queried whether M \u2032 has positive performance for all agents under all stationary policies.\nThis problem belongs to coNP (Corollary 3.2). Henceforth, the algorithm shows the competing\nagents problem to be in NEXPNP.\n2\n\nLemma 3.4 For every k \u2265 2, the competing agents problem for 2k agents is hard for NEXPNP.\nProof We give a reduction from the exponential S 2 square tiling problem to the competing agents\nproblem.\nLet T = (T,1k) be an instance of the exponential S 2 square tiling problem, where T = (V,H) is a\ntile type. We will show how to construct a POSG M with 4 agents from it, such that T is a positive\ninstance of the exponential S 2 square tiling problem if and only if (1) agents 1 and 2 have a tiling\nfor the 2k square with \ufb01nal row w such that (2) agents 3 and 4 have no tiling for the 2k square with\ninitial row w.\n\nThe basic idea for checking of tilings with POSGs for two agents stems from Bernstein et al. [1],\nbut we give a slight simpli\ufb01cation of their proof technique, and in fact have to extend it for four\nagents later on. The POSG is constructed so that on every trajectory each agent sees a position in\nthe square. This position is chosen by the process. The only action of the agent that has impact\non the process is putting a tile on the given position. In fact, the same position is observed by the\nagents in different states of the POSG. From a global point of view, the process splits into two parts.\nThe \ufb01rst part checks whether both agents know the same tiling, without checking that it is a correct\ntiling. In the state where the agents are asked to put their tiles on the given position, a high negative\nreward is obtained if the agents put different tiles on that position. \u201dHigh negative\u201d means that,\nif there is at least one trajectory on which such a reward is obtained, then the performance of the\nwhole process will be negative. The second part checks whether the tiling is correct. The idea is to\ngive both the agents neighboured positions in the square and to ask each which tile she puts on that\nposition. Notice that the agents do not know in which part of the process they are. This means, that\nthey do not know whether the other agent is asked for the same position, or for its upper or right\nneighbour. This is why the agents cannot cheat the process. A high negative reward will be obtained\nif the agents\u2019 tiles do not \ufb01t together.\nFor the \ufb01rst part, we need to construct is a POSG Pk for two agents, that allows both agents to\nmake the same sequence of observations consisting of 2k bits. This sequence is randomly chosen,\nand encodes a position in a 2k \u00d7 2k grid. At the end, state same is reached, at which no observation is\nmade. At this state, it will be checked whether both agents put the same tile at this position (see later\non). The task of Pk is to provide both agents with the same position. Figure 1 shows an example\nfor a 24 \u00d7 24-square. The initial state is s4. Dashed arrows indicate transitions with probability 1\n2\nindependent of the actions. The observation of agent 1 is written on the left hand side of the states,\nand the observations of agent 2 at the right hand side. In s4, the agents make no observation. In Pk\nboth agents always make the same observations.\n\nThe second part is more involved. The goal is to provide both agents with neighboured positions\nin the square. Eventually, it is checked whether the tiles they put on the neighboured positions\nare according to the tile type T . Because the positions are encoded in binary, we can make use\n\n5\n\n\fs4\n\ns\n\ns\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n0\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n0\n\n1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n0\n\nrow\n\n0\n\n0\n\ncolumn\n\nsame\n\nhori\n\nsame\n\ncheck tiles\n\nFigure 1: P4\n\nFigure 2: C3,4\n\nFigure 3: L3,4\n\nof the following fact of subsequent binary numbers. Let u = u1 . . . uk and w = w1 . . . wk be bitwise\nrepresentation of strings.\nif nw = nu + 1, then for some index l it holds that (1) ui = wi for i =\n1,2, . . . ,l \u2212 1, (2) wl = 1 and ul = 0, and (3) w j = 0 and u j = 1 for j = l + 1, . . . ,k.\nThe POSG Cl,k is intended to provide the agents with two neighboured positions in the same row,\nwhere the index of the leftmost bit of the column encoding where both positions distinguish is l.\n(The C stands for column.) Figure 2 shows an example for the 24-square. The \u201c\ufb01nal state\u201d of Cl,k\nis the state hori, from which it is checked whether the agents put horizontically \ufb01tting tiles together.\nIn the same way, a POSG Rl,k can be constructed (R stands for row), whose task is, to check\nwhether two tiles in neighboured rows correspond to a correct tiling. This POSG has the \ufb01nal state\nvert, from which on it is checked whether two tiles \ufb01t vertically.\n\nFinally, we have to construct the last part of the POSG. It consists of the states same, hori, vert (as\nmentioned above), good, bad, and sink. All transitions between these states are deterministic (i.e.\nwith probability 1). From state same the state good is reached, if both agents take the same action\n\u2013 otherwise bad is reached. From state hori the state good is reached, if action a1 by agent 1 and a2\nby agent 2 make a pair (a1,a2) in H, i.e. in the set of horizontically correct pairs of tiles \u2013 otherwise\nbad is reached. Similarly, from state vert the state good is reached, if action a1 by agent 1 and a2 by\nagent 2 make a pair (a1,a2) in V . All these transitions are with reward 0. From state good the state\nsink is reached on every action with reward 1, and from state bad the state sink is reached on every\naction with reward \u2212(22k+2). When the state sink is reached, the process stays there on any action,\nand all agents obtain reward 0. All rewards are the same for both agents. (This part can be seen in\nthe overall picture in Figure 4).\nFrom these POSGs we construct a POSG T2,k that checks whether two agents know the same correct\ntiling for a 2k \u00d7 2k square, as described above. There are 2k + 1 parts of T2,k. The initial state of each\npart can be reached with one step from the initial state s0 of T2,k. The parts of T2,k are as follows.\n\n\u2022 P2k with initial state s (checks whether two agents have the same tiling)\n\u2022 For each l = 1,2, . . . ,k, we take Cl,k. Let cl be the initial state of Cl,k.\n\n6\n\n\fs0\n\nck\n\nr1\n\nCk,k\n\nR1,k\n\nrk\n\nRk,k\n\nsk\n\nPk\n\nc1\n\nC1,k\n\nsame\n\nhori\n\nvert\n\ngood\n\nbad\n\nsink\n\nFigure 4: T2,k\n\n\u2022 For each l = 1,2, . . . ,k, we take Rl,k. Let rl be the initial state of Rl,k.\n\nThere are 22k + 2 \u00b7 (cid:229)\nk\nl=1 2k \u00b7 2l\u22121 =: tr(k) trajectories with probability > 0 through T2,k. Notice\nthat tr(k) < 22k+2. From the initial state s0 of T2,k, each of the initial states of the parts is reachable\nindependent on the action chosen by the agents. We will give transition probabilities to the transition\nfrom s0 to each of the initial states of the parts in a way, that eventually each trajectory has the same\nprobability.\n\nt(s0,a1,a2,s\u2032) =( 22k\n\ntr(k) ,\n2k+l\u22121\ntr(k)\n\nif s\u2032 = s, i.e. the initial state of Pk\nif s \u2208 {rl ,cl | l = 1,2, . . . ,k}\n\nIn the initial state s0 and in the initial states of all parts, the observation e\nis made. When a state\nsame, hori, vert is reached, each agent has made 2k + 3 observations, where the \ufb01rst and last are e\nand the remaining 2k are each in {0,1}. Such a state is the only one where the actions of the agents\nhave impact on the process. Because of the partial observability, they cannot know in which part\nof T2,k they are. The agents can win, if they both know the same correct tiling and interpret the\nsequence of observations as the position in the grid they are asked to put a tile on. On the other\nhand, if both agents know different tilings or the tiling they share is not correct, then at least one\ntrajectory will end in a bad state and has reward \u2212(22k+2). The structure of the POSG is given in\nFigure 4.\n\nClaim 2 Let (T,1k) be an instance of the exponential square tiling problem.\n\n(1) There exists a polynomial time algorithm, that on input (T,1k) outputs T2,k.\n\n(2) There exists a T -tiling of the 2k square if and only if there exist policies for the agents under\n\nwhich T2,k has performance > 0.\n\nPart (1) is straightforward. Part (2) is not much harder. If there exists a T -tiling of the 2k square,\nboth agents use the same policy according to this tiling. Under these policies, state bad will not be\nreached. This guarantess performance > 0 for both agents. For the other direction: if there exist\npolicies for the agents under which T2,k has performance > 0, then state bad is not reached. Hence,\nboth agents use the same policy. It can be shown inductively that this policy \u201cis\u201d a T -tiling of the 2k\nsquare.\n\n7\n\n\fThe POSG for the competing agents problem with 4 agents consists of three parts. The \ufb01rst part is\na copy of T2,k. It is used to check whether the \ufb01rst square can be tiled correctly (by agents 1 and\n2). In this part, the negative rewards are increased in a way that guarantees the performance of the\nPOSG to be negative whenever agents 1 and 2 do not correctly tile their square. The second part\nis a modi\ufb01ed copy of T2,k. It is used to check whether the second square can be tiled correctly (by\nagents 3 and 4). Whenever state bad is left in this copy, reward 0 is obtained, and whenever state\ngood is left, reward \u22121 is obtained. The third part checks whether agent 1 puts the same tiles into\nthe last row of its square as agent 3 puts into the \ufb01rst row of its square. (See L3,4 in Figure 3 as an\nexample.) If this succeeds, the performance of the third part equals 0, otherwise it has performance\n1. These three parts run in parallel.\n\nIf agents 1 and 2 have a tiling for the \ufb01rst square, the performance of the \ufb01rst part equals 1.\n\n\u2022 If agents 3 and 4 are able to continue this tiling through their square, the performance\nof the second part equals \u22121 and the performance of the third part equals 0. At all, the\nperformance of the POSG under these policies equals 0.\n\n\u2022 If agents 3 and 4 are not able to continue this tiling through their square, then the perfor-\nmance of part 2 and part 3 is strictly greater \u22121. At all, the performance of the POSG under\nthese policies is > 0.\n\nLemmas 3.3 and 3.4 together yield completeness of the competing agents problem.\n\nTheorem 3.5 For every k \u2265 2, the competing agents problem for 2k agents is complete for NEXPNP.\n\n2\n\n4 Conclusion\n\nWe have shown that competition makes life\u2014and computation\u2014more complex. However, in order\nto do so, we needed teamwork. It is not yet clear what the complexity is of determining the existence\nof a good strategy for Player I in a 2-person POSG, or a 1-against-many POSG.\nThere are other variations that can be shown to be complete for NEXPNP, a complexity class that,\nshockingly, has not been well explored. We look forward to further results about the complexity of\nPOSGs, and to additional NEXPNP-completeness results for familiar AI and ML problems.\n\nReferences\n\n[1] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity\nof decentralized control of Markov decision processes. Math. Oper. Res., 27(4):819\u2013840, 2002.\n[2] E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable\nstochastic games. In Proceedings of the Nineteenth National Conference on Arti\ufb01cial Intelli-\ngence (AAAI-04), pages 709\u2013715, 2004.\n\n[3] Hao Wang. Proving theorems by pattern recognition II. Bell Systems Technical Journal, 40:1\u2013\n\n42, 1961.\n\n[4] M. Savelsbergh and P. van Emde Boas. Bounded tiling, an alternative to satis\ufb01ability. In Gerd\nWechsung, editor, 2nd Frege Conference, volume 20 of Mathematische Forschung, pages 354\u2013\n363. Akademie Verlag, Berlin, 1984.\n\n[5] C.H. Papadimitriou and J.N. Tsitsiklis. The complexity of Markov decision processes. Mathe-\n\nmatics of Operations Research, 12(3):441\u2013450, 1987.\n\n[6] Martin Mundhenk, Judy Goldsmith, Christopher Lusena, and Eric Allender. Complexity results\nfor \ufb01nite-horizon Markov decision process problems. Journal of the ACM, 47(4):681\u2013720, 2000.\n\n8\n\n\f", "award": [], "sourceid": 599, "authors": [{"given_name": "Judy", "family_name": "Goldsmith", "institution": null}, {"given_name": "Martin", "family_name": "Mundhenk", "institution": null}]}