{"title": "Learning to Correlate in Multi-Player General-Sum Sequential Games", "book": "Advances in Neural Information Processing Systems", "page_first": 13076, "page_last": 13086, "abstract": "In the context of multi-player, general-sum games, there is a growing interest in solution concepts involving some form of communication among players, since they can lead to socially better outcomes with respect to Nash equilibria and may be reached through learning dynamics in a decentralized fashion. In this paper, we focus on coarse correlated equilibria (CCEs) in sequential games. First, we complete the picture on the complexity of finding social-welfare-maximizing CCEs by proving that the problem is not in Poly-APX, unless P = NP, in games with three or more players (including chance). Then, we provide simple arguments showing that CFR---working with behavioral strategies---may not converge to a CCE in multi-player, general-sum sequential games. In order to amend this issue, we devise two variants of CFR that provably converge to a CCE. The first one (CFR-S) is a simple stochastic adaptation of CFR which employs sampling to build a correlated strategy, whereas the second variant (called CFR-Jr) enhances CFR with a more involved reconstruction procedure to recover correlated strategies from behavioral ones. Experiments on a rich testbed of multi-player, general-sum sequential games show that both CFR-S and CFR-Jr are dramatically faster than the state-of-the-art algorithms to compute CCEs, with CFR-Jr being also a good heuristic to find socially-optimal CCEs.", "full_text": "Learning to Correlate in Multi-Player General-Sum\n\nSequential Games\n\nAndrea Celli\u2217\n\nPolitecnico di Milano\n\nandrea.celli@polimi.it\n\nAlberto Marchesi\u2217\nPolitecnico di Milano\n\nalberto.marchesi@polimi.it\n\nTommaso Bianchi\nPolitecnico di Milano\n\ntommaso4.bianchi@mail.polimi.it\n\nAbstract\n\nNicola Gatti\n\nPolitecnico di Milano\n\nnicola.gatti@polimi.it\n\nIn the context of multi-player, general-sum games, there is a growing interest in\nsolution concepts involving some form of communication among players, since\nthey can lead to socially better outcomes with respect to Nash equilibria and may\nbe reached through learning dynamics in a decentralized fashion. In this paper,\nwe focus on coarse correlated equilibria (CCEs) in sequential games. First, we\ncomplete the picture on the complexity of \ufb01nding social-welfare-maximizing CCEs\nby proving that the problem is not in Poly-APX, unless P = NP, in games with\nthree or more players (including chance). Then, we provide simple arguments\nshowing that CFR\u2014working with behavioral strategies\u2014may not converge to a\nCCE in multi-player, general-sum sequential games. In order to amend this issue,\nwe devise two variants of CFR that provably converge to a CCE. The \ufb01rst one\n(CFR-S) is a simple stochastic adaptation of CFR which employs sampling to\nbuild a correlated strategy, whereas the second variant (called CFR-Jr) enhances\nCFR with a more involved reconstruction procedure to recover correlated strategies\nfrom behavioral ones. Experiments on a rich testbed of multi-player, general-sum\nsequential games show that both CFR-S and CFR-Jr are dramatically faster than\nthe state-of-the-art algorithms to compute CCEs, with CFR-Jr being also a good\nheuristic to \ufb01nd socially-optimal CCEs.\n\n1\n\nIntroduction\n\nA number of recent studies explore relaxations of the classical notion of equilibrium (i.e., the\nNash equilibrium (NE) [30]), allowing to model communication among the players [3, 14, 34].\nCommunication naturally brings about the possibility of playing correlated strategies. These are\ncustomarily modeled through a trusted external mediator who privately recommends actions to the\nplayers [1]. In particular, a correlated strategy is a correlated equilibrium (CE) if each player has\nno incentive to deviate from the recommendation, assuming the other players would not deviate\neither. A popular variation of the CE is the coarse correlated equilibrium (CCE), which only prevents\ndeviations before knowing the recommendation [28]. In sequential games, CEs and CCEs are\nwell-suited for scenarios where the players have limited communication capabilities and can only\ncommunicate before the game starts, such as, e.g., military settings where \ufb01eld units have no time or\nmeans of communicating during a battle, collusion in auctions where communication is illegal during\nbidding, and, in general, any setting with costly communication channels or blocking environments.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fCCEs present a number of appealing properties. A CCE can be reached through simple (no-regret)\nlearning dynamics in a decentralized fashion [17, 22], and, in several classes of games (such as, e.g.,\nnormal-form and succinct games [31, 26]), it can be computed exactly in time polynomial in the size\nof the input. Furthermore, an optimal (i.e., social-welfare-maximizing) CCE may provide arbitrarily\nlarger welfare than an optimal CE, which, in turn, may provide arbitrarily better welfare than an\noptimal NE [12]. Although the problem of \ufb01nding an optimal CCE is NP-hard for some game classes\n(such as, e.g., graphical, polymatrix, congestion, and anonymous games [3]), Roughgarden [34] shows\nthat the CCEs reached through regret-minimizing procedures have near-optimal social welfare when\nthe (\u03bb, \u00b5)-smoothness condition holds. This happens, e.g., in some speci\ufb01c auctions, congestion\ngames, and even in Bayesian settings, as showed by Hartline et al. [23]. Thus, decentralized\ncomputation via learning dynamics, computational ef\ufb01ciency, and welfare optimality make the CCE\none of the most interesting solution concepts for practical applications. However, the problem of\ncomputing CCEs has been addressed only for some speci\ufb01c games with particular structures [3, 23].\nIn this work, we study how to compute CCEs in the general class of games which are sequential,\ngeneral-sum, and multi-player. This is a crucial advancement of CCE computation, as sequential\ngames provide a model for strategic interactions which is richer and more adherent to real-world\nsituations than the normal form.\nIn sequential games, it is known that, when there are two players without chance moves, an optimal\nCCE can be computed in polynomial time [12]. Celli et al. [12] also provide an algorithm (with no\npolynomiality guarantees) to compute solutions in multi-player games, using a column-generation\nprocedure with a MILP pricing oracle. As for computing approximate CCEs, in the normal-form\nsetting, any Hannan consistent regret-minimizing procedure for simplex decision spaces may be\nemployed to approach the set of CCEs [5, 13]\u2014the most common of such techniques is regret\nmatching (RM) [4, 22]. However, approaching the set of CCEs in sequential games is more demanding.\nOne could represent the sequential game with its equivalent normal form and apply RM to it. However,\nthis would result in a guarantee on the cumulative regret which would be exponential in the size\nof the game tree (see Section 2). Thus, reaching a good approximation of a CCE could require\nan exponential number of iterations. The problem of designing learning algorithms avoiding the\nconstruction of the normal form has been successfully addressed in sequential games for the two-\nplayer, zero-sum setting. This is done by decomposing the overall regret locally at the information sets\nof the game [15]. The most widely adopted of such approaches are counterfactual regret minimization\n(CFR) [43] and CFR+ [39, 38], which originated variants such as those introduced by Brown and\nSandholm [9] and Brown et al. [10]. These techniques were the key for many recent remarkable\nresults [6, 7, 8? ]. However, these algorithms work with players\u2019 behavioral strategies rather than\nwith correlated strategies, and, thus, they are not guaranteed to approach CCEs in general-sum games,\neven with two players. The only known theoretical guarantee of CFR when applied to multi-player,\ngeneral-sum games is that it excludes dominated actions [19]. Some works also attempt to apply\nCFR to multi-player, zero-sum games, see, e.g., [32].\n\nOriginal contributions First, we complete the picture on the computational complexity of \ufb01nding\nan optimal CCE in sequential games, showing that the problem is inapproximable (i.e., not in Poly-\nAPX), unless P = NP, in games with three or more players (chance included). In the rest of the paper,\nwe focus on how to compute approximate CCEs in multi-player, general-sum, sequential games using\nno-regret-learning procedures. We start pointing out simple examples where CFR-like algorithms\navailable in the literature cannot be directly employed to our purpose, as they only provide players\u2019\naverage behavioral strategies whose product is not guaranteed to converge to an approximate CCE.\nHowever, we show how CFR can be easily adapted to approach the set of CCEs in multi-player,\ngeneral-sum sequential games by resorting to sampling procedures (we call the resulting, na\u00efve\nalgorithm CFR-S). Then, we design an enhanced version of CFR (called CFR-Jr) which computes\nan average correlated strategy guaranteed to converge to an approximate CCE with a bound on the\nregret sub-linear in the size of the game tree. The key component of CFR-Jr is a polynomial-time\nalgorithm which constructs, at each iteration, the players\u2019 normal-form strategies by working on the\ngame tree, avoiding to build the (exponential-sized) normal-form representation. We evaluate the\nscalability of CFR-S and CFR-Jr on a rich testbed of multi-player, general-sum sequential games.\nBoth algorithms solve instances which are orders of magnitude larger than those solved by previous\nstate-of-the-art CCE-\ufb01nding techniques. Moreover, CFR-Jr proved to be a good heuristic to compute\noptimal CCEs, returning nearly-socially-optimal solutions in all the instances of our testbeds. Finally,\nwe also test our algorithms against CFR in multi-player, general-sum games, showing that, in several\n\n2\n\n\finstances of our testbed, CFR does not converge to a CCE and it returns solutions providing a social\nwelfare considerably lower than that achieved with CFR-S and CFR-Jr.\n\n2 Preliminaries\n\nIn this section, we introduce some basic concepts which are used in the rest of the paper (see Shoham\nand Leyton-Brown [36] and Cesa-Bianchi and Lugosi [13] for further details).\n\n2.1 Extensive-form games and relevant solution concepts\n\nWe focus on extensive-form games (EFGs) with imperfect information and perfect recall. We\ndenote the set of players as P \u222a {c}, where c is the Nature (chance) player (representing exogenous\nstochasticity) selecting actions with a \ufb01xed known probability distribution. H is the set of nodes\nof the game tree, and a node h \u2208 H is identi\ufb01ed by the ordered sequence of actions from the root\nto the node. Z \u2286 H is the set of terminal nodes, which are the leaves of the game tree. For every\nh \u2208 H \\ Z, we let P (h) be the unique player who acts at h and A(h) be the set of actions she has\navailable. We write h \u00b7 a to denote the node reached when a \u2208 A(h) is played at h. For each player\ni \u2208 P, ui : Z \u2192 R is the payoff function. We denote by \u2206 the maximum range of payoffs in the\ngame, i.e., \u2206 = maxi\u2208P (maxz\u2208Z ui(z) \u2212 minz\u2208Z ui(z)).\nWe represent imperfect information using information sets (from here on, infosets). Any infoset I\nbelongs to a unique player i, and it groups nodes which are indistinguishable for that player, i.e.,\nA(h) = A(h(cid:48)) for any pair of nodes h, h(cid:48) \u2208 I. Ii denotes the set of all player i\u2019s infosets, which\nform a partition of {h \u2208 H | P (h) = i}. We denote by A(I) the set of actions available at infoset I.\nIn perfect-recall games, the infosets are such that no player forgets information once acquired.\nWe denote with \u03c0i a behavioral strategy of player i, which is a vector de\ufb01ning a probability distribution\nat each player i\u2019s infoset. Given \u03c0i, we let \u03c0i,I be the (sub)vector representing the probability\ndistribution at I \u2208 Ii, with \u03c0i,I,a denoting the probability of choosing action a \u2208 A(I).\nAn EFG has an equivalent tabular (normal-form) representation. A normal-form plan for player i\nis a vector \u03c3i \u2208 \u03a3i =\u00d7I\u2208Ii\nA(I) which speci\ufb01es an action for each player i\u2019s infoset. Then, an\nEFG is described through a |P|-dimensional matrix specifying a utility for each player at each joint\nnormal-form plan \u03c3 \u2208 \u03a3 =\u00d7i\u2208P\n\u03a3i. The expected payoff of player i, when she plays \u03c3i \u2208 \u03a3i and\nthe opponents play normal-form plans in \u03c3\u2212i \u2208 \u03a3\u2212i =\u00d7j(cid:54)=i\u2208P\n\u03a3j, is denoted, with an overload of\nnotation, by ui(\u03c3i, \u03c3\u2212i). Finally, a normal-form strategy xi is a probability distribution over \u03a3i. We\ndenote by Xi the set of the normal-form strategies of player i. Moreover, X denotes the set of joint\nprobability distributions de\ufb01ned over \u03a3.\nWe also introduce the following notation. We let \u03c1\u03c0i be a vector in which each component \u03c1\u03c0i\nz is the\nprobability of reaching the terminal node z \u2208 Z, given that player i adopts the behavioral strategy \u03c0i\nand the other players play so as to reach z. Similarly, given a normal-form plan \u03c3i \u2208 \u03a3i, we de\ufb01ne\nthe vector \u03c1\u03c3i. Moreover, with an abuse of notation, \u03c1\u03c0i\nI denote the probability of reaching\ninfoset I \u2208 Ii. Finally, Z(\u03c3i) \u2286 Z is the subset of terminal nodes which are (potentially) reachable\nif player i plays according to \u03c3i \u2208 \u03a3i.\nThe classical notion of CE by Aumann [1] models correlation via the introduction of an external\nmediator who, before the play, draws the joint normal-form plan \u03c3\u2217 \u2208 \u03a3 according to a publicly\nknown x\u2217 \u2208 X , and privately communicates each recommendation \u03c3\u2217i to the corresponding player.\nAfter observing their recommended plan, each player decides whether to follow it or not. A CCE is a\nrelaxation of the CE, de\ufb01ned by Moulin and Vial [28], which enforces protection against deviations\nwhich are independent from the sampled joint normal-form plan.\nDe\ufb01nition 1. A CCE of an EFG is a probability distribution x\u2217 \u2208 X such that, for every i \u2208 P, and\n\u03c3(cid:48)i \u2208 \u03a3i, it holds: (cid:88)\n\nI and \u03c1\u03c3i\n\nx\u2217(\u03c3i, \u03c3\u2212i) (ui(\u03c3i, \u03c3\u2212i) \u2212 ui(\u03c3(cid:48)i, \u03c3\u2212i)) \u2265 0.\n\n(cid:88)\n\n\u03c3i\u2208\u03a3i\n\n\u03c3\u2212i\u2208\u03a3\u2212i\n\nCCEs differ from CEs in that a CCE only requires that following the suggested plan is a best response\nin expectation, before the recommended plan is actually revealed. In both equilibrium concepts, the\n\n3\n\n\fentire probability distribution according to which recommendations are drawn is revealed before the\ngame starts. After that, each player commits to playing a normal-form plan (see Appendix A for\nfurther details on the various notions of correlated equilibrium in EFGs). An NE [30] is a CCE which\ncan be written as a product of players\u2019 normal-form strategies x\u2217i \u2208 Xi. In conclusion, an \u03b5-CCE is\na relaxation of a CCE in which every player has an incentive to deviate less than or equal to \u03b5 (the\nsame de\ufb01nition holds true for \u03b5-CE and \u03b5-NE).\n\n2.2 Regret and regret minimization\n\nIn the online convex optimization framework [42], each player i plays repeatedly against an unknown\ni. In the basic setting, the decision space\nenvironment by making a series of decisions x1\ni, player i\nof player i is the whole normal-form strategy space Xi. At iteration t, after selecting xt\nobserves a utility ut\n\ni). The cumulative external regret of player i up to iteration T is de\ufb01ned as\n\ni , . . . , xt\n\ni , x2\n\ni(xt\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nRT\n\ni = max\n\u02c6xi\u2208Xi\n\nut\ni(\u02c6xi) \u2212\n\nut\ni(xt\n\ni).\n\n(1)\n\ni\n\n1\n\nT RT\n\nA regret minimizer is a function providing the next player i\u2019s strategy xt+1\non the basis of the past\nhistory of play and the observed utilities up to iteration t. A desirable property for regret minimizers\ni \u2264 0, i.e., the cumulative regret\nis Hannan consistency [21], which requires that lim supT\u2192\u221e\ngrows at a sublinear rate in the number of iterations T .\nIn an EFG, the regret can be de\ufb01ned at each infoset. After T iterations, the cumulative regret for not\nhaving selected action a \u2208 A(I) at infoset I \u2208 Ii (denoted by RT\nI (a)) is the cumulative difference in\nutility that player i would have experienced by selecting a at I instead of following the behavioral\nstrategy \u03c0t\ni at each iteration t up to T . Then, the regret for player i at infoset I \u2208 Ii is de\ufb01ned as\nRT\nI = maxa\u2208A(I) RT\nRM [22] is the most widely adopted regret-minimizing scheme when the decision space is Xi (e.g., in\nnormal-form games). In the context of EFGs, RM is usually applied locally at each infoset, where the\nplayer selects a distribution over available actions proportionally to their positive regret. Speci\ufb01cally,\nat iteration T + 1 player i selects actions a \u2208 A(I) according to the following probability distribution:\n\nI (a). Moreover, we let RT,+\n\n(a) = max{RT\n\nI (a), 0}.\n\nI\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:80)\n\n\u03c0T +1\ni,I,a =\n\nI\n\nRT ,+\n\n(a)\na(cid:48)\u2208A(I) RT ,+\n1\n\nI\n\n,\n\n(a(cid:48))\n\n,\n\n|A(I)|\n\nif (cid:80)\n\notherwise\n\na(cid:48)\u2208A(I) RT,+\n\nI\n\n(a(cid:48)) > 0\n\n.\n\nPlaying according to RM at each iteration guarantees, on iteration T , RT\n[13]. CFR [43]\nis an anytime algorithm to compute \u03b5-NEs in two-player, zero-sum EFGs. CFR minimizes the external\ni by employing RM locally at each infoset. In two-player, zero-sum games, if both players\nregret RT\nhave cumulative regrets such that 1\ni \u2264 \u03b5, then their average behavioral strategies are a 2\u03b5-\nNE [41]. CFR+ is a variation of classical CFR which exhibits better practical performances [39].\nHowever, it uses alternation (i.e., it alternates which player updates her regret on each iteration),\nwhich complicates the theoretical analysis to prove convergence [15, 11].\n\nI \u2264 \u2206\n\nT RT\n\n\u221a|A(I)|\u221aT\n\n3 Hardness of approximating optimal CCEs\n\nWe address the following question: given an EFG, can we \ufb01nd a social-welfare-maximizing CCE in\npolynomial time? As shown by Celli et al. [12], the answer is yes in two-player EFGs without Nature.\nHere, we give a negative answer to the question in the remaining cases, i.e., two-player EFGs with\nNature (Theorem 1) and EFGs with three or more players without Nature (Theorem 2). Speci\ufb01cally,\nwe provide an even stronger negative result: there is no polynomial-time approximation algorithm\nwhich \ufb01nds a CCE whose value approximates that of a social-welfare-maximizing CCE up to any\npolynomial factor in the input size, unless P = NP. 2 We prove our results by means of a reduction\nfrom SAT, a well known NP-complete problem [18], which reads as follows.\n\n2Formally, an r-approximation algorithm A for a maximization problem is such that OPT\n\nAPX \u2264 r, where OPT\nis the value of an optimal solution to the problem instance and APX is the value of the solution returned by A.\nSee [2] for additional details on approximation algorithms.\n\n4\n\n\fFigure 1: Left: Example of game for the reduction of Theorem 1, where V = {x, y, z}, C =\n{\u03c61, \u03c62, \u03c63}, \u03c61 = x, \u03c62 = \u00afx \u2228 y, and \u03c63 = \u00afx \u2228 \u00afy. Right: Example of game for the reduction of\nTheorem 2, with V and C as before.\n\nDe\ufb01nition 2 (SAT). Given a \ufb01nite set C of clauses de\ufb01ned over a \ufb01nite set V of variables, is there a\ntruth assignment to the variables which satis\ufb01es all clauses?\n\nFor clarity, Figure 1 shows concrete examples of the EFGs employed for the reductions of Theo-\nrems 1 and 2. Here, we only provide proof sketches, while we report full proofs in Appendix B.\nTheorem 1. Given a two-player EFG with Nature, the problem of computing a social-welfare-\nmaximizing CCE is not in Poly-APX unless P = NP. 3\n\nProof sketch. An example of our reduction from SAT is provided on the left of Figure 1. Its main idea\nis the following: player 2 selects a truth assignment to the variables, while player 1 chooses a literal\nfor each clause in order to satisfy it. It can be proved that there exists a CCE in which each player\ngets utility 1 if and only if SAT is satis\ufb01able (as player 1 selects aI), otherwise player 1 plays aO in\nany CCE and its social welfare is \u03b5. Assume there is a a polynomial-time poly(\u03b7)-approximation\npoly(\u03b7). Since,\nalgorithm A. If SAT is satis\ufb01able, A would return a CCE with social welfare at least\n2\u03b7 , then A would allow us to decide in polynomial time\nfor \u03b7 suf\ufb01ciently large it holds\nwhether SAT is satis\ufb01able, leading to a contradiction unless P = NP.\n\npoly(\u03b7) > 1\n\n2\n\n2\n\nTheorem 2. Given a three-player EFG without Nature, the problem of computing a social-welfare-\nmaximizing CCE is not in Poly-APX unless P = NP.\n\nProof sketch. An example of our reduction from SAT is provided on the right of Figure 1. It is based\non the same idea as that of the previous proof, where the uniform probability distribution played by\nNature is simulated by a particular game gadget (requiring a third player).\n\n4 CFR in multi-player general-sum sequential games\n\nIn this section, we \ufb01rst highlight why CFR cannot be directly employed when computing CCEs of\ngeneral-sum games. Then, we show a simple way to amend it.\n\n4.1 Convergence to CCEs in general-sum games\n\nWhen players follow strategies recommended by a regret minimizer, the empirical frequency of play\napproaches the set of CCEs [13]. Suppose that, at time t, the players play a joint normal-form plan\n\u03c3t \u2208 \u03a3 drawn according to their current strategies. Then, the empirical frequency of play after T\niterations is de\ufb01ned as the joint probability distribution \u00afxT \u2208 X such that \u00afxT (\u03c3) := |t\u2264T :\u03c3t=\u03c3|\nfor\nevery \u03c3 \u2208 \u03a3. However, vanilla CFR and its most popular variations (such as, e.g., CFR+ [39] and\nDCFR [9]) do not keep track of the empirical frequency of play, as they only keep track of the players\u2019\n\nT\n\n3Poly-APX is the class of optimization problems admitting a polynomial-time poly(\u03b7)-approximation\n\nalgorithm, where poly(\u03b7) is a polynomial function of the input size \u03b7 [2].\n\n5\n\nI\u2205hNI\u03c611x0\u00afxx13I\u03c620x1\u00afx\u00afx1y0\u00afyy13I\u03c630x1\u00afx\u00afx0y1\u00afy\u00afy13aI1,\u22121+\u03b5aOIyIx\u03c7I\u22051,\u22121+\u03b5,0aOhNI\u03c61aN,\u03c61I\u03c62aN,\u03c62I\u03c63aN,\u03c63aI...aO,\u03c61|C|2(|C|\u22121),\u2212|C||C|\u22121aN,\u03c61\u2212|C|2,|C|aN,\u03c62|C|2(|C|\u22121),\u2212|C||C|\u22121aN,\u03c63aO,\u03c62...aO,\u03c63aIIN\u03c7\u2022player1\u25e6player2(cid:3)Nature/player3\f\u03c3L\n\n1, 1\n\n0, 1\n\n\u03c3R\n\n1, 0\n\n1, 1\n\n\u03c3L\n\n\u03c3R\n\nFigure 2: Left: Game where \u00afxT\n\u00afxT and \u00afxT\n\n1 \u2297 \u00afxT\n\n1 \u2297 \u00afxT\n\n2 when RM is applied to a variation of the Shapley game (see Appendix C).\n\n2 does not converge to a CCE. Right: Approximation attained by\n\naverage behavioral strategies. This ensures that the strategies are compactly represented, but it is\nnot suf\ufb01cient to recover a CCE in multi-player, general-sum games. Indeed, it is possible to show\nthat, even in normal-form games, if the players play according to some regret-minimizing strategies,\nthen the product distribution x \u2208 X resulting from players\u2019 (marginal) average strategies may not\nconverge to a CCE. In order to see this, we provide the following simple example.\n\n1, xt\n\n1(xt\n\n1 and xt\n\nt=1 xt\n\n1 and \u00afxT\n\n2 = 1\nT\n\n(cid:80)T\n\n1(\u03c3L) = xt\n\n2 be such that xt\n\n2 = 0, and, thus, xt\n\n2(\u03c3L) = (t + 1) mod 2. Clearly, ut\n\nExample Consider the two-player normal-form game depicted on the left in Figure 2. At iteration\nt, let players\u2019 strategies xt\n1) =\n2) = 1 for any t. For both players, at iteration t, the regret of not having played \u03c3L is 0, while\nut\n2(xt\n(cid:80)T\nthe regret of \u03c3R is \u22121 if and only if t is even, otherwise it is 0. As a result, after T iterations,\n2 minimize the cumulative external regret. Players\u2019 average\nRT\n1 = RT\nstrategies \u00afxT\n2 ) as T \u2192 \u221e. However, x \u2208 X\n1 = 1\nT\n4 for every \u03c3 \u2208 \u03a3 is not a CCE of the game. Indeed, a player is always better off\nsuch that x(\u03c3) = 1\nplaying \u03c3L, obtaining a utility of 1, while she only gets 3\n4 if she chooses to stick to x. We remark that\n\u00afxT converges, as T \u2192 \u221e, to x \u2208 X : x(\u03c3L, \u03c3L) = x(\u03c3R, \u03c3R) = 1\nThe example above employs handpicked regret-minimizing strategies, but similar examples can be\neasily found when applying common regret minimizers. As an illustrative case, Figure 2 shows, on\nthe right, that, even with a simple variation of the Shapley game (see Appendix C), the outer product\nof the average strategies \u00afxT\n2 obtained via RM does not converge to a CCE as T \u2192 \u221e. It is\nclear that the same issue may (and does, see Figures 3 and 5) happen when directly applying CFR to\ngeneral-sum EFGs.\n\n2, which is a CCE.\n\n2 converge to ( 1\n\n1 \u2297 \u00afxT\n\nt=1 xt\n\n2 , 1\n\n4.2 CFR with sampling (CFR-S)\n\ni(\u03c3i) := ui(\u03c3i, \u03c3t\u2212i)\n\ni \u2190 RECOMMEND(I\u2205)\n\u03c3t\nObserve ut\nUPDATE(I\u2205, \u03c3t\nt \u2190 t + 1\n\nInitialize regret minimizer for each I \u2208 Ii\nt \u2190 1\nwhile t < T do\n\nAlgorithm 1 CFR-S for player i\n1: function CFR-S(\u0393,i)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nMotivated by the previous examples, we describe a\nsimple variation of CFR guaranteeing approachability\nto the set of CCEs even in multi-player, general-sum\nEFGs. Vanilla CFR proceeds as follows (see Subsec-\ntion 2.2 for the details): for each iteration t, and for\neach infoset I \u2208 Ii, player i observes the realized\nutility for each action a \u2208 A(I), and then computes\ni,I according to standard RM. Once \u03c0t\ni,I has been\n\u03c0t\ncomputed, it is used by the regret minimizers of in-\nfosets on the path from the root to I so as to compute\nobserved utilities. We propose CFR with sampling\n(CFR-S) as a simple way to keep track of the empir-\nical frequency of play. The basic idea is letting each player i, at each t, draw \u03c3t\ni according to her\ncurrent strategy. Algorithm 1 describes the structure of CFR-S, where function RECOMMEND builds\na normal-form plan \u03c3t\ni computed via\nRM, and UPDATE updates the average regrets local to each regret minimizer by propagating utilities\ni. Each player i experiences utilities depending, at each t, on the sampled plans \u03c3t\naccording to \u03c3t\n\u2212i\n(Line 6). Joint normal form plans \u03c3t := (\u03c3t\n\u2212i) can be easily stored to compute the empirical\nfrequency of play. We state the following (see Appendix C for detailed proofs):\n\ni by sampling, at each I \u2208 Ii, an action in A(I) according to \u03c0t\n\ni , ut\ni)\n\ni , \u03c3t\n\n6\n\n10010110210310410500.20.4Iterations\u03b5\u00afxT\u00afxT1\u2297\u00afxT2\fTheorem 3. The empirical frequency of play \u00afxT obtained with CFR-S converges to a CCE almost\nsurely, for T \u2192 \u221e.\nMoreover, the cumulative regret grows as O(T \u22121/2). This result is in line with the approach of Hart\nand Mas-Colell [22] in normal-form games. Despite its simplicity, we show (see Section 6 for an\nexperimental evaluation) that it is possible to achieve better perfomances via a smarter reconstruction\ntechnique that keeps CFR deterministic, avoiding any sampling step.\n\n5 CFR with joint distribution reconstruction (CFR-Jr)\n\nWe design a new method\u2014called CFR with joint distribution reconstruction (CFR-Jr)\u2014to enhance\nCFR so as to approach the set of CCEs in multi-player, general-sum EFGs. Differently from the\nna\u00efve CFR-S algorithm, CFR-Jr does not sample normal-form plans, thus avoiding any stochasticity.\nThe main idea behind CFR-Jr is to keep track of the average joint probability distribution \u00afxT \u2208 X\narising from the regret-minimizing strategies built with CFR. Formally, \u00afxT = 1\nt=1 xt, where\nT\nxt \u2208 X is the joint probability distribution de\ufb01ned as the product of the players\u2019 normal-form\nstrategies at iteration t. At each t, CFR-Jr computes \u03c0t\ni with CFR\u2019s update rules, and then constructs\ni \u2208 Xi which is realization equivalent (i.e., it induces the same probability distribution\na strategy xt\ni. We do this ef\ufb01ciently by directly\non the terminal nodes, see [36] for a formal de\ufb01nition) to \u03c0t\nworking on the game tree, without resorting to the normal-form representation. Strategies xt\ni are then\nemployed to compute xt. The pseudocode of CFR-Jr is provided in Appendix D.\n\n(cid:80)T\n\n(cid:46) X is a dictionary de\ufb01ning xi\n\nminz\u2208Z(\u03c3i) \u03c9z\n\nreturn xi built from the pairs in X\n\nX \u2190 \u2205\n\u03c9z \u2190 \u03c1\u03c0i\nwhile \u03c9 > 0 do\n\nz\n\n\u2200z \u2208 Z\n\n\u00af\u03c3i \u2190 arg max\u03c3i\u2208\u03a3i\n\u00af\u03c9 \u2190 minz\u2208Z(\u00af\u03c3i) \u03c9z\nX \u2190 X \u222a (\u00af\u03c3i, \u00af\u03c9)\n\u03c9 \u2190 \u03c9 \u2212 \u00af\u03c9 \u03c1\u00af\u03c3i\n\nAlgorithm 2 Reconstruct xi from \u03c0i\n1: function NF-STRATEGY-RECONSTRUCTION(\u03c0i)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nAlgorithm 2 shows a polynomial-time pro-\ncedure to compute a normal-form strat-\negy xi \u2208 Xi realization equivalent to a\ngiven behavioral strategy \u03c0i. The algo-\nrithm maintains a vector \u03c9 which is initial-\nized with the probabilities of reaching the\nterminal nodes by playing \u03c0i (Line 3), and\nit works by iteratively assigning probabil-\nity to normal-form plans so as to induce\nthe same distribution as \u03c9 over Z. 4 In or-\nder for this to work, at each iteration, the\nalgorithm must pick a normal-form plan\n\u00af\u03c3i \u2208 \u03a3i which maximizes the minimum\n(remaining) probability \u03c9z over the terminal nodes z \u2208 Z(\u00af\u03c3i) reachable when playing \u00af\u03c3i (Line 5).\nThen, the probabilities \u03c9z for z \u2208 Z(\u00af\u03c3i) are decreased by the minimum (remaining) probability \u00af\u03c9\ncorresponding to \u00af\u03c3i, and \u00af\u03c3i is assigned probability \u00af\u03c9 in xi. The algorithm terminates when the vector\n\u03c9 is zeroed, returning a normal-form strategy xi realization equivalent to \u03c0i. This is formally stated\nby the following result, which also provides a polynomial (in the size of the game tree) upper bound\non the run time of the algorithm and on the support size of the returned normal-form strategy xi. 5\nTheorem 4. Algorithm 2 outputs a normal-form strategy xi \u2208 Xi realization equivalent to a given\nbehavioral strategy \u03c0i, and it runs in time O(|Z|2). Moreover, xi has support size at most |Z|.\nIntuitively, the result in Theorem 4 (its full proof is in Appendix D) relies on the crucial observation\nthat, at each iteration, there is at least one terminal node z \u2208 Z whose corresponding probability\n\u03c9z is zeroed during that iteration. The algorithm is guaranteed to terminate since each \u03c9z is never\nnegative, which is the case given how the normal-form plans are selected (Line 5), and since the game\nhas perfect recall. This guarantees that the algorithm eventually terminates in at most |Z| iterations.\nFinally, the following theorem (whose full proof is in Appendix D) proves that the average distribution\n\u00afxT obtained with CFR-Jr approaches the set of CCEs. Formally:\nTheorem 5. If 1\nThis is a direct consequence of the connection between regret-minimizing procedures and CCEs,\nand of the fact that \u00afxT is obtained by averaging the products of normal-form strategies which are\nequivalent to regret-minimizing behavioral strategies obtained with CFR.\n\ni \u2264 \u03b5 for each player i \u2208 P, then \u00afxT obtained with CFR-Jr is an \u03b5-CCE.\n\nT RT\n\n4Vector \u03c9 is a realization-form strategy, as de\ufb01ned by Farina et al. [14, De\ufb01nition 2].\n5Given a normal-form strategy xi \u2208 Xi, its support is de\ufb01ned as the set of \u03c3i \u2208 \u03a3i such that xi(\u03c3i) > 0.\n\n7\n\n\fTable 1: Comparison between the run time and the social welfare of CFR-S, CFR-Jr (for various\nlevels of accuracy \u03b1), and the CG algorithm. General-sum instances are marked with (cid:63). Results of\nCFR-S are averaged over 50 runs. We generated 20 instances for each Rp-d family. > 24h means\nthat the execution of the algorithm was stopped before its completion, after running for 24 hours.\n\n6 Experimental evaluation\n\nWe experimentally evaluate CFR-Jr, comparing its performance with that of CFR-S, CFR, and the\nstate-of-the-art algorithm for computing optimal CCEs (denoted by CG) [12]. 6 This algorithm is a\nvariation of the simplex method employing a column generation technique based on a MILP pricing\noracle (we use the GUROBI 8.0 MILP solver). Notice that directly applying RM on the normal form\nis not feasible, as |\u03a3| > 1020 even for the smallest instances. Further results are in Appendix E.\nSetup We conduct experiments on parametric instances of three-player Kuhn poker games [27],\nthree-player Leduc hold\u2019em poker games [37], two/three-player Goofspiel games [33], and some\nrandomly generated general-sum EFGs. The two-player, zero-sum versions of these games are\nstandard benchmarks for imperfect-information game solving. In Appendix E, we describe their\nmulti-player, general-sum counterparts. Each instance is identi\ufb01ed by parameters p and r, which\ndenote, respectively, the number of players and the number of ranks in the deck of cards. For example,\na three-player Kuhn game with rank four is denoted by Kuhn3-4, or K3-4. We use different tie-\nbreaking rules for the Goofspiel instances (denoted by A, DA, DH, AL\u2014see Appendix E). Moreover,\nRp-d denotes a random game instance with p players and depth of the game tree d.\n\nConvergence We evaluate the run time required by the algorithms to \ufb01nd an approximate CCE.\nThe results are provided in Table 1, which reports the run time needed by CFR-S and CFR-Jr to\nachieve solutions with different levels of accuracy, and the time needed by CG for reaching an\nequilibrium. 7 The accuracy \u03b1 of the \u03b5-CCEs reached is de\ufb01ned as \u03b1 = \u03b5\n\u2206. Both CFR-S and CFR-Jr\nconsistently outperform CG, as the latter fails to \ufb01nd a CCE in all instances except for the smallest\nones (with less than 100 infosets). We also compare the convergence rates of CFR-S and CFR-Jr to\nthat of CFR in multi-player, general-sum game instances. As expected, our experiments show that,\nin many instances, CFR fails to converge to a CCE. For instance, Figure 3, on the left, shows the\nperformance of CFR-Jr, CFR-S (mean plus/minus standard deviation), and CFR over G2-4-DA in\nterms of accuracy \u03b1. CFR performs dramatically worse than CFR-S and CFR-Jr, and it exhibits a\nnon-convergent behavior with \u03b1 being stuck above 4 \u00b7 10\u22122.\n\n6The only other known algorithm to compute a CCE is by Huang and von Stengel [24] (see also [26] for an\namended version). However, this algorithm relies on the ellipsoid method, which is inef\ufb01cient in practice [20].\n7Table 1 only accounts for algorithms with guaranteed convergence to a CCE (recall that CFR is not\nguaranteed to converge in multi-player, general-sum EFGs). The original version of the CG algorithm computes\nan optimal CCE. For our tests, we modi\ufb01ed it to stop when a feasible solution is reached.\n\n8\n\nGameTreesizeCFR-SCFR-JrCG#infosets\u03b1=0.05\u03b1=0.005\u03b1=0.0005swAPX/swOPT\u03b1=0.05\u03b1=0.005\u03b1=0.0005swAPX/swOPTK3-6721.41s9h15m>24h-1.03s13.41s11m21s-3h47mK3-7844.22s17h11m>24h-2.35s14.33s51m27s-14h37mK3-1012022.69s>24h>24h-7.21s72.78s4h11m->24hL3-4120010m33s>24h>24h-1m15s6h10s>24h->24hL3-626642h5m>24h>24h-2m40s11h19m>24h->24hL3-8470413h55m>24h>24h-20m22s>24h>24h->24hG2-4-A?485610m31s>24h>24h0.97920m23m11h4m>24h0.994>24hG2-4-DA?48562m1s3h28m4h17m0.9181m3656m6s>24h0.976>24hG2-4-DH?48561m19s2h7m3h28m0.9181m51s1h5m>24h0.976>24hG2-4-AL?48562m3s1h33m4h20m0.9191m48s55m43s>24h0.976>24hG3-4-A?985081h33m>24h>24h0.9961h3m4h13m>24h0.999>24hG3-4-DA?985081h13m>24h>24h0.98712m18s1h50m>24h1.000>24hG3-4-DH?9850847m33s19h40m>24h0.88616m38s4h8m15h27m1.000>24hG3-4-AL?9850832m34s15h32m17h30m0.6921h21m5h2s>24h0.730>24hR3-12?30711m44s35m38s3h8m0.90716.94s3m19s24m6s0.897>24hR3-15?2454221m30s4h28m7h50m0.9243m34s14m53s3h3m0.931>24h\fFigure 3: Left: Convergence rate attained in G2-4-DA. Right: Social welfare attained in G2-4-DA.\n\nSocial welfare Table 1 shows, for the general-sum games, the social welfare approximation ratio\nbetween the social welfare of the solutions returned by the algorithms (swAPX) and an upper bound\non the optimal social welfare (swOPT). In particular, swOPT is the maximum sum of players\u2019 utilities,\nwhich, while it is not guaranteed to be achievable by a CCE, it is always greater than or equal to\nthe social welfare of an optimal CCE. 8 Interestingly, the approximation ratio provided by CFR-Jr\nis always better than that of CFR-S. Moreover, the social welfare guaranteed by CFR-Jr is always\nnearly optimal, which makes it a good heuristic to compute optimal CCEs. Reaching a socially good\nequilibrium is crucial, in practice, to make correlation credible. Figure 3, on the right, details the\nperformance of CFR-Jr, CFR-S (mean plus/minus standard deviation), and CFR over G2-4-DA in\nterms of social welfare approximation ratio. CFR performs worse than the other two algorithms.\nThis shows that, not only CFR does not converge to a CCE, but it is also not a good heuristic to \ufb01nd\nsocial-welfare-maximizing equilibria in multi-player, general-sum games.\n\n7 Conclusions and future works\n\nIn this paper, we proved that \ufb01nding an optimal (i.e., social-welfare maximizing) CCE is not in\nPoly-APX, unless P = NP, in general-sum EFGs with two players and chance or with multiple\nplayers. We proposed CFR-Jr as an appealing remedy to the conundrum of computing correlated\nstrategies for multi-player, general-sum settings with game instances beyond toy-problems. In the\nfuture, it would be interesting to further study how to approximate CCEs in other classes of structured\ngames such as, e.g., polymatrix games and congestion games. Moreover, a CCE strategy pro\ufb01le could\nbe employed as a starting point to approximate tighter solution concepts which admit some form\nof correlation. This could be the case, e.g., of the TMECor [14], which is used to model collusive\nbehaviors and interactions involving teams. Finally, it would be interesting to further investigate\nwhether it is possible to de\ufb01ne regret-minimizing procedures for general EFGs leading to re\ufb01nements\nof the CCEs, such as CEs and EFCEs. This begets new challenging problems in the study of how to\nminimize regret in structured games.\n\nAcknowledgments\nWe would like to thank Gabriele Farina for his helpful feedback. This work has been partially\nsupported by the Italian MIUR PRIN 2017 Project ALGADIMAR \u201cAlgorithms, Games, and Digital\nMarkets\u201d.\n\nReferences\n[1] Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of mathemati-\n\ncal Economics, 1(1):67\u201396, 1974.\n\n[2] Giorgio Ausiello, Pierluigi Crescenzi, Giorgio Gambosi, Viggo Kann, Alberto Marchetti-\nSpaccamela, and Marco Protasi. Complexity and approximation: Combinatorial optimization\nproblems and their approximability properties. Springer Science & Business Media, 2012.\n\n8Let us remark that we cannot employ the social welfare value of an optimal CCE, as we would need to\n\ncompute it with the CG algorithm, which, as shown in Table 1, does not scale on big game instances.\n\n9\n\n01234\u00b710402468\u00b710\u22122Iterations\u03b1CFR-JrCFRCFR-S01234\u00b71040.850.90.951IterationsswApx/swOptCFR-JrCFRCFR-S\f[3] Siddharth Barman and Katrina Ligett. Finding any nontrivial coarse correlated equilibrium is\n\nhard. In ACM Conference on Economics and Computation (EC), pages 815\u2013816, 2015.\n\n[4] David Blackwell et al. An analog of the minimax theorem for vector payoffs. Paci\ufb01c Journal of\n\nMathematics, 6(1):1\u20138, 1956.\n\n[5] Avrim Blum and Yishay Monsour. Learning, regret minimization, and equilibria. 2007.\n\n[6] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit\n\nhold\u2019em poker is solved. Science, 347(6218):145\u2013149, 2015.\n\n[7] Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-\ninformation games. In Advances in Neural Information Processing Systems (NeurIPS), pages\n689\u2013699, 2017.\n\n[8] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, 359(6374):418\u2013424, 2018.\n\n[9] Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted\nregret minimization. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence (AAAI),\nvolume 33, pages 1829\u20131836, 2019.\n\n[10] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret\n\nminimization. arXiv preprint arXiv:1811.00164, 2018.\n\n[11] Neil Burch, Matej Moravcik, and Martin Schmid. Revisiting CFR+ and alternating updates.\n\nCoRR, abs/1810.11542, 2018.\n\n[12] Andrea Celli, Stefano Coniglio, and Nicola Gatti. Computing optimal ex ante correlated\nequilibria in two-player sequential games. In Proceedings of the 18th International Conference\non Autonomous Agents and MultiAgent Systems (AAMAS), pages 909\u2013917, 2019.\n\n[13] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[14] Gabriele Farina, Andrea Celli, Nicola Gatti, and Tuomas Sandholm. Ex ante coordination and\ncollusion in zero-sum multi-player extensive-form games. In Advances in Neural Information\nProcessing Systems (NeurIPS), 2018.\n\n[15] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for\nsequential decision processes and extensive-form games. In Proceedings of the AAAI Conference\non Arti\ufb01cial Intelligence (AAAI), volume 33, pages 1917\u20131925, 2019.\n\n[16] F. Forges. Five legitimate de\ufb01nitions of correlated equilibrium in games with incomplete\n\ninformation. Theory and Decision, 35(3):277\u2013310, Nov 1993.\n\n[17] Dean P Foster and Rakesh V Vohra. Calibrated learning and correlated equilibrium. Games and\n\nEconomic Behavior, 21(1-2):40, 1997.\n\n[18] M. R Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of\n\nNP-completeness. WH Freeman and Company, 1979.\n\n[19] Richard Gibson. Regret minimization in non-zero-sum games with applications to building\n\nchampion multiplayer computer poker agents. CoRR, abs/1305.0034, 2013.\n\n[20] Martin Gr\u00f6tschel, L\u00e1szl\u00f3 Lov\u00e1sz, and Alexander Schrijver. The ellipsoid method and its\n\nconsequences in combinatorial optimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[21] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of\n\nGames, 3:97\u2013139, 1957.\n\n[22] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilib-\n\nrium. Econometrica, 68(5):1127\u20131150, 2000.\n\n10\n\n\f[23] Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), pages 3061\u20133069, 2015.\n\n[24] Wan Huang and Bernhard von Stengel. Computing an extensive-form correlated equilibrium\nin polynomial time. In International Workshop on Internet and Network Economics, pages\n506\u2013513. Springer, 2008.\n\n[25] Amir Jafari, Amy Greenwald, David Gondek, and Gunes Ercal. On no-regret learning, \ufb01ctitious\nplay, and nash equilibrium. In Proceedings of the Eighteenth International Conference on\nMachine Learning (ICML), pages 226\u2013233, 2001.\n\n[26] Albert Xin Jiang and Kevin Leyton-Brown. Polynomial-time computation of exact correlated\n\nequilibrium in compact games. Games and Economic Behavior, 91:347\u2013359, 2015.\n\n[27] Harold W Kuhn. A simpli\ufb01ed two-person poker. Contributions to the Theory of Games, 1:\n\n97\u2013103, 1950.\n\n[28] H. Moulin and J-P Vial. Strategically zero-sum games: the class of games whose completely\nInternational Journal of Game Theory, 7(3):\n\nmixed equilibria cannot be improved upon.\n201\u2013221, 1978.\n\n[29] R. B. Myerson. Multistage games with communication. Econometrica, 54(2):323\u2013358, 1986.\n[30] J. Nash. Non-cooperative games. Annals of mathematics, pages 286\u2013295, 1951.\n\n[31] Christos H Papadimitriou and Tim Roughgarden. Computing correlated equilibria in multi-\n\nplayer games. Journal of the ACM (JACM), 55(3):14, 2008.\n\n[32] Nick Abou Risk and Duane Szafron. Using counterfactual regret minimization to create\ncompetitive multiplayer poker agents. In International Conference on Autonomous Agents and\nMultiagent Systems (AAMAS), pages 159\u2013166, 2010.\n\n[33] Sheldon M Ross. Goofspiel\u2014the game of pure strategy. Journal of Applied Probability, 8(3):\n\n621\u2013625, 1971.\n\n[34] Tim Roughgarden. Intrinsic robustness of the price of anarchy. In ACM symposium on Theory\n\nof computing (STOC), pages 513\u2013522, 2009.\n\n[35] Lloyd Shapley. Some topics in two-person games. Advances in game theory, 52:1\u201329, 1964.\n[36] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and\n\nlogical foundations. Cambridge University Press, 2008.\n\n[37] Finnegan Southey, Michael H. Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse\nBillings, and D. Chris Rayner. Bayes\u2019 bluff: Opponent modelling in poker. In Conference on\nUncertainty in Arti\ufb01cial Intelligence (UAI), pages 550\u2013558, 2005.\n\n[38] Oskari Tammelin. Solving large imperfect information games using cfr+. arXiv preprint\n\narXiv:1407.5042, 2014.\n\n[39] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up\nlimit texas hold\u2019em. In International Joint Conferences on Arti\ufb01cial Intelligence (IJCAI), pages\n645\u2013652, 2015.\n\n[40] Bernhard von Stengel and Fran\u00e7oise Forges. Extensive-form correlated equilibrium: De\ufb01nition\nand computational complexity. Mathematics of Operations Research, 33(4):1002\u20131022, 2008.\n\n[41] Kevin Waugh. Abstraction in large extensive games. 2009.\n\n[42] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn International Conference on Machine Learning (ICML), pages 928\u2013936, 2003.\n\n[43] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini-\nmization in games with incomplete information. In Advances in Neural Information Processing\nSystems (NeurIPS), pages 1729\u20131736, 2008.\n\n11\n\n\f", "award": [], "sourceid": 7160, "authors": [{"given_name": "Andrea", "family_name": "Celli", "institution": "Politecnico di Milano"}, {"given_name": "Alberto", "family_name": "Marchesi", "institution": "Politecnico di Milano"}, {"given_name": "Tommaso", "family_name": "Bianchi", "institution": "Politecnico di Milano"}, {"given_name": "Nicola", "family_name": "Gatti", "institution": "Politecnico di Milano"}]}