{"title": "Convergence and No-Regret in Multiagent Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 216, "abstract": null, "full_text": " Convergence and No-Regret in Multiagent\n Learning\n\n\n\n Michael Bowling\n Department of Computing Science\n University of Alberta\n Edmonton, Alberta\n Canada T6G 2E8\n bowling@cs.ualberta.ca\n\n\n\n\n Abstract\n\n Learning in a multiagent system is a challenging problem due to two key\n factors. First, if other agents are simultaneously learning then the envi-\n ronment is no longer stationary, thus undermining convergence guaran-\n tees. Second, learning is often susceptible to deception, where the other\n agents may be able to exploit a learner's particular dynamics. In the\n worst case, this could result in poorer performance than if the agent was\n not learning at all. These challenges are identifiable in the two most com-\n mon evaluation criteria for multiagent learning algorithms: convergence\n and regret. Algorithms focusing on convergence or regret in isolation\n are numerous. In this paper, we seek to address both criteria in a single\n algorithm by introducing GIGA-WoLF, a learning algorithm for normal-\n form games. We prove the algorithm guarantees at most zero average\n regret, while demonstrating the algorithm converges in many situations\n of self-play. We prove convergence in a limited setting and give empir-\n ical results in a wider variety of situations. These results also suggest\n a third new learning criterion combining convergence and regret, which\n we call negative non-convergence regret (NNR).\n\n\n\n1 Introduction\n\nLearning to select actions to achieve goals in a multiagent setting requires overcoming a\nnumber of key challenges. One of these challenges is the loss of the stationarity assumption\nwhen multiple agents are learning simultaneously. Another challenge is guaranteeing that\nthe learner cannot be deceptively exploited by another agent. Both of these challenges dis-\ntinguish the multiagent learning problem from traditional single-agent learning, and have\nbeen gaining recent attention as multiagent applications continue to proliferate.\n\nIn single-agent learning tasks, it is reasonable to assume that the same action from the\nsame state will result in the same distribution over outcomes, both rewards and next states.\nIn other words, the environment is stationary. In a multiagent task with other learning\nagents, the outcomes of an agent's action will vary with the changing policies of the other\nagents. Since most of the convergence results in reinforcement learning depend upon the\nenvironment being stationary, convergence is often difficult to obtain in multiagent settings.\n\n\f\nThe desirability of convergence has been recently contested. We offer some brief insight\ninto this debate in the introduction of the extended version of this paper [1].\n\nEquilibrium learners [2, 3, 4] are one method of handling the loss of stationarity. These al-\ngorithms learn joint-action values, which are stationary, and in certain circumstances guar-\nantee these values converge to Nash (or correlated) equilibrium values. Using these values,\nthe player's strategy corresponds to the player's component of some Nash (or correlated)\nequilibrium. This convergence of strategies is guaranteed nearly independently of the ac-\ntions selected by the other agents, including when other agents play suboptimal responses.\nEquilibrium learners, therefore, can fail to learn best-response policies even against simple\nnon-learning opponents.1 Best-response learners [5, 6, 7] are an alternative approach that\nhas sought to learn best-responses, but still considering whether the resulting algorithm\nconverges in some form. These approaches usually examine convergence in self-play, and\nhave included both theoretical and experimental results.\n\nThe second challenge is the avoidance of exploitation. Since learning strategies dynami-\ncally change their action selection over time, it is important to know that the change cannot\nbe exploited by a clever opponent. A deceptive strategy may \"lure\" a dynamic strategy\naway from a safe choice in order to switch to a strategy where the learner receives much\nlower reward. For example, Chang and Kaelbling [8] demonstrated that the best-response\nlearner PHC [7] could be exploited by a particular dynamic strategy. One method of mea-\nsuring whether an algorithm can be exploited is the notion of regret. Regret has been\nexplored both in game theory [9] and computer science [10, 11]. Regret measures how\nmuch worse an algorithm performs compared to the best static strategy, with the goal to\nguarantee at least zero average regret, no-regret, in the limit.\n\nThese two challenges result in two completely different criteria for evaluation: conver-\ngence and no-regret. In addition, they have almost exclusively been explored in isolation.\nFor example, equilibrium learners can have arbitrarily large average regret. On the other\nhand, no-regret learners' strategies rarely converge in self-play [12] in even the simplest\nof games.2 In this paper, we seek to explore these two criteria in a single algorithm for\nlearning in normal-form games. In Section 2 we present a more formal description of the\nproblem and the two criteria. We also examine key related work in applying gradient ascent\nalgorithms to this learning problem. In Section 3 we introduce GIGA-WoLF, an algorithm\nwith both regret and convergence properties. The algorithm is followed by theoretical and\nexperimental analyses in Sections 4 and 5, respectively, before concluding.\n\n\n2 Online Learning in Games\n\nA game in normal form is a tuple, (n, A1...n, R1...n), where n is the number of players in\nthe game, Ai is a set of actions available to player i (A = A1. . .An), and Ri : A \nis a mapping from joint actions to player i's reward. The problem of learning in a normal-\nform game is one of repeatedly selecting an action and receiving a reward, with a goal of\nmaximizing average reward against an unknown opponent. If there are two players then it\nis convenient to write a player's reward function as a |A1| |A2| matrix. Three example\nnormal-form games are shown in Table 1.\n\nUnless stated otherwise we will assume the learning algorithm is player one. In the context\nof a particular learning algorithm and a particular opponent, let rt |A1| be the vector\nof actual rewards that player one would receive at time t for each of its actions. Let xt \n 1This work is not restricted to zero-sum games and our use of the word \"opponent\" refers simply\nto other players in the game.\n 2A notable exception is Hart and Mas-Colell's algorithm that guarantees the empirical distribu-\ntion of play converges to that of a correlated equilibrium. Neither strategies nor expected values\nnecessarily converge, though.\n\n\f\n Table 1: Examples of games in normal-form.\n\n A B R P S\n H T\n A 3 R 0 1 1\n H 1 R ,,0 0 - 1\n R - 1 = B 1 2 1 0 1\n 1 = ,, 1 R P\n T 1 1 1 = -\n - S @ 1 1 0\n - A\n A\n R2 = R R\n - 1 2 = ,,3 2\n B 0 1 R2 = R\n - 1\n\n (a) Matching Pennies (b) Tricky Game (c) RockPaperScissors\n\n\n\nP D(A1) be the algorithm's strategy at time t, selected from probability distributions over\nactions. So, player one's expected payoff at time t is (rt xt). Let 1a be the probability\ndistribution that assigns probability 1 to action a A1. Lastly, we will assume the reward\nfor any action is bounded by rmax and therefore ||rt||2 |A1|r2max.\n\n2.1 Evaluation Criteria\n\nOne common evaluation criterion for learning in normal-form games is convergence. There\nare a number of different forms of convergence that have been examined in the literature.\nThese include, roughly increasing in strength: average reward (i.e., (rt xt)/T), empir-\nical distribution of actions (i.e., xt/T ), expected reward (i.e., (rt xt)), and strategies\n(i.e., xt). We focus in this paper on convergence of strategies as this implies the other three\nforms of convergence as well. In particular, we will say an algorithm converges against a\nparticular opponent if and only if limt xt = x.\n\nThe second common evaluation criterion is regret. Total regret3 is the difference between\nthe maximum total reward of any static strategy given the past history of play and the\nalgorithm's total reward.\n\n T\n\n RT max ((rt 1a) - (rt xt))\n aA1 t=1\n\nAverage regret is just the asymptotic average of total regret, limT RT /T . An algorithm\nhas no-regret if and only if the average regret is less than or equal to zero against all op-\nponent strategies. The no-regret property makes a strong claim about the performance of\nthe algorithm: the algorithm's expected average reward is at least as large as the expected\naverage award any static strategy could have achieved. In other words, the algorithm is\nperforming at least as well as any static strategy.\n\n\n2.2 Gradient Ascent Learning\n\nGradient ascent is a simple and common technique for finding parameters that optimize\na target function. In the case of learning in games, the parameters represent the player's\nstrategy, and the target function is expected reward. We will examine three recent results\nevaluating gradient ascent learning algorithms in normal-form games.\n\nSingh, Kearns, and Mansour [6] analyzed gradient ascent (IGA) in two-player, two-action\ngames, e.g., Table 1(a) and (b). They examined the resulting strategy trajectories and pay-\noffs in self-play, demonstrating that strategies do not always converge to a Nash equilib-\nrium, depending on the game. They proved, instead, that average payoffs converge (a\n\n 3Our analysis focuses on expectations of regret (total and average), similar to [10, 11]. Although\nnote that for any self-oblivious behavior, including GIGA-WoLF, average regret of at most zero on\nexpectation implies universal consistency, i.e., regret of at most with high probability [11].\n\n\f\nweaker form of convergence) to the payoffs of the equilibrium. WoLF-IGA [7] extended\nthis work to the stronger form of convergence, namely convergence of strategies, through\nthe use of a variable learning rate. Using the WoLF (\"Win or Learn Fast\") principle, the al-\ngorithm would choose a larger step size when the current strategy had less expected payoff\nthan the equilibrium strategy. This results in strategies converging to the Nash equilibrium\nin a variety of games including all two-player, two-action games.4 Zinkevich [11] looked\nat gradient ascent using the evaluation criterion of regret. He first extended IGA beyond\ntwo-player, two-action games. His algorithm, GIGA (Generalized Infinitesimal Gradient\nAscent), updates strategies using an unconstrained gradient, and then projects the resulting\nstrategy vector back into the simplex of legal probability distributions,\n\n xt+1 = P (xt + trt) where P (x) = argmin ||x - x ||, (1)\n x P D(A1)\n\nt is the stepsize at time t, and |||| is the standard L2 norm. He proved GIGA's total regret\nis bounded by,\n \n RT T + |A|r2max( T - 1/2). (2)\nSince GIGA is identical to IGA in two-player, two-action games, we also have that GIGA\nachieves the weak form of convergence in this subclass of games. It is also true, though,\nthat GIGA's strategies do not converge in self-play even in simple games like matching\npennies.\n\nIn the next section, we present an algorithm that simultaneously achieves GIGA's no-regret\nresult and part of WoLF-IGA's convergence result. We first present the algorithm and then\nanalyze these criteria both theoretically and experimentally.\n\n\n3 GIGA-WoLF\n\nGIGA-WoLF is a gradient based learning algorithm that internally keeps track of two dif-\nferent gradient-updated strategies, xt and zt. The algorithm chooses actions according to\nthe distribution xt, but updates both xt and zt after each iteration. The update rules consist\nof three steps.\n\n (1) ^\n xt+1 = P (xt + trt)\n (2) zt+1 = P (zt + trt/3)\n\n t+1 = min 1, ||zt+1 - zt||\n ||zt+1 - ^xt+1||\n (3) xt+1 = ^\n xt+1 + t+1(zt+1 - ^xt+1)\nStep (1) updates xt according to GIGA's standard gradient update and stores the result as\n^\nxt+1. Step (2) updates zt in the same manner, but with a smaller step-size. Step (3) makes\na final adjustment on xt+1 by moving it toward zt+1. The magnitude of this adjustment is\nlimited by the change in zt that occurred in step (2).\n\nA key factor in understanding this algorithm is the observance that a strategy a receives\nhigher reward than a strategy b if and only if the gradient at a is in the direction of b (i.e.,\nrt (b - a) > 0). Therefore, the step (3) adjustment is in the direction of the gradient\nif and only if zt received higher reward than xt. Notice also that, as long as xt is not\nnear the boundary, the change due to step (3) is of lower magnitude than the change due\n\n 4WoLF-IGA may, in fact, be a limited variant of the extragradient method [13] for variational\ninequality problems. The extragradient algorithm is guaranteed to converge to a Nash equilibrium\nin self-play for all zero-sum games. Like WoLF-IGA, though, it does not have any known regret\nguarantees, but more importantly requires the other players' payoffs to be known.\n\n\f\nto step (1). Hence, the combination of steps (1) and (3) result in a change with two key\nproperties. First, the change is in the direction of positive gradient. Second, the magnitude\nof the change is larger when zt received higher reward than xt. So, we can interpret the\nupdate rule as a variation on the WoLF principle of \"win or learn fast\", i.e., the algorithm is\nlearning faster if and only if its strategy x is losing to strategy z. GIGA-WoLF is a major\nimprovement on the original presentation of WoLF-IGA, where winning was determined\nby comparison with an equilibrium strategy that was assumed to be given. Not only is less\nknowledge required, but the use of a GIGA-updated strategy to determine winning will\nallow us to prove guarantees on the algorithm's regret.\n\nIn the next section we present a theoretical examination of GIGA-WoLF's regret in n-\nplayer, n-action games, along with a limited guarantee of convergence in two-player, two-\naction games. In Section 5, we give experimental results of learning using GIGA-WoLF,\ndemonstrating that convergence extends beyond the theoretical analysis presented.\n\n\n4 Theoretical Analysis\n\nWe begin by examining GIGA-WoLF's regret against an unknown opponent strategy. We\nwill prove the following bound on average regret.\n\n\nTheorem 1 If t = 1/t, the regret of GIGA-WoLF is,\n \n RT 2 T + |A|r2max(2 T - 1).\nTherefore, limT RT /T 0, hence GIGA-WoLF has no-regret.\nProof. We begin with a brief overview of the proof. We will find a bound on the regret of\nthe strategy xt with respect to the dynamic strategy zt. Since zt is unmodified GIGA, we\nalready have a bound on the regret of zt with respect to any static strategy. Hence, we can\nbound the regret of xt with respect to any static strategy.\n\nWe start by examining the regret of xt with respect to zt using a similar analysis as used\nby Zinkevich [11]. Let xz refer to the difference in expected payoff between z\n t t and xt\nat time t, and Rxz be the sum of these differences, i.e., the total regret of x\n T t with respect\nto zt,\n T\n\n xz = r xz.\n t t (zt - xt) Rxz\n T t\n t=1\n\n\nWe will use the following potential function, t (xt - zt)2/2t. We can examine how\nthis potential changes with each step of the update. 1, 2, and 3 refers to the\n t t t\nchange in potential caused by steps (1), (2), and (3), respectively. 4 refers to the change\n t\nin potential caused by the learning rate change from t-1 to t. This gives us the following\nequations for the potential change.\n\n 1t+1 = 1/2t((^xt+1 - zt)2 - (xt - zt)2)\n 2t+1 = 1/2t((^xt+1 - zt+1)2 - (^xt+1 - zt)2)\n 3t+1 = 1/2t((xt+1 - zt+1)2 - (^xt+1 - zt+1)2)\n 4t+1 = (1/2t+1 - 1/2t)(xt+1 - zt+1)2\n t+1 = 1t+1 + 2t+1 + 3t+1 + 4t+1\n\nNotice that if t+1 = 1 then xt+1 = zt+1. Hence t+1 = 0, and 2t+1 + 3t+1 0.\nIf t+1 < 1, then ||xt+1 - ^xt+1|| = ||zt+1 - zt||. Notice also that in this case xt+1 is\n\n\f\nco-linear and between ^\n xt+1 and zt+1. So,\n\n ||^xt+1 - zt+1|| = ||^xt+1 - xt+1|| + ||xt+1 - zt+1||\n = ||zt+1 - zt|| + ||xt+1 - zt+1||\nWe can bound the left with the triangle inequality,\n\n ||^xt+1 - zt+1|| ||^xt+1 - zt|| + ||zt - zt+1||\n ||xt+1 - zt+1|| ||^xt+1 - zt||.\nSo regardless of t+1, 2t+1 + 3t+1 < 0. Hence, t+1 1t+1 + 4t+1.\nWe will now use this bound on the change in the potential to bound the regret of xt with\nrespect to zt. We know from Zinkevich that,\n\n (^\n xt+1 - zt)2 - (xt - zt)2 2trt (zt - xt) + 2r2.\n t t\n\n\nTherefore,\n\n 1\n xz (^\n x\n t - t\n 2 +1 - zt)2 - (xt - zt)2 - r2t\n t\n\n -1 .\n t+1 + 1/2tr2\n t -t+1 + 4t+1 + 1/2tr2t\nSince we assume rewards are bounded by rmax we can bound r2 by\n t |A|r2max. Summing up\nregret and using the fact that t = 1/t, we get the following bound.\n\n T t\n Rxz +\n T -t + 4t 2 |A|r2max\n t=1\n\n T\n 1\n (1 - T) + t\n - 1 + |A|r2max\n T 2 t=1\n \n T + |A|r2max( T - 1/2)\nWe know that GIGA's regret with respect to any strategy is bounded by the same value (see\nInequality 2). Hence, \n RT 2 T + |A|r2max(2 T - 1),\nas claimed.\n\nThe second criterion we want to consider is convergence. As with IGA, WoLF-IGA, and\nother algorithms, our theoretical analysis will be limited to two-player, two-action general-\nsum games. We further limit ourselves to the situation of GIGA-WoLF playing \"against\"\nGIGA. These restrictions are a limitation of the proof method, which uses a case-by-case\nanalysis that is combinatorially impractical for the case of self-play. This is not necessarily\na limitation on GIGA-WoLF's convergence. This theorem along with the empirical results\nwe present later in Section 5 give a strong sense of GIGA-WoLF's convergence properties.\nThe full proof can be found in [1].\n\nTheorem 2 In a two-player, two-action repeated game, if one player follows the GIGA-\nWoLF algorithm and the other follows the GIGA algorithm, then their strategies will con-\nverge to a Nash equilibrium.\n\n\n5 Experimental Analysis\n\nWe have presented here two theoretical properties of GIGA-WoLF relating to guarantees\non both regret and convergence. There have also been extensive experimental results per-\nformed on GIGA-WoLF in a variety of normal-form games [1]. We summarize the results\n\n\f\nhere. The purpose of these experiments was to demonstrate the theoretical results from the\nprevious section as well as explore the extent to which the results (convergence, in partic-\nular) can be generalized. In that vein, we examined the same suite of normal-form games\nused in experiments with WoLF-PHC, the practical variant of WoLF-IGA [7].\n\nOne of the requirements of GIGA-WoLF (and GIGA) is knowledge of the entire reward\nvector (rt), which requires knowledge of the game and observation of the opponent's ac-\ntion. In practical situations, one or both of these are unlikely to be available. Instead, only\nthe reward of the selected action is likely to be observable. We have relaxed this require-\nment in these experiments by providing GIGA-WoLF (and GIGA) with only estimates of\nthe gradient from stochastic approximation. In particular, after selecting action a and re-\nceiving reward ^\n ra, we update the current estimate of action a's component of the reward\nvector, rt+1 = rt + t(^\n ra - 1a rt)1a, where t is the learning rate. This is a standard\nmethod of estimation commonly used in reinforcement learning (e.g., Q-learning).\n\nFor almost all of the games explored, including two-player, two-action games as well as\nn-action zero-sum games, GIGA-WoLF strategies converged in self-play to equilibrium\nstrategies of the game. GIGA's strategies, on the other hand, failed to converge in self-play\nover the same suite of games. These results are nearly identical to the PHC and WoLF-PHC\nexperiments over the same games. A prototypical example of these results is provided in\nFigure 1(a) and (b), showing strategy trajectories while learning in Rock-Paper-Scissors.\nGIGA's strategies do not converge, while GIGA-WoLF's strategies do converge. GIGA-\nWoLF also played directly against GIGA in this game resulting in convergence, but with\na curious twist. The resulting expected and average payoffs are shown in Figure 1(c).\nSince both are no-regret learners, average payoffs are guaranteed to go to zero, but the\nshort-term payoff is highly favoring GIGA-WoLF. This result raises an interesting question\nabout the relative short-term performance of no-regret learning algorithms, which needs to\nbe explored further.\n\n\n\n 0.12\n 1 1 Average Reward\n Expected Reward\n 0.8 0.8 0.08\n\n 0.6 0.6 0.04\n\n 0.4 0.4 0\n Reward\n Pr(Paper) Pr(Paper)\n 0.2 0.2 -0.04\n\n 0 0 -0.08\n 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 500000 1e+06\n\n Pr(Rock) Pr(Rock) Iterations\n\n (a) GIGA (b) GIGA-WoLF (c) GIGA v. GIGA-WoLF\n\nFigure 1: Trajectories of joint strategies in Rock-Paper-Scissors when both players use\nGIGA (a) or GIGA-WoLF (b). Also shown (c) are the expected and average payoffs of the\nplayers when GIGA and GIGA-WoLF play against each other.\n\n\nGIGA-WoLF did not lead to convergence in all of the explored games. The \"problematic\"\nShapley's game, for which many similarly convergent algorithms fail in, also resulted in\nnon-convergence for GIGA-WoLF. On the other hand, this game has the interesting prop-\nerty that both players' when using GIGA-WoLF (or GIGA) actually achieve negative regret.\nIn other words, the algorithms are outperforming any static strategy to which they could\nconverge upon. This suggests a new desirable property for future multiagent (or online)\nlearning algorithms, negative non-convergence regret (NNR). An algorithm has NNR, if it\nsatisfies the no-regret property and either (i) achieves negative regret or (ii) its strategy con-\nverges. This property combines the criteria of regret and convergence, and GIGA-WoLF is\na natural candidate for achieving this compelling result.\n\n\f\n6 Conclusion\n\nWe introduced GIGA-WoLF, a new gradient-based algorithm, that we believe is the first\nto simultaneously address two criteria: no-regret and convergence. We proved GIGA-\nWoLF has no-regret. We also proved that in a small class of normal-form games, GIGA-\nWoLF's strategy when played against GIGA will converge to a Nash equilibrium. We\nsummarized experimental results of GIGA-WoLF playing in a variety of zero-sum and\ngeneral-sum games. These experiments verified our theoretical results and exposed two\ninteresting phenomenon that deserve further study: short-term performance of no-regret\nlearners and the new desiderata of negative non-convergence regret. We expect GIGA-\nWoLF and these results to be the foundation for further understanding of the connections\nbetween the regret and convergence criteria.\n\n\nReferences\n\n [1] Michael Bowling. Convergence and no-regret in multiagent learning. Technical Re-\n port TR04-11, Department of Computing Science, University of Alberta, 2004.\n\n [2] Michael L. Littman. Markov games as a framework for multi-agent reinforcement\n learning. In Proceedings of the Eleventh International Conference on Machine Learn-\n ing, pages 157163, 1994.\n\n [3] Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical\n framework and an algorithm. In Proceedings of the Fifteenth International Confer-\n ence on Machine Learning, pages 242250, 1998.\n\n [4] Amy Greenwald and Keith Hall. Correlated Q-learning. In Proceedings of the AAAI\n Spring Symposium Workshop on Collaborative Learning Agents, 2002.\n\n [5] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in coop-\n erative multiagent systems. In Proceedings of the Fifteenth National Conference on\n Artificial Intelligence, pages 746752, 1998.\n\n [6] Satinder Singh, Michael Kearns, and Yishay Mansour. Nash convergence of gradi-\n ent dynamics in general-sum games. In Proceedings of the Sixteenth Conference on\n Uncertainty in Artificial Intelligence, pages 541548, 2000.\n\n [7] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning\n rate. Artificial Intelligence, 136:215250, 2002.\n\n [8] Yu-Han Chang and Leslie Pack Kaelbling. Playing is believing: the role of beliefs\n in multi-agent learning. In Advances in Neural Information Processing Systems 14,\n 2001.\n\n [9] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to corre-\n lated equilibrium. Econometrica, 68:11271150, 2000.\n\n[10] Peter Auer, Nicol `o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in\n a rigged casino: The adversarial multi-arm bandit problem. In 36th Annual Sympo-\n sium on Foundations of Computer Science, pages 322331, 1995.\n\n[11] Martin Zinkevich. Online convex programming and generalized infinitesimal gradi-\n ent ascent. In Proceedings of the Twentieth International Conference on Machine\n Learning, pages 928925, 2003.\n\n[12] Amir Jafari, Amy Greenwald, David Gondek, and Gunes Ercal. On no-regret learning,\n fictitious play, and nash equilibrium. In Proceedings of the Eighteenth International\n Conference on Machine Learning, pages 226223, 2001.\n\n[13] G. M. Korpelevich. The extragradient method for finding saddle points and other\n problems. Matecon, 12:747756, 1976.\n\n\f\n", "award": [], "sourceid": 2673, "authors": [{"given_name": "Michael", "family_name": "Bowling", "institution": null}]}