{"title": "Re-evaluating evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 3268, "page_last": 3279, "abstract": "Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.", "full_text": "Re-evaluating Evaluation\n\nDavid Balduzzi\u21e4\n\nKarl Tuyls\u21e4\n\nJulien Perolat\u21e4\n\nThore Graepel\u21e4\n\nAbstract\n\n\u201cWhat we observe is not nature itself, but nature exposed to our method of ques-\ntioning.\u201d \u2013 Werner Heisenberg\nProgress in machine learning is measured by careful evaluation on problems of\noutstanding common interest. However, the proliferation of benchmark suites\nand environments, adversarial attacks, and other complications has diluted the\nbasic evaluation model by overwhelming researchers with choices. Deliberate\nor accidental cherry picking is increasingly likely, and designing well-balanced\nevaluation suites requires increasing effort. In this paper we take a step back\nand propose Nash averaging. The approach builds on a detailed analysis of the\nalgebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-\nvs-task. The key strength of Nash averaging is that it automatically adapts to\nredundancies in evaluation data, so that results are not biased by the incorporation\nof easy tasks or weak agents. Nash averaging thus encourages maximally inclusive\nevaluation \u2013 since there is no harm (computational cost aside) from including all\navailable tasks and agents.\n\n1\n\nIntroduction\n\nEvaluation is a key driver of progress in machine learning, with e.g. ImageNet [1] and the Arcade\nLearning Environment [2] enabling subsequent breakthroughs in supervised and reinforcement\nlearning [3, 4]. However, developing evaluations has received little systematic attention compared to\ndeveloping algorithms. Immense amounts of compute is continually expended smashing algorithms\nand tasks together \u2013 but the results are almost never used to evaluate and optimize evaluations. In a\nstriking asymmetry, results are almost exclusively applied to evaluate and optimize algorithms.\nThe classic train-and-test paradigm on common datasets, which has served the community well [5], is\nreaching its limits. Three examples suf\ufb01ce. Adversarial attacks have complicated evaluation, raising\nquestions about which attacks to test against [6\u20139]. Training agents far beyond human performance\nwith self-play means they can only really be evaluated against each other [10, 11]. The desire to build\nincreasingly general-purpose agents has led to a proliferation of environments: Mujoco, DM Lab,\nOpen AI Gym, Psychlab and others [12\u201315].\nIn this paper we pause to ask, and partially answer, some basic questions about evaluation: Q1. Do\ntasks test what we think they test? Q2. When is a task redundant? Q3. Which tasks (and agents)\nmatter the most? Q4. How should evaluations be evaluated?\nWe consider two scenarios: agent vs task (AvT), where algorithms are evaluated on suites of datasets\nor environments; and agent vs agent (AvA), where agents compete directly as in Go and Starcraft.\nOur goal is to treat tasks and agents symmetrically \u2013 with a view towards, ultimately, co-optimizing\nagents and evaluations. From this perspective AvA, where the task is (beating) another agent, is\nespecially interesting. Performance in AvA is often quanti\ufb01ed using Elo ratings [16] or the closely\nrelated TrueSkill [17]. There are two main problems with Elo. Firstly, Elo bakes-in the assumption\nthat relative skill is transitive; but Elo is meaningless \u2013 it has no predictive power \u2013 in cyclic games\n\n\u21e4DeepMind. Email: { dbalduzzi | karltuyls | perolat | thore }@google.com\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flike rock-paper-scissors. Intransitivity has been linked to biodiversity in ecology, and may be useful\nwhen evolving populations of agents [18\u201321]. Secondly, an agent\u2019s Elo rating can be in\ufb02ated by\ninstantiating many copies of an agent it beats (or conversely). This can cause problems when Elo\nguides hyper-optimization methods like population-based training [22]. Similarly, the most important\ndecision when constructing a task-suite is which tasks to include. It is easy, and all too common, to\nbias task-suites in favor of particular agents or algorithms.\n\n1.1 Overview\n\nSection 2 presents background information on Elo and tools for working with antisymmetric matrices,\nsuch as the Schur decomposition and combinatorial Hodge theory. A major theme underlying the\npaper is that the fundamental algebraic structure of tournaments and evaluation is antisymmetric [23].\nTechniques speci\ufb01c to antisymmetric matrices are less familiar to the machine learning community\nthan approaches like PCA that apply to symmetric matrices and are typically correlation-based.\nSection 3 presents a uni\ufb01ed approach to representing evaluation data, where agents and tasks are\ntreated symmetrically. A basic application of the approach results in our \ufb01rst contribution: a\nmultidimensional Elo rating (mElo) that handles cyclic interactions. We also sketch how the Schur\ndecomposition can uncover latent skills and tasks, providing a partial answer to Q1. We illustrate\nmElo on the domain of training an AlphaGo agent [24].\nThe second contribution of the paper is Nash averaging, an evaluation method that is invariant\nto redundant tasks and agents, see section 4. The basic idea is to play a meta-game on evaluation\ndata [25]. The meta-game has a unique maximum entropy Nash equilibrium. The key insight of the\npaper is that the maxent Nash adapts automatically to the presence of redundant tasks and agents. The\nmaxent Nash distribution thus provides a principled answer to Q2 and Q3: which tasks and agents do\nand do not matter is determined by a meta-game. Finally, expected dif\ufb01culty of tasks under the Nash\ndistribution on agents yields a partial answer to Q4. The paper concludes by taking a second look at\nthe performance of agents on Atari. We \ufb01nd that, under Nash averaging, human performance ties\nwith the best agents, suggesting better-than-human performance has not yet been achieved.\n\n1.2 Related work\n\nLegg and Hutter developed a de\ufb01nition of intelligence which, informally, states \u201cintelligence measures\nan agent\u2019s ability to achieve goals in a wide range of environments\u201d [26, 27]. Formally, they consider\nall computable tasks weighted by algorithmic complexity [28\u201330]. Besides being incomputable, the\ndistribution places (perhaps overly) heavy weight on the simplest tasks.\nA comprehensive study of performance metrics for machine learning and AI can be found in [31\u201335].\nThere is a long history of psychometric evaluation in humans, some of which has been applied\nin arti\ufb01cial intelligence [36\u201338]. Bradley-Terry models provide a general framework for pairwise\ncomparison [39]. Researchers have recently taken a second look at the arcade learning environment [2]\nand introduced new performance metrics [40]. However, the approach is quite particular. Recent\nwork using agents to evaluate games has somewhat overlapping motivation with this paper [41\u201345].\nItem response theory is an alternative, and likely complementary, approach to using agents to evaluate\ntasks [46] that has recently been applied to study the performance of agents on the Arcade Learning\nEnvironment [47].\nOur approach draws heavily on work applying combinatorial Hodge theory to statistical ranking [48]\nand game theory [49\u201351]. We also draw on empirical game theory [52, 53], by using a meta-game to\n\u201cevaluate evaluations\u201d, see section 4. Empirical game theory has been applied to domains like poker\nand continuous double auctions, and has recently been extended to asymmetric games [54\u201358]. von\nNeumann winners in the dueling bandit setting and NE-regret are related to Nash averaging [59\u201362].\n\n2 Preliminaries\n\nNotation. Vectors are column vectors. 0 and 1 denote the constant vectors of zeros and ones\nrespectively. We sometimes use subscripts to indicate dimensions of vectors and matrices, e.g. rn\u21e51\nor Sm\u21e5n and sometimes their entries, e.g. ri or Sij; no confusion should result. The unit vector with\na 1 in coordinate i is ei. Proofs and code are in the appendix.\n\n2\n\n\f2.1 The Elo rating system\nSuppose n agents play a series of pairwise matches against each other. Elo assigns a rating ri to each\nplayer i 2 [n] based on their wins and losses, which we represent as an n-vector r. The predicted\nprobability of i beating j given their Elo ratings is\nlog(10)\n\n10ri/400\n\n1\n\n\u02c6pij :=\n\n10ri/400 + 10rj /400 = (\u21b5ri  \u21b5rj), where (x) =\n\nand \u21b5 =\n\n.\n\n400\n\n1 + ex\n\nThe constant \u21b5 is not important in what follows, so we pretend \u21b5 = 1. Observe that only the\ndifference between Elo ratings affects win-loss predictions. We therefore impose that Elo ratings sum\nto zero, r|1 = 0, without loss of generality. De\ufb01ne the loss,\n\n`Elo(pij, \u02c6pij) = pij log \u02c6pij  (1  pij) log(1  \u02c6pij), where\n\n\u02c6pij = (ri  rj)\n\nand pij is the true probability of i beating j. Suppose the tth match pits player i against j, with\noutcome St\n\nij = 1 if i wins and St\nrt+1\ni rt\n\nij = 0 if i loses. Online gradient descent on `Elo obtains\ni  \u2318 \u00b7r ri`Elo(St\n\ni + \u2318 \u00b7 (St\n\nij  \u02c6pt\n\nij) = rt\n\nij, \u02c6pt\n\nij).\n\nChoosing learning rate \u2318 = 16 or 32 recovers the updates introduced by Arpad Elo in [16].\nThe win-loss probabilities predicted by Elo ratings can fail in simple cases. For example, rock, paper\nand scissors will all receive the same Elo ratings. Elo\u2019s predictions are \u02c6pij = 1\n2 for all i, j \u2013 and so\nElo has no predictive power for any given pair of players (e.g. paper beats rock with p = 1).\nWhat are the Elo update\u2019s \ufb01xed points? Suppose we batch matches to obtain empirical estimates\n. As the number of matches approaches in\ufb01nity,\n\nthe empirical estimates approach the true probabilities pij.\nProposition 1. Elo ratings are at a stationary point under batch updates iff the matrices of empirical\nprobabilities and predicted probabilities have the same row-sums (or, equivalently the same column-\nsums):\n\nof the probability of player i beating j: \u00afpij =Pn\n`Elo(\u00afpij, \u02c6pij)35 = 0 8i\n\nrri24Xj\n\niff Xj\n\n\u00afpij =Xj\n\n\u02c6pij 8i.\n\nMany different win-loss probability matrices result in identical Elo ratings. The situation is analogous\nto how many different joint probability distributions can have the same marginals. We return to this\ntopic in section 3.1.\n\nSn\nij\nNij\n\n2.2 Antisymmetric matrices\nWe recall some basic facts about antisymmetric matrices. Matrix A is antisymmetric if A + A| = 0.\nAntisymmetric matrices have even rank and imaginary eigenvalues {\u00b1ij}rank(A)/2\n. Any antisym-\nmetric matrix A admits a real Schur decomposition:\n\nj=1\n\nwhere Q is orthogonal and \u21e4 consists of zeros except for (2 \u21e5 2) diagonal-blocks of the form:\n\nAn\u21e5n = Qn\u21e5n \u00b7 \u21e4n\u21e5n \u00b7 Q|\n0\u25c6 .\n\n\u21e4 =\u2713 0\n\nj\n\nj\n\nn\u21e5n,\n\nThe entries of \u21e4 are real numbers, found by multiplying the eigenvalues of A by i = p1.\nProposition 2. Given matrix Sm\u21e5n with rank r and singular value decomposition Um\u21e5rDr\u21e5rV|\nConstruct antisymmetric matrix\n\nr\u21e5n.\n\nA(m+n)\u21e5(m+n) =\u2713 0m\u21e5m Sm\u21e5n\nn\u21e5m 0n\u21e5n\u25c6 .\nS|\nThen the thin Schur decomposition of A is Q(m+n)\u21e52r\u21e42r\u21e52rQ|\nin \u21e42r\u21e52r are \u00b1 the singular values in Dr\u21e5r and\n0m\u21e51\nv1\n\nQ(m+n)\u21e52r =\u2713u1\n\n\u00b7\u00b7\u00b7  ur\n\u00b7\u00b7\u00b7\n0n\u21e51\n\n0n\u21e51\n\n0m\u21e51\nvr\n\n\u25c6 .\n\n2r\u21e5(m+n) where the nonzero pairs\n\n3\n\n\fCombinatorial Hodge theory is developed by analogy with differential geometry, see [48\u201351].\nConsider a fully connected graph with vertex set [n] = {1, . . . , n}. Assign a \ufb02ow Aij 2 R to each\nedge of the graph. The \ufb02ow in the opposite direction ji is Aji = Aij, so \ufb02ows are just (n \u21e5 n)\nantisymmetric matrices. The \ufb02ow on a graph is analogous to a vector \ufb01eld on a manifold.\nThe combinatorial gradient of an n-vector r is the \ufb02ow: grad(r) := r1|1r|. Flow A is a gradient\n\ufb02ow if A = grad(r) for some r, or equivalently if Aij = ri  rj for all i, j. The divergence of a\nn A \u00b7 1. The divergence measures the contribution to the \ufb02ow of each\n\ufb02ow is the n-vector div(A) := 1\nvertex, considered as a source. The curl of a \ufb02ow is the three-tensor curl(A)ijk = Aij + Ajk  Aik.\nnPn\nFinally, the rotation is rot(A)ij = 1\nTheorem (Hodge decomposition, [48]). (i) div  grad(r) = r for any r satisfying r|1 = 0.\n(ii) div  rot(A) = 0n\u21e51 for any \ufb02ow A. (iii) rot grad(r) = 0n\u21e5n for any vector r.\n(iv) The vector space of antisymmetric matrices admits an orthogonal decomposition\n\ufb02ows =antisymmetric matrices = im(grad)  im(rot)\nwith respect to the standard inner product hA, Bi =Pij AijBij. Concretely, any antisymmetric\nA =transitive component +cyclic component = grad(r)+rot(A) where\n\nr = div(A).\nSneak peak. The divergence recovers Elo ratings or just plain average performance depending on\nthe scenario. The Hodge decomposition separates transitive (captured by averages or Elo) from\ncyclic interactions (rock-paper-scissors), and explains when Elo ratings make sense. The Schur\ndecomposition is a window into the latent skills and tasks not accounted for by Elo and averages.\n\nmatrix decomposes as\n\nk=1 curl(A)ijk.\n\n3 On the algebraic structure of evaluation\n\nThe Schur decomposition and combinatorial Hodge theory provide a uni\ufb01ed framework for analyzing\nevaluation data in the AvA and AvT scenarios. In this section we provide some basic tools and present\na multidimensional extension of Elo that handles cyclic interactions.\n\n3.1 Agents vs agents (AvA)\nIn AvA, results are collated into a matrix of win-loss probabilities based on relative frequencies.\nConstruct A = logit P with Aij := log pij\n. Matrix A is antisymmetric since pij + pji = 1.\n1pij\nWhen can Elo correctly predict win-loss probabilities? The answer is simple in logit space:\nProposition 3. (i) If probabilities P are generated by Elo ratings r then the divergence of its logit is\nr. That is,\n\nif pij = (ri  rj) 8i, j then div(logit P) =\u21e3 1\n\nn\n\nnXj=1\n\n(ri  rj)\u2318n\n\n= r.\n\ni=1\n\n(ii) There is an Elo rating that generates probabilities P iff curl(logit P) = 0. Equivalently, iff\nlog pij\npji\n\n= 0 for all i, j, k.\n\n+ log pjk\npkj\n\n+ log pki\npik\n\nElo is, essentially, a uniform average in logit space. Elo\u2019s predictive failures are due to the cyclic\ncomponent \u02dcA := rot(logit P) that uniform averaging ignores.\nMultidimensional Elo (mElo2k). Elo ratings bake-in the assumption that relative skill is transitive.\nHowever, there is no single dominant strategy in games like rock-paper-scissors or (arguably)\nStarCraft. Rating systems that can handle intransitive abilities are therefore necessary. An obvious\napproach is to learn a feature vector w and a rating vector ri per player, and predict \u02c6pij = (r|\ni w \nr|\nj w). Unfortunately, this reduces to the standard Elo rating since r|\nHandling intransitive abilities requires learning an approximation to the cyclic component \u02dcA. Com-\nbining the Schur and Hodge decompositions allows to construct low-rank approximations that extend\nElo. Note, antisymmetric matrices have even rank. Consider\n\ni w is a scalar.\n\nAn\u21e5n = grad(r) + \u02dcA \u21e1 grad(r) + C|0B@\n\n0\n1\n\n1CA C =: grad(r) + C|\n\n...\n\n1\n0\n\n4\n\nn\u21e52k\u23262k\u21e52kC2k\u21e5n\n\n\fwhere the rows of C are orthogonal to each other, to r, and to 1. The larger 2k, the better the\napproximation. Let mElo2k assign each player Elo rating ri and 2k-dimensional vector ci. Vanilla\nElo uses 2k = 0. The mElo2k win-loss prediction is\n\nmElo2k: \u02c6pij = \u21e3ri  rj + c|\n\n2i1).\nOnline updates can be computed by gradient descent, see section E, with orthogonality enforced.\n\ni \u00b7 \u23262k\u21e52k \u00b7 cj\u2318 where \u23262k\u21e52k =\n\n2i  e2ie|\n\n(e2i1e|\n\nkXi=1\n\n3.2 Application: predicting win-loss probabilities in Go\nElo ratings are widely used in Chess and Go. We compared the predictive capabilities of Elo and the\nsimplest extension mElo2 on eight Go algorithms taken from extended data table 9 in [24]: seven\nvariants of AlphaGo, and Zen. The Frobenius norms and logistic losses are kP  \u02c6PkF = 0.85 and\n`log = 1.41 for Elo vs the empirical probabilities and kP  \u02c6P2kF = 0.35 and `log = 1.27 for mElo2.\nTo better understand the difference, we zoom in on three algorithms that were observed to interact\nnon-transitively in [58]: \u21b5v with value net, \u21b5p with policy net, and Zen. Elo\u2019s win-loss predictions\nare poor (Table Elo: Elo incorrectly predicts both that \u21b5p likely beats \u21b5v and \u21b5v likely beats Zen),\nwhereas mElo2 (Table mElo2) correctly predicts likely winners in all cases (Table empirical), with\nmore accurate probabilities:\nElo\n\u21b5v\n\u21b5p\nZen\n\nempirical \u21b5v\n-\n0.3\n0.6\n\nZen\n0.4\n1.0\n-\n\nZen\n0.58\n0.67\n\nZen\n0.46\n0.98\n\n\u21b5p\n0.7\n-\n0.0\n\n\u21b5v\n\u21b5p\nZen\n\n\u21b5v\n\u21b5p\nZen\n\n0.59\n0.42\n\n\u21b5p\n0.72\n\n\u21b5v\n-\n\n0.28\n0.55\n\n-\n\n0.02\n\nmElo2\n\n-\n\n\u21b5v\n-\n\n\u21b5p\n0.41\n\n-\n\n0.33\n\n-\n\n3.3 Agents vs tasks (AvT)\nIn AvT, results are represented as an (m\u21e5 n) matrix S: rows are agents, columns are tasks, entries are\nscores (e.g. accuracy or total reward). Subtract the total mean, so the sum of all entries of S is zero.\nWe recast both agents and tasks as players and construct an antisymmetric (m + n)\u21e5 (m + n)-matrix.\nn S| \u00b7 1n\u21e51 be the average skill of each agent and the average\nm S \u00b7 1m\u21e51 and d =  1\nLet s = 1\ndif\ufb01culty of each task. De\ufb01ne \u02dcS = S  (s \u00b7 1|  1 \u00b7 d|). Let r be the concatenation of s and d. We\nconstruct the antisymmetric matrix\nA(m+n)\u21e5(m+n) = grad(r) +\u2713 0m\u21e5m\nn\u21e5m 0n\u21e5n\u25c6\n\u02dcS|\n}\n{z\n\nThe top-right block of A is agent performance on tasks; the bottom-left is task dif\ufb01culty for agents.\nThe top-left block compares agents by their average skill on tasks; the bottom-right compares tasks\nby their average dif\ufb01culty for agents. Average skill and dif\ufb01culty explain the data if the score of\nagent i on task j is Sij = si  dj, the agent\u2019s skill minus the task\u2019s dif\ufb01culty, for all i, j. Paralleling\nproposition 3, averages explain the data, S = s1|  1d|, iff curl(A) = 0.\nThe failure of averages to explain performance is encapsulated in \u02dcS and \u02dcA. By proposition 2, the\nSVD of \u02dcS and Schur decomposition of \u02dcA are equivalent. If the SVD is \u02dcSm\u21e5n = Um\u21e5rDr\u21e5rV|\nr\u21e5n\nthen the rows of U represent the latent abilities exhibited by agents and the rows of V represent the\nlatent problems posed by tasks.\n\n=\u2713grad(s)\nn\u21e5m grad(d)\u25c6 .\nS|\n\n\u02dcSm\u21e5n\n\nSm\u21e5n\n\n|\n\n\u02dcA\n\n4\n\nInvariant evaluation\n\nEvaluation is often based on metrics like average performance or Elo ratings. Unfortunately, two (or\ntwo hundred) tasks or agents that look different may test/exhibit identical skills. Overrepresenting\nparticular tasks or agents introduces biases into averages and Elo \u2013 biases that can only be detected\npost hoc. Humans must therefore decide which tasks or agents to retain, to prevent redundant agents\nor tasks from skewing results. At present, evaluation is not automatic and does not scale. To be\nscalable and automatic, an evaluation method should always bene\ufb01t from including additional agents\nand tasks. Moreover, it should adjust automatically and gracefully to redundant data.\n\n5\n\n\fDe\ufb01nition 1. An evaluation method maps data to a real-valued function on players (that is, agents\nor agents and tasks):\n\nE :evaluation data =antisymmetric matrices !hplayers ! Ri.\n\nDesired properties. An evaluation method should be:\n\nP1. Invariant: adding redundant copies of an agent or task to the data should make no difference.\nP2. Continuous: the evaluation method should be robust to small changes in the data.\nP3. Interpretable: hard to formalize, but the procedure should agree with intuition in basic cases.\n\nElo and uniform averaging over tasks are examples of evaluation methods that invariance excludes.\n\n4.1 Nash averaging\nThis section presents an evaluation method satisfying properties P 1, P 2, P 3. We discuss AvA here,\nsee section D for AvT. Given antisymmetric logit matrix A, de\ufb01ne a two-player meta-game with\npayoffs \u00b51(p, q) = p|Aq and \u00b52(p, q) = p|Bq for the row and column meta-players, where\nB = A|. The game is symmetric because B = A| and zero-sum because B = A.\nThe row and column meta-players pick \u2018teams\u2019 of agents. Their payoff is the expected log-odds\nof their respective team winning under the joint distribution. If there is a dominant agent that has\nbetter than even odds of beating the rest, both players will pick it. In rock-paper-scissors, the only\nunbeatable-on-average team is the uniform distribution. In general, the value of the game is zero and\nthe Nash equilibria are teams that are unbeatable in expectation.\nA problem with Nash equilibria (NE) is that they are not unique, which forces the user to make\nchoices and undermines interpretability [63, 64]. Fortunately, for zero-sum games there is a natural\nchoice of Nash:\nProposition 4 (maxent NE). For antisymmetric A there is a unique symmetric Nash equilibrium\n(p\u21e4, p\u21e4) solving maxp2n minq2n p|Aq with greater entropy than any other Nash equilibrium.\nMaxent Nash is maximally indifferent between players with the same empirical performance.\nDe\ufb01nition 2. The maxent Nash evaluation method for AvA is\n\nEm :evaluation data =antisymmetric matrices maxent NE\nwhere p\u21e4A is the maxent Nash equilibrium and nA := A \u00b7 p\u21e4A is the Nash average.\nInvariance to redundancy is best understood by looking at an example; for details see section C.\nExample 1 (invariance). Consider two logit matrices, where the second adds a redundant copy of\nagent C to the \ufb01rst:\n\n!hplayers Nash average\n\n! Ri ,\n\nA A\nA 0.0\nB -4.6\n4.6\nC\n\nB\n4.6\n0.0\n-4.6\n\nC\n-4.6\n4.6\n0.0\n\nand\n\nA0\nA\n0.0\nA\nB -4.6\n4.6\nC1\n4.6\nC2\n\nB\n4.6\n0.0\n-4.6\n-4.6\n\nC1\n-4.6\n4.6\n0.0\n0.0\n\nC2\n-4.6\n4.6\n0.0\n0.0\n\n3 , 1\n\n3 , 1\n\n3 ). It is easy to check that ( 1\n6 , 1\n\nThe maxent Nash for A is p\u21e4A = ( 1\nany \u21b5 2 [0, 1] and thus the maxent Nash for A0 is p\u21e4A0 = ( 1\ndetects the redundant agents C1, C2 and distributes C\u2019s mass over them equally.\nUniform averaging is not invariant to adding redundant agents; concretely div(A) = 0 whereas\ndiv(A0) = (1.15, 1.15, 0, 0), falsely suggesting agent B is superior. In contrast, nA = 03\u21e51 and\nnA0 = 04\u21e51 (the zero-vectors have different sizes because there are different numbers of agents).\nNash averaging correctly reports no agent is better than the rest in both cases.\nTheorem 1 (main result for AvA2). The maxent NE has the following properties:\n\n3 ) is Nash for A0 for\n3 , 1\n6 ). Maxent Nash automatically\n\n3 , \u21b5\n\n3 , 1\u21b5\n\n3 , 1\n\n3 , 1\n\n2The main result for AvT is analogous, see section D.\n\n6\n\n\fNASH PROBABILITYNASH PROBABILITYAGENTS[ left to right ]centipedeasterixprivate_eyedouble_dunkmontezuma[ other envshave p = 0 ]AB[ left to right ]distribDQNrainbowhumanpriorpopart[ other agentshave p = 0 ]ENVIRONMENTSFigure1:(A)TheNashp\u21e4aassignedtoagents;(B)theNashp\u21e4eassignedtoenvironments.P1.Invariant:Nashaveraging,withrespecttothemaxentNE,isinvarianttoredundanciesinA.P2.Continuous:Ifp\u21e4isaNashfor\u02c6Aand\u270f=kA\u02c6Akmaxthenp\u21e4isan\u270f-NashforA.P3.Interpretable:(i)ThemaxentNEonAistheuniformdistribution,p\u21e4=1n1,iffthemeta-gameiscyclic,i.e.div(A)=0.(ii)Ifthemeta-gameistransitive,i.e.A=grad(r),thenthemaxentNEistheuniformdistributionontheplayer(s)withhighestrating(s)\u2013therecouldbeatie.SeesectionCforproofandformalde\ufb01nitions.Forinterpretability,ifA=grad(r)thenthetransitiveratingisallthatmatters:Nashaveragingmeasuresperformanceagainstthebestplayer(s).Ifdiv(A)=0thennoplayerisbetterthananyother.Mixedcasescannotbedescribedinclosedform.Thecontinuitypropertyisquiteweak:theorem1.2showsthepayoffiscontinuous:ateamthatisunbeatablefor\u02c6Ais\u270f-beatablefornearbyA.Unfortunately,NashequilibriathemselvescanjumpdiscontinuouslywhenAismodi\ufb01edslightly.PerturbedbestresponseconvergestoamorestableapproximationtoNash[65,66]thatunfortunatelyisnotinvariant.Example2(continuity).ConsiderthecyclicandtransitivelogitmatricesC= 011101110!andT= 012101210!.ThemaxentNashequilibriaandNashaveragesofC+\u270fTarep\u21e4C+\u270fT=\u21e21+\u270f3,12\u270f3,1+\u270f3if0\uf8ff\u270f\uf8ff12(1,0,0)if12<\u270fandnC+\u270fT=\u21e2(0,0,0)0\uf8ff\u270f\uf8ff12(0,1\u270f,12\u270f)12<\u270fThemaxentNashistheuniformdistributionoveragentsinthecycliccase(\u270f=0),andisconcentratedonthe\ufb01rstplayerwhenitdominatestheothers(\u270f>12).When0<\u270f<12theoptimalteamhasmostmassonthe\ufb01rstandlastplayers.Nashjumpsdiscontinuouslyat\u270f=12.4.2Application:re-evaluationofagentsontheArcadeLearningEnvironmentToillustratethemethod,were-evaluatetheperformanceofagentsonAtari[2].Dataistakenfromresultspublishedin[67\u201370].Agentsincluderainbow,duelingnetworks,prioritizedreplay,pop-art,DQN,count-basedexplorationandbaselineslikehuman,random-actionandno-action.The20agentsevaluatedon54environmentsarerepresentedbymatrixS20\u21e554.Itisnecessarytostandardizeunitsacrossenvironmentswithquitedifferentrewardstructures:foreachcolumnwesubtracttheminanddividebythemaxsoscoresliein[0,1].Weintroduceameta-gamewhererowmeta-playerpicksaimstopickthebestdistributionp\u21e4aonagentsandcolumnmeta-playeraimstopickthehardestdistributionp\u21e4eonenvironments,seesectionDfordetails.We\ufb01ndaNashequilibriumusinganLP-solver;itshouldbepossibleto\ufb01ndthemaxentNashusingthealgorithmin[71,72].TheNashdistributionsareshownin\ufb01gure1.Thesupportsofthedistributionsarethe\u2018coreagents\u2019andthe\u2018coreenvironments\u2019thatformunexploitableteams.Seeappendixfortablescontainingallskillsanddif\ufb01culties.panelB.Figure2Ashowstheskillofagentsunderuniform1nS\u00b71andNashS\u00b7p\u21e4eaveragingoverenvironments;panelBshowsthedif\ufb01cultyofenvironmentsunderuniform1mS|\u00b71andNashS|\u00b7p\u21e4aaveragingoveragents.Thereisatiefortopbetweentheagentswithnon-zeromass\u2013includinghuman.ThisfollowsbytheindifferenceprincipleforNashequilibria:strategieswithsupporthaveequalpayoff.7\fDIFFICULTYUNIFORM AVERAGENASH AVERAGEENVIRONMENTS[ left to right ]centipedeasterixprivate_eyedouble_dunkmontezumaSKILLUNIFORM AVERAGENASH AVERAGEAGENTS[ left to right ]distribDQNrainbowhumanpriorpopartABFigure2:ComparisonofuniformandNashaverages.(A)Skillofagentsbyuniform1nS\u00b71andNashS\u00b7p\u21e4eaveragingoverenvironments.(B)Dif\ufb01cultyofenvironmentsunderuniform1mS|\u00b71andNashS|\u00b7p\u21e4aaveragingoveragents.AgentsandenvironmentsaresortedbyNash-averages.Ourresultssuggestthatthebetter-than-humanperformanceobservedontheArcadeLearningEn-vironmentisbecauseALEisskewedtowardsenvironmentsthat(current)agentsdowellon,andcontainsfewerenvironmentstestingskillsspeci\ufb01ctohumans.Solvingthemeta-gameautomatically\ufb01ndsadistributiononenvironmentsthatevensouttheplaying\ufb01eldand,simultaneously,identi\ufb01esthemostimportantagentsandenvironments.5ConclusionApowerfulguidingprinciplewhendecidingwhattomeasureisto\ufb01ndquantitiesthatareinvarianttonaturallyoccurringtransformations.Thedeterminantiscomputedoverabasis\u2013however,thedeterminantisinvarianttothechoiceofbasissincedet(G1AG)=det(A)foranyinvertiblematrixG.Noether\u2019stheoremimpliesthedynamicsofaphysicalsystemwithsymmetriesobeysaconservationlaw.Thespeedoflightisfundamentalbecauseitisinvarianttothechoiceofinertialreferenceframe.Onemusthavesymmetriesinmindtotalkaboutinvariance.Whatarethenaturallyoccurringsymmetriesinmachinelearning?Thequestionadmitsmanyanswersdependingonthecontext,seee.g.[73\u201379].Inthecontextofevaluatingagents,thataretypicallybuiltfromneuralnetworks,itisunclearaprioriwhethertwoseeminglydifferentagents\u2013basedontheirparametersorhy-perparameters\u2013areactuallydifferent.Further,itisincreasinglycommonthatenvironmentsandtasksareparameterized\u2013orarelearningagentsintheirownright,seeself-play[10,11],adversarialattacks[6\u20139],andautomatedcurricula[80].Theoverwhelmingsourceofsymmetrywhenevaluatinglearningalgorithmsisthereforeredundancy:differentagents,networks,algorithms,environmentsandtasksthatdobasicallythesamejob.Nashevaluationcomputesadistributiononplayers(agents,oragentsandtasks)thatautomaticallyadjuststoredundantdata.Itthusprovidesaninvariantapproachtomeasuringagent-agentandagent-environmentinteractions.Inparticular,Nashaveragingencouragesamaximallyinclusiveapproachtoevaluation:computationalcostaside,themethodshouldonlybene\ufb01tfromincludingasmanytasksandagentsaspossible.Easytasksorpoorlyperformingagentswillnotbiastheresults.AssuchNashaveragingisasigni\ufb01cantsteptowardsmoreobjectiveevaluation.Nashaveragingisnotalwaystherighttool.Firstly,itisonlyasgoodasthedata:garbagein,garbageout.Nashdecideswhichenvironmentsareimportantbasedontheagentsprovidedtoit,andconversely.Asaresult,themethodisblindtodifferencesbetweenenvironmentsthatdonotmakeadifferencetoagentsandviceversa.Nash-basedevaluationislikelytobemosteffectivewhenappliedtoadiversearrayofagentsandenvironments.Secondly,forgoodorill,Nashaveragingremovescontrolfromtheuser.OnemayhavegoodreasontodisagreewiththedistributionchosenbyNash.Finally,Nashisaharshmaster.Ittakesanadversarialperspectiveandmaynotbethebestapproachto,say,constructingautomatedcurricula\u2013althoughboostingisarelatedapproachthatworkswell[81,82].Itisanopenquestionwhetheralternateinvariantevaluationscanbeconstructed,game-theoreticallyorotherwise.Acknowledgements.WethankGeorgOstrovski,PedroOrtega,Jos\u00e9Hern\u00e1ndez-OralloandHadovanHasseltforusefulfeedback.8\fReferences\n[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale hierarchical image\n\ndatabase,\u201d in CVPR, 2009.\n\n[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, \u201cThe arcade learning environment: An evaluation\n\nplatform for general agents,\u201d J. Artif. Intell. Res., vol. 47, pp. 253\u2013279, 2013.\n\n[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis, \u201cHuman-level control through deep reinforcement learning,\u201d Nature,\nvol. 518, pp. 529\u2013533, 02 2015.\n\n[5] D. Donoho, \u201c50 years of Data Science,\u201d in Based on a presentation at the Tukey Centennial workshop,\n\n2015.\n\n[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, \u201cIntriguing\n\nproperties of neural networks,\u201d in arXiv:1312.6199, 2013.\n\n[7] F. Tram\u00e8r, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, \u201cEnsemble Adversarial\n\nTraining: Attacks and Defenses,\u201d in ICLR, 2018.\n\n[8] A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao, M. Liang, T. Pang, J. Zhu, X. Hu, C. Xie, J. Wang,\nZ. Zhang, Z. Ren, A. Yuille, S. Huang, Y. Zhao, Y. Zhao, Z. Han, J. Long, Y. Berdibekov, T. Akiba,\nS. Tokui, and M. Abe, \u201cAdversarial Attacks and Defences Competition,\u201d in arXiv:1804.00097, 2018.\n\n[9] J. Uesato, B. O\u2019Donoghue, A. van den Oord, and P. Kohli, \u201cAdversarial Risk and the Dangers of Evaluating\n\nAgainst Weak Attacks,\u201d in ICML, 2018.\n\n[10] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,\nA. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis,\n\u201cMastering the game of Go without human knowledge,\u201d Nature, vol. 550, pp. 354\u2013359, 2017.\n\n[11] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,\nT. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, \u201cMastering Chess and Shogi by Self-Play with a\nGeneral Reinforcement Learning Algorithm,\u201d in arXiv:1712.01815, 2017.\n\n[12] E. Todorov, T. Erez, and Y. Tassa, \u201cMujoco: A physics engine for model-based control,\u201d in IROS, 2012.\n[13] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. K\u00fcttler, A. Lefrancq, S. Green,\nV. Vald\u00e9s, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney,\nH. King, D. Hassabis, S. Legg, and S. Petersen, \u201cDeepMind Lab,\u201d in arXiv:1612.03801, 2016.\n\n[14] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, \u201cOpenAI\n\nGym,\u201d 2016.\n\n[15] J. Z. Leibo, C. de Masson d\u2019Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. G. Casta\u00f1eda,\nM. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hassabis, and M. M. Botvinick, \u201cPsychlab: A Psychology\nLaboratory for Deep Reinforcement Learning Agents,\u201d in arXiv:1801.08116, 2018.\n\n[16] A. E. Elo, The Rating of Chess players, Past and Present. Ishi Press International, 1978.\n[17] R. Herbrich, T. Minka, and T. Graepel, \u201cTrueSkill: a Bayesian skill rating system,\u201d in NIPS, 2007.\n[18] M. Frean and E. R. Abraham, \u201cRock-scissors-paper and the survival of the weakest,\u201d Proc. R. Soc. Lond. B,\n\nno. 268, pp. 1323\u20131327, 2001.\n\n[19] B. Kerr, M. A. Riley, M. W. Feldman, and B. J. M. Bohannan, \u201cLocal dispersal promotes biodiversity in a\n\nreal-life game of rock\u2013paper\u2013scissors,\u201d Nature, no. 418, pp. 171\u2013174, 2002.\n\n[20] R. A. Laird and B. S. Schamp, \u201cCompetitive Intransitivity Promotes Species Coexistence,\u201d The American\n\nNaturalist, vol. 168, no. 2, 2006.\n\n[21] A. Szolnoki, M. Mobilia, L.-L. Jiang, B. Szczesny, A. M. Rucklidge, and M. Perc, \u201cCyclic dominance in\n\nevolutionary games: a review,\u201d J R Soc Interface, vol. 11, no. 100, 2014.\n\n[22] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green,\nI. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu, \u201cPopulation based training of neural\nnetworks,\u201d CoRR, vol. abs/1711.09846, 2017.\n\n[23] D. Balduzzi, S. Racani\u00e8re, J. Martens, J. Foerster, K. Tuyls, and T. Graepel, \u201cThe mechanics of n-player\n\ndifferentiable games,\u201d in ICML, 2018.\n\n[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,\nI. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, \u201cMastering the game\nof Go with deep neural networks and tree search,\u201d Nature, vol. 529, no. 7587, pp. 484\u2013489, 2016.\n\n9\n\n\f[25] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Perolat, D. Silver, and T. Graepel, \u201cA\n\nUni\ufb01ed Game-Theoretic Approach to Multiagent Reinforcement Learning,\u201d in NIPS, 2017.\n\n[26] S. Legg and M. Hutter, \u201cA universal measure of intelligence for arti\ufb01cial agents,\u201d in IJCAI, 2005.\n[27] S. Legg and J. Veness, \u201cAn Approximation of the Universal Intelligence Measure,\u201d in Algorithmic\n\nProbability and Friends. Bayesian Prediction and Arti\ufb01cial Intelligence, 2013.\n\n[28] R. J. Solomonoff, \u201cA formal theory of inductive inference I, II,\u201d Inform. Control, vol. 7, no. 1-22, 224-254,\n\n1964.\n\n[29] A. N. Kolmogorov, \u201cThree approaches to the quantitative de\ufb01nition of information,\u201d Problems Inform.\n\nTransmission, vol. 1, no. 1, pp. 1\u20137, 1965.\n\n[30] G. J. Chaitin, \u201cOn the length of computer programs for computing \ufb01nite binary sequences,\u201d J Assoc.\n\nComput. Mach., vol. 13, pp. 547\u2013569, 1966.\n\n[31] C. Ferri, J. Hern\u00e1ndez-Orallo, and R. Modroiu, \u201cAn experimental comparison of performance measures for\n\nclassi\ufb01cation,\u201d Pattern Recognition Letters, no. 30, pp. 27\u201338, 2009.\n\n[32] J. Hern\u00e1ndez-Orallo, P. Flach, and C. Ferri, \u201cA Uni\ufb01ed View of Performance Metrics: Translating\n\nThreshold Choice into Expected Classi\ufb01cation Loss,\u201d JMLR, no. 13, pp. 2813\u20132869, 2012.\n\n[33] J. Hern\u00e1ndez-Orallo, The Measure of All Minds: Evaluating Natural and Arti\ufb01cial Intelligence. Cambridge\n\nUniversity Press, 2017.\n\n[34] J. Hern\u00e1ndez-Orallo, \u201cEvaluation in arti\ufb01cial intelligence: from task-oriented to ability-oriented measure-\n\nment,\u201d Arti\ufb01cial Intelligence Review, vol. 48, no. 3, pp. 397\u2013447, 2017.\n\n[35] R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore, \u201cPMLB: a large benchmark\n\nsuite for machine learning evaluation and comparison,\u201d BioData Mining, vol. 10, p. 36, Dec 2017.\n\n[36] C. Spearman, \u201c\u2018General Intelligence,\u2019 objectively determined and measured,\u201d Am. J. Psychol., vol. 15,\n\nno. 201, 1904.\n\n[37] A. Woolley, C. Fabris, A. Pentland, N. Hashmi, and T. Malone, \u201cEvidence for a Collective Intelligence\n\nFactor in the Performance of Human Groups,\u201d Science, no. 330, pp. 686\u2013688, 2010.\n\n[38] S. Bringsjord, \u201cPsychometric arti\ufb01cial intelligence,\u201d Journal of Experimental & Theoretical Arti\ufb01cial\n\nIntelligence, vol. 23, no. 3, pp. 271\u2013277, 2011.\n\n[39] D. R. Hunter, \u201cMM algorithms for generalized Bradley-Terry models,\u201d Annals of Statistics, vol. 32, no. 1,\n\npp. 384\u2013406, 2004.\n\n[40] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling, \u201cRevisiting the\nArcade Learning Environment: Evaluation Protocols and Open Problems for General Agents,\u201d Journal of\nArti\ufb01cial Intelligence Research (JAIR), vol. 61, pp. 523\u2013562, 2018.\n\n[41] A. Liapis, G. N. Yannakakis, and J. Togelius, \u201cTowards a Generic Method of Evaluating Game Levels,\u201d in\n\nArti\ufb01cial Intelligence in Digital Interactive Entertainment (AIIDE), 2013.\n\n[42] B. Horn, S. Dahlskog, N. Shaker, G. Smith, and J. Togelius, \u201cA Comparative Evaluation of Procedural\n\nLevel Generators in the Mario AI Framework,\u201d in Foundations of Digital Games, 2014.\n\n[43] T. S. Nielsen, G. Barros, J. Togelius, and M. J. Nelson, \u201cGeneral video game evaluation using relative\n\nalgorithm performance pro\ufb01les,\u201d in EvoApplications, 2015.\n\n[44] F. de Mesentier Silva, S. Lee, J. Togelius, and A. Nealen, \u201cAI-based Playtesting of Contemporary Board\n\nGames,\u201d in Foundations of Digital Games (FDG), 2017.\n\n[45] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. M. Smith, and S. Risi, \u201cEvolving Mario Levels in the Latent\n\nSpace of a Deep Convolutional Generative Adversarial Network,\u201d in GECCO, 2018.\n\n[46] R. K. Hambleton, H. Swaminathan, and H. J. Rogers, Fundamentals of item response theory. Sage\n\nPublications, 1991.\n\n[47] F. Mart\u00ednez-Plumed and J. Hern\u00e1ndez-Orallo, \u201c AI results for the Atari 2600 games: dif\ufb01culty and\n\ndiscrimination using IRT,\u201d in Workshop on Evaluating General-Purpose AI (EGPAI at IJCAI), 2017.\n\n[48] X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye, \u201cStatistical ranking and combinatorial Hodge theory,\u201d Math.\n\nProgram., Ser. B, vol. 127, pp. 203\u2013244, 2011.\n\n[49] O. Candogan, I. Menache, A. Ozdaglar, and P. A. Parrilo, \u201cFlows and Decompositions of Games: Harmonic\n\nand Potential Games,\u201d Mathematics of Operations Research, vol. 36, no. 3, pp. 474\u2013503, 2011.\n\n[50] O. Candogan, A. Ozdaglar, and P. A. Parrilo, \u201cNear-Potential Games: Geometry and Dynamics,\u201d ACM\n\nTrans Econ Comp, vol. 1, no. 2, 2013.\n\n[51] O. Candogan, A. Ozdaglar, and P. A. Parrilo, \u201cDynamics in near-potential games,\u201d Games and Economic\n\nBehavior, vol. 82, pp. 66\u201390, 2013.\n\n10\n\n\f[52] W. E. Walsh, D. C. Parkes, and R. Das, \u201cChoosing samples to compute heuristic-strategy nash equilibrium,\u201d\n\nin Proceedings of the Fifth Workshop on Agent-Mediated Electronic Commerce, 2003.\n\n[53] M. P. Wellman, \u201cMethods for empirical game-theoretic analysis,\u201d in Proceedings, The Twenty-First\nNational Conference on Arti\ufb01cial Intelligence and the Eighteenth Innovative Applications of Arti\ufb01cial\nIntelligence Conference, pp. 1552\u20131556, 2006.\n\n[54] S. Phelps, S. Parsons, and P. McBurney, \u201cAn Evolutionary Game-Theoretic Comparison of Two Double-\nAuction Market Designs,\u201d in Agent-Mediated Electronic Commerce VI, Theories for and Engineering of\nDistributed Mechanisms and Systems, AAMAS Workshop, pp. 101\u2013114, 2004.\n\n[55] S. Phelps, K. Cai, P. McBurney, J. Niu, S. Parsons, and E. Sklar, \u201cAuctions, Evolution, and Multi-agent\nLearning,\u201d in AAMAS and 7th European Symposium on Adaptive and Learning Agents and Multi-Agent\nSystems (ALAMAS), pp. 188\u2013210, 2007.\n\n[56] M. Ponsen, K. Tuyls, M. Kaisers, and J. Ramon, \u201cAn evolutionary game-theoretic analysis of poker\n\nstrategies,\u201d Entertainment Computing, vol. 1, no. 1, pp. 39\u201345, 2009.\n\n[57] D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers, \u201cEvolutionary dynamics of multi-agent learning: A\n\nsurvey,\u201d J. Artif. Intell. Res. (JAIR), vol. 53, pp. 659\u2013697, 2015.\n\n[58] K. Tuyls, J. Perolat, M. Lanctot, J. Z. Leibo, and T. Graepel, \u201cA Generalised Method for Empirical Game\n\nTheoretic Analysis ,\u201d in AAMAS, 2018.\n\n[59] M. Dudik, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi, \u201cContextual Dueling Bandits,\u201d in\n\nCOLT, 2015.\n\n[60] A. Balsubramani, Z. Karnin, R. E. Schapire, and M. Zoghi, \u201cInstance-dependent Regret Bounds for Dueling\n\nBandits,\u201d in COLT, 2016.\n\n[61] P. R. Jordan, C. Kiekintveld, and M. P. Wellman, \u201cEmpirical game-theoretic analysis of the TAC supply\nchain game,\u201d in 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS\n2007), Honolulu, Hawaii, USA, May 14-18, 2007, p. 193, 2007.\n\n[62] P. R. Jordan, Practical Strategic Reasoning with Applications in Market Games. PhD thesis, 2010.\n[63] J. von Neumann and O. Morgenstern, Theory of Games and Economic Behavior. Princeton University\n\nPress, Princeton NJ, 1944.\n\n[64] J. F. Nash, \u201cEquilibrium Points in n-Person Games,\u201d Proc Natl Acad Sci U S A, vol. 36, no. 1, pp. 48\u201349,\n\n1950.\n\n[65] J. Hofbauer and W. H. Sandholm, \u201cOn the global convergence of stochastic \ufb01ctitious play,\u201d Econometrica,\n\nvol. 70, no. 6, pp. 2265\u20132294, 2002.\n\n[66] W. H. Sandholm, Population Games and Evolutionary Dynamics. MIT Press, 2010.\n[67] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, \u201cDueling Network\n\nArchitectures for Deep Reinforcement Learning,\u201d in ICML, 2016.\n\n[68] H. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, \u201cLearning values across many orders of\n\nmagnitude,\u201d in NIPS, 2016.\n\n[69] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, \u201cCount-Based Exploration with Neural\n\nDensity Models,\u201d in ICML, 2017.\n\n[70] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar,\n\nand D. Silver, \u201cRainbow: Combining Improvements in Deep Reinforcement Learning,\u201d in AAAI, 2018.\n\n[71] L. E. Ortiz, R. E, Schapire, and S. M. Kakade, \u201cMaximum entropy correlated equilibrium,\u201d in Technical\n\nReport TR-2006-21, CSAIL MIT, 2006.\n\n[72] L. E. Ortiz, R. E. Schapire, and S. M. Kakade, \u201cMaximum entropy correlated equilibria,\u201d in AISTATS,\n\n2007.\n\n[73] P. Diaconis, Group Representations in Probability and Statistics. Institute of Mathematical Statistics, 1988.\n[74] Y. LeCun, , L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recognition,\u201d\n\nProc. IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[75] R. Kondor and T. Jebara, \u201cA kernel between sets of vectors,\u201d in ICML, 2003.\n[76] R. Kondor, \u201cGroup theoretical methods in machine learning,\u201d in PhD dissertation, 2008.\n[77] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, \u201cDeep Sets,\u201d in\n\nNIPS, 2017.\n\n[78] J. Hartford, D. R. Graham, K. Leyton-Brown, and S. Ravanbakhsh, \u201cDeep Models of Interactions Across\n\nSets,\u201d in ICML, 2018.\n\n[79] R. Kondor, Z. Lin, and S. Trivedi, \u201cClebsch\u2013Gordan Nets: a Fully Fourier Space Spherical Convolutional\n\nNeural Network,\u201d in NIPS, 2018.\n\n11\n\n\f[80] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus, \u201cIntrinsic Motivation and\n\nAutomatic Curricula via Asymmetric Self-Play,\u201d in ICLR, 2017.\n\n[81] Y. Freund and R. E. Schapire, \u201cA Decision-Theoretic Generalization of On-Line Learning and an Applica-\n\ntion to Boosting,\u201d Journal of Computer and System Sciences, 1996.\n\n[82] R. Schapire and Y. Freund, Boosting: Foundations and Algorithms. MIT Press, 2012.\n[83] R. J. Vandenberg and C. E. Lance, \u201cA Review and Synthesis of the Measurement Invariance Literature:\nSuggestions, Practices, and Recommendations for Organizational Research,\u201d Organizational Research\nMethods, vol. 3, no. 1, pp. 4\u201370, 2000.\n\n12\n\n\f", "award": [], "sourceid": 1669, "authors": [{"given_name": "David", "family_name": "Balduzzi", "institution": "DeepMind"}, {"given_name": "Karl", "family_name": "Tuyls", "institution": "DeepMind"}, {"given_name": "Julien", "family_name": "Perolat", "institution": "DeepMind"}, {"given_name": "Thore", "family_name": "Graepel", "institution": "DeepMind"}]}