NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6526
Title:Modelling the Dynamics of Multiagent Q-Learning in Repeated Symmetric Games: a Mean Field Theoretic Approach

Reviewer 1

The problem being tackled is that of describing the dynamics of Q-learning policies (through Q-values). This problem has been tackled before under a two player settings and this paper's main claim is its extension to n-player settings. The paper is well written and does seem to tackle a problem currently not reported elsewhere, however, I find the work lacking in two fronts: i) it is too derivative and incremental. Given the small contribution on the technical side, I would have liked a strong experimental section, but that is not the case and ii) experimental validation of the model is insufficient. The experiments are not showing anything insightful and (as I argue below) are not very useful. In order to improve this work, I suggest the authors to focus on developing good insights into the use of these methods for the design of new multiagent RL algorithms. In its current form I really do not see how the results presented here could be used in any way to design new algorithms. More so, given that there are already published works on multiagent RL in the mean field setting [Mguni et al. AAAI 2018, Mguni et al. AAMAS 2019], which by the way, these works are not even sited here, they should be used to validate this paper's models.

Reviewer 2

Let me start with a global comment. I enjoyed very much reading this paper. I found it well written (apart from typos, and some English sentences constructions that are a bit heavy) and interesting. It is related to a modern sub-field of reinforcement learning: multi-agent learning, that lacks theory w.r.t. to single-agent RL. The paper introduces a mean-field analysis of a large population of agents playing simple symmetric matrix games against each others, so that, as the population gets large, each player effectively plays against a single "mean" player. The theory is very well explained in spite of the conciseness of the paper, and the numerical experiments are very convincing. I recommend publication in NeurIPS. Let me give some more detailed comments (unfortunately the lines numbering are not present in the pdf; this is usually very convenient for pointing the mistakes, the authors will have to find where are the mistakes I'm referring to): _typos: page 2 (of AN individual agent may...; we introduce a probability...; of individual agentS under...). page 3 (Q-value WITH probability 1). page 4 (for AN individual agent; bar x_t should be bold; in fact irrelevant to the identity... -> in fact independent of the agents, but only depends on the current Q-values; such that given the settings of.. -> such that given...). page 5 (around the point q_t travel ALONG THE Z_i-AXIS; adjacent box B'={... for all j in {1,...k}/{i} } YOU MISS THE /{i}; of of the box). page 6 (recap -> recall; unfair for the player that plays H -> that plays D; table 2 (PD): left-down corner should be 5,0). page 8 (of AN individual Q-learner under A n-agent setting; dominates THE choice..) _please say in table 1 if row player's payoff is the first or second number in each entry of the payoff matrix _algo 1: it misses initializations: theta=0, t=0 (maybe others?) _my main question: From (1) to (2), because the game is stateless, the greedy term max Q (last term in (1)) disappears in the Q update equation. But I do not clearly see why, and this seems crucial for the analysis in the paper. This point, even if evident for the authors, should be clearly explained. _below (3): "where a=a_j.." is a useless sentence, as there is no a written above. _eq (4) is wrong as such: the second t+1 index on the left hand side should be t, and or 1) a should be replaced by a_j everywhere (the clearer solution to me, as the left hand side as it it written right now does not depend on j, while the right hand side does which is strange), or 2) you write explicitly close to (4) that a=a_j (the sentence below (3)). The same comment applies to the next equations, like (5) etc, where a should be a_j. Overall its a very good paper from which I leaned a lot. I recommend for acceptance after my comments have been taken into account, especially the authors should explain better how to go from (1) to (2). It would also have been nice to provide a code of the experiments. I have read the authors' rebuttal and taken it into account.

Reviewer 3

Originality: I am not an expert on RL systems and so it is somewhat difficult for me to judge this topic. I would largely defer to other reviewers here. However, I really enjoyed the exposition and found the line of research original and interesting. Quality: To me, it seemed that the quality of the research was high. I was able to follow the theoretical development and found it compelling. I strongly believe that I could sit down and reproduce the work without much difficulty. One place where some more exposition / details might be useful was in the experimental portion. Could the authors elaborate on exactly how the simulations were performed? A few more minor questions: 1) Is eq. 5 (especially converting from Q_{t+1} - Q_t) only valid in the \eta\to0 limit? 2) In eq. 8, is the series convergent? Would it be worth analyzing the \Delta x^2 correction to make sure it is subleading? Are there any conditions aside from n, m\to\infty that need to be satisfied here? 3) Eq 9 -> 10. It seems like in eq. 9 there are n-decoupled equations describing the per-agent Q-values. However, I don’t see how to get from there to a single agent independent equation. I.e. if two agents start with different Q values, surely their Q-values will follow different trajectories? Is this just saying that “without a loss of generality we can consider a single agent”? Which is a statement I agree with. Clarity: This paper was very clearly written. There are a few places where the language is not idiomatic, but I was very impressed as a non-expert by how easy the paper was to follow. One minor question, in eq. 4 & 5 should we be considering a_j since x_{j,t} is the probability given to action a_j? Significance: Naively, it seems to me that this work could be quite significant given increased interest in ever larger multi-agent games. Having said that, the action / state space considered here is small. It would be nice to include more examples including some that actually have state associated with them.