NeurIPS 2020

### Review 1

Summary and Contributions: This paper studies continuous-time fictitious play (CTFP) in mean-field games (MFGs), and shows that the exploitability metric (which characterizes the quality of an approximate Nash equilibrium) converges to 0 at a rate of O(1/t) for various settings including both finite horizon (with and without common noises) and discounted infinite horizon cases. The authors also numerically tested two discrete time implementation of CTFP, one with Backward Induction and the other with Q-learning, and show that these algorithms converge as expected to an interpretable population state distribution on various MFG examples (discretized linear-quadratic, beach bar with and without common noises, and maze). The major contribution is to establish the convergence rate of CTFP on generic MFGs, and to numerical test the performance of two discrete-time algorithms motivated by fictitious play.

Strengths: The major strengths of this paper are listed below: 1. It establishes the convergence rate of CTFP for generic MFGs with mild assumptions (monotonicity), and the results apply to both finite horizon MFGs (with and without common noises) and infinite horizon discounted MFGs. 2. It provides rich numerical tests of two discrete-time instantiations of CTFP, one with Backward Induction and the other with Q-learning, on a wide range of examples. The empirical findings mostly match the theoretical results and the interpretations (e.g., those on the learned population state distributions) are inspiring.

Correctness: Yes. I think the claims and method are correct and the empirical methodology is also correct as far as understand.

Clarity: Yes, mostly, apart from the drawbacks mentioned above in the weaknesses section.

Relation to Prior Work: Yes, mostly. I would suggest the authors to compare more explicitly with the CTFP literature (e.g, [68, 87]) and state more clearly what is the difference in the setting (referred by the authors as “simpler games”) considered in the literature and the setting considered in this paper. See the additional feedback for more comments related to this point.

Reproducibility: No

Additional Feedback: ================== post rebuttal ==================================== I have read the authors' rebuttal and the other reviews. I think the authors have showed sufficient effort in addressing most of the raised questions, and I'm satisfied with most of their explanations (pending that they do make the edits as they promised in the rebuttal). So I decide to raise my score from 5 to 6. These being said, I still have some small concerns and confusion. Firstly, the authors mentioned in the rebuttal that "As we use \bar{\mu}^j, we don’t need to select the policy uniformly over previously obtained policies". However, in Algorithm 1, an averaged policy \bar{\pi}_j is still computed and used. Secondly, the authors mentioned that "We also introduce a new theory of common noise for the two practical algs". However, the theory of common noise CTFP is not related to the practical (discrete time) algorithms, if I understand correctly. In general, the authors may want to be more careful in their revision when they want to add such kind of additional comments, which may introduce new confusion to the readers. Finally, regarding the common noise setting, I noticed that one of the key differences is that the policies (and mean fields) are dependent on the noise sequence. So in order to apply these polices, the realization of the noise sequence has to be known a priori. In reality, how do people know the noise sequence in advance (for example, whether the bar is closed or open at some time step in the example in Section 6.2)? The authors should add some comments on the applicability of the common noise policies in reality. ====================================================== In general, I think the convergence rate result of CTFP is clean and good, and numerical experiments are thorough. However, as mentioned above, the presentation of the CTFP theory (the central result) needs substantial improvement, and there is a non-negligible gap between the theory and the numerical experiments. So I would prefer to provide a more conservative rating at present, but the authors are welcome to argue and explain more to try to address my concerns in the rebuttal. Last but not least, please find some suggestions on typo fixing, writing improvement and so on below: 1. Appendix F seems to be separated from the other parts of the paper, including both the main text and the appendix. The authors may want to connect it with the main text or other parts of the appendix to make it easier to understand the thesis of this section. 2. In line 38, “see Appx. E” should be “see Appx. C”. Similarly, in line 137, “see Appx. C” should be “see Appx. E”. 3. In line 39, the authors may want to explain more about what they mean exactly by “simpler games”. 4. In lines 41-42, the authors claim that they derive the “first time convergence properties … MFGs with common noises”. Do you mean that there have been existing results for CTFP on MFGs without common noises (which is the main result of Section 3)? If not, the authors may want to state this more clearly. 5. In line 55, the authors assume that the state and action spaces are finite. But in a closely related paper [51], such finiteness assumptions are not made. The authors may want to comment and explain more on why this is needed and if this is essential to the proofs and so on. 6. In Algorithm 1, line 4, what is the exploitability e_{j-1} used for? it does not seem to be used anywhere in the algorithm. Is it just used to determine whether to terminate the algorithm? 7. In line 106, “proof in A” should better be “proof in Appx. A”. 8. In Theorem 1, “… of the system, $\forall t\geq 1$ …” should better be “… of the system, \textit{i.e.}, $\forall t\geq 1$ …”. 9. In Section 4.1, the quantities \Delta_n and \bar{\mu}_n are not explained until about 6 lines after their introduction, which should be put forward. Also, what does “nodes” mean in “… move up to M nodes …”? 10. In Section 4.2, numerical results, the authors may want to specify the bar location chosen so that the readers can interpret the plots even better. 11. In Section 5, there are several notational inconsistencies in terms of \mu_n(x|\Xi). The authors sometimes use \mu_0(x,\Xi_0) and \mu_n(x,\Xi), and sometimes even write \mu_{n+1}^{\pi}(x’|\Xi.\xi) and so on. These should be unified. 12. In line 218, “dtistribution” should be “distribution”. 13. In line 257, “metrics” should be “metric”. 14. In line 260, “Application” should be “Applications”. 15. In the appendix, Algorithms 2 and 3, “p(\dot|x_n^ka_n^k)” should be “p(\cdot|x_n^k,a_n^k)”. Also, references [65] and [66] are the same and one of them should be removed.

### Review 2

Summary and Contributions: The authors study a continuous-time Fictitious Play learning algorithms into discrete state mean-field games. A common noise, i.e. stochastic perturbation in probability densities, is considered. They connect learning in games with mean-field games. They also design several numerical experiments in either model-based or model-free settings.

Strengths: The author introduces the common noise from Mean field games into Fictitious Play. There are two potential benefits of the proposed approach. On the one hand, they apply noises in the probability density of players, which may have potential advantages in reinforcement learning problems. On the other hand, the connection between Mean field games and reinforcement learning is fruitful. The mean-field game also introduces physics-informed dynamics or the Markov process. A particular example is a Fokker-Planck equation. In this perspective, this connection will be new directions for future studies.

Weaknesses: The authors empahize examples in Mean field games. Is it possible to demonstrate several classical reinforcement learning examples and illustrate the connection?

Correctness: The method is correct.

Clarity: The paper is well written.

Relation to Prior Work: There are also several machine learning works in mean field games for primal dual algorithms, which are relevant. Lin et.al. APAC-Net: Alternating the Population and Agent Control via Two Neural Networks to Solve High-Dimensional Stochastic Mean Field Games. Lars. et.al. A machine learning framework for solving high-dimensional mean field game and mean field control problems.

Reproducibility: Yes

Additional Feedback: The paper is an attempt to connect the mean-field game with reinforcement learning. The examples here are mainly from classical mean-field games. In other words, the game dynamics are at a continuous level, which consists of the Fokker-Planck equation and the Hamilton-Jacobi equation. Generalizations of these game dynamics to a broader reinforcement setting will be interesting directions. It may be important to consider some classical reinforcement learning examples with their connection with the MFG theory. I have read all the response. The authors address my concerns. A good paper.

### Review 3

Summary and Contributions: The paper studies the performance of a continous time fictitious play algorithm in the context of finite (or discounted infinite) horizon Mean Field Games (MFG). Such games have found increasing use in modeling a variety of context, from oligopoly dynamics, advertising and sponsored search markets, ridesharing and two-sided platform design. Given their widespread use in modeling, a learning algorithm for finding the Nash equilibrium in such games has important practical implications. Under some monotonicity conditions on the reward functions of the agents, the authors show that in the continuous time fictitious play algorithm, the averaged sequence of policies converges to the unique Nash equilibrium, at rate O(1/t). This is shown by exhibiting that an exploitability metric is a strong Lyapunov function for the learning dynamics. The authors illustrate their analytical results through a set of numerical examples, including one with common noise affecting all agents' state transitions. After author response: I read through the author response. Regarding the response to (3), I agree that most convergence rate results for fictitious play are not for the last-iterate, but for the averaged policies. I was pointing to the fact that this is not what Theorem 1 in their submission claims -- theorem 1 claims convergence for the last iterate, which is not what is shown in the proof in the supplement, where the overlines denoting the average are dropped from the notation "to alleviate the presentation". Re: point(1): I continue to maintain the results are incremental over the results in [68] (in the paper) -- the consideration of common noise does not pose a significant challenge in my opinion as the analysis is for the mean field game (and hence for the aggregate distribution) and the convergence is over the fictitious play time t, and not over the game time n. For instance, in the proof in the supplement, the terms for different game times n never interact -- so in essence the results only require the Fokker-Planck equations for the aggregate distribution. Re: point (2) -- based on the authors response and other reviewers comments, I am willing to accept that this might be milder than other assumptions in the literature.

Strengths: 1. The authors allow for common noise to affect the state transitions of all agents in their model. 2. Use of an exploitability metric to capture the convergence of learning algorithm

Weaknesses: 1. The results seem incremental, and lack substantial novelty. 2. The monotonicity conditions to achieve O(1/t) convergence rate also imply uniqueness of equilibria, and thus seem to be quite strong. 3. The convergence is only established for the averaged policies, and not for the last-iterate.

Correctness: Yes.

Clarity: The notation for the averaged policies (\bar{\pi})$is replaced by the non-averaged$\pi\$ in the statement of Theorem 1, but this change is only made explicit in the proof in Appendix A (there is a passing reference right before the theorem statement). The paper would improve if this fact is made transparent.

Relation to Prior Work: Yes.

Reproducibility: Yes