__ Summary and Contributions__: This paper studies continuous-time fictitious play (CTFP) in mean-field games (MFGs), and shows that the exploitability metric (which characterizes the quality of an approximate Nash equilibrium) converges to 0 at a rate of O(1/t) for various settings including both finite horizon (with and without common noises) and discounted infinite horizon cases. The authors also numerically tested two discrete time implementation of CTFP, one with Backward Induction and the other with Q-learning, and show that these algorithms converge as expected to an interpretable population state distribution on various MFG examples (discretized linear-quadratic, beach bar with and without common noises, and maze).
The major contribution is to establish the convergence rate of CTFP on generic MFGs, and to numerical test the performance of two discrete-time algorithms motivated by fictitious play.

__ Strengths__: The major strengths of this paper are listed below:
1. It establishes the convergence rate of CTFP for generic MFGs with mild assumptions (monotonicity), and the results apply to both finite horizon MFGs (with and without common noises) and infinite horizon discounted MFGs.
2. It provides rich numerical tests of two discrete-time instantiations of CTFP, one with Backward Induction and the other with Q-learning, on a wide range of examples. The empirical findings mostly match the theoretical results and the interpretations (e.g., those on the learned population state distributions) are inspiring.

__ Weaknesses__: The major weaknesses of this paper are listed here. More details can be found below.
1. The presentation of the CTFP theory is confusing and inconsistent. In particular, Section 3 needs some substantial improvement.
2. There is a gap between the CTFP theory and the practical algorithms based on Backward Induction and Q-learning, and the details about the two algorithms are not clearly described and explained in the main text. And although some ingredients of the algorithms are written with more details in the appendix, it’s still unclear how each of the algorithm work as a whole.
Below we make some more detailed comments on the aforementioned points (and beyond):
1. In Section 2, the authors introduce two discrete time instantiations of the CTFP, one with Backward Induction and the other with Q-learning. However, the description of these algorithms are too brief in the main text, which makes it hard to understand what is tested in the various numerical experiments, especially for general readers (of the NeurIPS community, for example). Moreover, although the best response subproblem and the population distribution calculation are described in more details in the Appendix (Algorithms 2, 3, 4, 5), the authors do not provide sufficient details on how they are put together into two algorithms (that are numerically tested). For example, how is the obtained \hat{\mu}_n^{\pi} in Algorithm 3 used in Algorithm 2? Also, in Algorithm 2, the input states that “Start with a fixed distribution \mu = and …”: what is the \mu equal to? And what is the relationship among \mu, \mu_0, \mu_{k-1} and \hat{\mu}_n^{\pi}?
2. Again for the two practical algorithms, I’m wondering why the authors do not employ the randomization strategy of the policy selection (uniformly over previously obtained policies) and the incremental averaging of the mean field as is done in [51]. The authors may want to add some explanations of this. In addition, I noticed that the practical algorithms use greedy policies after obtaining the Q-values. However, as noted in [64] and [Angiuli, A., Fouque, J. P., & Laurière, M. (2020). Unified Reinforcement Q-Learning for Mean Field Game and Control Problems. arXiv preprint arXiv:2006.13912], some exploration trade-off (using soft-max operator or \epsilon-greedy strategies) should be adopted to avoid instability in the learning process in general, especially in the finite state-action spaces setting (as considered in the current paper). The authors may want to double check this and should otherwise provide explanations on why such a randomization is unnecessary in their numerical examples. By the way, notice that in fact for the examples in Section 4.1 and 7, the exploitability never goes below 10^0, which does not really indicate convergence. So the authors may want to double check whether adding some soft-max or epsilon-greedy smoothing in defining the next policy \pi from the learned Q-values in Algorithms 2 and 4 will improve the convergence. In any case, such kind of inexactness or randomization is not covered in the idealistic CTFP theory studied in this paper, which limits the contribution of the work.
3. On a related point, no (new) theory is provided for the two practical algorithms, leaving a non-negligible gap between the theory and practice in this paper. This could be partly resolved by considering (vanishing) inexactness or perturbation errors in CTFP and establish the convergence (rate) theory accordingly.
4. In Section 3, the authors fail to make a smooth transition from the discrete time fictitious play for MFGs at the end of Section 2 to the continuous-time setting. It would be very helpful to clearly specify how the authors move from the updates in Algorithm 1 to the CTFP at the beginning of Section 3. Also, although t=1 is also used as “watershed” in [68], which studies (mostly) the same CTFP dynamics, the authors should explain why they need to initialize with t<1 and then state the results only for t\geq 1. In particular, is the choice of 1 essential? And why can’t we start with t=0 and derive results for all t>0?
5. Again in Section 3, the presentation of the CTFP setting and analysis is hard to follow, and contains several inconsistencies. In particular, the best response policy is not formally defined, and the authors first state that \pi_n^{BR,t} is a best response policy (line 104) and then state that it is chosen arbitrarily for t<=1 (line 107), which seems to be a bit inconsistent. In addition, it is also quite unclear why it matters to introduce the equation involving \pi_n^t before stating Theorem 1. I checked Appendix A and find it much easier to follow. The authors should more carefully rewrite Section 3 with necessary details in Appendix A to make it easier to understand (without looking at the whole Appendix A).
6. In Section 5, the assumption on the common noise sequence is not clearly made. Are they allowed to be arbitrary (e.g., can they be history dependent/not even Markovian), or do they have to be stationary and i.i.d.? This is important especially for understanding the numerical experiments in Section 6 (and in particular why they are different and potentially more nontrivial than the previous experiments without common noises). To make it better, the authors should also specify the choice of the \xi_n sequence in Sections 6.1 and 6.2 with details, so that the readers can easily understand the difference. Otherwise, for Section 6.1, the readers may wonder why one cannot regard \rho\xi_n+\sqrt{1-\rho^2}\epsilon_n as a dummy placeholder for the original \epsilon_n term (without common noises), with no essential difference. In general, without a clear explanation of the common noise sequence, readers may feel that this is no more than an MFG setting with non-stationary transition probabilities and random non-stationary rewards.

__ Correctness__: Yes. I think the claims and method are correct and the empirical methodology is also correct as far as understand.

__ Clarity__: Yes, mostly, apart from the drawbacks mentioned above in the weaknesses section.

__ Relation to Prior Work__: Yes, mostly. I would suggest the authors to compare more explicitly with the CTFP literature (e.g, [68, 87]) and state more clearly what is the difference in the setting (referred by the authors as “simpler games”) considered in the literature and the setting considered in this paper. See the additional feedback for more comments related to this point.

__ Reproducibility__: No

__ Additional Feedback__: ================== post rebuttal ====================================
I have read the authors' rebuttal and the other reviews. I think the authors have showed sufficient effort in addressing most of the raised questions, and I'm satisfied with most of their explanations (pending that they do make the edits as they promised in the rebuttal). So I decide to raise my score from 5 to 6. These being said, I still have some small concerns and confusion. Firstly, the authors mentioned in the rebuttal that "As we use \bar{\mu}^j, we don’t need to select the policy uniformly over previously obtained policies". However, in Algorithm 1, an averaged policy \bar{\pi}_j is still computed and used. Secondly, the authors mentioned that "We also introduce a new theory of common noise for the two practical algs". However, the theory of common noise CTFP is not related to the practical (discrete time) algorithms, if I understand correctly. In general, the authors may want to be more careful in their revision when they want to add such kind of additional comments, which may introduce new confusion to the readers. Finally, regarding the common noise setting, I noticed that one of the key differences is that the policies (and mean fields) are dependent on the noise sequence. So in order to apply these polices, the realization of the noise sequence has to be known a priori. In reality, how do people know the noise sequence in advance (for example, whether the bar is closed or open at some time step in the example in Section 6.2)? The authors should add some comments on the applicability of the common noise policies in reality.
======================================================
In general, I think the convergence rate result of CTFP is clean and good, and numerical experiments are thorough. However, as mentioned above, the presentation of the CTFP theory (the central result) needs substantial improvement, and there is a non-negligible gap between the theory and the numerical experiments. So I would prefer to provide a more conservative rating at present, but the authors are welcome to argue and explain more to try to address my concerns in the rebuttal.
Last but not least, please find some suggestions on typo fixing, writing improvement and so on below:
1. Appendix F seems to be separated from the other parts of the paper, including both the main text and the appendix. The authors may want to connect it with the main text or other parts of the appendix to make it easier to understand the thesis of this section.
2. In line 38, “see Appx. E” should be “see Appx. C”. Similarly, in line 137, “see Appx. C” should be “see Appx. E”.
3. In line 39, the authors may want to explain more about what they mean exactly by “simpler games”.
4. In lines 41-42, the authors claim that they derive the “first time convergence properties … MFGs with common noises”. Do you mean that there have been existing results for CTFP on MFGs without common noises (which is the main result of Section 3)? If not, the authors may want to state this more clearly.
5. In line 55, the authors assume that the state and action spaces are finite. But in a closely related paper [51], such finiteness assumptions are not made. The authors may want to comment and explain more on why this is needed and if this is essential to the proofs and so on.
6. In Algorithm 1, line 4, what is the exploitability e_{j-1} used for? it does not seem to be used anywhere in the algorithm. Is it just used to determine whether to terminate the algorithm?
7. In line 106, “proof in A” should better be “proof in Appx. A”.
8. In Theorem 1, “… of the system, $\forall t\geq 1$ …” should better be “… of the system, \textit{i.e.}, $\forall t\geq 1$ …”.
9. In Section 4.1, the quantities \Delta_n and \bar{\mu}_n are not explained until about 6 lines after their introduction, which should be put forward. Also, what does “nodes” mean in “… move up to M nodes …”?
10. In Section 4.2, numerical results, the authors may want to specify the bar location chosen so that the readers can interpret the plots even better.
11. In Section 5, there are several notational inconsistencies in terms of \mu_n(x|\Xi). The authors sometimes use \mu_0(x,\Xi_0) and \mu_n(x,\Xi), and sometimes even write \mu_{n+1}^{\pi}(x’|\Xi.\xi) and so on. These should be unified.
12. In line 218, “dtistribution” should be “distribution”.
13. In line 257, “metrics” should be “metric”.
14. In line 260, “Application” should be “Applications”.
15. In the appendix, Algorithms 2 and 3, “p(\dot|x_n^ka_n^k)” should be “p(\cdot|x_n^k,a_n^k)”. Also, references [65] and [66] are the same and one of them should be removed.

__ Summary and Contributions__: The authors study a continuous-time Fictitious Play learning algorithms into discrete state mean-field games. A common noise, i.e. stochastic perturbation in probability densities, is considered. They connect learning in games with mean-field games. They also design several numerical experiments in either model-based or model-free settings.

__ Strengths__: The author introduces the common noise from Mean field games into Fictitious Play. There are two potential benefits of the proposed approach. On the one hand, they apply noises in the probability density of players, which may have potential advantages in reinforcement learning problems. On the other hand, the connection between Mean field games and reinforcement learning is fruitful. The mean-field game also introduces physics-informed dynamics or the Markov process. A particular example is a Fokker-Planck equation. In this perspective, this connection will be new directions for future studies.

__ Weaknesses__: The authors empahize examples in Mean field games. Is it possible to demonstrate several classical reinforcement learning examples and illustrate the connection?

__ Correctness__: The method is correct.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: There are also several machine learning works in mean field games for primal dual algorithms, which are relevant.
Lin et.al. APAC-Net: Alternating the Population and Agent Control via Two Neural Networks to Solve High-Dimensional Stochastic Mean Field Games.
Lars. et.al. A machine learning framework for solving high-dimensional mean field game and mean field control problems.

__ Reproducibility__: Yes

__ Additional Feedback__: The paper is an attempt to connect the mean-field game with reinforcement learning. The examples here are mainly from classical mean-field games. In other words, the game dynamics are at a continuous level, which consists of the Fokker-Planck equation and the Hamilton-Jacobi equation. Generalizations of these game dynamics to a broader reinforcement setting will be interesting directions. It may be important to consider some classical reinforcement learning examples with their connection with the MFG theory.
I have read all the response. The authors address my concerns. A good paper.

__ Summary and Contributions__: The paper studies the performance of a continous time fictitious play
algorithm in the context of finite (or discounted infinite) horizon
Mean Field Games (MFG). Such games have found increasing use in
modeling a variety of context, from oligopoly dynamics, advertising
and sponsored search markets, ridesharing and two-sided platform
design. Given their widespread use in modeling, a learning algorithm
for finding the Nash equilibrium in such games has important practical
implications.
Under some monotonicity conditions on the reward functions of the
agents, the authors show that in the continuous time fictitious play
algorithm, the averaged sequence of policies converges to the unique
Nash equilibrium, at rate O(1/t). This is shown by exhibiting that an
exploitability metric is a strong Lyapunov function for the learning
dynamics. The authors illustrate their analytical results through a
set of numerical examples, including one with common noise affecting
all agents' state transitions.
After author response:
I read through the author response. Regarding the response to (3), I agree that most convergence rate results for fictitious play are not for the last-iterate, but for the averaged policies. I was pointing to the fact that this is not what Theorem 1 in their submission claims -- theorem 1 claims convergence for the last iterate, which is not what is shown in the proof in the supplement, where the overlines denoting the average are dropped from the notation "to alleviate the presentation". Re: point(1): I continue to maintain the results are incremental over the results in [68] (in the paper) -- the consideration of common noise does not pose a significant challenge in my opinion as the analysis is for the mean field game (and hence for the aggregate distribution) and the convergence is over the fictitious play time t, and not over the game time n. For instance, in the proof in the supplement, the terms for different game times n never interact -- so in essence the results only require the Fokker-Planck equations for the aggregate distribution. Re: point (2) -- based on the authors response and other reviewers comments, I am willing to accept that this might be milder than other assumptions in the literature.

__ Strengths__: 1. The authors allow for common noise to affect the state
transitions of all agents in their model.
2. Use of an exploitability metric to capture the convergence of learning algorithm

__ Weaknesses__: 1. The results seem incremental, and lack substantial novelty.
2. The monotonicity conditions to achieve O(1/t) convergence rate also imply uniqueness of equilibria, and thus seem to be quite strong.
3. The convergence is only established for the averaged policies, and not for the last-iterate.

__ Correctness__: Yes.

__ Clarity__: The notation for the averaged policies (\bar{\pi})$ is replaced by the non-averaged $\pi$ in the statement of Theorem 1, but this change is only made explicit
in the proof in Appendix A (there is a passing reference right before
the theorem statement). The paper would improve if this fact is made transparent.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: