NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The framework proposed in this paper addresses the problem of multi-agent online decision-making when the population size is large. The problem is well motivated by practical examples such as online ad auctions. The results stated here are nice. Some references to background knowledge should be provided: 1. In Section 2.1, it is claimed that N-player game is hard, but this needs some verification. 2. In Section 3, what is the definition of Wasserstein distance? There also exist some confusions in notations and possible typos: 1. In line 67, what is s^t? It seems like the authors are referring to s_t. 2. In line 78, the definition of the indicator function is unclear. It would indeed be good if the authors can use a separate section to state notations.
Reviewer 2
This paper considers learning in mean-field games (MFG). MFGs take the limit of an infinite number of agents, which are considered indistinguishable. Based on a motivating example consisting of a repeated Ad auction problem, the authors introduce a "general" mean-field game (GMFG), a model-free version of the standard MFG. The authors revisit standard Q-Learning and a soft version of it, and provide convergence and complexity results of such an algorithm. These methods are compared numerically on the auction problem together with a recently proposed approach and show better performance. The paper addresses a relevant problem in mean field games, a body of work that is becoming increasingly popular. It is overall well written and provides interesting results. I have the following concerns with this paper: - stationary vs infinite-time policies: it is unclear what are the assumptions required for the stationary case. The original formulation [15] considers the finite-horizon (time-dependent policies) setting. The authors just remove the time index and go with that. This should be clarified. - Incremental contribution: Looking at [12,18], the theoretical contribution seems too incremental. The main difference being the use of or state-action distributions (to define Q values) instead of the state distribution only. Regarding this, I think the term General Mean Field Games (GMFG) is confusing, since the presented is just a different (model-free way) of solving the same class of Mean Field games. It is unclear in what way GMFGs are more general than, e.g., in [25]. - Soft version of Q-learning: the literature on complexity/convergence of the soft version of Q-learning used in the paper is marginally cited. For example, the results of Asadi & Littman "An Alternative Softmax Operator for Reinforcement Learning" which shows potential divergence when using the Botzmann operator, as seems to be used in this paper. - Generality of the repeated Ad auction problem: the method is only illustrated on the auction problem, which is interesting, but limited. It reduces winning an auction to the game intensity parameter M. This seems to simplify the problem significantly. It would be desirable to see how the method performs in other popular benchmarks for MFGs. - Related work on MFGs: The authors are aware of the very recent AAMAS paper [25], but avoid to relate both works. There are other relevant papers that should also be discussed: * Mean field stochastic games with binary actions: Stationary threshold policies. M. Huang and Y. Ma. CDC 2017 * Decentralised Learning in Systems With Many, Many Strategic Agents. D, Mguni et al. AAAI 2018. minor: I found confusing how the notion of Nash equilibrium used in this paper. Since NE involves a optimization between different agents, it is clear that an individual agent is optimizing the cost given the joint distribution. However, the "population side" is just inferring the "compatible" joint with the individual optimal policy. line 121: a^M should be a^M_t line 152 (and 419): P(\cdot|\cdot, ...) should be P(\cdot|s_t, ...) line 211: use a different symbol for \alpha, since \alpha is used for the marginal before Algorithm 1: line 1: where is parameter c? line 3: better don't mention the model, since it is a model-free step line 251: "iteration" -> iterations line 261: "It learning accuracy" -> "Its learning accuracy" line 408: remove subindex $t$ of $\pi_t$ in $\pi_t(s_t,\mathcal{L})$ ----------------------- POST REBUTTAL After the authors rebuttal and the discussion I lean towards acceptance. However, I think the paper needs substantial editing to clarify the setting, specially stationary vs non-stationary, finite-horizon. It should be better clarified how the time-independent solutions found by the proposed algorithm are captured in a non-stationary setting. The paper also lacks an in-depth comparison with papers [31] and [15], describing differences in the settings, e.g., function approximation, stationarity and the obtained results.
Reviewer 3
1. This work seems to be the first to propose a convergent RL algorithm for MFG that is able to approximately find the mean-field equilibrium. Moreover, the explanation of why the naive application of Q=learning fails seems convincing. 2. The paper is well-written overall. However, the presentation can be further improved. For examples, in Section 2.3, it would be better to directly state the problem of the repeated auction as an MFG by specifying what $L$ is and how the reward function and transition probability depends on the mean-field distribution. Currently, the reward function in (3) seems to have a lot of notations and it is unclear how to write it as a function of s, a, and L. Another example is in Theorem 2, it would be better to carefully define each quantity instead of deferring them to the appendix. 3. In Section 4 the authors focus on the stationary solution which is time-independent. However, the uniqueness and existence theorem only guarantee a solution which might depend on time. Is it possible to directly show that the MFG solution is time-independent? 4. Given L, since the optimal policy of an MDP can be non-unique. In this case, the left-hand side of (5) might be large. Is there any intuition why this assumption holds? 5. My major concern is on the epsilon-net. Since the space of all distribution over S\times A is large, the size of this epsilon-net might be huge. In practice, how to construct this epsilon-net? In addition, the construction of this epsilon-net yields a gap $phi(\epsilon)$, which appears in the parameter $c$. In practice, how to choose parameter $c$? 6. The number of iteration in Theorem 12 seems complicated. Is there an easy way to see the order of iteration complexity? 7. Existence of mean-field equilibrium in discrete-time finite MFG is studied in the literature. See ``Markov--Nash Equilibria in Mean-Field Games with Discounted Cost“