NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7573
Title:No-Regret Learning in Unknown Games with Correlated Payoffs

Reviewer 1

The paper considers learning in the following model: the player repeatedly plays in the same game, that is not known a-priory. Gets feedback only about the his value for the realized outcome, but also learns actions of the other players. Two important assumptions are made about the game: 1. a regularity assumption that the game has a low dimensional kernel, and 2. that the response function is noisy with Gaussian noise on top of the originally unknown value. The paper offers an elegant algorithm combining Kernelized multiarm bandit algorithms (from ICML10 and ICML'17) and multiplicative weights to get a no-regret strategy. The main idea is the use the UCB-style upper bounds in a multiplicative weight algorithm, which moves the performance of the algorithm closer to full information feedback. Section 4 of the paper offers empirical evidence on the better performance of the algorithm on three different classes of examples. Unfortunately a number of the important details of these examples are deferred to supplementary materials. But there is still a convincing case for the improved performance of the algorithm. I read the authors response. The response provided for my question (of what application can the model proposed be a good model for) is not great. True, that maybe one can get information on the number of agents traveling a route, but the more natural sources of information is actually the delay there (say offered by Google maps). And true, one can observe properties of others bidding in the auction space, maybe true for the winning bidder, but not all participants, and not all their action. I view the paper as a first step in an interesting direction, hence my relatively high rating of 7.

Reviewer 2

EDIT after rebuttal: The paper proposes a new feedback model which is new. However, the algorithmic idea and analysis is not new. This is probably the weakest point of the paper. So it boils down to if the question is itself useful (e.g., has useful applications). I agree that 7 is a generous score. My primary reason was that this seemed like an interesting direction: to understand how much can the full-feedback be relaxed while getting the same bounds. The stated motivations in the rebuttal are pretty weak; in fact, even if feasible I would not release the number of agents traveling a route due to privacy issues (e.g., if a route is sensitive and telling if > 0 agents choosing a route gives away too much information). On the other hand, I thought that robust bayesian optimization, in the paper, seemed like a reasonable one. --- This paper considers the no-regret learning dynamics in multi-player game. They consider the setting where one player is using a no-regret algorithm that receives a noisy reward for the chosen action and the actions chosen by the other players (but not their rewards). They further assume a regularity assumption on the rewards that is required by their solution which uses gaussian process (GP) to estimate UCB on the rewards of the other players. The key result is to illustrate that under this restrictive feedback, in many applications they are able to get as fast a convergence rate as receiving full feedback and always faster than bandit feedback. They argue that this is a natural feedback that is received in many applications and full-feedback is restrictive for applications. The propose an algorithm, prove regret bounds and evaluate this on simulated and real-world transportation dataset. My comments for this paper as follows. I think the feedback model they consider and the goal of the paper is novel. It is interesting to understand how much feedback one really needs to get fast bounds and the trade-off between feedback and convergence rate is a research worth pursuing. The algorithm seems to be a rather simple modification of the usual exponential weights approach after appropriately estimating the rewards of the other players from their actions using standard tools from GP. Thus, the algorithm and hence the analysis seems to be a straight forward marriage between the techniques in these two areas. But, they provide *extensive* experimental evaluation on simulated and real-world transportation datasets. They also show an application in Bayesian optimization previously considered in prior work and compare their algorithms with that in prior work. The lack of significant novelty in theory is made up by the extensive experimental evaluation. As a bonus, this paper is very well-written and easy to read. Weighing the pros and cons, I would lean towards acceptance of this paper. I had a question to the authors. In the bandit setting and 2-player games, when both players play no-regret algorithms faster convergence rates can be obtained (for instance [1]). Is something similar possible for this model of feedback? In summary, the pros and cons are as follows: Pros: - Novel feedback model - Extensive experimental evaluation - Paper extremely well-written Cons: - Algorithm and analysis is rather straightforward once the problem is correctly setup. [1] - Fast convergence of regularized learning in games - Syrgkanis, Agarwal, Luo, Schapire, NIPS 2015.

Reviewer 3

The paper is well-written and technically sound. As mentioned below, the significance is less clear as the kernel assumption is not well justified. Detailed comments: (1) Page 2, Table 1: Just based on what is in the table, it is really unclear how GP-MW improves over EXP3. The setting of EXP3 and GP-MW are different in several aspects and it is unclear which one of the differences causes the difference in regret bound. For example, what is the optimal regret bound of EXP3's setting plus kernel assumption? And what is the optimal regret bound of GP-MW's setting without the kernel's assumption? To make a fair comparison between EXP3 and GP-MW, the authors can also talk about how GP-MW performs when all opponents have only 1 action to choose from and there is no kernel assumption. (2) The authors do not spend enough effort to justify why the kernel assumption makes sense. Do most forms of games have this property? Not everyone is an expert on this assumption and it is really hard to judge the result without further justification of the assumption. (3) In the regret definition in this paper, a_t^{-i} is a fixed sequence. This mean the paper does not consider that other players are using adaptive strategies and chosen actions can affect other players' actions. It would be better to make this clear. (4) Line 170: Is the improvement mainly because of the kernel assumption? (5) In the experiments, although some part of the data is from real-world, the rewards are generated using kernels. So the experiments also don't justify the kernel assumption. I have read the author response. "The Kernel assumption is standard in kernelized literature" does not sound like a good justification for the assumption. I agree that it is technically convenient. I would hope the authors can provide more justification from game theory perspective, i.e. showing examples of games which are standard in game theory and work well with the kernel assumption.