Review for NeurIPS paper: Learning to Incentivize Other Learning Agents

NeurIPS 2020

Learning to Incentivize Other Learning Agents

Review 1

Summary and Contributions: This paper presents a new framework for multi-agent reinforcement learning by allowing the agents to incentivize other agents by giving out their own "rewards". An effective algorithm is also proposed for effective policy learning within this new framework and empirical results are shown on several MARL testbeds.

Strengths: The framework of allowing agents to give out reward is a novel and interesting contribution to the whole MARL community, which has great potential for solving a wide range of problems, such as credit assignment, cooperation, emergent behavior, etc. The analysis and derivation of the algorithm are neat, clear, and insightful. In general, I like this paper.

Weaknesses: I have two concerns on (1) baselines and (2) scalability. (1) regarding the baselines, although I do think the current experiments are sound, I would be interested to see comparisons with other MARL approaches beyond the LOLA-style second-order optimization approaches. IA is a good one and it is nice to see that LIO outperforms IA, but I do think the results can be more convincing if more benchmark algorithms can be included. For example, does social influence solves the problem (https://arxiv.org/pdf/1810.08647.pdf)? Mutual information can be also viewed as an approximation of accounting other agents' future policy change and has shown great performances in harvest/cleanup with a large number of agents. Can we simply learn a value function conditioned on the received rewards of different agents (in the same spirit of DDPG) so that we can avoid performing second-order gradient? These are the questions raised when I read the paper and I believe a more in-depth discussion/experiments will further consolidate the contribution of this work. (2) regarding scalability, it is a bit unfortunate that the biggest experiment in this paper includes only 3 agents. I could totally imagine that the “bi-level” optimization scheme makes the algorithm inefficient when the total number of agents increase but it still remains critical for the readers to understand the asymptotic performance of the algorithm --- even if it doesn’t work, a discussion on the limitation and potential improvement can be extremely useful. I would strongly recommend the authors to include some analysis on environments with N>3 (e.g., N=5, 7, 10).

Correctness: The claims and methods are sound and rigorous.

Clarity: The paper is well-written and easy to follow.

Relation to Prior Work: The discussions are clear enough.

Reproducibility: Yes

Additional Feedback: I like this paper, but, as I mentioned above, I would strongly suggest the authors include more benchmarks (e.g., social influence) and results with more agents (e.g., 5 or 7 agents in cleanup like what the social influence paper did) to make the paper stronger. I can further increase my score if my concerns can be addressed. =============== After Rebuttal =============== I have checked the comments and I have increased my score based on the additional experiments with the Social Influence paper and with more agents. I would strongly encourage the author to include these two additional experiments and leave some discussions on scalability in the final version to make the paper much stronger.

Review 2

Summary and Contributions: The authors extent multi-agent social dilemma's to include a channel for sharing reward. The develop a new agent that in addition to learning to optimize it own reward, learns an influence function to shape the behavior of others. The author describe compare this approach to other agent shaping techniques (LOLA in particular) and show empirically that when these new agents are allowed to influence each other by transferring reward they reliably achieve a high collective reward.

Strengths: The paper is clearly written. The new toy environment (Escape Room) a great pedagogical and testing tool for demonstrating the power of this approach. The experiments are well executed. The quantitative and analytic results are clear and the qualitative analyses of the influence functions give good insights into the learned policies. I thought the detailed comparison to LOLA was well done and helps situate these results among that line of work. While the idea that side-payments can enable more robust cooperation in repeated games is well known the authors nicely demonstrate that naive implementations fail to realize the full potential of these methods. The introduction motivates these challenges well.

Weaknesses: I'm not sure how broadly interested the NeurIPS community will be in these results. I would like to see a greater attempt to explain specific ways that these techniques could be used in a more scaled up context. What might the currency of reward be in the real world (or even a simulated game world)? What assumptions made in LIO will not hold or require relaxing? Why did the agent start rewarding cleaning up but missing at 40K+ time steps? It is at least worth speculating in the paper.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The rebuttal improves an already strong submission. Thank you for the response.

Review 3

Summary and Contributions: The paper proposes a framework where agents can shape other agents’ behaviors by directly rewarding other agents. The authors separate a task policy with the reward-giving policy. Each agent learns its own incentive function by accounting for its impact on the learning of the recipients, and through them, the impact on its own extrinsic objective. The task policy is learned via RL. In experiments, agents seem to divide into selfish agents (“winners”) and selfless agents (“cooperators”). Cooperators seem to learn not to send incentives to winners to avoid the cost of sending an incentive. UPDATE AFTER AUTHOR RESPONSE: I have read the author response and think that it sufficiently clarifies some of the questions I had. I am happy to keep my score as is.

Strengths: + The ability to directly influence other agents via reward-giving is a novel improvement to opponent shaping algorithms like LOLA and SOS. I suspect that this paper will be cited by many multiagent learning works in the future. + The paper is clearly written

Weaknesses: - It would be helpful if the paper’s definition of “decentralized” is more explicitly stated in the paper, instead of in a footnote. Other ways of defining “decentralized” is where agents do not have access to the global state and actions of other agents during both training and execution which LIO seems to do. - Systematically studying the impact of the cost of incentivization on performance would have been a helpful analysis (e.g., for various values of \alpha, what are the reward incentives each agent receives, and what is the collective return?). It seems like roles between “winners” and “cooperators” emerge because the cost to reward the other agent becomes high for the cooperators. If this cost were lower, it seems like roles would be less distinguished, causing the collective return to be much lower. - In Figure 5d, more explanation as to why the Winner receives about the same incentive as the Cooperator to pull the lever would be helpful; it doesn’t match how the plot is described on lines 286-287.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: This is perhaps the most relevant paper I have seen (and it has not been cited): Inducing Cooperation through Reward Reshaping based on Peer Evaluations in Deep Multi-Agent Reinforcement Learning by Hostallero et al. AAMAS 2020. Since this paper is recent, I do not think it would be fair to expect this work to be one of the baselines in the experiments. But I do think it is worth discussing how the current work is different from Hostallero et al. in the Related Works section.

Reproducibility: Yes

Additional Feedback: