Four knowledgeable referees reviewed this paper. After conducting initial reviews, reading the authors’ rebuttal (which resolved some concerns, but not the core concerns of two of the reviewers), and discussing the paper, the reviewers did not agree on an outcome. Two of the reviewers came to the conclusion that this is a ground-breaking paper (simple and elegant). The other two reviewers were perhaps somewhat intrigued, but did not feel the paper was yet ready for publication. For example, during the discussion phase, R4 (a very accomplished and well-respected research in the field) made very valid points about the papers weaknesses: “So all this leads me to suggest that there needs to be a better context, more related work and a better way to situate the paper in related arenas, e.g., provide some sort of a framework to back up the findings. I understand the issue of limited space, but given the amount of literature in this area, I feel that the paper doesnt do a good enough job explaining its findings in context.” and “This paper is different from other topics (e.g., papers looking at fairness in ML) in that this topic of pro-social behaviors in PD/IPD has seen decades of previous work including in multiagent systems, it is important in my view to situate this work better in context and provide more theoretical basis. The Kahn & Murnighan (1993) paper also discusses uncertainty in rewards. There are other papers that have discussed such uncertainty in PD/IPD such as the one I mentioned.” I read the paper in detail, and considered the reviewers’ comments and the authors’ rebuttal. I can see how experts can have differing opinions about the extent and validity of the paper’s contributions when viewed from different perspectives. From the perspective of getting deep RL algorithms to cooperate (including various forms of reciprocity and “team formation”) in repeated prisoner’s dilemmas, this potentially represents a nice achievement. And RUSP *seems* to be simple and quite compelling (though it isn’t at all clear to me how robust it is). The two reviewers in favor of the paper rightly, from this perspective, appreciated the success of the method in producing compelling behavior in several domains. Yet, not everything is about deep RL, which repeatedly been shown to be a tool with extreme limitations in multi-agent reinforcement learning outside the context of zero-sum games. As R4 points out, a study of how rational agents learn cooperative behavior (and reciprocate, form teams, etc.; which has been studied extensively in many disciplines ,including AI and the NeurIPS community, for many years) is not just about current RL methods. Thus, when the paper is viewed from this broader perspective, it is (though interesting) somewhat dissatisfying. The approach does not seem to be theoretically justified, nor do the results presented confirm the claims (they provide some evidence, but they do not thoroughly evaluate the strengths and weaknesses of the approach nor its robustness). Thus, from this perspective, reviewers can easily worry that by accepting the paper, it could open up a firestorm of misinformation that could potentially side-track the progress of the field. Overall, I think the approach and results described in the paper are compelling and could have a good impact on the field. So I believe that it could be accepted at NeurIPS. That said, I hope the authors will exhibit honesty and care as they present and ground their claims and results in the final version of the paper. In particular, I strongly urge the authors to provide satisfactory *context* and *a theoretical basis or argument* for their approach and results. Without such context and theoretical basis, the paper risks coming across as unprincipled hackery. [In saying all this perhaps overly bluntly, I do not wish to demean the paper's approach or results in any way. I simply hope the authors will take all of the reviewers' comments seriously in order to improve the final version of their paper.] ==== A couple of other points I had after reading the paper: - Like R1, the Oasis results do not look good to me, and seem to indicate the method might not achieve what is hoped. The mean reward and total deaths seem the same in all games. I didn’t quite understanding the authors’ conclusions from the results presented. R1 noted that they hoped the authors would improve these results (though that is a somewhat worrisome comment to me, as it suggests a need to fine-turn and tweak results until the desired outcome is achieved rather than demonstrating generalizable knowledge). - Some of the presentation is confusing. I’m not sure what a “training” iteration represents, nor was the setup of the games fully clear to me. Perhaps I just missed the explanations in the main paper or it is explained in the appendix (a description in the main paper seems desirable to me, though I understand space limitations).