Review for NeurIPS paper: Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

NeurIPS 2020

Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Review 1

Summary and Contributions: The paper describes a framework for self-play reinforcement learning and search in imperfect-information games. The main idea is to extend to imperfect-information games some known techniques for the perfect information setting via the notion of public belief space. The authors show that, under some assumptions, the algorithm converges to a Nash equilibrium in two-players zero-sum games. Experimental results show that the algorithm achieves good performances in standard benchmarks for imperfect-information games.

Strengths: The paper sets an interesting new direction for equilibrium-finding techniques in imperfect-information games. It presents a solid set of both theoretical a experimental results. The proposed framework consists of several ideas which are not new per se. However, I think it was not obvious how to effectively combine components from the perfect/imperfect information worlds (in theory and in practice).

Weaknesses: The proposed framework is mainly putting together existing concepts (such as the notion of PBS) and showing how and why the framework can actually work. I believe the framework is not radically changing the way we think about RL in imperfect-information games.

Correctness: I haven't checked the proofs but claims seem reasonable and the experimental evaluation is sound. Theorem 1 is not entirely clear at the present state. Could you provide more intuition on what it implies in practice?

Clarity: Yes. The paper is (almost always) clear and easy to follow.

Relation to Prior Work: Yes. The authors do a good job in explaining the relation between their work and the existing literature.

Reproducibility: Yes

Additional Feedback: Do you have any intuition or additional experiments on how this framework would behave in settings with more than two players? I think this would be an interesting direction for future research.

Review 2

Summary and Contributions: This paper presents a novel framework of combining deep reinforcement learning and search for imperfect information games. I believe this is an interesting work. The authors developed a new poker AI called ReBeL and evaluate the method on heads-up no-limited Texas Hold'em. The experimental results show superhuman performance although they only play against one top human.

Strengths: The authors introduced a novel framework on RL+search two-player zero-sum imperfect-information games. They claim a theoretical proof on safe search techniques and introduce a new subgame decomposition technique CFR-AVG, which is a stronger version of CFR-D.

Weaknesses: why not play against more top human on poker. It's hard to say ReBeL achieved superhuman when only playing against one top human. HUNL is really a large-scale imperfect information game. However, testing the algorithm on standard benchmark poker, such as Leduc Hold'em, will be necessary. There is a high doorsill of entering the community on imperfect information game and reproducing algorithms on large-scale game. Open source on standard small-size game will be helpful.

Correctness: Do you test Algorithm 1 using a warm-starting policy? What's the performance? I wonder what's the performance on multi-players imperfect information games? Is the metric LBR is comparable across different methods?

Clarity: The notations in Section 3 are difficult to follow. Are you work defined on simultaneous-moving game? The transition function T is defined on joint actions. Do you mean the system will return a state after all agents taking actions? It's hard to build the connection in poker. please give a brief introduction on Section 4. Typos: line 24: one-ply -> one-play?

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: ################################### After reading the response: The authors have answer part of my concerns. Thanks.

Review 3

Summary and Contributions: The paper introduces a general RL+Search algorithm called ReBel, which can converges to a Nash equilibrium in two-player zero-sum games. Rather than the needs of expert domain knowledge in AlphaZero and other RL+Search algorithm, ReBeL can achieve an effective performance with much less expert knowlegde. The main contributions of the paper can be concluded into three points: 1. Proved the connection between PBS gradients and infostate values. 2. The paper proposed an effective RL+Search algorithm, ReBeL, in imperfect-information games 3. Fictitious Linear Optimistic Play

Strengths: 1. According to the papre, ReBeL is a novel method to deal with two-player zero-sum imperfect-information games. And it may be able to be used to solve other inperfect-information problem. 2. The paper did a very detailed describtion to the POMDP's definition and the correctness of the RL+Search method theoretically. 3. The mathematical proof shows that learning V1 with the search algorith in the paper is sufficient. 4. The method is novel and the experiment design is typical, according to result, the ReBeL has a good performance in HUNL and TEH games with limited expert knowledge. 5. And the domain of this paper, multi-agents RL in imperfect-information, has high relevance to NIPS. 6. The experiment compared the module with human player, which is a strong evidence of the exploitability of ReBeL.

Weaknesses: 1. The paper only studied two-player situation, the performence of ReBeL in multi-player games is still to be confirm. 2. There are not enough comparison with other relative method in Section2. 3. The author only did experiment on two typical games, the ReBeL's performance on more complex problem. Especially when the game has bigger depth which will cause huge inputs of the value and policy function. 4. The theortical proof had only been considered two-player situation.

Correctness: According to the best of my knowledge after reading this paper, I believe the mathematical proof is correct and the empirical methodology makes sense.

Clarity: The paper is written with a clearly goal, and well explained the reason to use PBS, CFR-AVE and so on. And each parts of ReBeL module are introduced with rich details, the experiment design, hyper-parameters setting are also included.

Relation to Prior Work: The method in this paper is mainly compared with AlphaZero. And the paper indeed included a lot of prior work, but I think there should be more detailed comparsion with prior work.

Reproducibility: Yes

Additional Feedback: This work proposed a effective method to deal with two-player zero-sum imperfect-information games, and I suggest the author do more test on more complex multi-agent games. Also, the performance in games except zero-sum type still remains to research. It's also mentioned that ReBeL can work on perfect-information games, because it's a special casr of imperfect-information games, doing some experiments to compare ReBeL with prior method will be more convincing.

Review 4

Summary and Contributions: This work presents a novel method for solving imperfect-information zero-sum games, using less expert knowledge with respect to previous work. The method extends the current RL+Search paradigm to imperfect-information games by employing Public Belief States (PBS) to represent both the game state portion accessible by all players and probability distributions over private states pertaining individual players and hidden from the others. The method is validated on three imperfect-information games: Heads-up no-limit Texas hold'em, Turn endgame hold'em, and Liar's dice. Experiments demonstrate competitiveness with state-of-the-art algorithms on several benchmark tasks, including games against a top human poker player.

Strengths: - The method allows to extend applicability of RL+Search, up to now successful on perfect-information games only, to imperfect-information ones with two players - The method makes use of less expert domain knowledge with respect to state-of-the-art poker agents - The method is suitable for time-efficient decision making (i.e., interactive play against a human) - The method is validated against several benchmarks, including a top human player, showing potential for superhuman performance - Theoretical contributions pertaining key aspects of the proposed method are also introduced. See App. A for an exhaustive list. - The paper is well-written

Weaknesses: - More in-depth discussions of the experimental results may be beneficial - The significance of the claim on super-human performance after playing against a single human (although top player) may require further validation, especially for a high-variance game like poker

Correctness: - Proofs on the main theoretical contributions of the paper are provided, and the method follows a sound rationale - The empirical methodology is well-presented in Section 6 and in the Appendix - The experimental results for games against a single human top player seem robust enough for indicating potential for superhuman performance. They may not be sufficient for strongly concluding about superhuman performance in general.

Clarity: The paper is very well-written, introducing all relevant concepts with a convincing gradation of complexity and rich examples. The Appendix is also well-structured and provides extensive details.

Relation to Prior Work: Related works are, to my knowledge, extensively presented and discussed. The contribution is clearly identified with respect to the state of the art.

Reproducibility: Yes

Additional Feedback: Comments to rebuttal: The authors satisfactorily addressed the few doubts and questions I posed in my positive review. I confirm my score and recommend acceptance. ================= Comments: - The use of $\Delta$ for indicating probability distributions on sets (e.g. of infostates or actions) was not immediately clear to me at a first encounter. I do not know whether it can be considered standard notation, but it may be useful to introduce it explicitly. - A few words on the choice of the exploitability metric in the experiments may be useful for better interpretation of the results. This may be done either in Sec. 6 or in the App. Typos - L171: at public state --> at the public state