Reviews: Learning to Correlate in Multi-Player General-Sum Sequential Games

The originality: It is interesting to learn the coarse correlated equilibrium of general-sum extensive form game with multi-player setting, and to reveal the computational complexity of the problem is important. From the technical point of view, some proof techniques are not so novel. For example, main parts of Theorem 3 follow the previous work [15]. The quality: Theorem 4 shows that the time complexity of Algoirhm2, which is a subroutine of CFR-Jr, and the time-complexity of CFR-Jr is not provided. In addition, there is no theorem that shows how good \epsilon CFR-Jr can provide. The clarity: It is well-written. The paper puts some appropriate examples and intuitive explanations. The significance: Theorems 2 and 3 seem to be important for machine learning and game theory communities. I cannot find how important Theorems 4 and 5 are. Below, I describe some questions mainly for the quality and the significance: (i) Does Theorem 3 mean that CFR-S does not necessarily converge to the optimal CCE? If so, it would be better to show the bound between the optimal CCE and a CCE to which CFR-S converges because an objective of the paper is to find a social-welfare-maximizing CCE. (ii) Theorem 4 provides a time complexity about the reconstruction. What is the time-complexity of CFR-Jr? (iii) At line 7 in Algorithm 2, if we put a pair (\bar{\sigma}_i,\bar{w}) and X already includes the same \bar{\sigma}_i, how do we build x_i from X? (iv) I cannot find if CFR-Jr approaches to the set of CCEs because Theorem 5 only indicates the relation between the regret and CCEs. There is no theorem showing how small \epsilon CFR-Jr can obtain. Thus, I want to know, given the number of iteration T, can we bound \epsilon parameterized by T? (v) In the experimental evaluation, how do you find the optimal social welfare SW_{Opt}?

Reviewer 2

I thoroughly enjoyed this paper. The paper contributes important results and algorithms with guarantees that are clearly the new state-of-the art. I have two relatively minor criticisms: - The algorithm CFR-Jr is rather straight-forward, as in, it is the obvious/natural way to get CFR to converge to approximate CCE, and the guarantees are a straight-forward consequence of properties of regret minimization. - CFR-S is the most unnatural CFR variant I have ever read. Why would one sample profiles if the values need to be updated in portions of the space that are not reachable anyway? One would think to resort to sampling because the problem is too large and to cut down the state space, but my understanding is that all the information states require updating even if they are not reached in the joint sample, which is why Theorem 3 can hold without any exploration. Is there a use case for this algorithm, when would it be preferred to CFR-Jr? There is oddly no discussion of EFCE. Does a similar notion not exist for CCE? [19] shows that dominated actions are removed by CFR. i would assume this is preserved in CFR-Jr; is that true? Or is it only be a property of the marginal strategies computed by CFR? If it's true, does it imply that the normal form CCEs filter out dominated strategies as well? It would be nice to see more graphs like the one in Figure 3 (across more of the domains), as it shows what the convergence and welfare looks like over time. If accepted for publication, I would encourage the authors to add more of these to the appendix.

Reviewer 3

**Originality** The complexity result is new, but the proof is quite similar to previously known results. The fact that the average of the strategies produced by CFR converges to CCE is well known and the goal of the algorithmic part of this paper is mainly to compactly represent the average of the strategy profiles from individual iterations. The transformation of behavioral to mixed strategy, which is the key component of the proposed method is interesting and I am not aware of a similar construction in previous literature. **Significance** Computing social welfare maximizing CCE is a relevant problem and proving its complexity is a reasonably significant result. I am unsure about the significance of the algorithmic results. The most natural version of the algorithm I could think of is to run CFR for T iterations and store the current behavioral strategies form all the iterations as the result. This should not require more than T|Z| storage. When playing, the mediator would just pick a random iteration and have the players play according to (a sample from) the behavior strategies from that iteration. The added value of CFR-Jr introduced in this paper seems to be only in transforming the result to a slightly more standard form. Furthermore, it is not obvious that the more standard representation of the strategy does not require more storage space than the naïve algorithm. The comparison of the sizes would better be shown theoretically, but it should be evaluated at least empirically. Similarly, I do not see any point in the CFR-S algorithm. It does not seem to be better than the naive solution or CFR-Jr in any aspect, so I am not sure what is the point in presenting it. It should be either clarified or the whole discussion of CFR-S should be removed. The experimental comparison of runtime is not surprising, since it compares a polynomial and exponential algorithm. However, the evaluation of the distance from the social welfare optimum is interesting. **Quality** The paper seems to be correct. My main concern is how the optimal CCE is computed in cases where CG did not finish and why are the results not reported for Kuhn poker even though all algorithms finished in time. The size of the final strategy representation should be reported in the experiments. **Clarity** The paper is reasonably clear and easy to read. However, there are several drawbacks: The paper is not very well motivated with respect to the naive algorithm described above. It is not very clear that the only concern for the algorithm is the compact representation of average strategy. It is not clear how the approximation of the social welfare is measured. **Questions** How is the optimal CCE computed in cases where CG runs for >24h. Is it still used? Why is the more exact time not reported? What is the typical size of the final solution in relation to the worst case, which seems to be T|Z|^2? What is the size of the normal formed mixed strategy profile compared to the original behavior strategy profile? ** After Rebuttal ** Thank you for your response. It helped to clarify my main concerns. I still believe that CFR-S is not worth the space in the paper and the paper should rather very carefully analyze the size of the computed solutions, at least to the extent of the rebuttal response. If this discussion is added, I would be for accepting the paper.

Paper ID:	7160
Title:	Learning to Correlate in Multi-Player General-Sum Sequential Games

Reviewer 1

Reviewer 2

Reviewer 3