Reviews: Multi-Agent Common Knowledge Reinforcement Learning

My two biggest complaints center on 1) the illustrative single-step matrix game of section 4.1 and figure 3 and 2) the practical applications of MACKRL. 1) Since the primary role of the single-step matrix game in section 4.1 is illustrative, it should be much clearer what is going on. How are all 3 policies parameterized? What information does each have access to? What is the training data? First, let's focus on the JAL policy. As presented up until this point in the paper, JAL means centralized training *and* execution. If one is in a setting where centralized execution is practical, then JAL can be used and should perform at least as well as MACKRL. The advantage of MACKRL is that it can be applied where dencentralized execution is required. Therefore, the advantage of MACKRL is in wider applicability, not in performance on equal footing. So what is happening in Figure 3, middle? As far as I can reverse engineer from the results, the "JAL" model here can only leverage common knowledge, that is it only knows the matrix game when the common knowledge bit is set to 1. It does not receive access to the second, independent coin flip that the other two models have access to. I find this quite misleading. Intuitively, in a JAL model, any knowledge that *one* agent has should be available to *all* agents; that is, all knowledge is common, since execution is centralized. That is clearly not what was done here. This is important, because as presented, the implication is that MACKRL is better than JAL in a direct comparison, whereas I believe the advantage of MACKRL is in being able to be applied in settings where JAL cannot. Second, let's focus on the IAC policy. It is confusing that performance increases as p(common knowledge) increases, since independent agents cannot leverage this. Of course, the reason is because, as p(common knowledge) increases, so does p(independent knowledge). This is a confusing representation. I believe the results would be better presented if parameterized such that these two numbers varied independently. That is, one could imagine flipping coins for the two agents independently, and then with probability p letting them look at each other's coins. Then, the probability that an agent knows the game is decorrelated from whether there is common knowledge, and I think the results would be more clearly presented this way. In any case, as parameterized, the IAC results are still strange. For example, why do the IAC agents not achieve 100% performance when p(common knowledge)=1? In this case, the game is always revealed to both agents. Of course, they do not "know" the other agent knows the game, but even treating the other agent as part of the environment, learning an optimal strategy here should be easy. For example, in game 1, the agents could blindly coordinate on choosing action 1 (or 5), and in game 2, one would "specialize" in action 1 and the other in action 5. This can be learned without modeling the other agent. So why didn't the IAC agents learn this? Well, they probably would have if they were *only* trained with p(common knowledge)=1. I have to assume the authors trained the same agents over the whole range of p(common knowledge), so the task has a sort of hidden meta learning element of it that isn't quite stated explicitly. This should be. 2) I agree with the authors that common knowledge is an interesting and important concept for coordinating multi-agent learning. However, as presented, it seems somewhat brittle. First, for any (pseudo-)stochastic elements of the (hierarchical or primitive action) policies, the agents must have access to a shared random seed. But how is this coordinated? If all agents are deployed at the same time and either perform the same number of draws from the random number generator or have access to a synchronized clock, depending on how the RNG is implemented, then this should work fine. However, one imagines that, in reality, often agents will be deployed at different times and perhaps execute actions and therefore random number draws and not quite exactly synchronized intervals. Thus, the assumption of access to shared random number draws could quickly break down. That said, if this is the biggest challenge, perhaps that isn't so bad, because communicating a single scalar among a group of (mostly) dencentralized agents doesn't sound too challenging in many domains. Still, I wish this had been addressed in the paper. Second, the bar for common knowledge in this setting seems quite high, since it is assumed to be infinitely recursive and certain, that is knowledge that I 100% know that you 100% know that I 100% know and so on. Of course, many (most?) forms of common knowledge fall short of this. The knowledge may be finite-order (e.g. I know that you know, but not beyond that) or held with uncertainty. The authors try to address the latter with Figure 3 right, but its not clear to me what I should learn from this. Performance seems to degrade to the level of independent learners after just 10% corruption. Also I don't know what their "correlated sampling" approach is and the authors don't explain in the main text (another sentence or two in the main text would be nice). I would like to have seen independent agents *infer* what other agents know (both from direct observations and reasoning about behavior) and then to act upon that (probabilistic) common knowledge. Some conceptual clarifications: 3) Figure 1 - agent C can see both A and B and see that they see each other. Can agent C use the knowledge that A and B may/will make a joint decision in its own decision? As I understand the framework presented, the answer is "no", but could it be expanded to allow for this? 4) line 142 - I can imagine situations where, conditioning only on common knowledge, a pair of agents might choose a joint decision. However, one agent might have private knowledge that makes it clear they will do better acting independently. For example, perhaps two agents are working together to lift a rock to retrieve food (the rock is too heavy for just one agent alone to lift, so they must coordinate). They are looking at each other and the intersection of their fields of vision is the small space between them. Conditioning on this shared field of view, continuing the present activity is the best decision. However, a man-eating tiger is creeping up behind one of the agents, in full view of the other agent. It is clear the two agents should instead run or defend themselves in some way, but since the tiger's presence is not yet common knowledge, the greedy operation of the MACKRL hierarchical policy will probably choose the joint action. Am I understanding this correctly? Should MACKRL include an "independent override" ability for agents to reject the joint action and make an independent one when the expected value difference is very high? 5) line 86 - the setting here is fully cooperative with shared reward and centralized training. Could a variant of MACKRL be useful when agents incentives are not fully aligned? More minor comments and questions: 6) paragraph 1 - capitalization of "Joint action learning" and "Independent Learning" should be matched 7) sentence on lines 36-38 - would be nice to have an example or citation 8) line 36 - "... MACKRL uniquely occupies a middle ground between IL and JAL..." - This seems way too strong. Plenty of methods fall in this middle ground - anything that models other agents explicitly instead of treating then as part of the environment (e.g. Raileanu et al 2018 Modeling Others using Oneself in Multi-Agent Reinforcement Learning, which should probably also be cited). 9) lines 80-81: why are action spaces agent-specific but not observation spaces? (U_env has an "a" superscript, but not Z) 10) line 94 - \citet -> \citep 11) line 97 and 101 - please be more clear on what kind of mathematical object tau_t^G is. It is for example unclear what \geq means here. 12) line 163 - other heuristic one might leverage is greedy pairwise selection based on spatial proximity 13) line 173 and eqn 1 - why is the state value function conditioned on last actions as well? This doesn't seem standard. 14) eqn 1 - J_t should be introduced in words. 15) Figure 4 - number of runs shouldn't be in the legend since it is the same in all cases. Remove it from legend and put it once in the caption. 16) Figure 5 - in either the caption or main text discussion, the authors might want to comment on the total number of partitions the "full" version is choosing from, to place the restricted selector version in context. 17) line 295 - missing a space between sentences 18) citations - Rabinowitz et al 2018 Machine Theory of Mind focuses on inferring what agents know from observing their behavior. Relevant to inferred common knowledge. UPDATE: Thanks to the authors for their helpful rebuttal. In particular, thanks for clarifying the matrix game example; I think the presentation is much clearer now. I've raised my score from a 6 to a 7.

Paper ID:	5259
Title:	Multi-Agent Common Knowledge Reinforcement Learning

Reviewer 1

Reviewer 2

Reviewer 3