Paper ID: | 4443 |
---|---|

Title: | A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning |

The approach works for cases where a given agent-task pairing defines a fixed policy, so it's hierarchical in nature. It formulates planning as an agent-task constrained matching problem with singleton and pairwise terms, optimized by a constrained LP or QP. The formulation is scalable (polynomial in nature) and generalizes to larger variants of the scene. The novelty: -- The pairwise agent-task formulation is similar to [22], but focused on agent-task matching in the policy network. 2 different score schemes (Direct and PEM), the latter of which somewhat novel. -- The idea that through structured matching, training from smaller tasks, generalizes to larger tasks and evidence to this effect. -- Experimental results demonstrating intuitive results, that for matching problems QUAD+simpler DM or LP+simpler DM features are superior to PEM features. PEM has higher expressivity, but does not learn as well and does not generalize as well. Rather compelling videos of the resulting strategies are shared in complementary material. The paper is clear, all details are described and code is shared. It would help a bit to elaborate on why exactly AMAX-Pem and Quad-PEM did not converge. Is that a result of the optimization or the model collapsing, or both?

Originality: I think in the context of MAC problems the idea of learning a scoring model, and then using it as input to a quadratic or linear program to perform task assingment, may be origional. Quality: Results are adequately presented and analyzed. Clarity: The paper is fairly well written. I think enough information is provided in the paper and supplementary material to allow for reproducibility of results. Significance: The work presented is limited in scope to muti-agent collaboration (MAC) types of problems, but has some useful real world applications.

This paper proposes a new multi-task hierarchical reinforcement learning algorithm. The high-level policy achieves the assignment of tasks by solving a linear programming problem(or a quadratic programming problem), and the low-level policy is pre-defined. The biggest contribution of this paper is to get rid of the limitation of the number of agents and the number of tasks by modeling the multi-task assignment problem as an optimization problem, which based on the correlation between the agent and the task and the correlation between the tasks. After training the correlation in a simple task, you only need to re-solve the optimization problem in the complex task, without retraining, thus achieving zero-shot generalization. Contributions: 1. In this paper, the collaboration patterns between agents in the multi-task problem, such as creating subgroups of agents or spreading agents across tasks at the same time, are transformed into constraints to be added to the optimization problem corresponding to the high-level policy. Thereby achieving better collaboration between agents. 2. In order to learn the relationship between agents and tasks, and between tasks and tasks, this paper proposes a variety of learning models. These learning models are implemented using a reinforcement learning algorithm that models the correlation as a continuous action space in an MDP. 3. Due to the different degrees of inductive bias introduced, the fitting ability and generalization ability of relationship learning models are also different. Through rich experiments, this paper verifies the impact of the combination of these models and different optimization problems on the algorithm's ability to fitting ability and generalization ability. But there are some problems in this article. First of all, this paper focuses on the generalization ability of the reinforcement learning algorithm. The conclusion in the paper shows that in order to have better generalization ability, the model with more inductive bias can not be used. Does this affect the final performance of the algorithm after it has been transfered to a new task? Does the algorithm has the ability to continue training on new tasks? Does the algorithm can achieves the performance of the current state of the art algorithms after continuing training? Secondly, the positional embedding model proposed in this paper uses the same embedding learning method as the [35]. The only difference is the subsequent use of these embeddings. The paper should explicitly mention the relationship with the [35] in this section. Finally, this paper introduces the correlated exploration to enhance the exploration of the agents in the multi-agent multi-task environment, but the effectiveness of this method is not verified in the experiment. Here is a small mistake in the paper. The title of [25] was changed to “Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks”.