Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors propose a Fair-Efficient Network to better to train decentralized multi-agent reinforcement learning systems in tasks that involve resource allocation. In particular they introduce a shaping reward and a hierarchical model which they train with PPO on three new reinforcement learning environments (the code of which is made available). Their model outperforms several baselines, and ablation studies demonstrate the usefulness of the hierarchical nature of the model. The aims of the work are clear and well-stated. However, there are significant omissions in the review of related literature. Several papers have studied fairness in the context of common resources in multi-agent reinforcement learning prior to this work, namely: https://arxiv.org/abs/1803.08884 https://arxiv.org/abs/1811.05931 and related works cited therein. Although this impacts on the originality of the paper, the methods used to generate fairness are different in this work, so with an improved literature review and drawing contrasts to the previous work, the paper could be greatly strengthened. However, the choice of fair-efficient reward appears fairly arbitrary in this work. The equation in 3.1 could be replaced by many other options which also satisfied the criteria of Propositions 1 and 2. This is a weakness of the work, and the authors would do well to present a cogent argument for the functional form chosen. The hierarchical model is the greatest strength of the paper. The authors derive an interesting information-theoretic optimization goal for the sub-policies. The results in Figure 6 are particularly striking. Indeed, it would be interesting to see whether merely using the hierarchical model in conjunction with some of the other baselines obviates the need for the fair-efficient reward structure. Comments / experiments in this direction would strengthen the paper. In general there are some infelicities in wording which could be ameliorated on a proof-read. Moreover, the first two pages are fairly repetitive and could be condensed. On the other hand the descriptions of experiments are clear and concise. More details of hyperparameters and seeds chosen for the reinforcement learning training and models should be provided before publication for the purpose of reproducibility. Error bars and confidence intervals are provided in the results, but currently without sufficient explanation. === Response to authors: I was impressed by the response of the authors. They have clearly taken into account the feedback of the reviewers and make cogent arguments for the benefits of their method. They have also provided comparison against prior work and demonstrated the improvements that their work can bring. Moreover, it is now much clearer to me how both the hierarchy and the intrinsic motivation are beneficial and indeed complementary. Therefore I have increased my score by 2 points, and argue for the acceptance of this paper.
Summary: The authors propose a novel HRL algorithm (named FEN) for training fair and efficient policies in MARL settings. They design a new type of reward that takes both efficiency and fairness into consideration. FEN contains one meta controller and a few sub-policies. The controller learns to optimize the fair and efficient reward while one sub-policy is trained to optimize external reward (from the environment) and other sub-policies provide diverse but fair behavior. They show their method learns both fair and efficient behavior at the same time and outperforms relevant baselines on 3 different tasks: job scheduling, the Matthew effect, and manufacturing plant. They also show that the agents achieve Pareto efficiency and fairness is guaranteed in infinite-horizon sequential decision making if all agents play optimal policies (that maximize their own fair=efficient reward). Strengths: - the paper is generally clear and well-structured - the proposed approach is novel as far as I know - I believe the authors address an important topic in multi-agent reinforcement learning: designing agents that are both efficient and fair at the same time. I think people in the community would be interested in this work and could build on top of it. - Their method has some theoretical guarantees, which makes it quite appealing. It is also grounded in game theory aspects. - The approach is thoroughly evaluated on 3 different tasks and shows significant gains in fairness without losing in overall performance. - The authors do ablation studies to emphasize the importance of using a hierarchical model and also the effectiveness of the Gossip version of the model Weaknesses: - the paper is missing some references to other papers addressing fairness in MARL such as Hughes et al. 2018, Freire et al. 2019, and other related work on prosocial artificial agents such as Peysakhovich et al. 2018 etc. - the paper could benefit from comparisons against other baselines using fairness and prosocial ideas such as the ones proposed by Hughes et al. 2018 and Peysakhovich et al 2018 - I've find the use of "decentralized training" to not be entirely correct given that the agents need access to the utility of all (or at least some, in the Gossip version) agents in order to compute their own rewards. this is generally private information and, so I wouldn't consider it fully decentralized. while the Gossipy version of the model that only uses the utilities of neighboring agents helps to relax some of these constraints, the question of how these rewards can be obtains in a setting with truly independent learners remains. Please clarify if there is a misunderstanding on my part. Other Comments: - it wasn't very clear to me from the paper what happens at test time. is it the same as during training in that the meta-controller picks one of the policies to act? - it would be interesting to look at the behavior of the controller for gaining a better understanding of when it decides to pick the sub-policy that maximizes external reward and when it picks the sub-policies that maximize the fairness reward. - in particular, it would be valuable how the different types of policies are balanced and what factors influence the trade-off between the sub-policy maximizing external reward and those with diverse fair behavior (i.e. current reward, observation, training stage, environment etc.)
This paper assumes that the reward of each agent is independent from each other and the overall reward is addictive. This limits its applicability to more general multi-agent systems where multiple agents share the reward with common goals. This work uses the coefficient of variation (CV) of agents’ utilities to measure fairness. However, it is not clear how such fairness measurement is achieved with the proposed fair-efficient reward. There should be a formal proof that the CV is minimized given the decomposition of the reward. Propositions 1 and 2 are uninformative. It is not clear why the switch of sub-policies is necessary. As stated in the beginning of Section 3.2, if other agents change their behaviors, an agent may need to perform different action at the same sate to maximize its fair-efficient reward. This seems incorrect because all the agents are trained by the same algorithm so they must be coordinated. It is true that this is a multi-objective optimization problem with fairness and efficiency. However, the agents should make their choice eventually to balance the fairness and efficiency. So these two objectives can be combined and the controller is unnecessary. The decentralized training is only for one agent. It is not clear how the agents can coordinate with the policies trained in a decentralized fashion to achieve fairness.