NeurIPS 2020

Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward

Review 1

Summary and Contributions: This paper presents the counterpart of the results of Qu, Wierman and Li (2019) for average reward MARL (the work of Qu et al. (2019) discusses discounted reward MARL). The main theoretical result is Theorem 1. The paper also presents a Scalable Actor-Critic Algorithm based on the theoretical results.

Strengths: The theoretical results seem to be sound. The topic should be of interest to some researchers in MARL.

Weaknesses: The main weakness in my opinion is the significance of the work. While the setting of discounted reward MARL sounds natural and readily acceptable to me, I would like to be convinced that average reward MARL can model many meaning situations. Unfortunately arguments along this line are not strong in the paper. Even for the example of APs and users in Section 4, I do not see why it should be an average award case. Further comments after rebuttal: I have read the authors' responses and reviews from other reviewers. I believe the authors could have included the points in their rebuttal in the submission. Actually, I wish to see even stronger justifications than the authors have written in the rebuttal.

Correctness: Yes generally. However, it is disappointing to see that the authors do not explain the conclusions from the experiment, particularly Figure 2, in the main text. I agree that space constraints are always an issue for NIPS paper :-), but I believe that readers should not be required to read the appendices and supplementary materials unless they want to know the tedious details or read the complicated proofs.

Clarity: No really. The abstract is a bit misleading. The significance of the work (consideration of the average reward cases) and the results (theorems 1 and 2) could have been more clearly not stated and argued.

Relation to Prior Work: It would be more interesting to compare and contrast the new results in this paper with the counterparts in the work of Qu et al. (2019).

Reproducibility: Yes

Additional Feedback: I hope this paper can be rewritten so that the main points, in particular its significance, can be emphasised.

Review 2

Summary and Contributions: The paper introduces scalable actor-critic (SAC) method for networked MARL where agent's values are dependent on the local interaction with nearby agents. It aims to maximize the global average expected reward per time step instead of the more popular RL objective of maximizing expected discounted reward. It is shown that by exploting exponential decay property SAC learns near optimal individual policy for optimizing the average reward and its complexity scale with state-action space size of local neighbourhood. POST REBUTTAL: I have read other reviews and authors response and they have addressed my comments. I am convinced with their explanation about the difference between underlying framework of the proposed method and that of mean field RL. Also, there are very few MARL literature which deals with average reward settings. Overall, I am retaining my score.

Strengths: The novelty of the paper is to provide a scalable learning method for average reward settings with guarantee of small performance loss. The work is relevant to a number of real world applications such as social networks, communication networks, transportation networks etc. Following are the highlights of the paper - The problem formulation is clear and despite having so many variables in the proofs, the mathematical notations are wisely chosen and are unambiguous. - Error bound is provided on the truncated Q-function using exponential decay property. - Policy gradient is approximated using gradients of localized policy. - The error between true gradient and approximated gradient is shown to be bounded. - It is also shown that the exponential decay property holds true in case of discounted reward under the assumptions. - Convergence guarantee has been provided for the proposed SAC algorithm - Experiment result on a wireless network with multiple access points show that SAC significantly outperforms a baseline based on localized ALOHA protocol. The proofs appears to be correct and I liked the way few assumptions have been used to provide theoretical guarantees. Overall I think the paper should be accepted for publication.

Weaknesses: - Why J(\theta) has been deducted from immediate reward to compute for the Q-values (line 136)? A citation or explanation will be helpful. - To compute approximate gradient (Algorithm 1, line 7), it is assumed that agents have access to Q-values of neighbours, which is not clearly mentioned in the paper. - ALOHA protocol is used as baseline which seems too simplified to me. Comparision with other learning algorithm suited for average reward setting will make the results more convincing.

Correctness: The technical content of the paper appears correct to me. The authors have used exponential decay property nicely to show that the Q values and gradients can be efficiently approximated.

Clarity: The paper is well written and is easy to follow.

Relation to Prior Work: Authors have discussed about competitive MARL and independent learning in multi-agent setting and have explained that their work is relevant to cooperative MARL with a more structured setting where each agent has its own state that it acts upon. Also the work deals with maximizing expecetd average reward which is different than other existing work dealing with discounted reward. I feel discussion on existing work which deals with local interaction of agents (for example mean field RL) should be included.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper is concerned with interaction graphs that factor the transition model in multi-agent RL. The authors extend the set for which the Exponential Decay Property holds (Th.1) and prove tighter bounds for these MDP (Co.1). Based on these insights, they propose a tabular average reward variant of the Scalable Actor-Critic algorithm (confusingly named SAC) and prove that the it converges in the limit exponentially with the considered graph-distance $\kappa$. The algorithm is evaluated in a simple multiple access point experiment, which shows that SAC outperforms localized ALOHA, and that considering at least one hop ($\kappa=1$) improves over decentralized value functions ($\kappa=0$). The paper appears to be a straight extension of Qu et al. (2019) to the average reward case, with the exception of Theorem and Corollary 1, which seem to be genuinely novel. POST-REBUTTAL: I have read the reviews and author response. The authors have addressed my questions in the most part sufficiently. In my view Theorem 2 does not really depend on Theorem 1: it assumes the exponential decay property, which Theorem 1 only widens. But if the authors are willing to clarify this, I am satisfied. Based on the rebuttal I raised my score to 7.

Strengths: The discussion of the average reward case of interaction graphs is quite interesting and Theorem/Corollary 1 appears a valuable contribution. The paper contains extensive additional analysis in the supplementary material.

Weaknesses: The reviewer has not read Qu et al. (2019) and did not check the detailed proof, but at first glance Theorem 2 appears somewhat derivative, as it apparently does not make use of the insights of Theorem 1. Furthermore, while tabular methods are an excellent way for theoretical analysis, the paper would have been much stronger with a neural network implementation of Algorithm 1.

Correctness: The reviewer only had time to check the proof of Theorem 1, which is mostly correct. The theorem requires the additional assumption $\bar r \geq 0$, though, as the term can not be moved out of the absolute operator otherwise. As the authors attached 22 pages of proofs and further analysis, it might be better to submit to a journal where reviewers generally have more time checking proofs.

Clarity: The paper is well written, but it is not easy to see the immediate relationship between sections 3.1 and 3.2. It should also clarify the differences to Qu et al. (2019) better, who also define an algorithm called SAC. The authors should make explicit which parts of Theorem 2 are from prior publications and introduce a unique name for the presented method. On a side note, SAC also stands for Soft Actor-Critic (, one of the most popular algorithms in deep RL, which increases the confusion. Lastly, Figure 5 should be in the main paper (instead of Figure 1), as it adds substantial insight (even though it is the same insight as Figure 1 in Qu et al., 2019).

Relation to Prior Work: The related literature section discusses a wide range of related works, but the reviewer was missing current works on collaborative MARL with neural network approximation. For an overview, the authors are referred to: Oroojlooy Jadid, A. and Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. CoRR,abs/1908.03963, 2019. URL abs/1908.03963 The reviewer was also missing a discussion of the similarities to Sparse Collaborative Q-learning: Kok, J. R. and Vlassis, N. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7(Sep):1789–1828, 2006.

Reproducibility: Yes

Additional Feedback: The broader impact statement could be more fleshed out: which assumptions would be violated, e.g., in a smart infrastructure system? What could go wrong if SAC is still applied?

Review 4

Summary and Contributions: This paper is concerned with the scalability of multiagent reinforcement learning. It proves that local interactions and exponential decay can be exploited in order to build scalable MARL methods for networked systems with local neighbor approximation. It proposes a scalable actor-critic method with convergence guarantee, and performs numerical experiments in a wireless network environment.

Strengths: This paper proposes novel formulations of MARL in a graph with localized property and exponential decay, and identifies a general condition so that the exponential decay holds. Both of them come with concrete proof. This paper also proposes a scalable actor-critic method based on aforementioned theory, with convergence guarantee.

Weaknesses: While this paper stresses the scalability and richness of the proposed method, the numerical experiment, howeer is very limited and only contains one wireless network scenario with 5x5 grid. the number of users are not large scale and also no baselines are provided in order to compare the performance. As such, it is very difficult to assess the main claim from the paper.

Correctness: Whiel the method is derived from previous theortical work and constains certain novelty theoretically, the empirical experiment is, however, inadequate and does not support the claim due to the following reasons: 1) lack of strong baselines, suggest to compare the mean-field based solution (such as Gu et al 2020), which is certainly efficient in the grid setting. should also include indepent learner and factorised methods. 2) the tasks chosen are limited. check Yang, Yaodong, et al. "Mean field multi-agent reinforcement learning." ICML18 for varous more challenging tasks. 3) I will be keen to see more thorough ablation studies on such as a) varous network structures b) number of agents.

Clarity: mostly clear. a minor comment: 1) Eq. 2 is miss-leading. It is not multiagent formulation but rather a single agent formuation as it assumes there is a global optimizer who sees everything and optimimzes the joint policy \theta.

Relation to Prior Work: Yes. some important relevant work on mean-field approach to multiagent learning is missing. The paper should compare with them both theoretically and experimentally.

Reproducibility: Yes

Additional Feedback: In the numerical experiment, is it possible to simply increase the size of the grid (like 500x500 or even larger) to show the scalibility of the proposed method and compare it against other strong baselines such as mean-field approaches?