NeurIPS 2020

Cooperative Multi-player Bandit Optimization

Review 1

Summary and Contributions: The paper presents a decentralized algorithm for finding stationary points in cooperative multiagent problems, with no gradient information.

Strengths: The problem domain is very general, and could be of interest to a very wide community.

Weaknesses: There are a few steps in the long argument of correctness that might need to be expanded for clarity.

Correctness: Does this work in practice, or are there are numerical issues? The algorithm doesn't seem so complicated that it can't be tested. How does it do in some test domain, like one of the three example problems described? This work very much feels like it would be improved through empirical validation. line 159: doesn't variance in this gradient estimate also approach infinity as delta decreases? Maybe include a quick argument for why this isn't a problem. I tried a tiny experiment doing gradient descent on x^+y^2 using this gradient estimate, and it was extremely sensitive to step size and delta, with many choices leading to floating point overflow, or no progress. Are the perturbations being independently generated across agents? Expanding the argument of [1], sketched in the paragraph at line 156, to the distributed case would seem to require coordination of Z_n across the agents and the perturbation in \tilde(x). Otherwise, there would be two random variables -- one for the perturbation used in the other agent's utility computation, and another in my gradient estimate -- and my expection of the would seem to be 0. How does Lipschitz factor L fit into the Theorem 1 bounds? Given gradients are estimated from past points, there must surely be some be L-based upper bound on eta_0 to limit how quickly u_m can have shifted in the M steps.

Clarity: The paper is generally well written: the problem and the general approach taken is clear. It is possible to follow the outline of the correctness argument.

Relation to Prior Work: Looking at a random sample of the prior work cited in the paper, they seem relevant. I'm not sufficiently close to this field to know if there are significant works that should have been cited.

Reproducibility: Yes

Additional Feedback: The word "optimal", for example as used in the abstract "converges with probability 1 to the optimal action profile" seems like the wrong choice to describe a stationary point, when used without further qualifiers. Lipchitz -> Lipschitz line 159, 160: seems like this should be Z_n(t)/delta(t)u_m(x+delta(t)Z(t)) or some other indication of a distinction between \tilde{x} and x. line 169: "where all players' gradients are zero." This might be more clear if it is instead stated as "the gradient of u_n is zero with respect to x_n". line 199: "This is described in Step 2(d) and Step 3(a)" Step 2(e)? ---- Comments after feedback Thank you. Adding some experimental results is helpful, and adding the same clarifications in the response to the paper would make for a much easier read. I share some of the concern about M as an unknown parameter. It should be completely clear that one either needs to know M, or be able to choose a large M based on the graph size, or just select some estimate M and accept that one would get results, unless the graph process was sufficiently adversarial. Possibly just a quick sketch of how one might proceed, given some unknown interaction graph.

Review 2

Summary and Contributions: This paper considers a co-operative multi-agent multi-bandit setup for distributed optimization with limited communication. In the particular setup, there are N agents who are co-operatively trying to optimize the sum of their utility function. At each time-step, each of the agents takes an action based on the history of feedback they have received. After playing the action they observe the reward (which is a function of their action and other people's actions) but not the individual actions itself. Additionally, they can gather more information (such as the action played by other players) using communication. There is a communication graph among the players which tells how each of the players can communicate, which is an ergodic Markov chain. At time t, if there is an edge between two players, this implies that they can communicate their actions at this time-step. The goal is to design a distributed learning algorithm that mazimizes the total utility. The authors design a bandit-algorithm for this problem and show that this algorithm converges to the best optimal value obtained by a gradient ascent algorithm that can be used in the presence of full-feedback. In particular, if the utility function is concave, gradient ascent algorithm converges to the global optimal and thus, their distributed algorithm also converges to the global optimal with probability 1.

Strengths: The strengths of this paper are as follows: (1) The paper is theoretical. Bandit feedback about the actions played by the other players makes the problem challenging. Every agent has to make an estimate about the set of actions played by the other players. Moreover, not all agents communicate the action set they have played. Even if they communicate, it may not reach immediately will only reach after a few time-steps based on the graph. Thus, the algorithm needs to be sophisticated and account for this delay. Moreover, due to the nature of this graph some agents' action might come more often than certain other agent's and the algorithm needs to account for this bias. They do this by considering a version of the IPS on the prob. of communication using the past samples. (2)The problem is interesting and well-motivated. In the single-agent setting with no-interaction, this problem has been studied in both the full-feedback and the bandit-feedback models. In the setting with multiagent interaction, previously only an algorithm was known when the objective function (i.e., the utility function) was already known. Thus, this work fill the gap in the literature for the bandit feedback. (3) This work is relevant to the NeurIPS community since optimization is a central topic. This work combines areas of multi-armed bandits, zeroth order optimization and cooperative game theory each of which is central to this community with multiple applictions.

Weaknesses: The main weakness of this paper stems from the fact that it is not well-written. As a result I think this is not yet ready for publication. Here are particular comments that makes the paper weak in its current form. (1) Some of the mathematical statements are not fully rigoros. First, in definition 1, the evolution of the communication graph is completely unclear. The notation is too confusing (and probably meaningless as written). The variable G is I think overloaded. What does the notation \union_{G \in S_G} G even mean? S_G is a set of states. Then isn't \union_{G \in S_G} G just S_G? Please make this definition precise. Without this the communication model is not clear. (2) Along those lines, how does G(t) and G(t+1) differ? Is it the case that the transition probability of going from graph at time t and graph at time t+1 has no restrictions except for the fact that the markov chain over the set of graphs should be ergodic? And given a state at time t it can go to t+1 and the algorithm does not know about it? In general, to me the significance of the communication graph changing over time as a Markov chain does not seem well-motivated and a needless complication. The central idea is that the communication graph needs to be connected and eventually an agent will get the actions of agents through transitivity. I think the presentation can be simplified considering a static connected graph and then may be expanded to the more general notion with precise mathematical formulation. Right now, in the pursuit of considering a fully general model, the basic idea is not clear/precise. (3) The main theorem statement is not precise either. First, it states that it holds for some large M. There is not quantification on how large it should be. The idea is also that it should be smaller than N? (this doesn't seem right to me.) So then what is the range of M? In particular, why should M be smaller than N. What if the algorithm just considers all the past N rounds? In the simpler static graph setting it seems to be the right thing to do, since by then you can learn all the actions of an agent from N steps ago. So the algorithm will be lacking "reality" by N. And if N << T, this shouldn't really hurt. I guess going from a static graph to a time-evolving graph it might be the case that for some pair of agent the probability that they will exchange information may be tiny and thus, the expected number of times they should wait till they communicate is large. This is why it is critical to state the lower-bound on M! (4) Finally, the main theorem opnly shows that it converges with probability 1. One of the usual quantities od interest for bandit algorithms is also the rate of convergence. Can you characterize this in the theorem statement?

Correctness: The paper has some lack of rigor in some of the definitions/theorem statements (see above). I believe that eventually things may work out and be correct, but I am unable to assess it at this point.

Clarity: Please see my response in the weakness section. I think the paper is not written clearly for a reader to fully grasp all the mathematical details. In addition there are also a couple of typos. Page 2 line 77: "a the" -> "the" Page 3 line 115: "algorithm each player" -> "algorithm where each player" (In general that statement is too long and the grammar seems sketchy. May be break it into two sentences?)

Relation to Prior Work: Yes, the paper clearly describes the relationship to prior work and how the contirbutions in this paper compare. Overall, based on the prior work, the problem considered is novel.

Reproducibility: Yes

Additional Feedback: POST Rebuttal: Thanks a lot for answering my concerns and questions in detail! I would highly appreciate it if this was also clarified and reflected in the final version of the paper. Having said that, there are some more concerns that should be addressed in the paper. In particular, if G(t) is unknown, it is not clear how one would choose M. So the algorithm is not well-defined as stated. I see one of two possible ways to fix this and make the algorithm precise. First, either assume that a lower-bound on M is known but not the graph generating process itself. This is still a weaker assumption that knowing G(t), but not practical. Second, assume that G(t) is known. This does not take away the main contribution of this paper and results. Or may be state that either of these needs to to be assumed for the algorithms to work. A discussion of this in the final version will make this paper much more clear! Having said this, I overall think the ideas in this paper are valuable and interesting. To this end, I am increasing my score to a 6, with the hope that my concerns above are sufficiently addressed in the final version.

Review 3

Summary and Contributions: The paper contributes an algorithm and a proof of convergence for a multi-agent problem setting with bandit feedback and interaction between agents. Specifically, the paper assumes a twice differentiable reward function, convex action sets, a random time-varying communication graph that follows an ergodic Markov chain and show that the proposed algorithm converges to a stationary point of gradient ascent with probability one.

Strengths: The paper both proposes a problem setting that is novel and carries significant practical relevance and proposes an algorithmic solution to this problem setting. The paper is well-written.

Weaknesses: The algorithm only necessarily achieves a global optimum if the utility function is concave. Given the difficulty of the multi-agent setting, this "weakness" is reasonable. The theory does not cover convergence time. Small things: - Typo: Line 76 Our approach - Function definition on line 91 is awkward, u_n(x_1, \dots, x_N) is a value --- not a function. There are no experiments.

Correctness: The reviewer did not check the proofs.

Clarity: Yes.

Relation to Prior Work: Yes, the paper does a good job at clarifying how the problem setting being addressed differs from problem settings addressed in existing work.

Reproducibility: Yes

Additional Feedback: Although the reviewer lacks familiarity with the most closely related work and does not make an attempt to assess the difficulty of the proofs, the reviewer is willing to defend a high assessment for the submission because of the importance of the problem setting and the apparent absence of existing algorithms addressing it (see Table 1). The reviewer remains open to the rebuttal of the authors and the opinions of the other reviewers. POST REBUTTAL: The reviewer lowered the assessed score in light of the some nuance about picking M when G(t) is unknown. The reviewer maintains the opinion that the submission deserves to be accepted.

Review 4

Summary and Contributions: The paper considers a team of cooperative players that act in a networked environment, and whose goal is to learn to play together the action profile that maximizes the sum of their local rewards. The caveat is that players cannot observe the actions or rewards of other players, and can only receive information about the rewards of its neighbors in the environment graph, which is modeled as a time-varying graph following an ergodic Markov chain. The authors introduce a novel distributed algorithm that manages to converge to a stationary point with probability 1, even if the players take noisy gradient steps that are based on incomplete information about the rewards of their peers in the environment graph.

Strengths: 1. The authors deal with a novel setting and propose a novel distributed algorithm with convergence guarantees. The work is solid and contains an interesting theoretical result with a technical proof. 2. The analysis distinguishes between two different types of bias. The first is the gradient bias and noise due to the finite sampling radius; the second is the communication bias and noise which results from the time-varying graph and the fact that a player is more likely to interact more frequently with its peers in the graph. The authors discuss a very interesting interplay between the two types of bias, and shows how to address it. 3. The paper is very well written. The main submission discusses the setting in good detail, provides motivating examples, and gives an overview of the algorithm by discussing its different steps. The heavy math details are contained in the appendix. As a result, readers who may not be familiar with the mathematical tools can still understand the main text, get to know what this work is about, and realize its challenges.

Weaknesses: 1. The setting is perhaps not totally realistic. The time-varying graph seems totally fine and is true in many practical applications. On the other hand, in most applications if we know the reward of the peers, then it may be likely that we also know the actions they took. Having only access to the reward but not the action of the peers may not be so natural. 2. It would have been nice to have a simulation/experimental validation of the convergence of the algorithm, even in small or medium-sized graphs. For instance, ti would be nice to see how fast the algorithm converges in simple, controlled settings.

Correctness: I was not able to follow all proofs, but the ones I did looked correct to me. Lemma 3 on the communication bias and noise seems to be more involved than the other results.

Clarity: As I mentioned in the paper strengths, I think the paper is very well written and easy to follow, even by non-expert readers.

Relation to Prior Work: Related work is discussed, and the authors also explain how their work is positioned in the broader literature by providing a high-level taxonomy.

Reproducibility: Yes

Additional Feedback: 1. The fact that the time-varying graph follows an ergodic Markov chain is quite convenient for the proof. I was wondering whether it would have been possible to use a different assumption on the graph and still manage to achieve convergence (not necessarily a weaker or a stronger assumption, just different). --UPDATE-- The rebuttal addressed my concerns. With that said, some reviewers raised concerns about M when G(t) is unknown. The authors should shed light on this point in the updated version.