Reviews: Decentralized Cooperative Stochastic Bandits

The authors propose a decentralized algorithm for stochastic bandits. The key idea is to use information that is mixed to build the upper confidence interval of each arm. The regret bound of their algorithm equals the optimal regret for N agents plus a term depending on the spectral gap of the communication graph, and the communication cost is O(K) per round for each agent. Besides, their empirical results also show the good performance of their algorithm. I am concerned about a tighter lower bound relevant to the graph structure. I mean, the graph structure may cause communication latency, which results in a higher regret. However, the lower bound proposed in the paper doesn't indicate the dependence on graph structure, which is derived by comparing with the centralized UCB algorithm for TN steps. I believe that a tighter lower bound relevant to graph structure can indicate whether the regret can be further improved. I also wonder whether the communication cost can be improved. In DDUCB, each agent needs to communicate with other agents every round. As a result, the communication cost is linear in T. Is it possible to communicate only in certain rounds without loss of the regret? For example, for a single-agent stochastic MAB problem, we can only update the mean rewards and counter of an arm when the number of pulling that arm is doubled. This will not cause much loss of the regret. I don't know whether this idea can be used to reduce the communication complexity. Overall, I believe this paper is interesting and well-organized, and I vote for accepting it.

Reviewer 2

This paper studies the decentralized stochastic bandits, where N players can pull any arm of their choice at each time step. They then receive a corresponding reward (two players pulling the same arm can receive two different rewards in this setting) and can communicate after each time step through a given (possibly unknown) communication graph. This problem has already been studied, especially by Landgren et al., who used a gossip based algorithm. The authors here propose an improved gossip algorithm which yields a lower regret while requiring less prior knowledge of the graph. The paper uses interesting techniques such as gossip acceleration. Unfortunately, this work presents several drawbacks: - The results seem incremental as the used techniques are similar to those of Landgren et al. - Despite the authors claims, I am not convinced that this new algorithm brings a significant regret improvement (explained below). - I have a general feeling that some parts of the paper are unnecessary complex. They bring a lot of confusion without being significant for the results (or while they can be rewritten in a simpler way). For these different reasons, I think that this paper can still be improved. ------------------------------------------------------ Major comments: 1. Theorem 3.2.6 (given in appendix) gives the same asymptotic upper bound for the unaccelerated version of DDUCB. Then, what is the significance of acceleration in DDUCB ? From the proof, it seems that it will only improve the regret by some constant factor. This point is the main example of unnecessary complexity I mentioned above. If the acceleration only improves the regret by some constant, I think it is not worth the complexity it adds to the paper, and it could be delayed to the appendix (and bring the simpler unaccelerated version to the main text). All of this is supported by the experiments, where the unaccelerated version is comparable to the accelerated one. 2. What I understand from the paragraph at the beginning of page 5 is that the novelties of this paper compared to the work of Landgren et al. are the following: - the acceleration technique (which is not so useful if theorem 3.2.6 is correct) - the observations are not directly mixed to the statistics but we wait some time instead - the mathemetical tools for the proof of the upper bound are different This is why I consider this work incremental, especially if the acceleration is not so useful. 3. My other point is that I am still not convinced of the significance of improvement compared to the previous work. This point is discussed in Remark 3.4: the authors claim they significantly improve the incurred regret. But when looking at the details of their claim (see Appendix C), they show it by considering that in a cycle graph, the communication matrix P will be 1/2 above and below the diagonal and 0 elsewhere. But why would we consider such a matrix P ? Intuitively, the matrix with 1/3 below, above and on the diagonal will give better communication. After looking at "On Distributed Cooperative Decision-Making in Multiarmed Bandits" , it indeed seems that they will choose 1 - kappa on the diagonal and kappa/2 above and below for some kappa in (0, 1]. They do not seem to precise which kappa to choose then, but I do not believe that 1 is then optimal in the cycle graph case. Also, which P is used in the experiments ? With the given parameters and the P you mention for cycle graphs, I believe the difference in regret should be larger between coop-UCB and DDUCB for the cycle graphs. Furthermore, the two values for N (200 and 225) are too close to really observe the behavior with N. ------------------------------------------------------------------ Minor comments 4. Page 3 line 133. You say that the size of each message could be more than poly(K). This is true for an arbitrary d but you then claim that d is actually set as \sqrt{K}, so it is in poly(K). Furthermore, you forgot an important additional complexity in the cited paper: there are delayed feedbacks. 5. page 4 line 155. Do you have a practical justification of why you do not want to build global structures ? To me, this would be the optimal way to build a decentralized algorithm in your setting, so a solid justification of why we do not want it (in practice) would be welcome. 6. page 4 line 163: this is not really an interval. This is actually the upper bound of the confidence interval. 7. Page 4: don't we need P to have positive coefficients as well ? 8. Page 5 line 194: this is for v in the simplex. I don't think m/N and n/N are in the simplex. The link between the two points is not clear here. 9. typo Page 5 line 195 : "run 2K running consensus" 10. For algorithm 2, I think it would be easier to describe it using directly the coefficients of q_C instead of the w here. We here need to read the appendix to know the link between q_C and w, so I think it would be better to directly put q_C here and refer to the appendix for its coefficients. This unnecessarily complexifies the algorithm to me. 11. Also, the index i seems useless for w here (and in the appendix, you add an index k to w, which also seems useless). 12. Page 7 line 285: why is there a +1 ? Here it is unclear whether this is a heuristic to give some intuition of what the lower bound should be or a proof of what the lower bound is. 13. Page 8: I find the paragraph "Variants of DDUCB" very dense and unclear for some parts and this part does not really enter in the structure of the paper. I think it can be shortened/delayed to the appendix. The details could then be made clearer in the appendix, and you would save some space in the main text. 14. Page 1 supp: is the s_t line 482 the same as line 486 ? I believe yes but this is unclear 15. I do not understand what is shown in Section D of the appendix. This section is very confusing, and needs to be rewritten in a clearer way. 16. It seems there are format errors in the pseudocodes of the appendix 17. How did you choose the parameter gamma of coop-UCB for your experiments ? 18. I think it could be good for the reader to add (in the appendix) a table, giving examples of the used communication matrix P and the corresponding lambda_2 for some classical graphs. ------------------- In their answer, the authors pointed out the regret improvement brought by the acceleration method and thus answered to my major concern. They also answered to some of my other concerns (no global structure, the regret is significantly better than Landgren et al.) and I thus decide to raise my score in consequence. I hope the authors will take into account all the reviewer comments for the camera ready version (in case of acceptance).

Reviewer 3

Paper ID:	2550
Title:	Decentralized Cooperative Stochastic Bandits

-- In general, distributed algorithms are measured in terms of the size of messages or by the number of messages. This is not the case here: the only measure taken into account is the regret. I do no understand why the nodes then just not send a vector of O($K$) reals that corresponds to some estimation of the reward of arms. Why can every agent not broadcast the following information: for every agent i (if he knows it), for each arm a , the number of times arm a has been selected by agent i, plus estimation of the reward for a. If these informations are available (and of course updated) why matrix P is required? One must take into accounts the cost of communications. Without these taken into considerations, this is very hard to compare the various algorithms. -- the paragraph that starts at line 227 (page 5) should be rewritten. -- How matrix P is computed? Is that expected to be some prior knowledge of the network, or should it be computed in distributed manner? If so, it would be good to consider also the related communication costs (evaluate the quantity of informations that will transit). -- Page 4, line 190

''See [37] for a proof and for a discussion on how to choose matrix P ''. Seeing the role played by the matrix P in this article, the intuition about its meaning should be explained in the current article, otherwise the current article is of very weak interest. I hence disagree that the current paper refers to other articles about how it is built and chosen. -- The work related to experimentation is very limited (even with the details in appendix). There is no explanation why the algorithm has good performances compared to other algorithms. My opinion is that the topologies that are used are graphs whose nodes have mainly a same type of neighborhood. It would have been good to understand the performance with respect to the ratio

''number of vertices/diameter`''. Other comments/questions just listed: --- In the experimental part, the topologies are very regular graphs and that means that matrix $P$ is very regular (an probably too regular). Would it be possible (if the paper is accepted) to do some experimentations on less regular graphs. --- Are the agents supposed to have distinct identifiers? If this is not the case, I do not understand how the algorithm is working. There exist some impossibility results: for example : Julien M. Hendrickx, John N. Tsitsiklis: Fundamental limitations for anonymous distributed systems with broadcast communications. Allerton 2015: 9-16

Reviewer 1

Reviewer 2

Reviewer 3