{"title": "Decentralized Cooperative Stochastic Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 4529, "page_last": 4540, "abstract": "We study a decentralized cooperative stochastic multi-armed bandit problem with K arms on a network of N agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regret of the entire network. We design a fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates. We analyze the regret of our algorithm and also provide a lower bound. The regret is bounded by the optimal centralized regret plus a natural and simple term depending on the spectral gap of the communication matrix. Our algorithm is simpler to analyze than those proposed in prior work and it achieves better regret bounds, while requiring less information about the underlying network. It also performs better empirically.", "full_text": "Decentralized Cooperative Stochastic Bandits\n\nDavid Mart\u00ednez-Rubio\n\nDepartment of Computer Science\n\nUniversity of Oxford\n\nOxford, United Kingdom\n\ndavid.martinez@cs.ox.ac.uk\n\nVarun Kanade\n\nDepartment of Computer Science\n\nUniversity of Oxford\n\nOxford, United Kingdom\nvarunk@cs.ox.ac.uk\n\nPatrick Rebeschini\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nOxford, United Kingdom\n\npatrick.rebeschini@stats.ox.ac.uk\n\nAbstract\n\nWe study a decentralized cooperative stochastic multi-armed bandit problem with\nK arms on a network of N agents. In our model, the reward distribution of each\narm is the same for each agent and rewards are drawn independently across agents\nand time steps. In each round, each agent chooses an arm to play and subsequently\nsends a message to her neighbors. The goal is to minimize the overall regret of the\nentire network. We design a fully decentralized algorithm that uses an accelerated\nconsensus procedure to compute (delayed) estimates of the average of rewards\nobtained by all the agents for each arm, and then uses an upper con\ufb01dence bound\n(UCB) algorithm that accounts for the delay and error of the estimates. We analyze\nthe regret of our algorithm and also provide a lower bound. The regret is bounded\nby the optimal centralized regret plus a natural and simple term depending on the\nspectral gap of the communication matrix. Our algorithm is simpler to analyze than\nthose proposed in prior work and it achieves better regret bounds, while requiring\nless information about the underlying network. It also performs better empirically.\n\n1\n\nIntroduction\n\nThe multi-armed bandit (MAB) problem is one of the most widely studied problems in online learning.\nIn the most basic setting of this problem, an agent has to pull one among a \ufb01nite set of arms (or\nactions), and she receives a reward that depends on the chosen action. This process is repeated\nover a \ufb01nite time-horizon and the goal is to get a cumulative reward as close as possible to the\nreward she could have obtained by committing to the best \ufb01xed action (in hindsight). The agent only\nobserves the rewards corresponding to the actions she chooses, i.e., the bandit setting as opposed to\nthe full-information setting.\nThere are two main variants of the MAB problem\u2014the stochastic and adversarial versions. In this\nwork, our focus is on the former, where each action yields a reward that is drawn from a \ufb01xed unknown\n(but stationary) distribution. In the latter version, rewards may be chosen by an adversary who may\nbe aware of the strategy employed by the agent, but does not observe the random choices made by the\nagent. Optimal algorithms have been developed for both the stochastic and the adversarial versions\n(cf. [9] for references). The MAB problem epitomizes the exploration-exploitation tradeoff that\nappears in most online learning settings: in order to maximize the cumulative reward, it is necessary\nto trade off between the exploration of the hitherto under-explored arms and the exploitation of the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fseemingly best arm. Variants of the MAB problem are used in a wide variety of applications ranging\nfrom online advertising systems to clinical trials, queuing and scheduling.\nIn several applications, the \u201cagent\u201d solving the MAB problem may itself be a distributed system,\ne.g., [1, 10, 15, 33, 35, 36]. The reason for using decentralized computation may be an inherent\nrestriction in some cases, or it could be a choice made to improve the total running time\u2014using N\nunits allows N arms to be pulled at each time step. When the agent is a distributed system, restrictions\non communication in the system introduce additional tradeoffs between communication cost and\nregret. Apart from the one considered in this work, there are several formulations of decentralized or\ndistributed MAB problems, some of which are discussed in the related work section below.\nProblem Formulation. This work focuses on a decentralized stochastic MAB problem. We consider\na network consisting of N agents that play the same MAB problem synchronously for T rounds, and\nthe goal is to obtain regret close to that incurred by an optimal centralized algorithm running for\nN T rounds (N T is the total number of arm pulls made by the decentralized algorithm). At each\ntime step, all agents simultaneously pull some arm and obtain a reward drawn from the distribution\ncorresponding to the pulled arm. The rewards are drawn independently across the agents and the time\nsteps. After the rewards have been received, the agents can send messages to their neighbors.\nMain Contributions. We solve the decentralized MAB problem using a gossip algorithm.1 Our\nalgorithm incurs regret equal to the optimal regret in the centralized problem plus a term that depends\non the spectral gap of the underlying communication graph and the number of agents (see Theorem 3.2\nfor a precise statement). At the end of each round, each agent sends O(K) values to her neighbors.\nThe amount of communication permitted can be reduced at the expense of incurring greater regret,\ncapturing the communication-regret tradeoff. The algorithm needs to know the total number of agents\nin the network and an upper bound on the spectral gap of the communication matrix. We assume the\nformer for clarity of exposition, but the number of nodes can be estimated, which is enough for our\npurposes (cf. Appendix F). The latter is widely made in the decentralized literature [7, 13, 14, 30].\nThe key contribution of our work is an algorithm for the decentralized setting that exhibits a natural\nand simple dependence on the spectral gap of the communication matrix. In particular, for our\nalgorithm we have:\n\n\u2022 A regret bound that is simpler to interpret, and asymptotically lower compared to other\nalgorithms previously designed for the same setting. We use delayed estimators of the\nrelevant information that is communicated in order to signi\ufb01cantly reduce their variance.\n\u2022 A graph-independent factor multiplying log T in the regret as opposed to previous works.\n\u2022 Our algorithm is fully decentralized and can be implemented on an arbitrary network, unlike\nsome of the other algorithms considered in the literature, which need to use extra global\ninformation. This is of interest for decentralization purposes but also from the point of view\nof the total computational complexity.\n\u2022 We use accelerated communication, which reduces the regret dependence on the spectral\n\ngap, which is important for scalability purposes.\n\nFuture work. Decentralized algorithms of this kind are a \ufb01rst step towards solving problems on\ntime-varying graphs or on networks prone to communication errors. We leave for future research an\nextension to time-varying graphs or graphs with random edge failures. Further future research can\ninclude a change in the model to allow asynchronous communication, making some assumptions on\nthe nodes so they have comparable activation frequencies.\n\n1.1 Related Work\n\nDistributed Algorithms. The development of distributed algorithms for optimization and decision-\nmaking problems has been an active area of research, motivated in part by the recent development\nof large scale distributed systems that enable speeding up computations. In some cases, distributed\ncomputation is a necessary restriction that is part of the problem, as is the case in packet routing or\nsensor networks. Gossip algorithms are a commonly used framework in this area [7, 13, 14, 28, 30,\n31]. In gossip algorithms, we have an iterative procedure with processing units at the nodes of a\n\n1A high-level description of some distributed algorithms is given in the related work section. For further\n\ndetails, the reader is referred to the references in that section.\n\n2\n\n\fgraph and the communication pattern dictated by the edges of the graph. A common sub-problem in\nthese applications is to have a value at each node that we want to average or synchronize across the\nnetwork. In fact, most solutions reduce to approximate averaging or synchronization. This can be\nachieved using the following simple and effective method: make each node compute iteratively a\nweighted average of its own value and the ones communicated by its neighbors, ensuring that the\n\ufb01nal value at each node converges to the average of the initial values across the network. Formally,\nthis communication can be represented as a multiplication by a matrix P that respects the network\nstructure and satis\ufb01es some conditions that guarantee fast averaging. The averaging can be accelerated\nby the use of Chebychev polynomials (see Lemma 3.1).\nDecentralized Bandits. There are several works that study stochastic and nonstochastic distributed\nor decentralized multi-armed bandit problems, but the precise models vary considerably.\nIn the stochastic case, the work of Landgren et al. [24, 25] proposes three algorithms to solve the\nsame problem that we consider in this paper: coop-UCB, coop-UCB2 and coop-UCL. The algorithm\ncoop-UCB follows a variant of the natural approach to solve this problem that is discussed in Section 3.\nIt needs to know more global information about the graph than just the number of nodes and the\nspectral gap: the algorithm uses a value per node that depends on the whole spectrum and the\nset of eigenvectors of the communication matrix. The algorithm coop-UCB2 is a modi\ufb01cation of\ncoop-UCB, in which the only information used about the graph is the number of nodes, but the regret\nis signi\ufb01cantly greater. Finally, coop-UCL is a Bayesian algorithm that also incurs greater regret than\ncoop-UCB. Our algorithm obtains lower asymptotic regret than all these algorithms while retaining\nthe same computational complexity (cf. Remark 3.4).\nOur work draws on techniques on gossip acceleration and stochastic bandits with delayed feedback.\nA number of works in the literature consider Chebyshev acceleration applied to gossip algorithms,\ne.g., [2, 30]. There are various works about learning with delayed feedback. The most relevant work\nto our problem is [19] which studies general online learning problems under delayed feedback. Our\nsetting differs in that we not only deal with delayed rewards but with approximations of them.\nSeveral other variants of distributed stochastic MAB problems have been proposed. Chakraborty et al.\n[12] consider the setting where at each time step, the agents can either broadcast the last obtained\nreward to the whole network or pull an arm. Korda et al. [22] study the setting where each agent can\nonly send information to one other agent per round, but this can be any agent in the network (not\nnecessarily a neighbor). Sz\u00f6r\u00e9nyi et al. [34] study the MAB problem in P2P random networks and\nanalyze the regret based on delayed reward estimates. Some other works do not assume independence\nof the reward draws across the network. Liu and Zhao [26] and Kalathil et al. [20] consider a\ndistributed MAB problem with collisions: if two players pull the same arm, the reward is split or no\nreward is obtained at all. Moreover in the latter work and a follow-up [27], the act of communicating\nincreases the regret. Anandkumar et al. [1] also consider a model with collisions and agents have to\nlearn from action collisions rather than by exchanging information. Shahrampour et al. [32] consider\nthe setting where each agent plays a different MAB problem and the total regret is minimized in order\nto identify the best action when averaged across nodes. Nodes only send values to their neighbors\nbut it is not a completely decentralized algorithm, since at each time step the arm played by all the\nnodes is given by the majority vote of the agents. Xu et al. [38] study a distributed MAB problem\nwith global feedback, i.e., with no communication involved. Kar et al. [21] also consider a different\ndistributed bandit model in which only one agent observes the rewards for the actions she plays, while\nthe others observe nothing and have to rely on the information broadcast by the \ufb01rst agent.\nThe problem of identifying an \u03b5-optimal arm using a distributed network has also been studied. Hillel\net al. [16] provide matching upper and lower bounds in the case that the communication happens only\nonce and when the graph topology is restricted to be the complete graph. They provide an algorithm\nthat achieves a speed up of N (the number agents) if log 1/\u03b5 communication steps are permitted.\n\n\u221a\n\n\u221a\n\n\u221a\nKT [3].\nIn the adversarial version, the best possible regret bound in the centralized setting is still\nKT ; and a\nIn the decentralized case, a trivial algorithm that has no communication incurs regret N\nlower bound of N\nT is known [11]; thus, only the dependence on K can be improved. Awerbuch\nand Kleinberg [6] study a distributed adversarial MAB problem with some Byzantine users, i.e.,\nusers that do not follow the protocol or report fake observations as they wish. In the case in which\nthere are no Byzantine users they obtain a regret of O(T 2/3(N + K) log N log T ). To the best of\nour knowledge, this is the \ufb01rst work that considers a decentralized adversarial MAB problem. They\n\n3\n\n\f(cid:112)\n\nallow log(N ) communication rounds between decision steps so it differs with our model in terms\nof communication. Also in the adversarial case, Cesa-Bianchi et al. [11] studied an algorithm that\nachieves regret N (\nK log T ) and prove some results that are graph-dependent.\nThe model is the same as ours, but in addition to the rewards she obtained, each agent communicates\nto her neighbors all the values she received from her neighbors in the last d rounds, that is potentially\nO(N d). Thus, the size of each message could be more than poly(K) at a given round. They get the\naforementioned regret bound by setting d =\n\nK 1/2T log K +\n\n\u221a\n\n\u221a\n\nK.\n\n2 Model and Problem Formulation\n\nWe consider a multi-agent network with N agents. The agents are represented by the nodes of an\nundirected and connected graph G and each agent can only communicate to her neighbors. Agents\nplay the same K-armed bandit problem for T time steps, send some values to their neighbors after\neach play and receive the information sent by their respective neighbors to use it in the next time step\nif they so wish. If an agent plays arm k, she receives a reward drawn from a \ufb01xed distribution with\nmean \u00b5k that is independent of the agent. The draw is independent of actions taken at previous time\nsteps and of actions played by other agents. We assume that rewards come from distributions that are\nsubgaussian with variance proxy \u03c32.\nAssume without loss of generality that \u00b51 \u2265 \u00b52 \u2265 \u00b7\u00b7\u00b7 \u2265 \u00b5K, and let the suboptimality gap be\nde\ufb01ned as \u2206k := \u00b51 \u2212 \u00b5k for any action k. Let It,i be the random variable that represents the action\nplayed by agent i at time t. Let nk\nt,i be the number of times arm k is pulled by node i up to time t\nt,i be the number of times arm k is pulled by all the nodes in the network up to\nand let nk\ntime t. We de\ufb01ne the regret of the whole network as\n\nt :=(cid:80)N\n\ni=1 nk\n\n(cid:34) T(cid:88)\n\nN(cid:88)\n\n(cid:35)\n\nK(cid:88)\n\n\u2206kE(cid:2)nk\n\nT\n\n(cid:3) .\n\nR(T ) := T N \u00b51 \u2212 E\n\n\u00b5It,i\n\n=\n\nt=1\n\ni=1\n\nk=1\n\nWe will use this notion of regret, which is the expected regret, in the entire paper.\nThe problem is to minimize the regret while allowing each agent to send poly(K) values to her\nneighbors per iteration. We allow to know only little information about the graph. The total number\nof nodes and an lower bound on the spectral gap of the communication matrix P , i.e. 1 \u2212 |\u03bb2|. Here\n\u03bb2 is the second greatest eigenvalue of P in absolute value. The communication matrix can be build\nwith little extra information about the graph, like the maximum degree of nodes of the graph [37].\nHowever, building global structures is not allowed. For instance, a spanning tree to propagate the\ninformation with a message passing algorithm is not valid. This is because our focus is on designing\na decentralized algorithm. Among other things, \ufb01nding these kinds of decentralized solutions serves\nas a \ufb01rst step towards the design of solutions for the same problem in time varying graphs or in\nnetworks prone to communication errors.\n\n3 Algorithm\n\nWe propose an algorithm that is an adaptation of UCB to the problem at hand that uses a gossip\nprotocol. We call the algorithm Decentralized Delayed Upper Con\ufb01dence Bound (DDUCB). UCB is\na popular algorithm for the stochastic MAB problem. At each time step, UCB computes an upper\nbound of a con\ufb01dence interval for the mean of each arm k, using two values: the empirical mean\nobserved, \u00b5k\nt . UCB plays at time t + 1 the arm that\nmaximizes the following upper con\ufb01dence bound\n\nt , and the number of times arm k was pulled, nk\n\n(cid:115)\n\n\u00b5k\n\nt +\n\n4\u03b7\u03c32 ln t\n\nnk\nt\n\n,\n\nwhere \u03b7 > 1 is an exploration parameter.\nIn our setting, as the pulls are distributed across the network, agents do not have access to these two\nvalues, namely the number of times each arm was pulled across the network and the empirical mean\nreward observed for each arm computed using the total number of pulls. Our algorithm maintains\ngood approximations of these values and it incurs a regret that is no more than the one for a centralized\n\n4\n\n\ft,i/(cid:98)nk\n\nt,i,(cid:98)nk\n\nt . Let (cid:98)mk\n\nUCB plus a term depending on the spectral gap and the number of nodes, but independent of time.\nThe latter term is a consequence of the approximation of the aforementioned values. Let mk\nt be the\nsum of rewards coming from all the pulls done to arm k by the entire network up to time t. We can\nuse a gossip protocol, for every k \u2208 {1, . . . , K}, to obtain at each node a good approximation of mk\nt\nand the number of times arm k was pulled, i.e. nk\nt and\nt made by node i with a gossip protocol at time t, respectively. Having this information at hand,\nnk\nt,i to get an estimation of the average reward of each arm. But\n\nagents could compute the ratio (cid:98)mk\n\nt,i be the approximations of mk\n\ncare needs to be taken when computing the foregoing approximations.\nA classical and effective way to keep a running approximation of the average of values that are\niteratively added at each node is what we will refer to as the running consensus [8]. Let N (i) be the\nset of neighbors of agent i in graph G. In this protocol, every agent stores her current approximation\nand performs communication and computing steps alternately: at each time step each agent computes\na weighted average of her neighbors\u2019 values and adds to it the new value she has computed. We can\nrepresent this operation in the following way. Let P \u2208 RN\u00d7N be a matrix that respects the structure\nof the network, which is represented by a graph G. So Pij = 0 if there is no edge in G that connects\nj to i. We consider P for which the sum of each row and the sum of each column is 1, which implies\nthat 1 is an eigenvalue of P . We further assume all other eigenvalues of P , namely \u03bb2, . . . , \u03bbN , are\nreal and are less than one in absolute value, i.e., 1 = \u03bb1 > |\u03bb2| \u2265 \u00b7\u00b7\u00b7 \u2265 |\u03bbN| \u2265 0. Note that they are\nsorted by magnitude. For matrices with real eigenvalues, these three conditions hold if and only if\nvalues in the network are averaged, i.e., P s converges to 11(cid:62)/N for large s. This de\ufb01nes a so called\ngossip matrix. See [37] for a proof and [14, 37] for a discussion on how to choose P . If we denote\nby xt \u2208 RN the vector containing the current approximations for all the agents and by yt \u2208 RN the\nvector containing the new values added by each node, then the running consensus can be written as\n(1)\n\nxt+1 = P xt + yt.\n\nThe conditions imposed on P not only ensure that values are averaged but also that the averaging\nprocess is fast. In particular, for any s \u2208 N and any v in the N-dimensional simplex\n\n(cid:107)P sv \u2212 1/N(cid:107)2 \u2264 |\u03bb2|s ,\n\nt + pk\n\nt , where the i-th entry of pk\n\nt+1 = P(cid:98)nk\n\n1 = 0 \u2208 RN and update (cid:98)mk\n\n1 = 0 \u2208 RN and update(cid:98)nk\n\n(2)\nsee [17], for instance. For a general vector, rescale the inequality by its 1-norm. A natural approach\nto the problem is to use 2K running consensus algorithms, computing approximations of mk\nt /N and\nt /N, k = 1, . . . , K. Landgren et al. [25] follow this approach and use extra global information\nnk\nof the graph, as described in the section on related work, to account for the inaccuracy of the mean\nestimate. We can estimate average rewards by their ratio and the number of times each arm was\nt+1 = P(cid:98)mk\nt /N by N. The running consensus protocols\npulled can be estimated by multiplying the quantity nk\nt ,\nt + \u03c0k\nt \u2208 RN contains the reward observed by node i at time t if arm k is pulled.\nwhere the i-th entry of \u03c0k\nElse, it is 0. Note that the i-th entry is only computed by the i-th node. Similarly, for k = 1, . . . , K,\nt \u2208 RN is 1 if at\n\nwould be the following. For k = 1, . . . , K, start with (cid:98)mk\nstart with(cid:98)nk\nrewrite (1) as xt =(cid:80)t\u22121\n\ntime t node i pulled arm k and 0 otherwise.\nThe problem with this approach is that even if the values computed are being mixed at a fast pace it\ntakes some time for the last added values to be mixed, resulting in poor approximations, especially\nif N is large. This phenomenon is more intense when the spectral gap is smaller. Indeed, we can\ns=1 P t\u22121\u2212sys, assuming that x1 = 0. For the values of s that are not too close\n(cid:80)N\nto t \u2212 1 we have by (2) that P t\u22121\u2212sys is very close to the vector that has as entries the average of\nj=1 ys,j. However, for values of s close to t \u2212 1 this\nthe values in ys, that is, c1, where c = 1\nN\nis not true and the values of ys in\ufb02uence heavily the resulting estimate, being specially inaccurate\nas an estimation of the true mean if N is large. The key observations that lead to the algorithm we\npropose are that the number of these values of s close to t \u2212 1 is small, that we can make it even\nsmaller using accelerated gossip techniques and that the regret of UCB does not increase much when\nworking with delayed values of rewards so we can temporarily ignore the recently computed rewards\nin order to work with much more accurate approximations of mk\nt /N. In particular, with C\ncommunication steps agents can compute a polynomial qC of degree C of the communication matrix\nP applied to a vector, that is, qC(P )v. The acceleration comes from computing a rescaled Chebyshev\npolynomial and it is encapsulated in the following lemma. It is the same one can \ufb01nd in previous\nworks [30]. See the supplementary material for a proof and for the derivation of Algorithm 2 that\ncomputes (qC(P )v)i iteratively after C calls.\n\nt /N and nk\n\n5\n\n\fsimplex and let C = (cid:100)ln(2N/\u03b5)/(cid:112)2 ln(1/|\u03bb2|)(cid:101). Agents can compute, after C communication steps,\n\nLemma 3.1. Let P be a communication matrix with real eigenvalues such that 1(cid:62)P = 1(cid:62), P 1 = 1\nand whose second largest eigenvalue in absolute value is \u22121 < \u03bb2 < 1. Let v be in the N-dimensional\na polynomial qC of degree C which satis\ufb01es (cid:107)qC(P )v \u2212 1/N(cid:107)2 \u2264 \u03b5/N.\nGiven the previous lemma, we consider that any value that has been computed since at least C\niterations before the current time step is mixed enough to be used to approximate mk\nt /N.\nWe now describe DDUCB at node i. The pseudocode is given in Algorithm 1. We use Greek letters to\ndenote variables that contain rewards estimators, and corresponding Latin letters to denote variables\nthat contain counter estimators. A notation chart can be found in the supplementary material. Agents\nrun an accelerated running consensus in stages of C iterations. Each node maintains three pairs of\nK-dimensional vectors. The variable \u03b1i contains rewards that are mixed, \u03b2i contains rewards that are\nbeing mixed and \u03b3i contains rewards obtained in the current stage. The vectors ai, bi and ci store the\nnumber of arm pulls associated to the quantities \u03b1i, \u03b2i and \u03b3i, respectively. At the beginning, agent i\npulls each arm once and initialize \u03b1i and ai with the observed values divided by N. During each\nstage, for C iterations, agent i uses \u03b1i and ai, as updated at the end of the previous stage, to decide\nwhich arm to pull using an upper con\ufb01dence bound. Variables \u03b2i and bi are mixed in an accelerated\nway and \u03b3i and ci are added new values obtained by the new pulls done in the current stage. After C\niterations, values in \u03b2i and bi are mixed enough so we add them to \u03b1i and ai. The only exception\nbeing the end of the \ufb01rst stage in which the values of the latter variables are overwritten by the former\nones. Variables \u03b4i and di just serve to make this distinction. The unmixed information about the\npulls obtained in the last stage, i.e. \u03b3i and ci, is assigned to \u03b2i and bi so the process can start again.\nVariables \u03b3i and ci are reset with zeroes. There are T iterations in total.\n\nt /N and nk\n\nyr\u22121,i\n\nwr+1\n\n(cid:110) \u03b1k\n\ni\nak\ni\n\n(cid:111)\n\nN ak\ni\n\nr,i \u2190(cid:80)\n\ni , . . . , X K\n\ni ) ; zi \u2190 (1, . . . , 1)\n\nj\u2208N (i) 2Pijyr,j/|\u03bb2|\n\nw0 \u2190 1/2; w\u22121 \u2190 0\ny0,i \u2190 y0,i/2; y\u22121,i \u2190 (0, . . . , 0)\n\nu \u2190 Play arm k\u2217, return reward\ni \u2190 \u03b3k\u2217\n\u03b3k\u2217\ni + 1\n\u03b2i \u2190 mix(\u03b2i, r, i) ; bi \u2190 mix(bi, r, i)\nt \u2190 t + 1\nif t > T then return end if\n\ni \u2190 ck\u2217\n\n2: C = (cid:100)ln(2N/\u03b5)/(cid:112)2 ln(1/|\u03bb2|)(cid:101)\n(cid:113) 2\u03b7\u03c32 ln s\n\nAlgorithm 2 Accelerated communication\nand mixing step. mix(yr,i, r, i)\n1: if r is 0 then\n2:\n3:\n4: end if\n5: Send yr,i to neighbors\n6: Receive corresp.values yr,j,\u2200j\u2208N (i)\n7: y(cid:48)\n8: wr+1 \u2190 2wr/|\u03bb2| \u2212 wr\u22121\nr,i \u2212 wr\u22121\ny(cid:48)\n9: yr+1,i = wr\nwr+1\n10: if r is 0 then\ny0,i \u2190 2y0,i ; w0 \u2190 2w0\n11:\n12: end if\n13: return yr+1,i\n\nAlgorithm 1 DDUCB at node i.\n1: \u03b6i \u2190 (X 1\n3: \u03b1i \u2190 \u03b6i/N ;ai \u2190 zi/N ;\u03b2i \u2190 \u03b6i ;bi \u2190 zi\n4: \u03b3i \u2190 0 ; ci \u2190 0 ; \u03b4i \u2190 0 ; di \u2190 0\n5: t \u2190 K ; s \u2190 K\n6: while t \u2264 T do\nk\u2217 \u2190 arg maxk\n7:\n+\nfor r from 0 to C \u2212 1 do\n8:\n9:\ni + u ; ck\u2217\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: end while\nNow we describe some mathematical properties about the variables during the execution of the\nalgorithm. Let tS be the time at which a stage begins, so it ends at tS + C \u2212 1. At t = tS, using\ni but in the \ufb01rst\nthe notation above, it is \u03b1k\nstage, in which their values are initialized from a local pull. In particular, denote X 1\nthe\ni , . . . , X K\ni\nrewards obtained when pulling all the arms before starting the \ufb01rst stage. Then the initialization\nis \u03b1i \u2190 (X 1\ni and ak\ni\nbeing the approximations for mk\nt,i/N. The algorithm does not update \u03b1i and ai again\nuntil t = tS + C, so they contain information that at the end of the stage is delayed by 2C \u2212 1\niterations. The time step s used to compute the upper con\ufb01dence bound is (tS \u2212 C)N, since \u03b1i and\nai contain information about that number of rewards and pulls. The variable \u03b3i is needed because we\nneed to mix \u03b2i for C steps so the Chebyshev polynomial of degree C is computed. In this way agents\ncompute upper con\ufb01dence bounds with accurate approximations, with a delay of at most 2C \u2212 1.\n\nend for\ns \u2190 (t \u2212 C)N\n\u03b4i\u2190 \u03b4i +\u03b2i ;di\u2190 di +bi ;\u03b1i\u2190 \u03b4i ;ai\u2190 di\n\u03b2i \u2190 \u03b3i ; bi \u2190 ci ; \u03b3i \u2190 0 ; ci \u2190 0\n\ni /N ) and ai \u2190 (1/N, . . . , 1/N ). The division by N is due to \u03b1k\n\ni =(cid:80)tS\u2212C\n\ns=1\n\ni =(cid:80)tS\u2212C\n\ns=1\n\n(cid:0)qC(P )pk\n\ns\n\n(cid:1)\n\n(cid:0)qC(P )\u03c0k\n\ns\n\n(cid:1)\n\ni /N, . . . , X K\n\ni and ak\n\nt,i/N and nk\n\n6\n\n\fparticular, having a delay of d steps increases the regret by at most d(cid:80)K\n\nAs we will see, the regret of UCB does not increase much when working with delayed estimates. In\n\nWe now present the regret which the DDUCB algorithm incurs. We use A (cid:46) B to denote there is a\nconstant c > 0 such that A \u2264 cB. See Appendix A.1 for a proof.\nTheorem 3.2 (Regret of DDUCB). Let P be a communication matrix with real eigenvalues such\nthat 1(cid:62)P = 1(cid:62), P 1 = 1 whose second largest eigenvalue in absolute value is \u03bb2, with |\u03bb2| < 1.\nConsider the distributed multi-armed bandit problem with N nodes, K actions and subgaussian\nrewards with variance proxy \u03c32. The algorithm DDUCB with exploration parameter \u03b7 = 2 and\n\u03b5 = 1/22 satis\ufb01es:\n\nk=1 \u2206k.\n\n1. The following \ufb01nite-time bound on the regret, for C = (cid:100) log(2N/\u03b5)\n2 ln(1/|\u03bb2|)\n\n(cid:101)\n\n\u221a\n\n(cid:18)\n\n(cid:19) K(cid:88)\n\n+\n\nN (6C + 1) + 4\n\n\u2206k.\n\nk=1\n\n(cid:112)ln(1/|\u03bb2|)\n\nN ln(N )\n\n+\n\nK(cid:88)\n\nk=1\n\n\u2206k.\n\n(cid:88)\n\n32(1 + 1/11)\u03c32 ln(T N )\n\nR(T ) <\n\nk:\u2206k>0\n\n\u2206k\n\n2. The corresponding asymptotic bound:\n\nR(T ) (cid:46) (cid:88)\n\nk:\u2206k>0\n\n\u03c32 ln(T N )\n\n\u2206k\n\nFor simplicity and comparison purposes we set the value of \u03b7 and \u03b5 to speci\ufb01c values. For a general\nversion of Theorem 3.2, see the supplementary material. Note that the algorithm needs to know\n\u03bb2, the second largest eigenvalue of P in absolute value, since it is used to compute C, which is a\nparameter that indicates when values are close enough to be mixed. However, if we use DDUCB with\n\nC set to any upper bound E of C = (cid:100)log(2N/\u03b5)/(cid:112)2 ln(1/|\u03bb2|)(cid:101) the inequality of the \ufb01nite-time\nanalysis above still holds true, substituting C by E. In the asymptotic bound, N ln N/(cid:112)ln(1/|\u03bb2|)\n\nwould be substituted by N E. The knowledge of the spectral gap is an assumption that is widely\nmade throughout the decentralized literature [13, 14, 30]. We can use Theorem 3.2 to derive an\ninstance-independent analysis of the regret. See Theorem A.3 in the supplementary material.\nRemark 3.3 (Lower bound). In order to interpret the regret obtained in the previous theorem, it is\nuseful to note that running the centralized UCB algorithm for T N steps incurs a regret bounded above\nk=1 \u2206k, up to a constant. Moreover, running N separate instances of\n+\nk=1 \u2206k. On the other hand, the following is an asymptotic lower bound for any consistent\n\nby(cid:80)\nUCB at each node without allowing communication incurs a regret of R(T ) (cid:46)(cid:80)\nN(cid:80)K\n\n+(cid:80)K\n\nN \u03c32 ln(T )\n\n\u03c32 ln(T N )\n\nk:\u2206k>0\n\nk:\u2206k>0\n\n\u2206k\n\n\u2206k\n\ncentralized policy [23]: lim inf\nT\u2192\u221e\n\nk:\u2206k>0\n\n2\u03c32\n\u2206k\n\n.\n\nln T \u2265(cid:80)\n\nR(T )\n\nThus, we see that the regret obtained in Theorem 3.2 improves signi\ufb01cantly the dependence on N\nof the regret with respect to the trivial algorithm that does not involve communication, and that it\nis asymptotically optimal in terms of T , with N and K \ufb01xed. Since in the \ufb01rst iteration of this\nproblem N arms have to be pulled and there is no prior information on the arms\u2019 distribution, any\nasymptotically optimal algorithm in terms of N and K must pull \u0398( N\nK + 1) times each arm, yielding\nk=1 \u2206k, up to a constant. Hence, by the lower bound above and the\nlatter argument, we can give the following lower bound for the problem we consider. The regret of\nour problem must be\n\nregret of at least(cid:0) N\n\n(cid:17) K(cid:88)\n(cid:17)\nmin(K, N ) ln(N )/(cid:112)ln(1/|\u03bb2|) in the second summand of the regret.\n\n(cid:16) N\n\n\u03c32 ln(T N )\n\nk:\u2206k>0\n\n+ 1\n\n\u2206k\n\n\u2206k\n\nk=1\n\nK\n\n+\n\n\u2126\n\n,\n\nK + 1(cid:1)(cid:80)K\n(cid:16) (cid:88)\n\nand the regret obtained in Theorem 3.2 is asymptotically optimal up to at most a factor of\n\nRemark 3.4 (Comparison with previous work). We note that in [25] the regret bounds were\ncomputed applying a concentration inequality that cannot be used, since it does not take into account\nthat the number of times an arm was pulled is a random variable. They claim their analysis follows\nthe one in [4], which does not present this problem. If we changed their upper con\ufb01dence bound\n\n7\n\n\fto be proportional to(cid:112)6 ln(tN ) instead of to(cid:112)2 ln(t) at time t and follow [4] then, for their best\nobtained. The regret of their algorithm is bounded by A + B(cid:80)K\n(cid:16) \u03b3\n\nalgorithm in terms of regret, named coopUCB, we can get a very similar regret bound to the one they\n\nk=1 \u2206k, where\n\n(cid:17)\n\n\u221a\n\n16\u03b3\u03c32(1 + \u03b5j\nc)\n\n(cid:88)\n\nN(cid:88)\n\nN(cid:88)\n\nln(T N ), B := N\n\n\u03b3 \u2212 1\n\n+\n\nN\n\nj=2\n\n|\u03bbj|\n1 \u2212 |\u03bbj|\n\n.\n\nA :=\n\nk:\u2206k>0\n\nj=1\n\nN \u2206k\n\nThus A is at least(cid:80)\n\nThe difference between this bound and the one presented in [25] is the N inside the logarithm in A\nand a factor of 2 in A. Here, \u03b3 > 1 is an exploration parameter that the algorithm receives as input\nand \u03b5j\nc is a non-negative graph-dependent value, which is only 0 when the graph is the complete graph.\n. Hence, up to a constant, A is always greater than the \ufb01rst\nln(|\u03bb2|\u22121)\n\nsummand in the regret of our algorithm in Theorem 3.2. Note that\nso\n\n\u03b3\u22121 \u2265 1 and\n\n1\u2212|\u03bb2| \u2265\n\n16\u03c32 ln(T N )\n\nk:\u2206k>0\n\n\u2206k\n\n\u03b3\n\n1\n\n1\n\n\u221a\n\nwhere \u03bb(cid:48)\n\nTheorem 3.2 is N ln N/(cid:112)ln(1/|\u03bb2|) \u2264 N ln N/ ln(1/|\u03bb2|) \u2264 2B, for |\u03bb2| \u2265 1/e, since the\n\nk=1 \u2206k in the second summand in\n\nN |\u03bb2| \u2208 [0,\n\n2 :=\n\n\u221a\n\n(cid:32)\n\n(cid:33)\nN ). The factor multiplying(cid:80)K\n\n\u03bb(cid:48)\n\u221a\nN /\u03bb(cid:48)\n2)\nln(\n\nB \u2265 N\n\n1 +\n\n2\n\n,\n\n(cid:16)\n\ninequality below holds.\n\n2B \u2265 2N\n\n1 +\n\n\u03bb(cid:48)\n\u221a\nN /\u03bb(cid:48)\n2)\nln(\n\n2\n\n(cid:17) \u2265 N ln N\n\n\u221a\nln(\n\nN /\u03bb(cid:48)\n2)\n\n\u21d4 ln N \u2212 2 ln(\u03bb(cid:48)\n\n2) + 2\u03bb(cid:48)\n\n2 \u2265 ln N.\n\nN\n\n\u03b3\u22121 +\n\nfor the term multiplying(cid:80)\n\nSee the case |\u03bb2| < 1/e in Appendix D. In the case of a complete graph, the problem reduces to\na centralized batched bandit problem, in which N actions are taken at each time step [29]. The\ncommunication in this case is trivial, just send the obtained rewards to your neighbors, so not\nsurprisingly our work and [25] incur the same regret in such a case. The previous reasoning proves,\nhowever, that for every graph our asymptotic regret is never worse and for many graphs we get\nsubstantial improvement. Depending on the graph, A and B can be much greater than the lower\nbound we have used for both of them for comparison purposes. In the supplementary material, for\ninstance, we show that in the case of a cycle graph with a natural communication matrix these two\nparts are substantially worse in [25], namely \u0398(N 2) versus \u0398(1) and \u0398(N 7/2) versus \u0398(N 2 log N )\nk:\u2206k>0 \u03c32 ln(T N )/\u2206k in A and for B, respectively. In general, the\nalgorithm we propose presents several improvements. We get a graph-independent value multiplying\n\u221a\nln(T N ) in the \ufb01rst summand of the regret whereas A contains the 1 + \u03b5j\nc graph-dependent values.\n|\u03bb2|\nIn B, just the sum N ( \u03b3\n1\u2212|\u03bb2| ) is of greater order than our second summand. Moreover,\nB contains other terms depending on the eigenvalues \u03bbj for j \u2265 3. Furthermore, we get this while\nusing less global information about the graph. This is of interest for decentralization purposes. Note\nhowever it has computational implications as well, since in principle the computation of \u03b5j\nc needs the\nentire set of eigenvalues and eigenvectors of P . Thus, even if P were input to coopUCB, it would\nneed to run an expensive procedure to compute these values before starting executing the decision\nprocess, while our algorithm does not need.\nRemark 3.5 (Variants of DDUCB). The algorithm can be modi\ufb01ed slightly to obtain better estima-\ntions of mk\nt /N, which implies the regret is improved. The easiest (and recommended)\nmodi\ufb01cation is the following. While waiting for the vectors \u03b2i and bi, i = 1, . . . N to be mixed, each\nnode i adds to the variables \u03b1i and ai the information of the pulls that are done times 1/N. The\nvariable s accounting for the time step has to be modi\ufb01ed accordingly. It contains the number of\npulls made to obtain the approximations of \u03b1i and ai, so it needs to be increased by one when adding\none extra reward. This corresponds to uncommenting lines 14-15 in Algorithm 5, the pseudocode\nin the supplementary material. Since the values of \u03b1i and ai are overwritten after the for loop, the\nassignment of s after the loop remains unchanged. Note that if the lines are not uncommented then\neach time the for loop is executed the C pulls that are made in a node are taken with respect to the\nsame arm. Another variant that would provide better estimations and therefore better regret, while\nkeeping the communication cost O(K) would consist of also sending the information of the new\npull, \u03c0i and pi, to the neighbors of i, receiving their respective values of their new pulls and adding\nthese values to \u03b1i and ai multiplied by 1/N, respectively. We analyze the algorithm without any\nmodi\ufb01cation for the sake of clarity of exposition. The same asymptotic upper bound on the regret in\nTheorem 3.2 can be computed for these two variations.\n\nt /N and nk\n\n8\n\n\fWe can vary the communication rate with some trade-offs. On the one hand, we can mix values\nof \u03b4i and di at each iteration of the for loop, in an unaccelerated way and with Algorithm 3 (see\nAlgorithm 5, line 17 in the supplementary material), to get even more precise estimations. In such a\ncase, we could use \u03b4i and di to compute the upper con\ufb01dence bounds instead of \u03b1i and ai. However,\nthat approach cannot bene\ufb01t from using the information from local pulls obtained during the stage.\nOn the other hand, if each agent could not communicate 2K values per iteration, corresponding to the\nmixing in line 11, the algorithm can be slightly modi\ufb01ed to account for it at the expense of incurring\ngreater regret. Suppose each agent can only communicate L values to her neighbors per iteration.\nLet E be (cid:100)2KC/L(cid:101). If each agent runs the algorithm in stages of E iterations, ensuring to send\neach element of \u03b2i and bi exactly C times and using the mixing step C times, then the bounds in\n\nTheorem 3.2, substituting C by E, still hold. Again, in the asymptotic bound, N ln N/(cid:112)ln(1/|\u03bb2|)\n\nwould be substituted by N E. In each iteration, agents have to send values corresponding to the same\nentries of \u03b2i or bi. The factor of C in the second summand of the regret accounts for the number of\nrounds of delay since a reward is obtained until it is used to compute upper con\ufb01dence bounds. If\nwe decrease the communication rate and compensate it with a greater delay, the approximations in\n\u03b1i and ai satisfy the same properties as in the original algorithm. Only the second summand in the\nregret increases because of an increment of the delay.\nExperiments. We show that the algorithm proposed in this work, DDUCB, does not only enjoy a\nbetter theoretical regret guarantee but it also performs better in practice. In general we have observed\nthat the accelerated method performs well with the recommended values, that is, no tuning, for the\nexploration parameter \u03b7 and the parameter \u03b5 that measures the precision of the mixing after a stage.\nRemember these values are \u03b7 = 2, \u03b5 = 1\n22. On the other hand the constant C that results in the\nunaccelerated method is usually excessively large, so it is convenient to heuristically decrease it,\nwhich corresponds to using a different value of \u03b5. We set \u03b5 so the value of C for the unaccelerated\nmethod is the same as the value of C for the accelerated one. We have used the recommended\nmodi\ufb01cation of DDUCB consisting of adding to the variables \u03b1i and ai the information of the pulls\nthat are done times 1/N while waiting for the vectors \u03b2i and bi to be mixed. This modi\ufb01cation adds\nextra information that is at hand at virtually no computational cost so it is always convenient to use it.\nWe tuned \u03b3, the exploration parameter of coopUCB [25], to get best results for that algorithm and\nplot the executions for the best \u03b3\u2019s and also \u03b3 = 2 for comparison purposes. In the \ufb01gures one can\nobserve that after a few stages, DDUCB algorithms learn with high precision which the best arm is\nand the regret curve that is observed afterwards shows an almost horizontal behavior. After 10000\niterations, coopUCB not only accumulates a greater regret but the slope indicates that it still has not\nlearned effectively which the best arm is.\nSee Appendix G for a more detailed description about the experiments.\n\nFigure 1: Simulation of DDUCB and coopUCB for cycles (top) and square grids (bottom) for 100\nnodes (left) , 200 nodes (top right) and 225 nodes (bottom right).\n\n9\n\n\fAcknowledgments\n\nThe authors thank Rapha\u00ebl Berthier and Francis Bach for helpful exchanges on the problem of\naveraging. David Mart\u00ednez-Rubio was supported in part by EP/N509711/1 from the EPSRC MPLS\ndivision, grant No 2053152. Varun Kanade and Patrick Rebeschini were supported in part by the\nAlan Turing Institute under the EPSRC grant EP/N510129/1. The authors acknolwedge support from\nthe AWS Cloud Credits for Research program.\n\nReferences\n[1] Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed\nalgorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on\nSelected Areas in Communications, 29(4):731\u2013745, 2011.\n\n[2] Mario Arioli and J Scott. Chebyshev acceleration of iterative re\ufb01nement. Numerical Algorithms,\n\n66(3):591\u2013608, 2014.\n\n[3] Jean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic\n\nbandits. In COLT, pages 217\u2013226, 2009.\n\n[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[5] W Auzinger and J Melenk. Iterative solution of large linear systems. Lecture notes, TU Wien,\n\n2011.\n\n[6] Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of\n\nComputer and System Sciences, 74(8):1271\u20131288, 2008.\n\n[7] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip\n\nalgorithms. IEEE transactions on information theory, 52(6):2508\u20132530, 2006.\n\n[8] Paolo Braca, Stefano Marano, and Vincenzo Matta. Enforcing consensus while monitoring\nthe environment in wireless sensor networks. IEEE Transactions on Signal Processing, 56(7):\n3375\u20133380, 2008.\n\n[9] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[10] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Multi-armed bandits in the presence\nof side observations in social networks. In Decision and Control (CDC), 2013 IEEE 52nd\nAnnual Conference on, pages 7309\u20137314. IEEE, 2013.\n\n[11] Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and\ncooperation in nonstochastic bandits. Journal of Machine Learning Research, 49:605\u2013622,\n2016.\n\n[12] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated\nversus decentralized exploration in multi-agent multi-armed bandits. In Proceedings of the\nTwenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17, pages 164\u2013170,\n2017.\n\n[13] Alexandros G Dimakis, Soummya Kar, Jos\u00e9 MF Moura, Michael G Rabbat, and Anna Scaglione.\nGossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847\u2013\n1864, 2010.\n\n[14] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed\noptimization: Convergence analysis and network scaling. IEEE Transactions on Automatic\ncontrol, 57(3):592\u2013606, 2012.\n\n[15] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel allocations in\ncognitive radio networks: A combinatorial multi-armed bandit formulation. In Symposium on\nNew Frontiers in Dynamic Spectrum, pages 1\u20139. IEEE, 2010.\n\n10\n\n\f[16] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed\nexploration in multi-armed bandits. In Advances in Neural Information Processing Systems,\npages 854\u2013862, 2013.\n\n[17] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1990.\n\n[18] Kenneth Ireland and Michael Rosen. A classical introduction to modern number theory,\n\nvolume 84. Springer Science & Business Media, 2013.\n\n[19] Pooria Joulani, Andras Gyorgy, and Csaba Szepesv\u00e1ri. Online learning under delayed feedback.\n\nIn International Conference on Machine Learning, pages 1453\u20131461, 2013.\n\n[20] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer\n\nmultiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331\u20132345, 2014.\n\n[21] Soummya Kar, H Vincent Poor, and Shuguang Cui. Bandit problems in networks: Asymp-\ntotically ef\ufb01cient distributed allocation rules. In Decision and Control and European Control\nConference (CDC-ECC), 2011 50th IEEE Conference on, pages 1771\u20131778. IEEE, 2011.\n\n[22] Nathan Korda, Bal\u00e1zs Sz\u00f6r\u00e9nyi, and Li Shuai. Distributed clustering of linear bandits in\npeer to peer networks. In Journal of Machine Learning Research Workshop and Conference\nProceedings, volume 48, pages 1301\u20131309. International Machine Learning Societ, 2016.\n\n[23] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[24] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative\ndecision-making in multiarmed bandits: frequentist and bayesian algorithms. In Decision and\nControl (CDC), 2016 IEEE 55th Conference on, pages 167\u2013172. IEEE, September 2016.\n\n[25] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative\ndecision-making in multiarmed bandits. In Control Conference (ECC), 2016 European, pages\n243\u2013248. IEEE, May 2016.\n\n[26] Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players.\n\nIEEE Transactions on Signal Processing, 58(11):5667\u20135681, 2010.\n\n[27] Naumaan Nayyar, Dileep Kalathil, and Rahul Jain. On regret-optimal learning in decentralized\n\nmulti-player multi-armed bandits. IEEE Transactions on Control of Network Systems, 2016.\n\n[28] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-\n\nmization. IEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[29] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, Erik Snowberg, et al. Batched bandit\n\nproblems. The Annals of Statistics, 44(2):660\u2013681, 2016.\n\n[30] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint\narXiv:1702.08704, 2017.\n\n[31] Devavrat Shah et al. Gossip algorithms. Foundations and Trends R(cid:13) in Networking, 3(1):1\u2013125,\n\n2009.\n\n[32] Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International\nConference on, pages 2786\u20132790. IEEE, 2017.\n\n[33] Ruben Stranders, Long Tran-Thanh, Francesco M Delle Fave, Alex Rogers, and Nicholas R\nJennings. Dcops and bandits: Exploration and exploitation in decentralised coordination.\nIn Proceedings of the 11th International Conference on Autonomous Agents and Multiagent\nSystems-Volume 1, pages 289\u2013296. International Foundation for Autonomous Agents and\nMultiagent Systems, 2012.\n\n11\n\n\f[34] Bal\u00e1zs Sz\u00f6r\u00e9nyi, R\u00f3bert Busa-Fekete, Istv\u00e1n Heged\u02ddus, R\u00f3bert Orm\u00e1ndi, M\u00e1rk Jelasity, and\nBal\u00e1zs K\u00e9gl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine\nLearning Research Workshop and Conference Proceedings, volume 2, pages 1056\u20131064. Inter-\nnational Machine Learning Societ, 2013.\n\n[35] Cem Tekin and Mingyan Liu. Online learning in decentralized multi-user spectrum access with\nsynchronized explorations. In Military Communications Conference, pages 1\u20136. IEEE, 2012.\n\n[36] Long Tran-Thanh, Alex Rogers, and Nicholas R Jennings. Long-term information collection\nwith energy harvesting wireless sensors: a multi-armed bandit based approach. Autonomous\nAgents and Multi-Agent Systems, 25(2):352\u2013394, 2012.\n\n[37] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control\n\nLetters, 53(1):65\u201378, 2004.\n\n[38] Jie Xu, Cem Tekin, Simpson Zhang, and Mihaela Van Der Schaar. Distributed multi-agent\nonline learning based on global feedback. IEEE Trans. Signal Processing, 63(9):2225\u20132238,\n2015.\n\n12\n\n\f", "award": [], "sourceid": 2550, "authors": [{"given_name": "David", "family_name": "Mart\u00ednez-Rubio", "institution": "University of Oxford"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "University of Oxford"}, {"given_name": "Patrick", "family_name": "Rebeschini", "institution": "University of Oxford"}]}