{"title": "Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3116, "page_last": 3126, "abstract": "We study agents communicating over an underlying network by exchanging messages, in order to optimize their individual regret in a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of $\\widetilde{O}\\left(\\sqrt{\\left(1+\\frac{K}{\\left|\\mathcal{N}\\left(v\\right)\\right|}\\right)T}\\right)$, where $T$ is the number of time steps, $K$ is the number of actions and $\\mathcal{N}\\left(v\\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the graph is unknown, each agent knows only the set of its neighbors and an upper bound on the total number of agents. The individual regret between the models differs only by a logarithmic factor. Our work resolves an open problem from [Cesa-Bianchi et al., 2019b].", "full_text": "Individual Regret in Cooperative Nonstochastic\n\nMulti-Armed Bandits\n\nYogev Bar-On\n\nTel Aviv University, Israel\nbaronyogev@gmail.com\n\nYishay Mansour\n\nTel Aviv University, Israel\nand Google Research, Israel\n\nmansour.yishay@gmail.com\n\nAbstract\n\nWe study agents communicating over an underlying network by exchanging mes-\nsages, in order to optimize their individual regret in a common nonstochastic\nmulti-armed bandit problem. We derive regret minimization algorithms that guar-\n\nantee for each agent v an individual expected regret of (cid:101)O\n\n(cid:18)(cid:114)(cid:16)\n\n,\nwhere T is the number of time steps, K is the number of actions and N (v) is the\nset of neighbors of agent v in the communication graph. We present algorithms\nboth for the case that the communication graph is known to all the agents, and for\nthe case that the graph is unknown. When the graph is unknown, each agent knows\nonly the set of its neighbors and an upper bound on the total number of agents.\nThe individual regret between the models differs only by a logarithmic factor. Our\nwork resolves an open problem from [Cesa-Bianchi et al., 2019b].\n\n1 + K|N (v)|\n\n(cid:19)\n\n(cid:17)\n\nT\n\n1\n\nIntroduction\n\nThe multi-armed bandit (MAB) problem is one of the most basic models for decision making under\nuncertainty. It highlights the agent\u2019s uncertainty regarding the losses it suffers from selecting various\nactions. The agent selects actions in an online fashion - each time step the agent selects a single action\nand suffers a loss corresponding to that action. The agent\u2019s goal is to minimize its cumulative loss\nover a \ufb01xed horizon of time steps. The agent observes only the loss of the action it selected each step.\nTherefore, the MAB problem captures well the crucial trade-off between exploration and exploitation,\nwhere the agent needs to explore various actions in order to gather information about them.\nMAB research discusses two main settings: the stochastic setting, where the losses of each action are\nsampled i.i.d. from an unknown distribution, and the nonstochastic (adversarial) setting, where we\nmake no assumptions about the loss sequences. In this work we consider the nonstochastic setting\nand the objective of minimizing the regret - the difference between the agent\u2019s cumulative loss and\nthe cumulative loss of the best action in hindsight. It is known that a regret of the order of \u0398\nis the best that can be guaranteed, where K is the number of actions and T is the time horizon. In\ncontrast, when the losses of all actions are observed (full-information feedback) the regret can be of\nthe order of \u0398\n\n(see, e.g., [Cesa-Bianchi and Lugosi, 2006, Bubeck et al., 2012]).\n\n(cid:16)\u221a\n\n(cid:16)\u221a\n\nT ln K\n\n(cid:17)\n\n(cid:17)\n\nKT\n\nThe main focus of our work is to consider agents that are connected in a communication graph, and\ncan exchange messages in each step, in order to reduce their individual regret. This is possible since\nthe losses depend only on the action and the time step, but not on the agent.\nOne extreme case is when the communication graph is a clique, i.e., any pair of agents can commu-\nnicate directly. In this case, the agents can run the well known Exp3 algorithm [Auer et al., 2002],\nand guarantee each a regret of O\n, assuming there are at least K agents (see [Seldin et al.,\n\n(cid:16)\u221a\n\nT ln K\n\n(cid:17)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fachieves an expected regret when averaged over all agents of (cid:101)O\n\n2014, Cesa-Bianchi et al., 2019b]). However, in many motivating applications, such as distributed\nlearning, or communication tasks such as routing, the communication graph is not a clique.\nThe work of Cesa-Bianchi et al. [2019b] studies a general communication graph, where the agents\ncan communicate in order to reduce their regret. The paper presents the Exp3-Coop algorithm, which\n, where\n\u03b1 (G) is the independence number of the communication graph G, and N is the number of agents.\nThe question of whether it is possible to obtain a low individual regret, that holds simultaneously\nfor all agents, was left as an open question. We answer this question af\ufb01rmatively in this work.\nOur main contribution is an individual expected regret bound, which holds for each agent v, of order\n\n(cid:16)(cid:113)(cid:0)1 + K\n\nN \u03b1 (G)(cid:1) T\n\n(cid:17)\n\n(cid:32)(cid:115)(cid:18)\n\n(cid:101)O\n\n(cid:33)\n\n(cid:19)\n\nT\n\n,\n\n1 +\n\nK\n\n|N (v)|\n\nwhere N (v) is the set of neighbors of agent v in the communication graph. We remark that our result\nalso implies the previous average regret bound.\nThe main idea of our algorithm is to arti\ufb01cially partition the graph into disjoint connected components.\nEach component has a center agent, which is in some sense the leader of the component. The center\nagent has (almost) the largest degree in the component, and it selects actions using the Exp3-Coop\nalgorithm. By observing the outcomes of its immediate neighboring agents, the center agent can\nguarantee its own desired individual regret. The main challenge is to create such components with a\nrelatively small diameter, so that the center will be able to broadcast its information in a short time to\nall the agents in the component. Special care is given to relate the agents\u2019 local parameters (degree)\nto the global component parameters (degree of the center agent and the broadcast time).\nWe consider both the case that the communication graph is known to all the agents in advance (the\ninformed setting), and the case that the graph is unknown (the uninformed setting). In the uninformed\nsetting, we assume each agent knows its local neighborhood (i.e., the set of its neighbors), and an\nupper bound on the total number of agents. The regret bound in the uninformed setting is higher by a\nlogarithmic factor and the algorithm is more complex.\nIn the next section, we formally de\ufb01ne our model, and review preliminary material. Section 3 shows\nthe center-based policy, given a graph partition. We then present our graph partitioning algorithms\nin Section 4. Overview of the analysis is given in Section 5, while all proofs are differed to the\nsupplementary material. Our work is concluded in Section 6.\n\n1.1 Additional related works\n\n(cid:16)(cid:113)(cid:0)d + K\n\n(cid:17)\nN \u03b1 (G)(cid:1) T + d\n\nThe cooperative nonstochastic MAB setting was introduced by Awerbuch and Kleinberg [2008], where\nthey bound the average regret, when some agents might be dishonest and the communication is done\nthrough a public channel (clique network). The previously mentioned [Cesa-Bianchi et al., 2019b],\nalso considers the issue of delays, and presents a bound on the average regret for a general graph of\n, when messages need d steps to arrive. Dist-Hedge, introduced\nby Sahu and Kar [2017], considers a network of forecasting agents, with delayed and inexact losses,\nand derives a sub-linear individual regret bound, that also depends on spectral properties of the graph.\nMore recently, Cesa-Bianchi et al. [2019a] studied an online learning model where only a subset of\nthe agents play at each time step, and showed matching upper and lower bounds on the average regret\n\norder (cid:101)O\nof order(cid:112)\u03b1 (G) T when the set of agents that play each step is chosen stochastically. When the set\n\nof agents is chosen arbitrarily, the lower bound becomes T .\nIn the stochastic setting, Landgren et al. [2016a,b] presented a cooperative variant of the well-known\nUCB algorithm, that uses a consensus algorithm for estimating the mean losses, to obtain a low\naverage regret. More cooperative variants of the UCB algorithm that yield a low average regret were\npresented by Kolla et al. [2018]. They also showed a policy, where like in the methods in this work,\nagents with a low degree follow the actions of agents with a high degree. Stochastic MAB over P2P\ncommunication networks were studied by Sz\u00f6r\u00e9nyi et al. [2013], which showed that the probability\nto select a sub-optimal arm reduces linearly with the number of peers. The case where only one agent\ncan observe losses was investigated by Kar et al. [2011]. This agent needs to broadcast information\nthrough the network, and it was shown this is enough to obtain a low average regret.\n\n2\n\n\f(cid:17)\n\n(cid:16)\n\nAnother multi-agent research area involve agents that compete on shared resources. The motivation\ncomes from radio channel selection, where multiple devices need to choose a radio channel, and two\nor more devices that use the same channel simultaneously interfere with each other. In this setting,\nmany papers assume agents cannot communicate with each other, and do not receive a reward upon\ncollision - where more than one agent tries to choose the same action at the same step. The \ufb01rst to\ngive regret bounds on this variant are Avner and Mannor [2014], that presented an average regret\nin the stochastic setting. Also in the stochastic setting, Rosenski et al. [2016]\nbound of order O\n\nshowed an expected average regret bound of order O(cid:0) K\nstep, and presented a regret bound of (cid:101)O\nThey showed an average regret of order O(cid:0)log2+\u0001 T(cid:1) for every \u0001 > 0, where the O-notation hides\n\n1 \u2212 \u03b4, where \u2206 is the minimal gap between the mean rewards (notice that this bound is independent\nof T ). In the same paper, they also studied the case that the number of agents may change each\n, where x is the total number of agents throughout\nthe game. Bistritz and Leshem [2018] consider the case that different agents have different mean\nrewards, and each agent has a different unique action it should choose to maximize the total regret.\n\n(cid:1) + N(cid:1) that holds with probability\n\n\u22062 ln(cid:0) K\n\n(cid:16)\u221a\n\n(cid:17)\n\nT 2\n\nxT\n\n\u03b4\n\n3\n\nthe dependency on the mean rewards.\n\n2 Preliminaries\nWe consider a nonstochastic multi-armed bandit problem over a \ufb01nite action set A = {1, . . . , K}\nplayed by N agents. Let G = (cid:104)V, E(cid:105) be an undirected connected communication graph for the set of\nagents V = {1, . . . , N}, and denote by N (v) the neighborhood of v \u2208 V , including itself. Namely,\n\nN (v) = {u \u2208 V | (cid:104)u, v(cid:105) \u2208 E} \u222a {v} .\n\nt (1) , . . . , pv\n\nAt each time step t = 1, 2, . . . , T , each agent v \u2208 V draws an action It (v) \u2208 A from a distribution\nt (K)(cid:105) on A. It then suffers a loss (cid:96)t (It (v)) \u2208 [0, 1] which it observes. Notice\nt = (cid:104)pv\npv\nthe loss does not depend on the agent, but only on the time step and the chosen action. Thus, agents\nthat pick the same action at the same step will suffer the same loss. We also assume the adversary\nis oblivious, i.e., the losses do not depend on the agents\u2019 realized actions. In the end of step t, each\nagent sends a message\nto all the agents in its neighborhood, and also receives messages from its neighbors: mt (v(cid:48)) for all\nv(cid:48) \u2208 N (v). Our goal is to minimize, for each v \u2208 V , its expected regret over T steps:\n\nmt (v) = (cid:104)v, t, It (v) , (cid:96)t (It (v)) , pv\nt(cid:105)\n\nt = (cid:80)\n\nt (i) = wv\n\nt is the exponential-weights algorithm (Exp3) with weights wv\n\nA well-known policy to update pv\nfor all i \u2208 A, such that pv\nLugosi, 2006]). The weights are updated as follows: let Bv\naction i at step t; in our case Bv\nAlso, let \u02c6(cid:96)v\nt (i)] Bv\nthe expectation conditioned on all the agents\u2019 choices up to step t (hence, Et\n\nt (i)\nt (i) (see, e.g., [Cesa-Bianchi and\nt (i) be the event that v observed the loss of\nt (i) = I{\u2203v(cid:48) \u2208 N (v) : It (v(cid:48)) = i}, where I is the indicator function.\nt (i) be an unbiased estimated loss of action i at step t, where Et [\u00b7] is\n= (cid:96)t (i)). Then\n\nt (i) = (cid:96)t(i)\nEt[Bv\n\n(cid:104)\u02c6(cid:96)v\n\nwhere W v\n\ni\u2208A wv\n\nt (i)\nW v\nt\n\n(cid:105)\n\nt (i)\n\n(cid:16)\u2212\u03b7 (v) \u02c6(cid:96)v\n\n(cid:17)\n\nwv\n\nt+1 (i) = wv\n\nt (i) exp\n\nt (i)\n\n,\n\nwhere \u03b7 (v) is a positive parameter chosen by v, called the learning rate of agent v. Exp3 is given\nexplicitly in the supplementary material. Notice that in our setting all agents v \u2208 V have the\ninformation needed to compute \u02c6(cid:96)v\n\nt (i)] = Pr [\u2203v(cid:48) \u2208 N (v) : It (v(cid:48)) = i] = 1 \u2212 (cid:89)\n\n1 \u2212 pv(cid:48)\n\nt (i)\n\nt (i), since\n\nEt [Bv\n\n(cid:16)\n\n(cid:17)\n\n,\n\nv(cid:48)\u2208N (v)\n\nand if agent v does not observe (cid:96)t (i), then \u02c6(cid:96)v\nWe proceed with two useful lemmas that will help us later. For completeness, we provide their proofs\nin the supplementary material as well. The \ufb01rst lemma is the usual analysis of the exponential-weights\nalgorithm:\n\nt (i) = 0.\n\n3\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\nRT (v) = E\n\n(cid:96)t (It (v)) \u2212 min\ni\u2208A\n\n(cid:96)t (i)\n\n.\n\n\fLemma 1. Assuming agent v uses the exponential-weights algorithm, its expected regret satis\ufb01es\n\nRT (v) \u2264 ln K\n\u03b7 (v)\n\n+\n\n\u03b7 (v)\n\n2\n\nE\n\nt (i) \u02c6(cid:96)v\npv\n\nt (i)2\n\n.\n\n(cid:34) T(cid:88)\n\nK(cid:88)\n\nt=1\n\ni=1\n\n(cid:35)\n\nThe next lemma is from [Cesa-Bianchi et al., 2019b], and it bounds the change of the action\ndistribution in the exponential-weights algorithm.\nLemma 2. Assuming agent v uses the exponential-weights algorithm with a learning rate \u03b7 (v) \u2264 1\n2K ,\nthen for all i \u2208 A:\n\n1 \u2212 \u03b7 (v) \u02c6(cid:96)v\n\nt (i)\n\nt (i) \u2264 pv\npv\n\nt+1 (i) \u2264 2pv\n\nt (i) .\n\n(cid:16)\n\n(cid:17)\n\nAlso, the following de\ufb01nition will be needed for our algorithm. We denote by Gr the r-th power of\nG, in which v1, v2 \u2208 V are adjacent if and only if distG (v, v(cid:48)) \u2264 r; and by G|U the sub-graph of G\ninduced by U \u2286 V .\nDe\ufb01nition 3. Let G = (cid:104)V, E(cid:105) be an undirected connected graph and let W \u2286 U \u2286 V . W is called\nan r-independent set of G, if it is an independent set of Gr. Namely,\n\u2200w, w(cid:48) \u2208 W : distG (w, w(cid:48)) \u2265 r + 1.\n\nIf W is also a maximal independent set of (Gr)|U , it is called a maximal r-independent subset\n(r-MIS) of U. Namely, there is no r-independent set W (cid:48) \u2286 U such that W \u2282 W (cid:48).\n\n3 Center-based cooperative multi-armed bandits\n\nWe now present the center-based policy for the cooperative multi-armed bandit setting, which will\ngive us the desired low individual regret. In the center-based cooperative MAB, not all the agents\nbehave similarly. We partition the agents to three different types.\nCenter agents are the agents that determine the action distribution for all other agents. They work\ntogether with their neighbors to minimize their regret. The neighbors of the center agents in the com-\nmunication graph, center-adjacent agents, always copy the action distribution from their neighboring\ncenter, and thus the centers gain more information about their own distribution each step.\nOther (not center or center-adjacent) agents are simple agents, which simply copy the action distribu-\ntion from one of the centers. Since they are not center-adjacent, they receive the action distribution\nwith delay, through other agents that copy from the same center.\nWe arti\ufb01cially partition the graph to connected components, such that each center c has its own\ncomponent, and all the simple agents in the component of c copy their action distribution from it. To\nobtain a low individual regret, we require the components to have a relatively small diameter, and\nthe center agents to have a high degree in the communication graph. Namely, center agents have the\nhighest or nearly highest degree in their component.\nIn more detail, we select a set C \u2286 V of center agents. All center agents c \u2208 C use the exponential-\nweights algorithm with a learning rate \u03b7 (c) = 1\n. The agent set V is partitioned\n2\ninto disjoint subsets {Vc \u2286 V | c \u2208 C}, such that N (c) \u2286 Vc for all c \u2208 C, and the sub-graph\nGc \u2261 G|Vc induced by Vc is connected. Notice that since the components are disjoint, the condition\nN (c) \u2286 Vc implies C is a 2-independent set. For all non-centers v \u2208 V \\ C, we denote by C (v) \u2208 C\nthe center agent such that v \u2208 VC(v), and call it the center of v. All non-center agents v \u2208 V \\ C copy\ntheir distribution from their origin neighbor U (v), which is their neighbor in GC(v) closest to C (v),\nbreaking ties arbitrarily. Namely,\n\n(cid:113) (ln K) min{|N (c)|,K}\n\nKT\n\nU (v) = arg min\n\nv(cid:48)\u2208N (v)\u2229VC(v)\n\ndistGC(v) (v(cid:48),C (v)) .\n\nThus, agent v receives its center\u2019s distribution with a delay of d (v) = distGC(v) (v,C (v)) steps, so\nfor all t \u2265 d (v) + 1:\n\nNotice that if v \u2208 N (c), then v is center-adjacent and it holds U (v) = C (v) and d (v) = 1. For\ncompleteness, we de\ufb01ne U (c) = C (c) = c and d (c) = 0 for all c \u2208 C.\nTo express the regret of the center-based policy, we introduce a new concept:\n\npv\n\nt = p\n\nC(v)\nt\u2212d(v).\n\n4\n\n\f2\n\nKT\n\n; wv\n\n(cid:113) (ln K)M (v)\n\n1 (i) \u2190 1\n\nAlgorithm 1 Center-based cooperative MAB - v is a center agent\nParameters: Number of arms K; Time horizon T .\nInitialize: \u03b7 (v) \u2190 1\n1: for t \u2264 T do\nt (i) \u2190 wv\n2:\n3:\n4:\n5:\n6:\n7:\n\nSet pv\nPlay an action It (v) drawn from pv\nObserve loss (cid:96)t (It (v)).\nSend the following message to the set N (v): mt (v) = (cid:104)v, t, It (v) , (cid:96)t (It (v)) , pv\nt(cid:105).\nReceive all messages mt (v(cid:48)) from v(cid:48) \u2208 N (v).\nUpdate for all i \u2208 A: wv\n\nfor all i \u2208 A, where W v\nt = (cid:104)pv\n\nK for all i \u2208 A.\n\nt (1) , . . . , pv\n\nt+1 (i) \u2190 wv\n\nt (i) exp\n\n, where\n\nt (i)\nW v\nt\n\nt (i)\n\ni\u2208A wv\n\nt (i).\nt (K)(cid:105).\n(cid:17)\n\nt =(cid:80)\n(cid:16)\u2212\u03b7 (v) \u02c6(cid:96)v\nt (i)] = 1 \u2212 (cid:89)\n\nt (i) ,\n\nBv\n\nt (i)]\nt (i) = I{\u2203v(cid:48) \u2208 N (v) : It (v(cid:48)) = i} , Et [Bv\nBv\n\n\u02c6(cid:96)v\nt (i) =\n\n(cid:96)t (i)\nEt [Bv\n\n(cid:16)\n\n(cid:17)\n\n.\n\n1 \u2212 pv(cid:48)\n\nt (i)\n\nv(cid:48)\u2208N (v)\n\n8: end for\n\n1 (i) \u2190 1\n\nK for all i \u2208 A.\n\nAlgorithm 2 Center-based cooperative MAB - v is a non-center agent\nParameters: Number of arms K; Time horizon T ; Origin neighbor U (v).\nInitialize: pv\n1: for t \u2264 T do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\nPlay an action It (v) drawn from pv\nObserve loss (cid:96)t (It (v)).\nSend the following message to the set N (v): mt (v) = (cid:104)v, t, It (v) , (cid:96)t (It (v)) , pv\nt(cid:105).\nReceive the message mt (U (v)) from U (v).\nUpdate pv\n\n(i) for all i \u2208 A.\n\nt+1 (i) = pU (v)\n\nt (1) , . . . , pv\n\nt = (cid:104)pv\n\nt (K)(cid:105).\n\nt\n\nDe\ufb01nition 4. The mass of a center agent c \u2208 C is de\ufb01ned to be\nM (c) \u2261 min{|N (c)| , K} ,\n\nand the mass of non-center agent v \u2208 V \\ C is\nM (v) \u2261 e\u2212 1\n\n6 d(v)M (C (v)) .\n\nNotice the mass depends only on how the graph is partitioned, and it satis\ufb01es M (v) = e\u2212 1\n6 M (U (v))\nfor all non-centers v \u2208 V \\ C. Intuitively, the mass of agent v captures the idea that as the degree of\nthe center is larger and as the agent is closer to its center, the lower the regret of v. We prove that\n. Our partitioning algorithms, presented in the next section, show that the\nmass of agent v satis\ufb01es M (v) = \u2126 (min{|N (v)| , K}), so we obtain an individual regret of the\n\nM (v) T\n\n(cid:17)\n\n(cid:16)(cid:113) K\nthe regret is (cid:101)O\n(cid:18)(cid:114)(cid:16)\norder of (cid:101)O\n\n(cid:19)\n\n(cid:17)\n\n1 + K|N (v)|\n\nT\n\n.\n\nWe specify the center-based policy in Algorithms 1 and 2. We emphasize that before the agents use\nthe center-based policy they must partition the graph with one of the algorithms we present in the\nnext section. While the agents partition the graph, they play arbitrary actions.\n\n4 Partitioning the graph\n\nThe goal now is to show that we can partition the graph such that the mass is large for every\nv \u2208 V .\nIn particular, we want to show that any graph can be partitioned such that M (v) =\n\u2126 (min{|N (v)| , K}).\nWe consider two cases: the informed and uninformed settings. In the informed setting, all of the\nagents have access to the graph structure. Each agent can partition the graph by itself in advance,\n\n5\n\n\fInitialize: C0 (v) \u2190 v; U0 (v) \u2190 v; M0 (v) \u2190 min{|N (v)| , K}.\nInitialize: C0 (v) \u2190 nil; U0 (v) \u2190 nil; M0 (v) \u2190 0.\n\nAlgorithm 3 Centers-to-Components\nParameters: Number of arms K; Center set C.\nInitialize: Number of iterations \u0398K \u2190 (cid:98)12 ln K(cid:99).\n1: if v \u2208 C then\n2:\n3: else\n4:\n5: end if\n6: for 0 \u2264 t \u2264 \u0398K do\n7:\n8:\n9:\n10:\n\nFind the best origin neighbor for v:\n\nSend the following message to the set N (v): \u00b5t (v) = (cid:104)v, t,Ct (v) , Mt (v)(cid:105).\nReceive all messages \u00b5t (v(cid:48)) from v(cid:48) \u2208 N (v).\nif Ut (v) /\u2208 C then\n\n(cid:46) The center-based policy requires N (c) \u2286 Vc for all c \u2208 C.\n\nUt+1 (v) \u2190 arg max\nv(cid:48)\u2208N (v)\\{v}\n\nMt (v(cid:48)) .\n\nelse\n\n11:\n12:\n13:\n14:\n15: end for\n16: return\n\nUpdate: Ct+1 (v) \u2190 Ct (Ut+1 (v)) ; Mt+1 (v) \u2190 e\u2212 1\nKeep old values: Ct+1 (v) \u2190 Ct (v) ; Ut+1 (v) \u2190 Ut (v) ; Mt+1 (v) \u2190 Mt (v).\n\n6 Mt (Ut+1 (v)).\n\nend if\n\nC (v) = C\u0398K +1 (v) ; U (v) = U\u0398K +1 (v) ; M (v) = M\u0398K +1 (v) .\n\nto know the role it plays: whether it is a center or not, and which agent is its origin neighbor. In\nthe uninformed setting, the graph structure is not known to the agents, only their neighbors and an\nupper bound on the total number of agents \u00afN \u2265 N. The agents partition the graph using a distributed\nalgorithm while playing actions and suffering loss.\nThe basic structure of the partitioning algorithm in both settings is the same. First, we show an\nalgorithm that computes the connected components given a center set C. Then, we show an algorithm\nthat computes a center set C. The second algorithm is speci\ufb01cally designed to be used with the \ufb01rst,\nand together they partition the graph to connected components such that every agent has a large mass.\n\n4.1 Computing graph components given a center set\n\nGiven a center set C, we show a distributed algorithm called Centers-to-Components, which computes\nthe connected components, and present it in Algorithm 3. Although it is distributed, in the informed\nsetting agents can simply simulate it locally in advance.\nCenters-to-Components runs simultaneous distributed BFS graph traversals, originating from every\ncenter c \u2208 C. When the traversal of center c arrives to a simple agent v \u2208 V \\ C, v decides if c is the\nbest center for it so far, and if it is, v switches its component to Vc. Notice each agent needs to know\nonly if itself is a center or not.\n\n4.2 Computing centers\n\nTo compute the center set C, we show two algorithms; one for the informed setting and one for the\nuninformed setting. The regret bound for the informed setting is slightly better, and the algorithm is\nsimpler.\n\nThe informed setting The algorithm that computes the center set in the informed setting is called\nCompute-Centers-Informed and is presented in Algorithm 4. The center set is built in a greedy\nway: each iteration, all of the agents test if they are \u201csatis\ufb01ed\u201d with the current center set (i.e.,\nM (v) \u2265 min{|N (v)| , K}). If there are unsatis\ufb01ed agents left, the agent with the highest degree is\nadded to the center set.\n\n6\n\n\fAlgorithm 4 Compute-Centers-Informed\nParameters: Undirected connected graph G = (cid:104)V, E(cid:105); Number of arms K.\nInitialize: Center set C0 \u2190 \u2205; Unsatis\ufb01ed agents S0 \u2190 V .\n1: t \u2190 0.\n2: while St (cid:54)= \u2205 do\n3:\n4:\n5:\n6:\n\nChoose the next center: ct \u2190 arg maxv\u2208St |N (v)|.\nUpdate Ct+1 \u2190 Ct \u222a {ct}.\nRun Centers-to-Components with center set Ct+1, and obtain mass Mt+1 (v) for each v \u2208 V .\nUpdate\n\n(cid:27)\n\n(cid:26)\n\nSt+1 \u2190\n\nt \u2190 t + 1.\n\n7:\n8: end while\n9: return C = Ct.\n\nv \u2208 V | Mt+1 (v) < min{|N (v)| , K} \u2227 min\nc\u2208Ct+1\n\ndistG (v, c) \u2265 3\n\n.\n\nThe uninformed setting At \ufb01rst, it may seem that the uninformed setting can be solved the same\nway as the informed setting, with some distributed version of Compute-Centers-Informed. However,\nsuch algorithm will require \u2126 (N ) steps in the worst case, since at each iteration only one agent\nbecomes a center. In the informed setting we do not care about this, since the components are\ncomputed in advance. In the uninformed setting however, at each step of the algorithm the agents\nsuffer a loss, and thus the regret bound will be at least linear in the number of agents, which can be\nvery large.\nTo avoid this problem, we need to add many centers each iteration, and not just one as in Compute-\nCenters-Informed. To do this, we exploit the fact that there are only K possible values for a center\u2019s\nmass. In our algorithm, there are K iterations, and in each iteration t, as many agents as possible\nwith degree K \u2212 t become centers. To ensure the \ufb01nal center set is 2-independent, only a 2-MIS of\nthe potential center agents are added to the center set each iteration.\nTo compute a 2-MIS in a distributed manner, we use Luby\u2019s algorithm [Luby, 1986, Alon et al.,\n1986] on the sub-graph of G2 induced by the potential center agents. Brie\ufb02y, at each iteration of\nLuby\u2019s algorithm, every potential center agent picks a number uniformly from [0, 1]. Agents that\npicked the maximal number among their neighbors of distance 2 join the 2-MIS, and their neighbors\nof distance 2 stop participating. A 2-MIS is computed after\niterations with probability\n1\u2212 \u03b4. Each iteration requires exchanging 4 messages - 2 for communicating the random numbers and\n2 for communicating the new agents in the 2-MIS. Hence, 4\nsteps suf\ufb01ce to compute a\n2-MIS with probability 1 \u2212 \u03b4. A more detailed explanation of Luby\u2019s algorithm can be found in the\nsupplementary material.\nWe present Compute-Centers-Uninformed in Algorithm 5. Since this is a distributed algorithm, we\nhave the variables C (v) and S (v) as indicators for whether v is a center or unsatis\ufb01ed, respectively.\n\n(cid:16) N\u221a\n(cid:16) N\u221a\n\n(cid:17)(cid:109)\n(cid:17)(cid:109)\n\n(cid:108)\n(cid:108)\n\n\u03b4\n\n\u03b4\n\n3 ln\n\n3 ln\n\n5 Regret analysis\n\nWe will now provide an overview for the analysis of our algorithms. We remind that all proofs are\ndiffered to the supplementary material.\n\n5.1\n\nIndividual regret of the center-based policy\n\nWe start by bounding the expected regret of the agents when they are using the center-based policy.\nTheorem 5. Let T \u2265 K 2 ln K. Using the center-based policy, the regret of each agent v \u2208 V\nsatis\ufb01es\n\nRT (v) \u2264 7\n\nK\n\nM (v)\n\nT .\n\n(cid:115)\n\n(ln K)\n\n7\n\n\fAlgorithm 5 Compute-Centers-Uninformed - agent v\nParameters: Number of arms K; Upper bound on the total number of agents \u00afN; Time horizon T .\nInitialize: Center indicator C (v) \u2190 FALSE; Unsatis\ufb01ed indicator S (v) \u2190 TRUE.\n1: for 0 \u2264 t \u2264 K \u2212 1 do\n2:\n\nsteps in Luby\u2019s algorithm on(cid:0)G2(cid:1)\n\nParticipate for 4\n\n(cid:16) \u00afN\n\n(cid:17)(cid:109)\n\n, where\n\n(cid:108)\n\n\u221a\n\n3 ln\n\nKT\n\n|St\n\nSt = {v \u2208 V | S (v) = TRUE \u2227 min{|N (v)| , K} = K \u2212 t} ,\n\nto compute Wt, a 2-MIS of St, with probability 1 \u2212 1\nT K .\n\nIf v \u2208 Wt, set C (v) \u2190 TRUE.\nParticipate in Centers-to-Components with center set Ct = {v(cid:48) \u2208 V | C (v(cid:48)) = TRUE};\n(cid:46) minc\u2208Ct distG (v, c) \u2265 3 if and only if C2 (v) = nil in Centers-to-Components.\n\nobtain mass Mt (v) and whether minc\u2208Ct distG (v, c) \u2265 3.\n\n3:\n4:\n\n5:\n6:\n\n(cid:20)\n\nUpdate\n\nS (v) \u2190 I\n\n7: end for\n8: return C = CK\u22121.\n\nMt (v) < min{|N (v)| , K} \u2227 min\nc\u2208Ct\n\ndistG (v, c) \u2265 3\n\n.\n\n(cid:21)\n\nThis individual regret bound holds simultaneously for all agents in the graph, and it depends only on\nthe graph structure and components.\n\n5.2 Analyzing Centers-to-Components\n\nWe need to show the results of Centers-to-Components follow their de\ufb01nitions, and the derived\ncomponents satisfy all the properties required by the center-based policy. The following lemma show\nit under some requirements from the center set C.\nLemma 6. Let C \u2286 V be a center set that is 2-independent, such that every v \u2208 V holds\nminc\u2208C distG (v, c) \u2264 6 ln K \u2212 1. Let C (v) , U (v) , M (v) be the results of Centers-to-Components.\nFor each c \u2208 C, let Vc be its corresponding component, namely, Vc = {v \u2208 V | C (v) = c}. Then\nthe following properties are satis\ufb01ed:\n\n1. {Vc | c \u2208 C} are pairwise disjoint and V =(cid:83)\n\nc\u2208C Vc.\n\n2. N (c) \u2286 Vc and Gc is connected for all c \u2208 C.\n3. M (v) = e\u2212 1\n\n6 d(v)M (C (v)) and U (v) = arg minv(cid:48)\u2208N (v)\u2229VC(v)\n\nd (v(cid:48)) for all v \u2208 V \\ C.\n\n5.3 Analyzing Compute-Centers-Informed\n\nThe \ufb01rst thing we need to show is that the center set returned by Compute-Centers-Informed satis\ufb01es\nthe conditions of Lemma 6:\nLemma 7. Let C \u2286 V be the center set returned by Compute-Centers-Informed. Then:\n\n1. C is 2-independent.\n2. For all v \u2208 V , minc\u2208C distG (v, c) \u2264 6 ln K \u2212 1.\n\nNow, we can show that by using our informed graph partitioning algorithms, the mass of all agents is\nlarge:\nTheorem 8. Let C \u2286 V be the center set returned by Compute-Centers-Informed, and let\n{Vc \u2286 V | c \u2208 C} be the components resulted from Centers-to-Components. For every v \u2208 V :\n\nM (v) \u2265 e\u22121 min{|N (v)| , K} .\n\nTogether with Theorem 5, we obtain the desired regret bound.\n\n8\n\n\fCorollary 9. Let T \u2265 K 2 ln K. Let C \u2286 V be the center set returned by Compute-Centers-\nInformed, and let {Vc \u2286 V | c \u2208 C} be the components resulted from Centers-to-Components. Using\nthe center-based policy, we obtain for every v \u2208 V :\n\nRT (v) \u2264 12\n\n(ln K)\n\n1 +\n\nK\n\n|N (v)|\n\n1 +\n\nK\n\n|N (v)|\n\n(cid:32)(cid:115)(cid:18)\n\n(cid:19)\n\nT = (cid:101)O\n\n(cid:33)\n\n(cid:19)\n\nT\n\n.\n\n(cid:115)\n\n(cid:18)\n\n5.4 Analyzing Compute-Centers-Uninformed\n\nFirst, we show that Compute-Centers-Uninformed terminates after a relatively small number of steps,\nand thus the loss suffered while running it is insigni\ufb01cant.\n\nLemma 10. Compute-Centers-Uninformed runs for less than 12K ln(cid:0)K 2 \u00afN T(cid:1) steps.\n\nAs in the informed setting, we now need to show the center set resulted from Compute-Centers-\nUninformed satis\ufb01es the conditions of Lemma 6.\nLemma 11. Let C \u2286 V be the center set resulted from Compute-Centers-Uninformed, such that\nLuby\u2019s algorithm succeeded at all iterations of the algorithm. Then:\n\n1. C is 2-independent.\n2. For all v \u2208 V , minc\u2208C distG (v, c) \u2264 6 ln K \u2212 1.\n\nWe can now obtain the same result as in the informed setting:\nTheorem 12. Let C \u2286 V be the center set resulted from Compute-Centers-Uninformed, such that\nLuby\u2019s algorithm succeeded at all iterations of the algorithm, and also let {Vc \u2286 V | c \u2208 C} be the\ncomponents resulted from Centers-to-Components. For every v \u2208 V :\n\nM (v) \u2265 e\u22121 min{|N (v)| , K} .\n\nAgain we can use Theorem 5 to obtain the desired regret bound.\nCorollary 13. Let T \u2265 K 2 ln K and \u00afN \u2265 N. Let C \u2286 V be the center set resulted from\nCompute-Centers-Uninformed, and let {Vc \u2286 V | c \u2208 C} be the components resulted from Centers-\nto-Components. Using the center-based policy, we obtain for every v \u2208 V :\nRT (v) \u2264 12\n\n(cid:32)\nK ln(cid:0)K 2 \u00afN T(cid:1) +\n\n+ 1 = (cid:101)O\n\n(cid:32)(cid:115)(cid:18)\n\n(ln K)\n\n1 +\n\n(cid:115)\n\n(cid:33)\n\n(cid:33)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nT\n\n.\n\n1 +\n\nK\n\n|N (v)|\n\nK\n\n|N (v)|\n\nT\n\n5.5 Average regret of the center-based policy\n\nAs mentioned before, we strictly improve the result of Cesa-Bianchi et al. [2019b], and our algorithms\nimply the same average expected regret bound.\nCorollary 14. Let T \u2265 K 2 ln K. Let C \u2286 V be the center set resulted from Compute-Centers-\nInformed or Compute-Centers-Uninformed, and let {Vc \u2286 V | c \u2208 C} be the components resulted\nfrom Centers-to-Components. Using the center-based policy, we get:\n\n(cid:32)(cid:115)(cid:18)\n\n(cid:88)\n\nv\u2208V\n\nRT (v) = (cid:101)O\n\n(cid:33)\n\n(cid:19)\n\n1 +\n\nK\nN\n\n\u03b1 (G)\n\nT\n\n.\n\n1\nN\n\n6 Conclusions\n\nWe investigated the cooperative nonstochastic multi-armed bandit problem, and presented the center-\nbased cooperation policy (Algorithms 1 and 2). We provided partitioning algorithms that provably\nyield a low individual regret bound that holds simultaneously for all agents (Algorithms 3, 4 and 5).\nWe express this bound in terms of the agents\u2019 degree in the communication graph. This bound strictly\nimproves a previous regret bound from [Cesa-Bianchi et al., 2019b] (Corollary 14), and also resolves\nan open question from that paper.\nNote that our regret bound in the informed setting does not depend on the total number of agents,\nN, and in the uninformed setting it depends on \u00afN only logarithmically. It is unclear whether in the\nuninformed setting, any dependence on N in the individual regret is required.\n\n9\n\n\fAcknowledgments\n\nThis work was supported in part by the Yandex Initiative in Machine Learning and by a grant from\nthe Israel Science Foundation (ISF).\n\nReferences\n\nNoga Alon, L\u00e1szl\u00f3 Babai, and Alon Itai. A fast and simple randomized parallel algorithm for the\n\nmaximal independent set problem. Journal of algorithms, 7(4):567\u2013583, 1986.\n\nPeter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\nOrly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, pages 66\u201381. Springer,\n2014.\n\nBaruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer\n\nand System Sciences, 74(8):1271\u20131288, 2008.\n\nIlai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In\n\nAdvances in Neural Information Processing Systems, pages 7222\u20137232, 2018.\n\nS\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\nNicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\nNicol\u00f2 Cesa-Bianchi, Tommaso R Cesari, and Claire Monteleoni. Cooperative online learning:\n\nKeeping your neighbors updated. arXiv preprint arXiv:1901.08082, 2019a.\n\nNicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic\n\nbandits. The Journal of Machine Learning Research, 20(1):613\u2013650, 2019b.\n\nSoummya Kar, H Vincent Poor, and Shuguang Cui. Bandit problems in networks: Asymptotically\nef\ufb01cient distributed allocation rules. In 2011 50th IEEE Conference on Decision and Control and\nEuropean Control Conference, pages 1771\u20131778. IEEE, 2011.\n\nRavi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic\nbandits over a social network. IEEE/ACM Transactions on Networking (TON), 26(4):1782\u20131795,\n2018.\n\nPeter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative decision-\nmaking in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243\u2013248.\nIEEE, 2016a.\n\nPeter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-\nmaking in multiarmed bandits: Frequentist and bayesian algorithms. In 2016 IEEE 55th Conference\non Decision and Control (CDC), pages 167\u2013172. IEEE, 2016b.\n\nMichael Luby. A simple parallel algorithm for the maximal independent set problem. SIAM journal\n\non computing, 15(4):1036\u20131053, 1986.\n\nJonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits\u2013a musical chairs approach.\n\nIn International Conference on Machine Learning, pages 155\u2013163, 2016.\n\nAnit Kumar Sahu and Soummya Kar. Dist-hedge: A partial information setting based distributed\nnon-stochastic sequence prediction algorithm. In 2017 IEEE Global Conference on Signal and\nInformation Processing (GlobalSIP), pages 528\u2013532. IEEE, 2017.\n\nYevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited\n\nadvice and multiarmed bandits with paid observations. In ICML, pages 280\u2013287, 2014.\n\n10\n\n\fBal\u00e1zs Sz\u00f6r\u00e9nyi, R\u00f3bert Busa-Fekete, Istv\u00e1n Heged\u02ddus, R\u00f3bert Orm\u00e1ndi, M\u00e1rk Jelasity, and Bal\u00e1zs\nK\u00e9gl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning\nResearch Workshop and Conference Proceedings, volume 2, pages 1056\u20131064. International\nMachine Learning Societ, 2013.\n\nVK Wei. A lower bound on the stability number of a simple graph. Technical report, Bell Laboratories\n\nTechnical Memorandum 81-11217-9, Murray Hill, NJ, 1981.\n\n11\n\n\f", "award": [], "sourceid": 1761, "authors": [{"given_name": "Yogev", "family_name": "Bar-On", "institution": "Tel-Aviv University"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Tel Aviv University / Google"}]}