{"title": "SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 12071, "page_last": 12080, "abstract": "Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage.  We present a decentralized algorithm that achieves the same performance as a centralized one,  contradicting the existing lower bounds for that problem. This is possible by ``hacking'' the standard model by constructing a communication protocol between players that deliberately enforces collisions, allowing them to share their information at a negligible cost. \nThis motivates the introduction of a more appropriate dynamic setting without sensing, where similar communication protocols are no longer possible. However, we show that the logarithmic growth of the regret is still achievable for this model with a new algorithm.", "full_text": "SIC - MMAB: Synchronisation Involves\n\nCommunication in Multiplayer Multi-Armed Bandits\n\nEtienne Boursier\n\nCMLA, ENS Paris-Saclay\n\netienne.boursier@ens-paris-saclay.fr\n\nVianney Perchet\n\nCMLA, ENS Paris-Saclay\n\nCriteo AI Lab, Paris\n\nvianney.perchet@normalesup.org\n\nAbstract\n\nMotivated by cognitive radio networks, we consider the stochastic multiplayer\nmulti-armed bandit problem, where several players pull arms simultaneously and\ncollisions occur if one of them is pulled by several players at the same stage.\nWe present a decentralized algorithm that achieves the same performance as a\ncentralized one, contradicting the existing lower bounds for that problem. This\nis possible by \u201chacking\u201d the standard model by constructing a communication\nprotocol between players that deliberately enforces collisions, allowing them to\nshare their information at a negligible cost. This motivates the introduction of a\nmore appropriate dynamic setting without sensing, where similar communication\nprotocols are no longer possible. However, we show that the logarithmic growth of\nthe regret is still achievable for this model with a new algorithm.\n\n1\n\nIntroduction\n\nIn the stochastic Multi Armed Bandit problem (MAB), a single player sequentially takes a decision\n(or \u201cpulls an arm\u201d) amongst a \ufb01nite set of possibilities [K] := {1, . . . , K}. After pulling arm k \u2208 [K]\nat stage t \u2208 N\u2217, the player receives a random reward Xk(t) \u2208 [0, 1], drawn i.i.d. according to some\nunknown distribution \u03bdk of expectation \u00b5k := E[Xk(t)]. Her objective is to maximize her cumulative\nreward up to stage T \u2208 N\u2217. This sequential decision problem, \ufb01rst introduced for clinical trials\n[27, 25], involves an \u201cexploration/exploitation dilemma\u201d where the player must trade-off acquiring\nvs. using information. The performance of an algorithm is controlled in term of regret, the difference\nof the cumulated reward of an optimal algorithm knowing the distributions (\u03bdk)k\u2208[K] beforehand and\nthe cumulated reward of the player. It is known that any \u201creasonable\u201d algorithm must incur at least a\nlogarithmic regret [19], which is attained by some existing algorithms such as UCB [1, 4].\nMAB has been recently popularized thanks to its applications to online recommendation systems.\nMany different variants of MAB and classes of algorithms have thus emerged in the recent years [see\n11]. In particular, they have been considered for cognitive radios [16], where the problem gets more\nintricate as multiple users are involved and they collide if they pull the same arm k at the same time\nt, i.e., they transmit on the same channel. If this happens, they all receive 0 as a reward instead of\nXk(t), meaning that no message is transmitted.\nIf a central agent controls simultaneously all players\u2019 behavior then a tight lower bound is known\n[3, 18]. Yet this centralized problem is not adapted to cognitive radios, as it allows communication\nbetween players at each time step; in practice, this induces signi\ufb01cant costs in both energy and\ntime. As a consequence, most of the current interest lies in the decentralized case [20, 2, 5], which\npresents another complication due to the feedback. Besides the received reward, an additional piece of\ninformation may be observed at each time step. When this extra observation is the collision indicator,\nRosenski et al. [26] provided two algorithms for both a \ufb01xed and a varying number of players. They\nare based on a Musical Chairs procedure that quickly assigns players to different arms. Besson and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fKaufmann [8] provided an ef\ufb01cient UCB-based algorithm if Xk(t) is observed instead1. Lugosi and\nMehrabian [21] recently proposed an algorithm using no additional information. The performances\nof these algorithms and the underlying model differences are summarized in Table 1, Section 1.1.\nThe \ufb01rst non trivial lower bound for this problem has been recently improved [20, 8]. These lower\nbounds suggest that decentralization adds to the regret a multiplicative factor M, the number of\nplayers, compared to the centralized case [3]. Interestingly, these lower bounds scale linearly with\nthe inverse of the gaps between the \u00b5k whereas this scaling is quadratic for most of the existing\nalgorithms. This is due to the fact that although collisions account for most of the regret, lower\nbounds are proved without considering them.\nAlthough it is out of our scope, the heterogeneous model introduced by Kalathil et al. [17] is worth\nmentioning. In this case, the reward distribution depends on each user [6, 7]. An algorithm reaching\nthe optimal allocation without explicit communication between the players was recently proposed\n[9].\nOur main contributions are the following:\nSection 2: When collisions are observed, we introduce a new decentralized algorithm that is\n\u201chacking\u201d the setting and induces communication between players through deliberate collisions. The\nregret of this algorithm reaches asymptotically (up to some universal constant) the lower bound of\nthe centralized problem, meaning that the aforementioned lower bounds are unfortunately incorrect.\nThis algorithm relies on the unrealistic assumption that all users start transmitting at the very same\ntime. It also explains why the current literature fails to provide near optimal results for the multiplayer\nbandits. It therefore appears that the assumption of synchronization has to be removed for practical\napplication of the multiplayer bandits problem. On the other hand, this technique also shows that\nexhibiting lower bounds in multi-player MAB is more complex than in stochastic standard MAB.\n\nSection 3: Without synchronization or collision observations, we propose the \ufb01rst algorithm with a\nlogarithmic regret. The dependencies in the gaps between rewards yet become quadratic.\n\n1.1 Models\n\nIn this section, we introduce different models of multiplayer MAB with a known number of arms K\nbut an unknown number of players M \u2264 K. The horizon T is assumed known to the players (for\nsimplicity of exposure, as the anytime generalization of results is now well understood [14]). At each\ntime step t \u2208 [T ], given their (private) information, all players j \u2208 [M ] simultaneously pull the arms\n\u03c0j(t) and receive the reward rj(t) \u2208 [0, 1] such that\n\nrj(t) := X\u03c0j (t)(t)(1 \u2212 \u03b7\u03c0j (t)(t)), where \u03b7\u03c0j (t)(t) is the collision indicator de\ufb01ned by\nCk(t) := {j \u2208 [M ] | \u03c0j(t) = k}.\n\n\u03b7k(t) := 1#Ck(t)>1\n\nwith\n\nThe problem is centralized if players can communicate any information to each other. In that case,\nthey can easily avoid collisions and share their statistics. In opposition, the problem is decentralized\nwhen players have only access to their own rewards and actions. The crucial concept we introduce is\n(a)synchronization between players. With synchronization, the model is called static.\nAssumption 1 (Synchronization). Player i enters the bandit game at the time \u03c4i = 0 and stays until\nthe \ufb01nal horizon T . This is common knowledge to all players.\nAssumption 2 (Quasi-Asynchronization). Players enter at different times \u03c4i \u2208 {0, . . . , T \u2212 1} and\nstay until the \ufb01nal horizon T . The \u03c4i are unknown to all players (including i).\n\nWith quasi-asynchronicity2, the model is dynamic and several variants already exist [26]. Denote by\nM(t) the set of players in the game at time t (unknown but not random) and by \u00b5(n) the n-th order\nstatistics of \u00b5, i.e., \u00b5(1) \u2265 \u00b5(2) \u2265 . . . \u2265 \u00b5(K). The total regret is then de\ufb01ned for both static and\ndynamic models by:\n\nT(cid:88)\n\n#M(t)(cid:88)\n\n\uf8ee\uf8f0 T(cid:88)\n\n(cid:88)\n\n\uf8f9\uf8fb .\n\nrj(t)\n\nRT :=\n\n\u00b5(k) \u2212 E\u00b5\n\nt=1\n\nk=1\n\nt=1\n\nj\u2208M(t)\n\n1We stress that Xk(t) does not necessarily correspond to the received reward in case of collision.\n2We prefer not to mention asynchronicity as players still use shared discrete time slots.\n\n2\n\n\fAs mentioned in the introduction, different observation settings are considered.\n\nCollision Sensing: Player j observes \u03b7\u03c0j (t)(t) and rj(t) at each time step.\nNo sensing: Player j only observes rj(t), i.e., a reward of 0 can indistinguishably come from a\n\ncollision with another player or a null statistic X\u03c0j (t)(t).\n\nNotice that as soon as P(Xk = 0) = 0, the No Sensing and Collision Sensing settings are equivalent.\nThe setting where both X\u03c0j (t) and rj(t) are observed is also considered in the literature and is called\nStatistic Sensing [8]. The No Sensing setting is the most dif\ufb01cult one as there is no extra observation.\nTable 1 below compares the performances of the major algorithms, specifying the precise setting\nconsidered for each of them. The second algorithm of Lugosi and Mehrabian [21] and our algorithms\nalso have problem independent bounds that are not mentioned in Table 1 for the sake of clarity. Due\nto space constraints, ADAPTED SIC-MMAB, SIC-MMAB2 and their related results are presented in\nAppendix C. Note that the two dynamic algorithms in Table 1 rely on different speci\ufb01c assumptions.\n\nModel\n\nAlgorithm\u2019s Reference\n\nPrior knowledge\n\nCentralized Multiplayer\n\nTheorem 1 [18]\n\nDecentralized, Stat. Sensing\n\nTheorem 11 [8]\n\nM\n\nM\n\nDecentralized, Col. Sensing\n\nTheorem 1 [26]\n\n\u00b5(M )\u2212\u00b5(M +1)\n\nDecentralized, Col. Sensing\n\nSIC-MMAB (Thm 1)\n\nDecentralized, No Sensing\n\nTheorem 1.1 [21]\n\n-\n\nM\n\nDecentralized, No Sensing\n\nTheorem 1.2 [21]\n\nM, \u00b5(M )\n\nDecentralized, No Sensing\n\nADAPT. SIC-MMAB (Eq (13))\n\nDecentralized, No Sensing\n\nSIC-MMAB2 (Thm 3)\n\nDec., Col. Sensing , Dynamic\n\nTheorem 2 [26]\n\nDec., No Sensing, Dynamic\n\nDYN-MMAB (Thm 2)\n\n\u00b5(K)\n\n\u00b5(K)\n\n\u00af\u2206(M )\n\n-\n\nAsymptotic Upper bound (up to constant factor)\n\nk>M\n\nlog(T )\n\n(cid:88)\nM 3 (cid:88)\n(cid:16)\n\u00b5(M )\u2212\u00b5(M +1)\n\n\u00b5(M )\u2212\u00b5(k)\n(cid:16)\n\u00b5(i)\u2212\u00b5(k)\n(cid:17)2\n\n1\u2264i<k\u2264K\n\nMK log(T )\n\nlog(T )\n\n(cid:17)2\n\n+ M K log(T )\n\nlog(T )\n\n\u00b5(M )\u2212\u00b5(k)\n(cid:16)\n\u00b5(M )\u2212\u00b5(M +1)\n\nMK log(T )\n\n(cid:17)2\n\n(cid:88)\n\nk>M\n\nMK2\n\u00b5(M )\n\nlog2(T ) + M K log(T )\n\nlog(T )\n\n\u00b5(M )\u2212\u00b5(k)\n\n+ M 3K log(T )\n\n\u00b5(K)\n\n\u2206(cid:48)\n\nlog2(cid:0) log(T )(cid:1)\n\n(cid:88)\n\nk>M\n\n(cid:88)\n\nk>M\n\nM\n\nlog(T )\n\n\u00b5(M )\u2212\u00b5(k)\n\n+ MK2\n\u00b5(K)\n\nlog(T )\n\nM\n\n\u221a\n\nK log(T )T\n\u00af\u22062\n\n(M )\n\nMK log(T )\n\n\u00af\u22062\n\n(M )\n\n+ M 2K log(T )\n\n\u00b5(M )\n\nTable 1: Performances of different algorithms. Our algorithms and results are highlighted in\nred. \u00af\u2206(M ) := mini=1,...,M (\u00b5(i) \u2212 \u00b5(i+1)) is the smallest gap among the top-M + 1 arms and\n\u2206(cid:48) := min{\u00b5(M ) \u2212 \u00b5i | \u00b5(M ) \u2212 \u00b5i > 0} is the positive sub-optimality gap.\n\n2 Collision Sensing: achieving centralized performances by communicating\n\nthrough collisions\n\nIn this section, we consider the Collision Sensing static model and prove that the decentralized\nproblem is almost as complex, in terms of regret growth, as the centralized one. When players are\nsynchronized, we provide an algorithm with an exploration regret similar to the known centralized\nlower bound [3]. This algorithm strongly relies on the synchronization assumption, which we leverage\nto allow communication between players through observed collisions. The communication protocol\nis detailed and explained in Section 2.2.3. This result also implies that the two lower bounds provided\nin the literature [8, 20] are unfortunately not correct. Indeed, the factor M that was supposed to be\nthe cost of the decentralization in the regret should not appear.\nLet us now describe our algorithm SIC-MMAB. It consists of several phases.\n\n1. The initialization phase \ufb01rst estimates the number of players and assigns ranks among them.\n2. Players then alternate between exploration phases and communication phases.\n\n3\n\n\f(a) During the p-th exploration phase, each arm is pulled 2p times and its performance is\n\nestimated in a Successive Accepts and Rejects fashion [22, 12].\n\n(b) During the communication phases, players communicate their statistics to each other\nusing collisions. Afterwards, the updated common statistics are known to all players.\n3. The last phase, the exploitation one, is triggered for a player as soon as an arm is detected as\n\noptimal and assigned to her. This player then pulls this arm until the \ufb01nal horizon T .\n\n2.1 Some preliminary notations\n\nPlayers that are not in the exploitation phase are called active. We denote, with a slight abuse of\nnotation, by [Mp] the set of active players during the p-th phase of exploration-communication\nand by Mp \u2264 M its cardinality. Notice that Mp is non increasing because players never leave the\nexploitation phase.\nAny arm among the top-M ones is called optimal and any other arm is sub-optimal. Arms that\nstill need to be explored (players cannot determine whether they are optimal or sub-optimal yet) are\nactive. We denote, with the same abuse of notation, the set of active arms by [Kp] of cardinality\nKp \u2264 K. By construction of our algorithm, this set is common to all active players at each stage.\nOur algorithm is based on a protocol called sequential hopping [15]. It consists of incrementing\nthe index of the arm pulled by a speci\ufb01c player:\nt at time t, she will play\n\u03c0k\nt+1 = \u03c0k\n\nif she plays arm \u03c0k\nt + 1 (mod [Kp]) at time t + 1 during the p-th exploration phase.\n\n2.2 Description of our protocol\n\nAs mentioned above, the SIC-MMAB algorithm consists of several phases. During the communication\nphase, players communicate with each other. At the end of this phase, each player thus knows\nthe statistics of all players on all arms, so that this decentralized problem becomes similar to the\ncentralized one. After alternating enough times between exploration and communication phases,\nsub-optimal arms are eliminated and players are \ufb01xed to different optimal arms and will exploit them\nuntil stage T . The complete pseudocode of SIC-MMAB is given in Algorithm 1, Appendix A.1.\n\n2.2.1 Initialization phase\n\nThe objective of the \ufb01rst phase is to estimate the number of players M and to assign internal\nranks to players. First, players follow the Musical Chairs algorithm [26], described in Pseudocode 4,\nAppendix A.1, during T0 := (cid:100)K log(T )(cid:101) steps in order to reach an orthogonal setting, i.e., a position\nwhere they are all pulling different arms. The index of the arm pulled by a player at stage T0 will\nthen be her external rank.\nThe second procedure, given by Pseudocode 5 in Appendix A.1, determines M and assigns a unique\ninternal rank in [M ] to each player. For example, if there are three players on arms 5, 7 and 2 at\nt = T0, their external ranks are 5, 7 and 2 respectively, while their internal ranks are 2, 3 and 1.\nRoughly speaking, the players follow each other sequentially hopping through all the arms so that\nplayers with external ranks k and k(cid:48) collide exactly after a time k + k(cid:48). Each player then deduces M\nand her internal rank from observed collisions during this procedure that lasts 2K steps.\nIn the next phases, active players will always know the set of active players [Mp]. This is how\nthe initial symmetry among players is broken and it allows the decentralized algorithm to establish\ncommunication protocols.\n\n2.2.2 Exploration phase\n\nDuring the p-th exploration phase, active players sequentially hop among the active arms for Kp2p\nsteps. Any active arm is thus pulled 2p times by each active player. Using their internal rank, players\nstart and remain in an orthogonal setting during the exploration phase, which is collision-free.\n\nTk(p) =(cid:80)M\n\nWe denote by Bs = 3\nthe error bound after s pulls and by Tk(p) (resp. Sk(p)) the centralized\nnumber of pulls (resp. sum of rewards) for the arm k during the p \ufb01rst exploration phases, i.e.,\nk (p) is the number of pulls for the arm k by player j during the p \ufb01rst\n\n2s\n\nj=1 T j\n\nk (p) where T j\n\n(cid:113) log(T )\n\n4\n\n\fexploration phases. During the communication phase, quantized rewards(cid:101)Sj\n\nbetween active players as described in Section 2.2.3.\nAfter a succession of two phases (exploration and communication), an arm k is accepted if\n\n(cid:110)\ni \u2208 [Kp](cid:12)(cid:12)(cid:101)\u00b5k(p) \u2212 BTk(p) \u2265(cid:101)\u00b5i(p) + BTi(p)\n(cid:80)M\nm=1 (cid:101)Sj\n\nk(p)\n\n#\n\n(cid:111) \u2265 Kp \u2212 Mp,\n\nwhere(cid:101)\u00b5k(p) =\n\nk(p) will be communicated\n\nis the centralized quantized empirical mean of the arm k3, which is an\nTk(p) . This inequality implies that k is among the top-Mp active arms\n\napproximation of \u02c6\u00b5k(p) = Sk(p)\nwith high probability. In the same way, k is rejected if\n\ni \u2208 [Kp](cid:12)(cid:12)(cid:101)\u00b5i(p) \u2212 BTi(p) \u2265(cid:101)\u00b5k(p) + BTk(p)\n\n(cid:111) \u2265 Mp,\n\nTk(p)\n\n(cid:110)\n\n#\n\nmeaning that there are at least Mp active arms better than k with high probability. Notice that each\nk(p) to accept/reject an arm instead of the exact ones\n\nplayer j uses her own quantized statistics (cid:101)Sj\nk(p). Otherwise, the estimations(cid:101)\u00b5k(p) would indeed differ between the players as well as the sets\nand the con\ufb01dence bound can be chosen as Bs =(cid:112)2 log(T )/s.\n\nSj\nof accepted and rejected arms. With Bernoulli distributions, the quantization becomes unnecessary\n\n2.2.3 Communication phase\n\nIn this phase, each active player communicates, one at a time, her statistics of the active arms to all\nother active players. Each player has her own communicating arm, corresponding to her internal\nrank. When the player j is communicating, she sends a bit at a time step to the player l by deciding\nwhich arm to pull: a 1 bit is sent by pulling the communicating arm of player l (a collision occurs)\nand a 0 bit by pulling her own arm. The main originality of SIC-MMAB comes from this trick which\nallows implicit communication through collisions and is used in subsequent papers [13, 10, 24]. In\nan independent work, Tibrewal et al. [28] also proposed an algorithm using similar communication\nprotocols for the heterogeneous case.\nAs an arm is pulled 2n times by a single player during the n-th exploration phase, it has been\npulled 2p+1 \u2212 1 times in total at the end of the p-th phase and the statistic Sj\nk(p) is a real number in\nk(p) \u2208 [2p+1\u2212 1] to each other in p + 1\nk(p) \u2212 n be the integer and decimal parts of Sj\nk(p),\nk(p)] = Sj\nk(p).\n\n[0, 2p+1\u2212 1]. Players then send a quantized integer statistic(cid:101)Sj\nthe quantized statistic is then n + 1 with probability d and n otherwise, so that E[(cid:101)Sj\n\nbits, i.e., collisions. Let n = (cid:98)Sj\n\nk(p)(cid:99) and d = Sj\n\nAn active player can have three possible statuses during the communication phase:\n\n1. either she is receiving some other players\u2019 statistics about the arm k. In that case, she\n\nproceeds to Receive Protocol (see Pseudocode 1).\n\n2. Or she is sending her quantized statistics about arm k to player l (who is then receiving). In\nthat case, she proceeds to Send Protocol (see Pseudocode 2) to send them in a time p + 1.\n3. Or she is pulling her communicating arm, while waiting for other players to \ufb01nish communi-\n\ncating statistics among them.\n\nCommunicated statistics are all of length p + 1, even if they could be sent with shorter messages, in\norder to maintain synchronization among players. Using their internal ranks, the players can com-\nmunicate in turn without interfering with each other. The general protocol for each communication\nphase is described in Pseudocode 3 below.\n\nAt the end of the communication phase, all active players know the statistics (cid:101)Sj\n\nk(p) and so which\narms to accept or reject. Rejected arms are removed right away from the set of active arms. Thanks\nto the assigned ranks, accepted arms are assigned to one player each. The remaining active players\nthen update both sets of active players and arms as described in Algorithm 1, line 21.\nThis communication protocol uses the fact that a bit can be sent with a single collision. Without\nsensing, this can not be done in a single time step, but communication is still somehow possible. A\nbit can then be sent in log(T )\nT . Using this trick, two different algorithms\n\u00b5(K)\nrelying on communication protocols are proposed in Appendix C for the No Sensing setting.\n\n3For a player j already exploiting since the pj-th phase, we instead use the last statistic (cid:101)Sj\n\nsteps with probability 1 \u2212 1\n\nk(p) = (cid:101)Sj\n\nk(pj).\n\n5\n\n\fReceive Protocol\n\nInput: p (phase number), l (own internal rank),\n\n[Kp] (set of active arms)\n\nOutput: s (statistic sent by the sending\nplayer)\n1: s \u2190 0 and \u03c0 \u2190 index of the l-th active arm\n2: for n = 0, . . . , p do\n3:\n4:\n5:\n6: end for\n7: return s\nPseudocode 1: receive statistics of length\np + 1.\n\nPull \u03c0\nif \u03b7\u03c0(t) = 1 then # other player sends 1\ns \u2190 s + 2n end if\n\n# sent statistics\n\nSend Protocol\n\nInput:\n\nl (player receiving), s (statistics to\nsend), p (phase number), j (own internal rank),\n[Kp] (set of active arms)\n1: m \u2190 binary writing of s of length p + 1, i.e.,\n\ns =(cid:80)p\n\nn=0 mn2n\n2: for n = 0, . . . , p do\nif mn = 1 then\n3:\nPull the l-th active arm\n4:\nelse Pull the j-th active arm\n5:\nend if\n6:\n7: end for\nPseudocode 2: send statistics s of length\np + 1 to player l.\n\n# send 1\n# send 0\n\nInput: s (personal statistics of previous phases), p (phase number), j (own internal rank), [Kp] (set of active\n\nCommunication Protocol\n\narms), [Mp] (set of active players)\n\nOutput:(cid:101)S (quantized statistics of all active players)\n(cid:40)(cid:98)s[k](cid:99) + 1 with probability s[k] \u2212 (cid:98)s[k](cid:99)\n1: For all k, sample(cid:101)s[k] =\n2: De\ufb01ne Ep := {(i, l, k) \u2208 [Mp] \u00d7 [Mp] \u00d7 [Kp] | i (cid:54)= l} and set(cid:101)Sj \u2190(cid:101)s\n\n(cid:98)s[k](cid:99) otherwise\n\nif i = j then Send (l,(cid:101)s[k], p, j, [Kp])\nelse if l = j then (cid:101)Si[k] \u2190 Receive(p, j, [Kp])\n\nelse for p + 1 time steps do Pull the j-th active arm end for\nend if\n\n3: for (i, l, k) \u2208 Ep do\n4:\n5:\n6:\n7:\n8: end for\n\n9: return (cid:101)S\n\n# quantization\n\n# Player i sends stats of arm k to player l\n# player communicating\n# player receiving\n# wait while others communicate\n\nPseudocode 3: player with rank j proceeds to the p-th communication phase.\n\n2.2.4 Regret bound of SIC-MMAB\n\nTheorem 1 bounds the expected regret incurred by SIC-MMAB. Due to space constraints, its proof is\ndelayed to Appendix A.2.\nTheorem 1. With the choice T0 = (cid:100)K log(T )(cid:101), for any given set of parameters K, M and \u00b5\u00b5\u00b5:\n\n(cid:88)\n(cid:3) \u2264 c1\n\nE(cid:2)RT\n\nmin\n\nk>M\n\n(cid:26) log(T )\n(cid:26)\n(cid:18)\n\n\u00b5(M ) \u2212 \u00b5(k)\n\n(cid:27)\n,(cid:112)T log(T )\n\n+ c3KM 3 log2\n\nmin\n\nlog(T )\n\n(\u00b5(M ) \u2212 \u00b5(M +1))2 , T\n\n+ c2KM log(T )\n\n(cid:27)(cid:19)\n\nwhere c1, c2 and c3 are universal constants.\n\nThe \ufb01rst, second and third terms respectively correspond to the regret incurred by the exploration, ini-\ntialization and communication phases, which dominate the regret due to low probability events of bad\n\ninitialization or incorrect estimations. Notice that the minmax regret scales with O(K(cid:112)T log(T )).\n\nExperiments on synthetic data are described in Appendix A.3. They empirically con\ufb01rm that SIC-\nMMAB scales better than MCTopM [8] with the gaps \u2206, besides having a smaller minmax regret.\n\n2.3\n\nIn contradiction with existing lower bounds?\n\n(cid:17)\n\n(cid:16)\n\nM(cid:80)\n\n(cid:16)(cid:80)\n\nTheorem 1 is in contradiction with the two existing lower bounds [8, 20], however SIC-MMAB\nrespects the conditions required for both.\nIt was thought that the decentralized lower bound\n, while the centralized lower bound was already known to be\nwas \u2126\n\nlog(T )\n\nk>M\n\n\u00b5(M )\u2212\u00b5(k)\n\nlog(T )\n\nk>M\n\n\u00b5(M )\u2212\u00b5(k)\n\n[3]. However, it appears that the asymptotic regret of the decentralized\n\u2126\ncase is not that much different from the latter, at least if players are synchronized. Indeed, SIC-MMAB\ntakes advantage of this synchronization to establish communication protocols as players are able to\n\n(cid:17)\n\n6\n\n\fcommunicate through collisions. Subsequent papers [10, 24] recently improved the communication\nprotocols of SIC-MMAB to obtain both initialization and communication costs constant in T , con\ufb01rm-\ning that the lower bound of the centralized case is also tight for the decentralized model considered\nso far.\nLiu and Zhao [20] proved the lower bound \u201cby considering the best case that they do not collide\u201d.\nThis is only true if colliding does not provide valuable information and the policies just maximize\nthe losses at each round, disregarding the information gathered for the future. Our algorithm is built\nupon the idea that the value of the information provided by collisions can exceed in the long run the\nimmediate loss in rewards (which is standard in dynamic programming or reinforcement learning\nfor instance). The mistake of Besson and Kaufmann [8] is found in the proof of Lemma 12 after the\nsentence \u201cWe now show that second term in (25) is zero\u201d. The conditional expectation cannot be\nput inside/outside of the expectation as written and the considered term, which corresponds to the\ndifference of information given by collisions for two different distributions, is therefore not zero.\nThese two lower bounds disregarded the amount of information that can be deduced from collisions,\nwhile SIC-MMAB obviously takes advantage of this information.\nOur exploration regret reaches, up to a constant factor, the lower bound of the centralized problem\n[3]. Although it is sub-logarithmic in time, the communication cost scales with KM 3 and can thus\nbe predominant in practice. Indeed for large networks, M 3 can easily be greater than log(T ) and the\ncommunication cost would then prevail over the other terms. This highlights the importance of the\nparameter M in multiplayer MAB and future work should focus on the dependency in both M and T\ninstead of only considering asymptotic results in T .\nSynchronization is not a reasonable assumption for practical purposes and it also leads to undesirable\nalgorithms relying on communication protocols such as SIC-MMAB. We thus claim that this assump-\ntion should be removed in the multiplayer MAB and the dynamic model should be considered instead.\nHowever, this problem seems complex to model formally. Indeed, if players stay in the game only for\na very short period, learning is not possible. The dif\ufb01culty to formalize an interesting and nontrivial\ndynamic model may explain why most of the literature focused on the static model so far.\n3 Without synchronization, the dynamic setting\nFrom now on, we no longer assume that players can communicate using synchronization. In the\nprevious section, it was crucial that all exploration/communication phases start and end at the same\ntime. This assumption is clearly unrealistic and should be alleviated, as radios do not start and end\ntransmitting simultaneously. We also consider the more dif\ufb01cult No Sensing setting in this section.\nWe assume in the following that players do not leave the game once they have started. Yet, we\nmention that our results can also be adapted to the cases when players can leave the game during\nspeci\ufb01c intervals or share an internal synchronized clock [26]. If the time is divided in several\nintervals, DYN-MMAB can be run independently on each of these intervals as suggested by Rosenski\net al. [26]. In some cases, players will be leaving in the middle of these intervals, leading to a large\nregret. But for any other interval, every player stays until its end, thus satisfying Assumption 2.\nIn this section, Assumption 2 holds. At each stage t = tj + \u03c4j, player j does not know t but only tj\n(duration since joining). We denote by T j = T \u2212 \u03c4j the (known) time horizon of player j.\n3.1 A logarithmic regret algorithm\n\nAs synchronization no longer holds, we propose the DYN-MMAB algorithm, relying on different tools\nthan SIC-MMAB. The main ideas of DYN-MMAB are given in Section 3.2. Its thorough description as\nwell as the proof of the regret bound are delayed to Appendix B due to space constraints.\nThe regret incurred by DYN-MMAB in the dynamic No Sensing model is given by Theorem 2 and\nits proof is delayed to Appendix B.2. We also mention that DYN-MMAB leads to a Pareto optimal\ncon\ufb01guration in the more general problem where users\u2019 reward distributions differ [17, 6, 7, 9].\nTheorem 2. In the dynamic setting, the regret incurred by DYN-MMAB is upper bounded as follows:\n\n(cid:32)\n\n(cid:33)\n\n,\n\nE[RT ] \u2264 O\n\nM 2K log(T )\n\n\u00b5(M )\n\n+\n\nM K log(T )\n\n\u00af\u22062\n\n(M )\n\nwhere M = #M(T ) is the total number of players in the game and \u00af\u2206(M ) = min\n\ni=1,...,M\n\n7\n\n(\u00b5(i) \u2212 \u00b5(i+1)).\n\n\f3.2 A communication-less protocol\n\nDYN-MMAB\u2019s ideas are easy to understand but the upper bound proof is quite technical. This section\ngives some intuitions about DYN-MMAB and its performance guarantees stated in Theorem 2.\nA player will only follow two different sampling strategies: either she samples uniformly at random\nin [K] during the exploration phase; or she exploits an arm and pulls it until the \ufb01nal horizon. In the\n\ufb01rst case, the exploration of the other players is not too disturbed by collisions as they only change\nthe mean reward of all arms by a common multiplicative term. In the second case, the exploited arm\nwill appear as sub-optimal to the other players, which is actually convenient for them as this arm is\nnow exploited.\nDuring the exploration phase, a player will update a set of arms called Occupied \u2282 [K] and an\nordered list of arms called Preferences \u2208 [K](cid:63). As soon as an arm is detected as occupied (by\nanother player), it is then added to Occupied (which is the empty set at the beginning). If an arm is\ndiscovered to be the best one amongst those that are neither in Occupied nor in Preferences, it\nis then added to Preferences (at the last position). An arm is active for player j if it was neither\nadded to Occupied nor to Preferences by this player yet.\nTo handle the fact that players can enter the game at anytime, we introduce the quantity \u03b3j(t), the\nexpected multiplicative factor of the means de\ufb01ned by\n\nt+\u03c4j(cid:88)\n\nE(cid:104)\n\nt(cid:48)=1+\u03c4j\n\n(1 \u2212 1\nK\n\n)mt(cid:48)\u22121(cid:105)\n\n,\n\n\u03b3j(t) =\n\n1\nt\n\nwhere mt is the number of players in their exploration phase at time t. The value of \u03b3j(t) is unknown\nto the player and random but it only affects the analysis of DYN-MMAB and not how it runs.\nThe objective of the algorithm is still to form estimates and con\ufb01dence intervals of the performances\nof arms. However, it might happen that the true mean \u00b5k does not belong to this con\ufb01dence interval.\nIndeed, this is only true for \u03b3j(t)\u00b5k, if the arm k is still free (not exploited). This is the \ufb01rst point of\nLemma 1 below. Notice that as soon as the con\ufb01dence interval for the arm i dominates the con\ufb01dence\ninterval for the arm k, then it must hold that \u03b3j(t)\u00b5i \u2265 \u03b3j(t)\u00b5k and thus arm i is better than k.\nThe second crucial point is to detect when an arm k is exploited by another player. This detection will\nhappen if a player receives too many 0 rewards successively (so that it is statistically very unlikely\nthat this arm is not occupied). The number of zero rewards needed for player j to disregard arm k is\ndenoted by Lj\nk, which is sequentially updated during the process (following the rule of Equation (4)\nk \u2265 2e log(T j)/\u00b5k. As the probability of observing a 0 reward on a free\nin Appendix B.1), so that Lj\narm k is smaller than 1 \u2212 \u00b5k/e, no matter the current number of players, observing Lj\nk successive 0\nrewards on an unexploited arm happens with probability smaller than\n\n1\n\n(T j )2 .\n\nThe second point of Lemma 1 then states that an exploited arm will either be quickly detected as\noccupied after observing Lj\nk is small enough) or its average reward will quickly drop\nbecause it now gives zero rewards (and it will be dominated by another arm after a relatively small\nnumber of pulls). The proof of Lemma 1 is delayed to Appendix B.2.\nLemma 1. We denote by \u02c6rj\n\nk(t) the empirical average reward of arm k for player j at stage t + \u03c4j.\n\nk zeros (if Lj\n\n1. For any player j and arm k, if k is still free at stage t + \u03c4j, then\n\nP(cid:104)|\u02c6rj\n\n(cid:114)\nk(t) \u2212 \u03b3j(t)\u00b5k| > 2\n\n(cid:105) \u2264 4\n\n6 K log(T j)\n\nt\n\nt\n\nholds as long as k is free.\n\n2. On the other hand, if k is exploited by some player j(cid:48) (cid:54)= j at stage t0 +\u03c4j, then, conditionally\n\n(T j)2 .\nWe then say that the arm k is correctly estimated by player j if |\u02c6rj\n2\n\n(cid:113) 6 K log(T j )\non the correct estimation of all the arms by player j, with probability 1 \u2212 O(cid:0) 1\n\u2022 either k is added to Occupied at a stage at most t0 + \u03c4j + O(cid:16) K log(T )\n(cid:17)\nO(cid:16) K log(T )\n\nT j\nby player j,\n\u2022 or k is dominated by another unoccupied arm i (for player j) at stage at most\n\nk(t) \u2212 \u03b3j(t)\u00b5k| \u2264\n\n(cid:1):\n\n(cid:17)\n\n\u00b5k\n\n+ \u03c4j.\n\n\u00b52\ni\n\n8\n\n\fIt remains to describe how players start exploiting arms. After some time (upper-bounded by\nLemma 10 in Appendix B.2), an arm which is still free and such that all better arms are occupied\nwill be detected as the best remaining one. The player will try to occupy it, and this happens as soon\nas she gets a positive reward from it: either she succeeds and starts exploiting it, or she fails and\nassumes it is occupied by another player (this only takes a few number of steps, see Lemma 1). In the\nlatter case, she resumes exploring until she detects the next available best arm. With high probability,\nthe player will necessarily end up exploiting an arm while all the better arms are already exploited by\nother players.\n4 Conclusion\nWe have presented algorithms for different multiplayer bandits models. The \ufb01rst one illustrates\nwhy the assumption of synchronization between the players is basically equivalent to allowing\ncommunication. Since communication through collisions is possible with other players at a sub-\nlogarithmic cost, the decentralized multiplayer bandits is almost equivalent to the centralized one for\nthe considered model. However, this communication cost has a large dependency in the number of\nagents in the network. Future work should then focus on considering both the dependency in time\nand the number of players as well as developing ef\ufb01cient communication protocols.\nOur major claim is that synchronization should not be considered anymore, unless communication is\nallowed. We thus introduced a dynamic model and proposed the \ufb01rst algorithm with a logarithmic\nregret.\n\nAcknowledgments\n\nThis work was supported in part by a public grant as part of the Investissement d\u2019avenir project,\nreference ANR-11-LABX-0056-LMH, LabEx LMH, in a joint call with Gaspard Monge Program for\noptimization, operations research and their interactions with data sciences.\n\nReferences\n[1] R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit\n\nproblem. Advances in Applied Probability, 27(4):1054\u20131078, 1995.\n\n[2] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning\nand cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in\nCommunications, 29(4):731\u2013745, 2011.\n\n[3] V. Anantharam, P. Varaiya, and J. Walrand. Asymptotically ef\ufb01cient allocation rules for the\nmultiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEE Transactions on\nAutomatic Control, 32(11):968\u2013976, 1987.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[5] O. Avner and S. Mannor. Concurrent bandits and cognitive radio networks. In Joint Euro-\npean Conference on Machine Learning and Knowledge Discovery in Databases, pages 66\u201381.\nSpringer, 2014.\n\n[6] O. Avner and S. Mannor. Learning to coordinate without communication in multi-user multi-\n\narmed bandit problems. arXiv preprint arXiv:1504.08167, 2015.\n\n[7] O. Avner and S. Mannor. Multi-user communication networks: A coordinated multi-armed\n\nbandit approach. arXiv preprint arXiv:1808.04875, 2018.\n\n[8] L. Besson and E. Kaufmann. Multi-Player Bandits Revisited. In Algorithmic Learning Theory,\n\nLanzarote, Spain, 2018.\n\n[9] I. Bistritz and A. Leshem. Distributed multi-player bandits-a game of thrones approach. In\n\nAdvances in Neural Information Processing Systems, pages 7222\u20137232. 2018.\n\n[10] E. Boursier, E. Kaufmann, A. Mehrabian, and V. Perchet. A practical algorithm for multiplayer\n\nbandits when arm means vary among players. arXiv preprint arXiv:1902.01239, 2019.\n\n9\n\n\f[11] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[12] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identi\ufb01cations in multi-armed bandits. In\n\nInternational Conference on Machine Learning, pages 258\u2013265, 2013.\n\n[13] S. Bubeck, Y. Li, Y. Peres, and M. Sellke. Non-stochastic multi-player multi-armed bandits:\nOptimal rate with collision information, sublinear without. arXiv preprint arXiv:1904.12233,\n2019.\n\n[14] R. Degenne and V. Perchet. Anytime optimal algorithms in stochastic multi-armed bandits. In\n\nInternational Conference on Machine Learning, pages 1587\u20131595, 2016.\n\n[15] H. Joshi, R. Kumar, A. Yadav, and S. J. Darak. Distributed algorithm for dynamic spectrum\naccess in infrastructure-less cognitive radio network. In 2018 IEEE Wireless Communications\nand Networking Conference (WCNC), pages 1\u20136, 2018.\n\n[16] W. Jouini, D. Ernst, C. Moy, and J. Palicot. Multi-armed bandit based policies for cognitive\nradio\u2019s decision making issues. In 2009 3rd International Conference on Signals, Circuits and\nSystems (SCS), 2009.\n\n[17] D. Kalathil, N. Nayyar, and R. Jain. Decentralized learning for multiplayer multiarmed bandits.\n\nIEEE Transactions on Information Theory, 60(4):2331\u20132345, 2014.\n\n[18] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of thompson sampling in\nstochastic multi-armed bandit problem with multiple plays. In International Conference on\nMachine Learning, pages 1152\u20131161, 2015.\n\n[19] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[20] K. Liu and Q. Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE\n\nTransactions on Signal Processing, 58(11):5667\u20135681, 2010.\n\n[21] G. Lugosi and A. Mehrabian. Multiplayer bandits without observing collision information.\n\narXiv preprint arXiv:1808.08416, 2018.\n\n[22] V. Perchet and P. Rigollet. The multi-armed bandit problem with covariates. The Annals of\n\nStatistics, 41(2):693\u2013721, 2013.\n\n[23] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg. Batched bandit problems. In Proceedings\n\nof The 28th Conference on Learning Theory, pages 1456\u20131456, 2015.\n\n[24] A. Proutiere and P. Wang. An optimal algorithm in multiplayer multi-armed bandits, 2019.\n\n[25] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58(5):527\u2013535, 1952.\n\n[26] J. Rosenski, O. Shamir, and L. Szlak. Multi-player bandits\u2013a musical chairs approach. In\n\nInternational Conference on Machine Learning, pages 155\u2013163, 2016.\n\n[27] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3-4):285\u2013294, 1933.\n\n[28] H. Tibrewal, S. Patchala, M.K. Hanawal, and S.J. Darak. Distributed learning and optimal\nassignment in multiplayer heterogeneous networks. In IEEE INFOCOM, pages 1693\u20131701,\n2019.\n\n10\n\n\f", "award": [], "sourceid": 6497, "authors": [{"given_name": "Etienne", "family_name": "Boursier", "institution": "ENS Paris Saclay"}, {"given_name": "Vianney", "family_name": "Perchet", "institution": "ENSAE & Criteo AI Lab"}]}