{"title": "No-Regret Learning in Unknown Games with Correlated Payoffs", "book": "Advances in Neural Information Processing Systems", "page_first": 13624, "page_last": 13633, "abstract": "We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance.  In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of  GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.", "full_text": "No-Regret Learning in Unknown Games with\n\nCorrelated Payoffs\n\nPier Giuseppe Sessa\n\nETH Z\u00fcrich\n\nsessap@ethz.ch\n\nIlija Bogunovic\n\nETH Z\u00fcrich\n\nilijab@ethz.ch\n\nMaryam Kamgarpour\n\nETH Z\u00fcrich\n\nmaryamk@ethz.ch\n\nAndreas Krause\n\nETH Z\u00fcrich\n\nkrausea@ethz.ch\n\nAbstract\n\nWe consider the problem of learning to play a repeated multi-agent game with an\nunknown reward function. Single player online learning algorithms attain strong\nregret bounds when provided with full information feedback, which unfortunately\nis unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing\noutcomes only for the selected action, yields substantially worse performance. In\nthis paper, we consider a natural model where, besides a noisy measurement of\nthe obtained reward, the player can also observe the opponents\u2019 actions. This\nfeedback model, together with a regularity assumption on the reward function,\nallows us to exploit the correlations among different game outcomes by means of\nGaussian processes (GPs). We propose a novel con\ufb01dence-bound based bandit\nalgorithm GP-MW, which utilizes the GP model for the reward function and runs\na multiplicative weight (MW) method. We obtain novel kernel-dependent regret\nbounds that are comparable to the known bounds in the full information setting,\nwhile substantially improving upon the existing bandit results. We experimentally\ndemonstrate the effectiveness of GP-MW in random matrix games, as well as real-\nworld problems of traf\ufb01c routing and movie recommendation. In our experiments,\nGP-MW consistently outperforms several baselines, while its performance is often\ncomparable to methods that have access to full information feedback.\n\n1\n\nIntroduction\n\nMany real-world problems, such as traf\ufb01c routing [14], market prediction [10], and social network\ndynamics [21], involve multiple learning agents that interact and compete with each other. Such\nproblems can be described as repeated games, in which the goal of every agent is to maximize her\ncumulative reward. In most cases, the underlying game is unknown to the agents, and the only way to\nlearn about it is by repeatedly playing and observing the corresponding game outcomes.\nThe performance of an agent in a repeated game is often measured in terms of regret. For example,\nin traf\ufb01c routing, the regret of an agent quanti\ufb01es the reduction in travel time had the agent known\nthe routes chosen by the other agents. No-regret algorithms for playing unknown repeated games\nexist, and their performance depends on the information available at every round. In the case of\nfull information feedback, the agent observes the obtained reward, as well as the rewards of other\nnon-played actions. While these algorithms attain strong regret guarantees, such full information\nfeedback is often unrealistic in real-world applications. In traf\ufb01c routing, for instance, agents only\nobserve the incurred travel times and cannot observe the travel times for the routes not chosen.\nIn this paper, we address this challenge by considering a more realistic feedback model, where at\nevery round of the game, the agent plays an action and observes the noisy reward outcome. In\naddition to this bandit feedback, the agent also observes the actions played by other agents. Under\nthis feedback model and further regularity assumptions on the reward function, we present a novel\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fHEDGE [11]\n\nEXP3 [3]\n\nGP-MW [this paper]\n\nFeedback\nRegret\n\nrewards for all actions\n\nOpT log Ki\n\nobtained reward\n\nOpT Ki log Ki\n\nobtained reward + opponents\u2019 actions\n\nOpT log Ki + TpT\n\nTable 1: Finite action set regret bounds that depend on the available feedback observed by player i at\neach time step. Time horizon is denoted with T , and Ki is the number of actions available to player i.\nKernel dependent quantity T (Eq. (3)) captures the degrees of freedom in the reward function.\n\nno-regret algorithm for playing unknown repeated games. The proposed algorithm alleviates the need\nfor full information feedback while still achieving comparable regret guarantees.\nRelated Work. In the full information setting, multiplicative-weights (MW) algorithms [17] such\nas HEDGE [11] attain optimal O(pT log Ki) regret, where Ki is the number of actions available to\nagent i. In the case of convex action sets in Rdi, and convex and Lipschitz rewards, online convex\noptimization algorithms attain optimal O(pT ) regret [25]. By only assuming Lipschitz rewards\nand bounded action sets, O(pdiT log T ) regret follows from [18], while in [13] the authors provide\n\nef\ufb01cient gradient-based algorithms with \u2018local\u2019 regret guarantees. Full information feedback requires\nperfect knowledge of the game and is unrealistic in many applications. Our proposed algorithm\novercomes this limitation while achieving comparable regret bounds.\nIn the more challenging bandit setting, existing algorithms have a substantially worse dependence on\n\nthe size of the action set. For \ufb01nite actions, EXP3 [3] and its variants ensure optimal O(pT Ki log Ki)\nO(poly(di)pT ) regret [6], while in the case of Lipschitz rewards O(T\n\nregret. In the case of convex action sets, and convex and Lipschitz rewards, bandit algorithms attain\ndi+1\ndi+2 log T ) regret can be\nobtained [22]. In contrast, our algorithm works in the noisy bandit setting and requires the knowledge\nof the actions played by other agents. This allows us to, under some regularity assumptions, obtain\nsubstantially improved performance. In Table 1, we summarize the regret and feedback model of our\nalgorithm together with the existing no-regret algorithms.\nThe previously mentioned online algorithms reduce the unknown repeated game to a single agent\nproblem against an adversarial and adaptive environment that selects a different reward function at\nevery time step [7]. A fact not exploited by these algorithms is that in a repeated game, the rewards\nobtained at different time steps are correlated through a static unknown reward function. In [24]\nthe authors use this fact to show that, if every agent uses a regularized no-regret algorithm, their\nindividual regret grows at a lower rate of O(T 1/4), while the sum of their rewards grows only as O(1).\nIn contrast to [24], we focus on the single-player viewpoint, and we do not make any assumption on\nopponents strategies1. Instead, we show that by observing opponents\u2019 actions, the agent can exploit\nthe structure of the reward function to reduce her individual regret.\nContributions. We propose a novel no-regret bandit algorithm GP-MW for playing unknown\nrepeated games. GP-MW combines the ideas of the multiplicative weights update method [17], with\nGP upper con\ufb01dence bounds, a powerful tool used in GP bandit algorithms (e.g., [23, 5]). When a\n\ufb01nite number Ki of actions is available to player i, we provide a novel high-probability regret bound\n\nO(pT log Ki + TpT ), that depends on a kernel-dependent quantity T [23]. For common kernel\nchoices, this results in a sublinear regret bound, which grows only logarithmically in Ki. In the case\nof in\ufb01nite action subsets of Rdi and Lipschitz rewards, via a discretization argument, we obtain a\nhigh-probability regret bound of O(pdiT log(diT ) + TpT ). We experimentally demonstrate that\n\nGP-MW outperforms existing bandit baselines in random matrix games and traf\ufb01c routing problems.\nMoreover, we present an application of GP-MW to a novel robust Bayesian optimization setting in\nwhich our algorithm performs favourably in comparison to other baselines.\n\n2 Problem Formulation\n\nWe consider a repeated static game among N non-cooperative agents, or players. Each player i has\nan action set Ai \u2713 Rdi and a reward function ri : A = A1 \u21e5\u00b7\u00b7\u00b7\u21e5A N ! [0, 1]. We assume that\nthe reward function ri is unknown to player i. At every time t, players simultaneously choose actions\nt ), which depends on the played action ai\nat = (a1\nt\n\nt, ai\n\nt , . . . , aN\n\nt ) and player i obtains a reward ri(ai\n1In fact, they are allowed to be adaptive and adversarial.\n\n2\n\n\fand the actions ai\nt\n\n:= (a1\n\nt , . . . , ai1\n\nt\n\nto maximize the cumulative rewardPT\n\nde\ufb01ned as\n\n, ai+1\n\n, . . . , aN\nt, ai\n\nt\nt=1 ri(ai\n\nt ) of all the other players. The goal of player i is\nt ). After T time steps, the regret of player i is\n\nRi(T ) = max\na2Ai\n\nri(a, ai\n\nt ) \n\nTXt=1\n\nTXt=1\n\nri(ai\n\nt, ai\n\nt ) ,\n\n(1)\n\nt=1.\n\nt, ai\n\nt, ai\n\nt, ai\n\nt }T\n\nt) = ri(ai\n\nt ), i.e., rt(ai\n\nt )]a2Ai 2 RKi. With bandit feedback, only the reward ri(ai\n\ni.e., the maximum gain the player could have achieved by playing the single best \ufb01xed action in case\nthe sequence of opponents\u2019 actions {ai\nt=1 and the reward function were known in hindsight. An\nalgorithm is no-regret for player i if Ri(T )/T ! 0 as T ! 1 for any sequence {ai\nt }T\nFirst, we consider the case of a \ufb01nite number of available actions Ki, i.e., |Ai| = Ki. To achieve\nno-regret, the player should play mixed strategies [7], i.e., probability distributions wi\nt 2 [0, 1]Ki\nover Ai. With full-information feedback, at every time t player i observes the vector of rewards\nrt = [ri(a, ai\nt ) is observed by\nthe player. Existing full information and bandit algorithms [11, 3], reduce the repeated game to a\nsequential decision making problem between player i and an adaptive environment that, at each time\nt, selects a reward function rt : Ai ! [0, 1]. In a repeated game, the reward that player i observes at\ntime t is a static \ufb01xed function of (ai\nt ), and in many practical settings\nsimilar game outcomes lead to similar rewards (see, e.g., the traf\ufb01c routing application in Section 4.2).\nIn contrast to existing approaches, we exploit such correlations by considering the feedback and\nreward function models described below.\nFeedback model. We consider a noisy bandit feedback model where, at every time t, player i\nobserves a noisy measurement of the reward \u02dcri\nt is i-sub-Gaussian, i.e.,\ni /2) for all c 2 R, with independence over time. The presence of noise is\nE[exp(c\u270f i\nt)] \uf8ff exp(c22\ntypical in real-world applications, since perfect measurements are unrealistic, e.g., measured travel\ntimes in traf\ufb01c routing.\nBesides the standard noisy bandit feedback, we assume player i also observes the played actions ai\nt\nof all the other players. In some applications, the reward function ri depends only indirectly on ai\nt\nthrough some aggregative function (ai\nt ) represents\nthe total occupancy of the network\u2019s edges, while in network games [15], it represents the strategies\nt ) instead of ai\nof player i\u2019s neighbours. In such cases, it is suf\ufb01cient for the player to observe (ai\n.\nt\nRegularity assumption on rewards. In this work, we assume the unknown reward function ri :\nA! [0, 1] has a bounded norm in a reproducing kernel Hilbert space (RKHS) associated with a\npositive semi-de\ufb01nite kernel function ki(\u00b7,\u00b7), that satis\ufb01es ki(a, a0) \uf8ff 1 for all a, a0 2A . The\nRKHS norm krikki =phri, riiki measures the smoothness of ri with respect to the kernel function\nki(\u00b7,\u00b7), while the kernel encodes the similarity between two different outcomes of the game a, a0 2A .\nTypical kernel choices are polynomial, Squared Exponential, and Mat\u00e9rn:\n\nt ). For example, in traf\ufb01c routing [14], (ai\n\nt where \u270fi\n\nt = ri(ai\n\nt, ai\n\nt ) + \u270fi\n\nkpoly(a, a0) =\u2713b +\n\nkM at\u00b4ern(a, a0) =\n\na>a0\n\n,\n\nkSE(a, a0) = exp\u2713\nl \u25c6n\n(\u232b) sp2\u232b\nl !\u232b\nB\u232b sp2\u232b\nl ! ,\n\n21\u232b\n\ns2\n\n2l2\u25c6 ,\n\nwhere s = ka  a0k2, B\u232b is the modi\ufb01ed Bessel function, and l, n, \u232b > 0 are kernel hyperparameters\n[20, Section 4]. This is a standard smoothness assumption used in kernelized bandits and Bayesian\noptimization (e.g., [23, 9]). In our context it allows player i to use the observed history of play to\nlearn about ri and predict unseen game outcomes. Our results are not restricted to any speci\ufb01c kernel\nfunction, and depending on the application at hand, various kernels can be used to model different\ntypes of reward functions. Moreover, composite kernels (see e.g., [16]) can be used to encode the\ndifferences in the structural dependence of ri on ai and ai.\nIt is well known that Gaussian Process models can be used to learn functions with bounded RKHS\nnorm [23, 9]. A GP is a probability distribution over functions f (a) \u21e0GP (\u00b5(a), k(a, a0)), speci\ufb01ed\nby its mean and covariance functions \u00b5(\u00b7) and k(\u00b7,\u00b7), respectively. Given a history of measurements\nj=1 with yj = f (aj) + \u270fj and \u270fj \u21e0N (0, 2), the posterior distribution under\n{yj}t\n\nj=1 at points {aj}t\n\n3\n\n\fAlgorithm 1 The GP-MW algorithm for player i\nInput: Set of actions Ai, GP prior (\u00b50, 0, ki), parameters {t}t1,\u2318\n1: Initialize: wi\n(1, . . . , 1) 2 RKi\n2: for t = 1, 2, . . . , T do\nSample action ai\nt \u21e0 wi\n3:\nt\nObserve noisy reward \u02dcri\n4:\n\n1 = 1\nKi\n\nt and opponents\u2019 actions ai\nt\nt, ai\nt ) + \u270fi\nt\nCompute optimistic reward estimates \u02c6rt 2 RKi :\n\n\u02dcri\nt = ri(ai\n\n:\n\n5:\n\n6:\n\n[\u02c6rt]a = min{1, U CBt(a, ai\nt )}\n\nfor every a = 1, . . . , Ki\n\nfor every a = 1, . . . , Ki\n\n(5)\n\n(6)\n\nUpdate mixed strategy:\n\n[wi\n\nt+1]a =\n\nt]a exp(\u2318 (1  [\u02c6rt]a))\n\nt]k exp(\u2318 (1  [\u02c6rt]k))\n\n[wi\nk=1[wi\n\nPKi\n\nUpdate \u00b5t, t according to (2)-(3) by appending (at, \u02dcri\n\nt) to the history of play.\n\n7:\n8: end for\n\na GP(0, k(a, a0)) prior is also Gaussian, with mean and variance functions:\n\n\u00b5t(a) = kt(a)>(Kt + 2It)1yt\nt (a) = k(a, a)  kt(a)>(Kt + 2It)1kt(a) ,\n2\nwhere kt(a) = [k(aj, a)]t\nj=1, yt = [y1, . . . , yt]>, and Kt = [k(aj, aj0)]j,j0 is the kernel matrix.\nAt time t, an upper con\ufb01dence bound on f can be obtained as:\n\n(2)\n(3)\n\nU CBt(a) := \u00b5t1(a) + tt1(a) ,\n\n(4)\nwhere t is a parameter that controls the width of the con\ufb01dence bound and ensures U CBt(a)  f (a),\nfor all a 2A and t  1, with high probability [23]. We make this statement precise in Theorem 1.\nDue to the above regularity assumptions and feedback model, player i can use the history of play\nt1)} to compute an upper con\ufb01dence bound U CBt(\u00b7) of the unknown reward\n{(a1, \u02dcri\nfunction ri by using (4). In the next section, we present our algorithm that makes use of U CBt(\u00b7) to\nsimulate full information feedback.\n\n1), . . . , (at1, \u02dcri\n\n3 The GP-MW Algorithm\n\nWe now introduce GP-MW, a novel no-regret bandit algorithm, which can be used by a generic player\ni (see Algorithm 1). GP-MW maintains a probability distribution (or mixed strategy) wi\nt over Ai and\nupdates it at every time step using a multiplicative-weight (MW) subroutine (see (6)) that requires\nfull information feedback. Since such feedback is not available, GP-MW builds (in (5)) an optimistic\nestimate of the true reward of every action via the upper con\ufb01dence bound U CBt of ri. Moreover,\nsince rewards are bounded in [0, 1], the algorithm makes use of min{1, U CBt(\u00b7)}. At every time\nstep t, GP-MW plays an action ai\nt and\nt, and uses the noisy reward observation \u02dcri\nactions ai\nt played by other players to compute the updated upper con\ufb01dence bound U CBt+1(\u00b7).\nIn Theorem 1, we present a high-probability regret bound for GP-MW while all the proofs of this\nsection can be found in the supplementary material. The obtained bound depends on the maximum\ninformation gain, a kernel-dependent quantity de\ufb01ned as:\n\nt sampled from wi\n\nt := max\na1,...,at\n\n1\n2\n\nlog det(It + 2Kt) .\n\nIt quanti\ufb01es the maximal reduction in uncertainty about ri after observing outcomes {aj}t\nj=1 and\nthe corresponding noisy rewards. The result of [23] shows that this quantity is sublinear in T , e.g.,\nT = O((log T )d+1) in the case of kSE, and T = OT\nwhere d is the total dimension of the outcomes a 2A , i.e., d =PN\n\n2\u232b+d2+d log T in the case of kM at\u00b4ern,\n\ni=1 di.\n\nd2+d\n\n4\n\n\fTheorem 1. Fix  2 (0, 1) and assume \u270fi\nt\u2019s are i-sub-Gaussian with independence over time. For\nany ri such that krikki \uf8ff B, if player i plays actions from Ai, |Ai| = Ki, according to GP-MW\nwith t = B +p2(t1 + log(2/)) and \u2318 =p(8 log Ki)/T , then with probability at least 1  ,\n\nRi(T ) = O\u21e3pT log Ki +pT log(2/) + BpT T +pT T (T + log(2/))\u2318 .\n\nThe proof of this theorem follows by the decomposition of the regret of GP-MW into the sum of two\nterms. The \ufb01rst term corresponds to the regret that player i incurs with respect to the sequence of\ncomputed upper con\ufb01dence bounds. The second term is due to not knowing the true reward function\nri. The proof of Theorem 1 then proceeds by bounding the \ufb01rst term using standard results from\nadversarial online learning [7], while the second term is upper bounded by using regret bounding\ntechniques from GP optimization [23, 4].\nTheorem 1 can be made more explicit by substituting bounds on T . For instance, in the case of the\nsquared exponential kernel, the regret bound becomes Ri(T ) = O\u21e3(log Ki)1/2 +(log T )d+1pT\u2318.\nIn comparison to the standard multi-armed bandit regret bound O(pT Ki log Ki) (e.g., [3]), this\nregret bound does not depend on pKi, similarly to the ideal full information setting.\n\nThe case of continuous action sets\nIn this section, we consider the case when Ai is a (continuous) compact subset of Rdi. In this case,\nfurther assumptions are required on ri and Ai to achieve sublinear regret. Hence, we assume a\nbounded set Ai \u21e2 Rdi and ri to be Lipschitz continuous in ai. Under the same assumptions, existing\nregret bounds are O(pdiT log T ) and O(T\ndi+1\ndi+2 log T ) in the full information [18] and bandit setting\n[22], respectively. By using a discretization argument, we obtain a high probability regret bound for\nGP-MW.\nCorollary 1. Let  2 (0, 1) and \u270fi\nt be i-sub-Gaussian with independence over time. Assume\nkrikk \uf8ff B, Ai \u21e2 [0, b]di, and ri is L-Lipschitz in its \ufb01rst argument, and consider the discretization\n[Ai]T with |[Ai]T| = (LbpdiT )di such that ka [a]Tk1 \uf8ffpdi/T /L for every a 2A i, where [a]T\nis the closest point to a in [Ai]T . If player i plays actions from [Ai]T according to GP-MW with\nt = B +p2(t1 + log(2/)) and \u2318 = p8di log(LbpdiT )/T , then with probability at least\n1  ,\nRi(T ) = O\u2713qdiT log(LbpdiT ) +pT log(2/) + BpT T +pT T (T + log(2/))\u25c6 .\n\nBy substituting bounds on T , our bound becomes Ri(T ) = O(T 1/2polylog(T)) in the case of the\nSE kernel (for \ufb01xed d). Such a bound has a strictly better dependence on T than the existing bandit\ndi+1\nbound O(T\ndi+2 log T ) from [22]. Similarly to [22, 18], the algorithm resulting from Corollary 1\nis not ef\ufb01cient in high dimensional settings, as its computational complexity is exponential in di.\n\n4 Experiments\n\nIn this section, we consider random matrix games and a traf\ufb01c routing model and compare GP-MW\nwith the existing algorithms for playing repeated games. Then, we show an application of GP-MW\nto robust BO and compare it with existing baselines on a movie recommendation problem.\n\nt , a2\n\nt ,a2\n\n4.1 Repeated random matrix games\nWe consider a repeated matrix game between two players with actions A1 = A2 = {0, 1, . . . , K  1}\nand payoff matrices Ai 2 RK\u21e5K, i = 1, 2. At every time step, each player i receives a payoff\nt , where [Ai]i,j indicates the (i, j)-th entry of matrix Ai. We select K = 30\nri(a1\nand generate 10 random matrices with r1 = r2 \u21e0 GP (0, k(\u00b7,\u00b7)), where k = kSE with l = 6. We set\nthe noise to \u270fi\nt \u21e0N (0, 1), and use T = 200. For every game, we distinguish between two settings:\nAgainst random opponent. In this setting, player-2 plays actions uniformly at random from A2 at\nevery round t, while player-1 plays according to a no-regret algorithm. In Figure 1a, we compare the\n\nt ) = [Ai]a1\n\n5\n\n\f(a) Against random opponent\n\n(b) GP-MW vs. EXP3.P.\n\nFigure 1: GP-MW leads to smaller regret compared to EXP3.P. HEDGE is an idealized benchmark\nwhich upper bounds the achievable performance. Shaded areas represent \u00b1 one standard deviation.\n\ntime-averaged regret of player-1 when playing according to HEDGE [11], EXP3.P [3], and GP-MW.\nOur algorithm is run with the true function prior while HEDGE receives (unrealistic) noiseless full\ninformation feedback (at every round t) and leads to the lowest regret. When only the noisy bandit\nfeedback is available, GP-MW signi\ufb01cantly outperforms EXP3.P.\nGP-MW vs EXP3.P. Here, player-1 plays according to GP-MW while player-2 is an adaptive\nadversary and plays using EXP3.P. In Figure 1b, we compare the regret of the two players averaged\nover the game instances. GP-MW outperforms EXP3.P and ensures player-1 a smaller regret.\n\n4.2 Repeated traf\ufb01c routing\n\nWe consider the Sioux-Falls road network [14, 1], a standard benchmark model in the transportation\nliterature. The network is a directed graph with 24 nodes and 76 edges (e 2 E). In this experiment,\nwe have N = 528 agents and every agent i seeks to send some number of units ui from a given\norigin to a given destination node. To do so, agent i can choose among Ki = 5 possible routes\nconsisting of network edges E(i) \u21e2 E. A route chosen by agent i corresponds to action ai 2 R|E(i)|\nwith [ai]e = ui in case e belongs to the route and [ai]e = 0 otherwise. The goal of each agent i is to\nminimize the travel time weighted by the number of units ui. The travel time of an agent is unknown\nand depends on the total occupancy of the traversed edges within the chosen route. Hence, the travel\ntime increases when more agents use the same edges. The number of units ui for every agent, as\nwell as travel time functions for each edge, are taken from [14, 1]. A more detailed description of\nour experimental setup is provided in Appendix C.\nWe consider a repeated game, where agents choose routes using either of the following algorithms:\n\u2022 HEDGE. To run HEDGE, each agent has to observe the travel time incurred had she chosen any\ndifferent route. This requires knowing the exact travel time functions. Although these assumptions\nare unrealistic, we use HEDGE as an idealized benchmark.\n\ncorresponds to the standard bandit feedback.\n\n\u2022 EXP3.P. In the case of EXP3.P, agents only need to observe their incurred travel time. This\n\u2022 GP-MW. Let (ai\nt ) 2 R|E(i)| be the total occupancy (by other agents) of edges E(i) at time t.\nTo run GP-MW, agent i needs to observe a noisy measurement of the travel time as well as the\ncorresponding (ai\nt ).\n\n\u2022 Q-BRI (Q-learning Better Replies with Inertia algorithm [8]). This algorithm requires the same\nfeedback as GP-MW and is proven to asymptotically converge to a Nash equilibrium (as the\nconsidered game is a potential game [19]). We use the same set of algorithm parameters as in [8].\nFor every agent i to run GP-MW, we use a composite kernel ki such that for every a1, a2 2A ,\n1 is a linear kernel\nki((ai\nand ki\nFirst, we consider a random subset of 100 agents that we refer to as learning agents. These agents\nchoose actions (routes) according to the aforementioned no-regret algorithms for T = 100 game\n\n1, ai\n1 + (ai\n2 is a polynomial kernel of degree n 2{ 2, 4, 6}.\n\n2 )) = ki\n\n2 + (ai\n\n2 )) , where ki\n\n2) \u00b7 ki\n\n1 ), (ai\n\n2, ai\n\n1(ai\n\n1, ai\n\n2(ai\n\n1 ), ai\n\n6\n\n\fFigure 2: GP-MW leads to a signi\ufb01cantly smaller average regret compared to EXP3.P and Q-BRI\nand improves the overall congestion in the network. HEDGE represents an idealized full information\nbenchmark which upper bounds the achievable performance.\n\nrounds. The remaining non-learning agents simply choose the shortest route, ignoring the presence\nof the other agents. In Figure 2 (top plots), we compare the average regret (expressed in hours) of\nthe learning agents when they use the different no-regret algorithms. We also show the associated\naverage congestion in the network (see (13) in Appendix C for a formal de\ufb01nition). When playing\naccording to GP-MW, agents incur signi\ufb01cantly smaller regret and the overall congestion is reduced\nin comparison to EXP3.P and Q-BRI.\nIn our second experiment, we consider the same setup as before, but we vary the number of learning\nagents. In Figure 2 (bottom plots), we show the \ufb01nal (when T = 100) average regret and congestion\nas a function of the number of learning agents. We observe that GP-MW systematically leads to\na smaller regret and reduced congestion in comparison to EXP3.P and Q-BRI. Moreover, as the\nnumber of learning agents increases, both HEDGE and GP-MW reduce the congestion in the network,\nwhile this is not the case with EXP3.P or Q-BRI (due to a slower convergence).\n\n4.3 GP-MW and robust Bayesian Optimization\nIn this section, we apply GP-MW to a novel robust Bayesian Optimization (BO) setting, similar to\nthe one considered in [4]. The goal is to optimize an unknown function f (under the same regularity\nassumptions as in Section 2) from a sequence of queries and corresponding noisy observations. Very\noften, the actual queried points may differ from the selected ones due to various input perturbations,\nor the function may depend on external parameters that cannot be controlled (see [4] for examples).\nThis scenario can be modelled via a two player repeated game, where a player is competing against\nan adversary. The unknown reward function is given by f : X\u21e5  ! R. At every round t of the\ngame, the player selects a point xt 2X , and the adversary chooses t 2 . The player then observes\nthe parameter t and a noisy estimate of the reward: f (xt, t) + \u270ft. After T time steps, the player\nincurs the regret\n\nR(T ) = max\nx2X\n\nf (x, t) \n\nTXt=1\n\nNote that both the regret de\ufb01nition and feedback model are the same as in Section 2.\n\nf (xt, t).\n\nTXt=1\n\n7\n\n\f(a) Users chosen at random.\n\n(b) Users chosen by adaptive adversary.\n\nFigure 3: GP-MW ensures no-regret against both randomly and adaptively chosen users, while\nGP-UCB and STABLEOPT attain constant average regret.\n\nIn the standard (non-adversarial) Bayesian optimization setting, the GP-UCB algorithm [23] ensures\nno-regret. On the other hand, the STABLEOPT algorithm [4] attains strong regret guarantees against\nthe worst-case adversary which perturbs the \ufb01nal reported point xT . Here instead, we consider\nthe case where the adversary is adaptive at every time t, i.e., it can adapt to past selected points\nx1, . . . , xt1. We note that both GP-UCB and STABLEOPT fail to achieve no-regret in this setting,\nas both algorithms are deterministic conditioned on the history of play. On the other hand, GP-MW\nis a no-regret algorithm in this setting according to Theorem 1 (and Corollary 1).\nNext, we demonstrate these observations experimentally in a movie recommendation problem.\nMovie recommendation. We seek to recommend movies to users according to their preferences.\nA priori it is unknown which user will see the recommendation at any time t. We assume that such a\nuser is chosen arbitrarily (possibly adversarially), simultaneously to our recommendation.\nWe use the MovieLens-100K dataset [12] which provides a matrix of ratings for 1682 movies\nrated by 943 users. We apply non-negative matrix factorization with p = 15 latent factors on the\nincomplete rating matrix and obtain feature vectors mi, uj 2 Rp for movies and users, respectively.\nHence, m>i uj represents the rating of movie i by user j. At every round t, the player selects mt 2\n{m1, . . . , m1682}, the adversary chooses (without observing mt) a user index it 2{ 1, . . . , 943},\nand the player receives reward f (mt, it) = m>t uit. We model f via a GP with composite kernel\nk((m, i), (m0, i0)) = k1(m, m0) \u00b7 k2(i, i0) where k1 is a linear kernel and k2 is a diagonal kernel.\nWe compare the performance of GP-MW against the ones of GP-UCB and STABLEOPT\nwhen sequentially recommending movies.\nIn this experiment, we let GP-UCB select mt =\narg maxm maxi U CBt(m, i), while STABLEOPT chooses mt = arg maxm mini U CBt(m, i)\nat every round t. Both algorithms update their posteriors with measurements at (mt,\u02c6it) with\n\u02c6it = arg maxi U CBt(mt, i) in the case of GP-UCB and \u02c6it = arg mini LCBt(mt, i) for STA-\nBLEOPT. Here, LCBt represents a lower con\ufb01dence bound on f (see [4] for details).\nIn Figure 3a, we show the average regret of the algorithms when the adversary chooses users uniformly\nat random at every t. In our second experiment (Figure 3b), we show their performance when the adver-\nsary is adaptive and selects it according to the HEDGE algorithm. We observe that in both experiments\nGP-MW is no-regret, while the average regrets of both GP-UCB and STABLEOPT do not vanish.\n\n5 Conclusions\n\nWe have proposed GP-MW, a no-regret bandit algorithm for playing unknown repeated games. In\naddition to the standard bandit feedback, the algorithm requires observing the actions of other players\nafter every round of the game. By exploiting the correlation among different game outcomes, it com-\nputes upper con\ufb01dence bounds on the rewards and uses them to simulate unavailable full information\nfeedback. Our algorithm attains high probability regret bounds that can substantially improve upon\nthe existing bandit regret bounds. In our experiments, we have demonstrated the effectiveness of\nGP-MW on synthetic games, and real-world problems of traf\ufb01c routing and movie recommendation.\n\n8\n\n\fAcknowledgments\nThis work was gratefully supported by Swiss National Science Foundation, under the grant SNSF\n200021_172781, and by the European Union\u2019s Horizon 2020 ERC grant 815943.\n\nReferences\n[1] Transportation network test problems. http://www.bgu.ac.il/ bargera/tntp/.\n\n[2] Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. PhD\n\nthesis, Edmonton, Alta., Canada, 2012.\n\n[3] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput., 32(1):48\u201377, January 2003.\n\n[4] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust\noptimization with gaussian processes. In Neural Information Processing Systems (NeurIPS),\n2018.\n\n[5] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance\nreduction: A uni\ufb01ed approach to bayesian optimization and level-set estimation. In Neural\nInformation Processing Systems (NeurIPS), 2016.\n\n[6] S\u00e9bastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex\nIn Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of\n\noptimization.\nComputing, STOC 2017, pages 72\u201385, 2017.\n\n[7] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, New York, NY, USA, 2006.\n\n[8] Archie C. Chapman, David S. Leslie, Alex Rogers, and Nicholas R. Jennings. Convergent\nlearning algorithms for unknown reward games. SIAM J. Control and Optimization, 51(4):3154\u2013\n3180, 2013.\n\n[9] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International\n\nConference on Machine Learning (ICML), 2017.\n\n[10] Itay P. Fainmesser. Community structure and market outcomes: A repeated games-in-networks\n\napproach. American Economic Journal: Microeconomics, 4(1):32\u201369, February 2012.\n\n[11] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and\nan application to boosting. Journal of Computer and System Sciences, 55(1):119 \u2013 139, 1997.\n\n[12] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM\n\nTrans. Interact. Intell. Syst., 5(4):19:1\u201319:19, December 2015.\n\n[13] Elad Hazan, Karan Singh, and Cyril Zhang. Ef\ufb01cient regret minimization in non-convex games.\n\nIn International Conference on Machine Learning (ICML), 2017.\n\n[14] Larry J. Leblanc. An algorithm for the discrete network design problem. Transportation Science,\n\n9:183\u2013199, 08 1975.\n\n[15] Matthew O. Jackson and Yves Zenou. Games on networks. In Handbook of Game Theory with\n\nEconomic Applications, volume 4, chapter 3, pages 95\u2013163. Elsevier, 2015.\n\n[16] Andreas Krause and Cheng S. Ong. Contextual gaussian process bandit optimization. In Neural\n\nInformation Processing Systems (NeurIPS). 2011.\n\n[17] N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Compu-\n\ntation, 108(2):212 \u2013 261, 1994.\n\n[18] Odalric-Ambrym Maillard and R\u00e9mi Munos. Online learning in adversarial lipschitz envi-\nronments. In Machine Learning and Knowledge Discovery in Databases, pages 305\u2013320,\n2010.\n\n9\n\n\f[19] Dov Monderer and Lloyd S. Shapley. Potential games. Games and Economic Behavior,\n\n14(1):124 \u2013 143, 1996.\n\n[20] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning (Adaptive Computation and Machine Learning). The MIT Press, 2005.\n\n[21] Brian Skyrms and Robin Pemantle. A dynamic model of social network formation. Adaptive\n\nNetworks: Theory, Models and Applications, pages 231\u2013251, 2009.\n\n[22] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine\n\nLearning Research, 15:2533\u20132568, 2014.\n\n[23] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process opti-\nmization in the bandit setting: No regret and experimental design. In International Conference\non Machine Learning (ICML), 2010.\n\n[24] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of\n\nregularized learning in games. In Neural Information Processing Systems (NeurIPS), 2015.\n\n[25] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn International Conference on Machine Learning (ICML), 2003.\n\n10\n\n\f", "award": [], "sourceid": 7573, "authors": [{"given_name": "Pier Giuseppe", "family_name": "Sessa", "institution": "ETH Z\u00fcrich"}, {"given_name": "Ilija", "family_name": "Bogunovic", "institution": "ETH Zurich"}, {"given_name": "Maryam", "family_name": "Kamgarpour", "institution": "ETH Z\u00fcrich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}