{"title": "Regret of Queueing Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1669, "page_last": 1677, "abstract": "We consider a variant of the multiarmed bandit problem where jobs queue for service, and service rates of different servers may be unknown.  We study algorithms that minimize queue-regret: the (expected) difference between the queue-lengths obtained by the algorithm, and those obtained by a genie-aided matching algorithm that knows exact service rates.  A naive view of this problem would suggest that queue-regret should grow logarithmically: since queue-regret cannot be larger than classical regret, results for the standard MAB problem give algorithms that ensure queue-regret increases no more than logarithmically in time. Our paper shows surprisingly more complex behavior.  In particular, the naive intuition is correct as long as the bandit algorithm's queues have relatively long regenerative cycles: in this case queue-regret is similar to cumulative regret, and scales (essentially) logarithmically.  However, we show that this \"early stage\" of the queueing bandit eventually gives way to a \"late stage\", where the optimal queue-regret scaling is O(1/t).  We demonstrate an algorithm that (order-wise) achieves this asymptotic queue-regret, and also exhibits close to optimal switching time from the early stage to the late stage.", "full_text": "Regret of Queueing Bandits\n\nSubhashini Krishnasamy\nUniversity of Texas at Austin\n\nRajat Sen\n\nUniversity of Texas at Austin\n\nRamesh Johari\n\nStanford University\n\nSanjay Shakkottai\n\nUniversity of Texas at Austin\n\nAbstract\n\nWe consider a variant of the multiarmed bandit problem where jobs queue for ser-\nvice, and service rates of different servers may be unknown. We study algorithms\nthat minimize queue-regret: the (expected) difference between the queue-lengths\nobtained by the algorithm, and those obtained by a \u201cgenie\u201d-aided matching algo-\nrithm that knows exact service rates. A naive view of this problem would suggest\nthat queue-regret should grow logarithmically: since queue-regret cannot be larger\nthan classical regret, results for the standard MAB problem give algorithms that en-\nsure queue-regret increases no more than logarithmically in time. Our paper shows\nsurprisingly more complex behavior. In particular, the naive intuition is correct\nas long as the bandit algorithm\u2019s queues have relatively long regenerative cycles:\nin this case queue-regret is similar to cumulative regret, and scales (essentially)\nlogarithmically. However, we show that this \u201cearly stage\u201d of the queueing bandit\neventually gives way to a \u201clate stage\u201d, where the optimal queue-regret scaling is\nO(1/t). We demonstrate an algorithm that (order-wise) achieves this asymptotic\nqueue-regret, and also exhibits close to optimal switching time from the early stage\nto the late stage.\n\n1\n\nIntroduction\n\nStochastic multi-armed bandits (MAB) have a rich history in sequential decision making [1, 2, 3].\nIn its simplest form, a collection of K arms are present, each having a binary reward (Bernoulli\nrandom variable over {0, 1}) with an unknown success probability1 (and different across arms). At\neach (discrete) time, a single arm is chosen by the bandit algorithm, and a (binary-valued) reward is\naccrued. The MAB problem is to determine which arm to choose at each time in order to minimize\nthe cumulative expected regret, namely, the cumulative loss of reward when compared to a genie that\nhas knowledge of the arm success probabilities.\nIn this paper, we consider the variant of this problem motivated by queueing applications. Formally,\nsuppose that arms are pulled upon arrivals of jobs; each arm is now a server that can serve the arriving\njob. In this model, the stochastic reward described above is equivalent to service. In other words,\nif the arm (server) that is chosen results in positive reward, the job is successfully completed and\ndeparts the system. However, this basic model fails to capture an essential feature of service in many\nsettings: in a queueing system, jobs wait until they complete service. Such systems are stateful: when\nthe chosen arm results in zero reward, the job being served remains in the queue, and over time the\nmodel must track the remaining jobs waiting to be served. The difference between the cumulative\nnumber of arrivals and departures, or the queue length, is the most common measure of the quality of\nthe service strategy being employed.\n\n1Here, the success probability of an arm is the probability that the reward equals \u20191\u2019.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fQueueing is employed in modeling a vast range of service systems, including supply and demand\nin online platforms (e.g., Uber, Lyft, Airbnb, Upwork, etc.); order \ufb02ow in \ufb01nancial markets (e.g.,\nlimit order books); packet \ufb02ow in communication networks; and supply chains. In all of these\nsystems, queueing is an essential part of the model: e.g., in online platforms, the available supply\n(e.g. available drivers in Uber or Lyft, or available rentals in Airbnb) queues until it is \u201cserved\u201d by\narriving demand (ride requests in Uber or Lyft, booking requests in Airbnb). Since MAB models are\na natural way to capture learning in this entire range of systems, incorporating queueing behavior\ninto the MAB model is an essential challenge.\nThis problem clearly has the explore-exploit tradeoff inherent in the standard MAB problem: since\nthe success probabilities across different servers are unknown, there is a tradeoff between learning\n(exploring) the different servers and (exploiting) the most promising server from past observations.\nWe refer to this problem as the queueing bandit. Since the queue length is simply the difference\nbetween the cumulative number arrivals and departures (cumulative actual reward; here reward is\n1 if job is served), the natural notion of regret here is to compare the expected queue length under\na bandit algorithm with the corresponding one under a genie policy (with identical arrivals) that\nhowever always chooses the arm with the highest expected reward.\nQueueing System: To capture this trade-off, we consider a discrete-time queueing system with a\nsingle queue and K servers. Arrivals to the queue and service offered by the links are according to\nproduct Bernoulli distribution and i.i.d. across time slots. Statistical parameters corresponding to the\nservice distributions are considered unknown. In any time slot, the queue can be served by at most\none server and the problem is to schedule a server in every time slot. The service is pre-emptive and a\njob returns to the queue if not served. There is at least one server that has a service rate higher than\nthe arrival rate, which ensures that the \"genie\" policy is stable.\nLet Q(t) be the queue length at time t under a given bandit algorithm, and let Q\u21e4(t) be the corre-\nsponding queue length under the \u201cgenie\u201d policy that always schedules the optimal server (i.e. always\nplays the arm with the highest mean). We de\ufb01ne the queue-regret as the difference in expected queue\nlengths for the two policies. That is, the regret is given by:\n\n (t) := E [Q(t)  Q\u21e4(t)] .\n\n(1)\nHere (t) has the interpretation of the traditional MAB regret with caveat that rewards are accumu-\nlated only if there is a job that can bene\ufb01t from this reward. We refer to (t) as the queue-regret;\nformally, our goal is to develop bandit algorithms that minimize the queue-regret at a \ufb01nite time t.\nTo develop some intuition, we compare this to the standard stochastic MAB problem. For the\nstandard problem, well-known algorithms such as UCB, KL-UCB, and Thompson sampling achieve\na cumulative regret of O((K  1) log t) at time t [4, 5, 6], and this result is essentially tight [7]. In\nthe queueing bandit, we can obtain a simple bound on the queue-regret by noting that it cannot be\nany higher than the traditional regret (where a reward is accrued at each time whether a job is present\nor not). This leads to an upper bound of O((K  1) log t) for the queue regret.\nHowever, this upper bound does not tell the whole story for the queueing bandit: we show that\nthere are two \u201cstages\u201d to the queueing bandit. In the early stage, the bandit algorithm is unable to\neven stabilize the queue \u2013 i.e. on average, the queue length increases over time and is continuously\nbacklogged; therefore the queue-regret grows with time, similar to the cumulative regret. Once the\nalgorithm is able to stabilize the queue\u2014the late stage\u2014then a dramatic shift occurs in the behavior\nof the queue regret. A stochastically stable queue goes through regenerative cycles \u2013 a random\ncyclical behavior where queues build-up over time, then empty, and the cycle repeats. The associated\nrecurring\u201czero-queue-length\u201d epochs means that sample-path queue-regret essentially \u201cresets\u201d at\n(stochastically) regular intervals; i.e., the sample-path queue-regret becomes non-positive at these\ntime instants. Thus the queue-regret should fall over time, as the algorithm learns.\nOur main results provide lower bounds on queue-regret for both the early and late stages, as well\nas algorithms that essentially match these lower bounds. We \ufb01rst describe the late stage, and then\ndescribe the early stage for a heavily loaded system.\n1. The late stage. We \ufb01rst consider what happens to the queue regret as t ! 1. As noted above, a\nreasonable intuition for this regime comes from considering a standard bandit algorithm, but where\nthe sample-path queue-regret \u201cresets\u201d at time points of regeneration.2 In this case, the queue-regret is\n2This is inexact since the optimal queueing system and bandit queueing system may not regenerate at the\n\nsame time point; but the intuition holds.\n\n2\n\n\fapproximately a (discrete) derivative of the cumulative regret. Since the optimal cumulative regret\nscales like log t, asymptotically the optimal queue-regret should scale like 1/t. Indeed, we show\nthat the queue-regret for \u21b5-consistent policies is at least C/t in\ufb01nitely often, where C is a constant\nindependent of t. Further, we introduce an algorithm called Q-ThS for the queueing bandit (a variant\nof Thompson sampling with explicit structured exploration), and show an asymptotic regret upper\nbound of O (poly(log t)/t) for Q-ThS, thus matching the lower bound up to poly-logarithmic factors\nin t. Q-ThS exploits structured exploration: we exploit the fact that the queue regenerates regularly\nto explore more systematically and aggressively.\n2. The early stage. The preceding discussion might suggest that an algorithm that explores ag-\ngressively would dominate any algorithm that balances exploration and exploitation. However, an\noverly aggressive exploration policy will preclude the queueing system from ever stabilizing, which\nis necessary to induce the regenerative cycles that lead the system to the late stage. To even enter the\nlate stage, therefore, we need an algorithm that exploits enough to actually stabilize the queue (i.e.\nchoose good arms suf\ufb01ciently often so that the mean service rate exceeds the expected arrival rate).\nWe refer to the early stage of the system, as noted above, as the period before the algorithm has\nlearned to stabilize the queues. For a heavily loaded system, where the arrival rate approaches the\nservice rate of the optimal server, we show a lower bound of \u2326(log t/ log log t) on the queue-regret in\nthe early stage. Thus up to a log log t factor, the early stage regret behaves similarly to the cumulative\nregret (which scales like log t). The heavily loaded regime is a natural asymptotic regime in which to\nstudy queueing systems, and has been extensively employed in the literature; see, e.g., [9, 10] for\nsurveys.\nPerhaps more importantly, our analysis shows that the time to switch from the early stage to the late\nstage scales at least as t =\u2326( K/\u270f), where \u270f is the gap between the arrival rate and the service rate\nof the optimal server; thus \u270f ! 0 in the heavy-load setting. In particular, we show that the early\nstage lower bound of \u2326(log t/ log log t) is valid up to t = O(K/\u270f); on the other hand, we also show\nthat, in the heavy-load limit, depending on the relative scaling between K and \u270f, the regret of Q-ThS\nscales like Opoly(log t)/\u270f2t for times that are arbitrarily close to \u2326(K/\u270f). In other words, Q-ThS\n\nis nearly optimal in the time it takes to \u201cswitch\u201d from the early stage to the late stage.\nOur results constitute the \ufb01rst insight into the behavior of regret in this queueing setting; as em-\nphasized, it is quite different than that seen for minimization of cumulative regret in the standard\nMAB problem. The preceding discussion highlights why minimization of queue-regret presents a\nsubtle learning problem. On one hand, if the queue has been stabilized, the presence of regenerative\ncycles allows us to establish that queue regret must eventually decay to zero at rate 1/t under an\noptimal algorithm (the late stage). On the other hand, to actually have regenerative cycles in the \ufb01rst\nplace, a learning algorithm needs to exploit enough to actually stabilize the queue (the early stage).\nOur analysis not only characterizes regret in both regimes, but also essentially exactly characterizes\nthe transition point between the two regimes. In this way the queueing bandit is a remarkable new\nexample of the tradeoff between exploration and exploitation.\n\n2 Related work\n\nMAB algorithms. Stochastic MAB models have been widely used in the past as a paradigm for\nvarious sequential decision making problems in industrial manufacturing, communication networks,\nclinical trials, online advertising and webpage optimization, and other domains requiring resource\nallocation and scheduling; see, e.g., [1, 2, 3]. The MAB problem has been studied in two variants,\nbased on different notions of optimality. One considers mean accumulated loss of rewards, often\ncalled regret, as compared to a genie policy that always chooses the best arm. Most effort in this\ndirection is focused on getting the best regret bounds possible at any \ufb01nite time in addition to designing\ncomputationally feasible algorithms [3]. The other line of research models the bandit problem as a\nMarkov decision process (MDP), with the goal of optimizing in\ufb01nite horizon discounted or average\nreward. The aim is to characterize the structure of the optimal policy [2]. Since these policies deal\nwith optimality with respect to in\ufb01nite horizon costs, unlike the former body of research, they give\nsteady-state and not \ufb01nite-time guarantees. Our work uses the regret minimization framework to\nstudy the queueing bandit problem.\nBandits for queues. There is body of literature on the application of bandit models to queueing and\nscheduling systems [2, 11, 12, 13, 14, 15, 16, 17]. These queueing studies focus on in\ufb01nite-horizon\n\n3\n\n\fcosts (i.e., statistically steady-state behavior, where the focus typically is on conditions for optimality\nof index policies); further, the models do not typically consider user-dependent server statistics. Our\nfocus here is different: algorithms and analysis to optimize \ufb01nite time regret.\n\n3 Problem Setting\n\nWe consider a discrete-time queueing system with a single queue and K servers. The servers are\nindexed by k = 1, . . . , K. Arrivals to the queue and service offered by the links are according to\nproduct Bernoulli distribution and i.i.d. across time slots. The mean arrival rate is given by  and\nthe mean service rates by the vector \u00b5 = [\u00b5k]k2[K], with < maxk2[K] \u00b5k. In any time slot, the\nqueue can be served by at most one server and the problem is to schedule a server in every time slot.\nThe scheduling decision at any time t is based on past observations corresponding to the services\nobtained from the scheduled servers until time t  1. Statistical parameters corresponding to the\nservice distributions are considered unknown. The queueing system evolution can be described\nas follows. Let \uf8ff(t) denote the server that is scheduled at time t. Also, let Rk(t) 2{ 0, 1} be\nthe service offered by server k and S(t) denote the service offered by server \uf8ff(t) at time t, i.e.,\nS(t) = R\uf8ff(t)(t). If A(t) is the number of arrivals at time t, then the queue-length at time t is given\nby: Q(t) = (Q(t  1) + A(t)  S(t))+.\nOur goal in this paper is to focus attention on how queueing behavior impacts regret minimization\nin bandit algorithms. We evaluate the performance of scheduling policies against the policy that\nschedules the (unique) optimal server in every time slot, i.e., the server k\u21e4 := arg maxk2[K] \u00b5k with\nthe maximum mean rate \u00b5\u21e4 := maxk2[K] \u00b5k. Let Q(t) be the queue-length vector at time t under\nour speci\ufb01ed algorithm, and let Q\u21e4(t) be the corresponding vector under the optimal policy. We\nde\ufb01ne regret as the difference in mean queue-lengths for the two policies. That is, the regret is given\nby: (t) := E [Q(t)  Q\u21e4(t)] . We use the terms queue-regret or simply regret to refer to (t).\nThroughout, when we evaluate queue-regret, we do so under the assumption that the queueing system\nstarts in the steady state distribution of the system induced by the optimal policy, as follows.\nAssumption 1 (Initial State). Both Q(0) and Q\u21e4(0) have the same initial state distribution, and this\nis chosen to be the stationary distribution of Q\u21e4(t); this distribution is denoted \u21e1(,\u00b5\u21e4).\n\n4 The Late Stage\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\nLate Stage\n\nEarly Stage\n\nO !log3 t\"\n\nWe analyze the performance of a scheduling algorithm with respect to queue-regret as a function of\ntime and system parameters like: (a) the load on the system \u270f := (\u00b5\u21e4  ), and (b) the minimum\ndifference between the rates of the best and the next best servers := \u00b5\u21e4  maxk6=k\u21e4 \u00b5k.\nAs a preview of the theoretical results, Fig-\nure 1 shows the evolution of queue-regret\nwith time in a system with 5 servers under\na scheduling policy inspired by Thompson\nSampling. Exact details of the scheduling\nalgorithm can be found in Section 4.2. It\nis observed that the regret goes through a\nphase transition. In the initial stage, when\nthe algorithm has not estimated the service\nrates well enough to stabilize the queue, the\nregret grows poly-logarithmically similar\nto the classical MAB setting. After a crit-\nical point when the algorithm has learned\nthe system parameters well enough to sta-\nbilize the queue, the queue-length goes\nthrough regenerative cycles as the queue\nbecome empty. In other-words, instead of the queue length being continuously backlogged, the\nqueuing system has a stochastic cyclical behavior where the queue builds up, becomes empty, and\nthis cycle recurs. Thus at the beginning of every regenerative cycle, there is no accumulation of past\nerrors and the sample-path queue-regret is at most zero. As the algorithm estimates the parameters\nbetter with time, the length of the regenerative cycles decreases and the queue-regret decays to zero.\n\nFigure 1: Queue-regret (t) under Q-ThS in a system\nwith K = 5, \u270f = 0.1 and = 0 .17\n\nO # log t\n\nlog log t$\n\nO # log3 t\nt $\n\n3000\n\n3500\n\n1000\n\n1500\n\n2000\n\n2500\n\n\u2126! 1\nt\"\n\n)\nt\n(\n\n\u03a8\n\n4000\n\n5\n\n0\n\n0\n\n500\n\nt\n\n4\n\n\fNotation: For the results in Section 4, the notation f (t) = O (g(K, \u270f, t)) for all t 2 h(K, \u270f) (here,\nh(K, \u270f) is an interval that depends on K, \u270f) implies that there exist constants C and t0 independent\nof K and \u270f such that f (t) \uf8ff Cg(K, \u270f, t) for all t 2 (t0,1) \\ h(K, \u270f).\n4.1 An Asymptotic Lower Bound\n\nWe establish an asymptotic lower bound on regret for the class of \u21b5-consistent policies; this class\nfor the queueing bandit is a generalization of the \u21b5-consistent class used in the literature for the\ntraditional stochastic MAB problem [7, 18, 19]. The precise de\ufb01nition is given below (1{\u00b7} below is\nthe indicator function).\nDe\ufb01nition 1. A scheduling policy is said to be \u21b5-consistent (for some \u21b5 2 (0, 1)) if given any\n\nproblem instance, speci\ufb01ed by (, \u00b5\u00b5\u00b5), EhPt\n\ns=1\n\n1{\uf8ff(s) = k}i = O(t\u21b5) for all k 6= k\u21e4.\n\nTheorem 1 below gives an asymptotic lower bound on the average queue-regret and per-queue regret\nfor an arbitrary \u21b5-consistent policy.\nTheorem 1. For any problem instance (, \u00b5\u00b5\u00b5) and any \u21b5-consistent policy, the regret (t) satis\ufb01es\n\n (t) \u2713 \n\n4\n\nD(\u00b5\u00b5\u00b5)(1  \u21b5)(K  1)\u25c6 1\n\nt\n\nfor in\ufb01nitely many t, where\n\nD(\u00b5\u00b5\u00b5) =\n\n\n\nKL\u00b5min, \u00b5\u21e4+1\n2  .\n\n(2)\n\nOutline for theorem 1. The proof of the lower bound consists of three main steps. First, in lemma 21,\nwe show that the regret at any time-slot is lower bounded by the probability of a sub-optimal schedule\nin that time-slot (up to a constant factor that is dependent on the problem instance). The key idea in\nthis lemma is to show the equivalence of any two systems with the same marginal service distributions\nunder bandit feedback. This is achieved through a carefully constructed coupling argument that maps\nthe original system with independent service across links to another system with service process that\nis dependent across links but with the same marginal distribution.\nAs a second step, the lower bound on the regret in terms of the probability of a sub-optimal schedule\nenables us to obtain a lower bound on the cumulative queue-regret in terms of the number of\nsub-optimal schedules. We then use a lower bound on the number of sub-optimal schedules for\n\u21b5-consistent policies (lemma 19 and corollary 20) to obtain a lower bound on the cumulative regret.\nIn the \ufb01nal step, we use the lower bound on the cumulative queue-regret to obtain an in\ufb01nitely often\nlower bound on the queue-regret.\n\n4.2 Achieving the Asymptotic Bound\n\nWe next focus on algorithms that can (up to a poly log factor) achieve a scaling of O (1/t) . A key\nchallenge in showing this is that we will need high probability bounds on the number of times the\ncorrect arm is scheduled, and these bounds to hold over the late-stage regenerative cycles of the\nqueue. Recall that these regenerative cycles are random time intervals with \u21e5(1) expected length\nfor the optimal policy, and whose lengths are correlated with the bandit algorithm decisions (the\nqueue length evolution is dependent on the past history of bandit arm schedules). To address this, we\npropose a slightly modi\ufb01ed version of the Thompson Sampling algorithm. The algorithm, which we\ncall Q-ThS, has an explicit structured exploration component similar to \u270f-greedy algorithms. This\nstructured exploration provides suf\ufb01ciently good estimates for all arms (including sub-optimal ones)\nin the late stage.\nWe describe the algorithm we employ in detail. Let Tk(t) be the number of times server k is\nassigned in the \ufb01rst t time-slots and \u02c6\u00b5\u00b5\u00b5(t) be the empirical mean of service rates at time-slot t\nfrom past observations (until t  1). At time-slot t, Q-ThS decides to explore with probability\nmin{1, 3K log2 t/t}, otherwise it exploits. When exploring, it chooses a server uniformly at random.\nThe chosen exploration rate ensures that we are able to obtain concentration results for the number\n\n5\n\n\fof times any link is sampled.3 When exploiting, for each k 2 [K], we pick a sample \u02c6\u2713k(t) of\ndistribution Beta (\u02c6\u00b5k(t)Tk(t  1) + 1, (1  \u02c6\u00b5k(t)) Tk(t  1) + 1) , and schedule the arm with the\nlargest sample (the standard Thompson sampling for Bernoulli arms [20]). Details of the algorithm\nare given in Algorithm 1 in the Appendix.\nWe now show that, for a given problem instance (, \u00b5\u00b5\u00b5) (and therefore \ufb01xed \u270f), the regret under\nQ-ThS scales as O (poly(log t)/t). We state the most general form of the asymptotic upper bound in\ntheorem 2. A slightly weaker version of the result is given in corollary 3. This corollary is useful to\nunderstand the dependence of the upper bound on the load \u270f and the number of servers K.\n\nNotation : For the following results, the notation f (t) = O (g(K, \u270f, t)) for all t 2 h(K, \u270f) (here,\nh(K, \u270f) is an interval that depends on K, \u270f) implies that there exist constants C and t0 independent\nof K and \u270f such that f (t) \uf8ff Cg(K, \u270f, t) for all t 2 (t0,1) \\ h(K, \u270f).\n\nTheorem 2. Consider any problem instance (, \u00b5\u00b5\u00b5). Let w(t) = exp\u2713\u21e3 2 log t\n\n \u23182/3\u25c6 , v0(t) =\n\nv0(t) log2 t\n\n. Then, under Q-ThS the regret (t), satis\ufb01es\n\n6K\n\n\u270f w(t) and v(t) = 24\n\n\u270f2 log t + 60K\n\u270f\n\nt\n\nfor all t such that w(t)\nCorollary 3. Let w(t) be as de\ufb01ned in Theorem 2. Then,\n\n\u270f , t  exp6/2 and v(t) + v0(t) \uf8ff t/2.\n\nlog t  2\n\n (t) = O\u2713 Kv(t) log2 t\n\nt\n\n\u25c6\n\n (t) = O\u2713K\n\nlog3 t\n\n\u270f2t \u25c6\n\nfor all t such that w(t)\n\n\u270f ,\nlog t  2\n\nt\n\nw(t)  max 24K\n\n\u270f\n\n, 15K2 log t , t  exp6/2 and t\n\n\u270f2 .\nlog t  198\n\nOutline for Theorem 2. As mentioned earlier, the central idea in the proof is that the sample-path\nqueue-regret is at most zero at the beginning of regenerative cycles, i.e., instants at which the queue\nbecomes empty. The proof consists of two main parts \u2013 one which gives a high probability result on\nthe number of sub-optimal schedules in the exploit phase in the late stage, and the other which shows\nthat at any time, the beginning of the current regenerative cycle is not very far in time.\nThe former part is proved in lemma 9, where we make use of the structured exploration component\nof Q-ThS to show that all the links, including the sub-optimal ones, are sampled a suf\ufb01ciently large\nnumber of times to give a good estimate of the link rates. This in turn ensures that the algorithm\nschedules the correct link in the exploit phase in the late stages with high probability.\nFor the latter part, we prove a high probability bound on the last time instant when the queue\nwas zero (which is the beginning of the current regenerative cycle) in lemma 15. Here, we make\nuse of a recursive argument to obtain a tight bound. More speci\ufb01cally, we \ufb01rst use a coarse high\nprobability upper bound on the queue-length (lemma 11) to get a \ufb01rst cut bound on the beginning of\nthe regenerative cycle (lemma 12). This bound on the regenerative cycle-length is then recursively\nused to obtain tighter bounds on the queue-length, and in turn, the start of the current regenerative\ncycle (lemmas 14 and 15 respectively).\nThe proof of the theorem proceeds by combining the two parts above to show that the main contribu-\ntion to the queue-regret comes from the structured exploration component in the current regenerative\ncycle, which gives the stated result.\n\n5 The Early Stage in the Heavily Loaded Regime\n\nIn order to study the performance of \u21b5-consistent policies in the early stage, we consider the heavily\nloaded system, where the arrival rate  is close to the optimal service rate \u00b5\u21e4, i.e., \u270f = \u00b5\u21e4   ! 0.\nThis is a well studied asymptotic in which to study queueing systems, as this regime leads to\n\n3The exploration rate could scale like log t/t if we knew  in advance; however, without this knowledge,\n\nadditional exploration is needed.\n\n6\n\n\ffundamental insight into the structure of queueing systems. See, e.g., [9, 10] for extensive surveys.\nAnalyzing queue-regret in the early stage in the heavily loaded regime has the effect that the the\noptimal server is the only one that stabilizes the queue. As a result, in the heavily loaded regime,\neffective learning and scheduling of the optimal server play a crucial role in determining the transition\npoint from the early stage to the late stage. For this reason the heavily loaded regime reveals the\nbehavior of regret in the early stage.\nNotation: For all the results in this section, the notation f (t) = O (g(K, \u270f, t)) for all t 2 h(K, \u270f)\n(h(K, \u270f) is an interval that depends on K, \u270f) implies that there exist numbers C and \u270f0 that depend\non  such that for all \u270f  \u270f0, f (t) \uf8ff Cg(K, \u270f, t) for all t 2 h(K, \u270f).\nTheorem 4 gives a lower bound on the regret in the heavily loaded regime, roughly in the time interval\nK1/(1\u21b5), O (K/\u270f) for any \u21b5-consistent policy.\n\nTheorem 4. Given any problem instance (, \u00b5\u00b5\u00b5), and for any \u21b5-consistent policy and > 1\nregret (t) satis\ufb01es\n\n1\u21b5, the\n\nD(\u00b5\u00b5\u00b5)\n\n (t) \n\n(K  1)\n\n2\n\nlog t\n\nlog log t\n\nfor t 2 hmax{C1K,\u2327 }, (K  1) D(\u00b5\u00b5\u00b5)\n\nconstants that depend on \u21b5,  and the policy.\n\n2\u270f i where D(\u00b5\u00b5\u00b5) is given by equation 2, and \u2327 and C1 are\n\nOutline for Theorem 4. The crucial idea in the proof is to show a lower bound on the queue-regret in\nterms of the number of sub-optimal schedules (Lemma 22). As in Theorem 1, we then use a lower\nbound on the number of sub-optimal schedules for \u21b5-consistent policies (given by Corollary 20) to\nobtain a lower bound on the queue-regret.\n\nTheorem 4 shows that, for any \u21b5-consistent policy, it takes at least \u2326( K/\u270f) time for the queue-regret\nto transition from the early stage to the late stage. In this region, the scaling O(log t/ log log t)\nre\ufb02ects the fact that queue-regret is dominated by the cumulative regret growing like O(log t). A\nreasonable question then arises: after time \u2326( K/\u270f), should we expect the regret to transition into the\nlate stage regime analyzed in the preceding section?\nWe answer this question by studying when Q-ThS achieves its late-stage regret scaling of\n\nOpoly(log t)/\u270f2t scaling; as we will see, in an appropriate sense, Q-ThS is close to optimal\nin its transition from early stage to late stage, when compared to the bound discovered in Theorem 4.\nFormally, we have Corollary 5, which is an analog to Corollary 3 under the heavily loaded regime.\nCorollary 5. For any problem instance (, \u00b5\u00b5\u00b5), any  2 (0, 1) and  2 (0, min(, 1  )), the regret\nunder Q-ThS satis\ufb01es\n\nBy combining the result in Corollary 5 with Theorem 4, we can infer that in the heavily loaded\n\n (t) = O\u2713 K log3 t\n\u270f2t \u25c6\n1o, where C2 is a constant independent of \u270f\n8t  C2 maxn 1\n\u270f2 1\n1 , 1\n , K\n\u270f 1\n\u270f 1\nregime, the time taken by Q-ThS to achieve Opoly(log t)/\u270f2t scaling is, in some sense, order-wise\nclose to the optimal in the \u21b5-consistent class. Speci\ufb01cally, for any  2 (0, 1), there exists a scaling of\nK with \u270f such that the queue-regret under Q-ThS scales as Opoly(log t)/\u270f2t for all t > (K/\u270f)\nwhile the regret under any \u21b5-consistent policy scales as \u2326( K log t/ log log t) for t < K/\u270f.\nWe conclude by noting that while the transition point from the early stage to the late stage for Q-ThS\nis near optimal in the heavily loaded regime, it does not yield optimal regret performance in the early\nstage in general. In particular, recall that at any time t, the structured exploration component in Q-ThS\nis invoked with probability 3K log2 t/t. As a result, we see that, in the early stage, queue-regret under\nQ-ThS could be a log2 t-factor worse than the \u2326 (log t/ log log t) lower bound shown in Theorem 4\nfor the \u21b5-consistent class. This intuition can be formalized: it is straightforward to show an upper\nbound of 2K log3 t for any t > max{C3, U}, where C3 is a constant that depends on  but is\nindependent of K and \u270f; we omit the details.\n\n(but depends on , and ).\n\n1 , (K2)\n\n1\n\n7\n\n\f150\n\n100\n\n50\n\n)\nt\n(\n\n\u03a8\n\n0\n\n0\n\n\u03f5 = 0.05\n\nPhase Transition\nShift\n\n\u03f5 = 0.15\n\n\u03f5 = 0.1\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\nt\n\n)\nt\n(\n\n\u03a8\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n0\n\n\u03f5 = 0.05\n\n\u03f5 = 0.1\n\n\u03f5 = 0.15\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\nt\n\n(a) Queue-Regret under Q-ThS for a system with\n5 servers with \u270f 2{ 0.05, 0.1, 0.15}\nFigure 2: Variation of Queue-regret (t) with K and \u270f under Q-Ths. The phase-transition point\nshifts towards the right as \u270f decreases. The ef\ufb01ciency of learning decreases with increase in the size\nof the system.\n\n(b) Queue-Regret under Q-ThS for a a system with\n7 servers with \u270f 2{ 0.05, 0.1, 0.15}\n\n6 Simulation Results\n\nIn this section we present simulation results of various queueing bandit systems with K servers.\nThese results corroborate our theoretical analysis in Sections 4 and 5. In particular a phase transition\nfrom unstable to stable behavior can be observed in all our simulations, as predicted by our analysis.\nIn the remainder of the section we demonstrate the performance of Algorithm 1 under variations of\nsystem parameters like the traf\ufb01c (\u270f), the gap between the optimal and the suboptimal servers (),\nand the size of the system (K). We also compare the performance of our algorithm with versions of\nUCB-1 [4] and Thompson Sampling [20] without structured exploration (Figure 3 in the appendix).\nVariation with \u270f\u270f\u270f and K. In Figure 2 we see the evolution of (t) in systems of size 5 and 7 . It can\nbe observed that the regret decays faster in the smaller system, which is predicted by Theorem 2 in\nthe late stage and Corollary 5 in the early stage. The performance of the system under different traf\ufb01c\nsettings can be observed in Figure 2. It is evident that the regret of the queueing system grows with\ndecreasing \u270f. This is in agreement with our analytical results (Corollaries 3 and 5). In Figure 2 we\ncan observe that the time at which the phase transition occurs shifts towards the right with decreasing\n\u270f which is predicted by Corollaries 3 and 5.\n\n7 Discussion and Conclusion\n\nThis paper provides the \ufb01rst regret analysis of the queueing bandit problem, including a charac-\nterization of regret in both early and late stages, together with analysis of the switching time; and\nan algorithm (Q-ThS) that is asymptotically optimal (to within poly-logarithmic factors) and also\nessentially exhibits the correct switching behavior between early and late stages. There remain\nsubstantial open directions for future work.\nFirst, is there a single algorithm that gives optimal performance in both early and late stages, as well\nas the optimal switching time between early and late stages? The price paid for structured exploration\nby Q-ThS is an in\ufb02ation of regret in the early stage. An important open question is to \ufb01nd a single,\nadaptive algorithm that gives good performance over all time. As we note in the appendix, classic\n(unstructured) Thompson sampling is an intriguing candidate from this perspective.\nSecond the most signi\ufb01cant technical hurdle in \ufb01nding a single optimal algorithm is the dif\ufb01culty\nof establishing concentration results for the number of suboptimal arm pulls within a regenerative\ncycle whose length is dependent on the bandit strategy. Such concentration results would be needed\nin two different limits: \ufb01rst, as the start time of the regenerative cycle approaches in\ufb01nity (for the\nasymptotic analysis of late stage regret); and second, as the load of the system increases (for the\nanalysis of early stage regret in the heavily loaded regime). Any progress on the open directions\ndescribed above would likely require substantial progress on these technical questions as well.\nAcknowledgement: This work is partially supported by NSF Grants CNS-1161868, CNS-1343383, CNS-\n1320175, ARO grants W911NF-16-1-0377, W911NF-15-1-0227, W911NF-14-1-0387 and the US DoT supported\nD-STOP Tier 1 University Transportation Center.\n\n8\n\n\fReferences\n[1] J. C. Gittins, \u201cBandit processes and dynamic allocation indices,\u201d Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pp. 148\u2013177, 1979.\n\n[2] A. Mahajan and D. Teneketzis, \u201cMulti-armed bandit problems,\u201d in Foundations and Applications of Sensor\n\nManagement. Springer, 2008, pp. 121\u2013151.\n\n[3] S. Bubeck and N. Cesa-Bianchi, \u201cRegret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems,\u201d Machine Learning, vol. 5, no. 1, pp. 1\u2013122, 2012.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer, \u201cFinite-time analysis of the multiarmed bandit problem,\u201d Machine\n\nlearning, vol. 47, no. 2-3, pp. 235\u2013256, 2002.\n\n[5] A. Garivier and O. Capp\u00e9, \u201cThe kl-ucb algorithm for bounded stochastic bandits and beyond,\u201d arXiv\n\npreprint arXiv:1102.2490, 2011.\n\n[6] S. Agrawal and N. Goyal, \u201cAnalysis of thompson sampling for the multi-armed bandit problem,\u201d arXiv\n\npreprint arXiv:1111.1797, 2011.\n\n[7] T. L. Lai and H. Robbins, \u201cAsymptotically ef\ufb01cient adaptive allocation rules,\u201d Advances in applied\n\nmathematics, vol. 6, no. 1, pp. 4\u201322, 1985.\n\n[8] J.-Y. Audibert and S. Bubeck, \u201cBest arm identi\ufb01cation in multi-armed bandits,\u201d in COLT-23th Conference\n\non Learning Theory-2010, 2010, pp. 13\u2013p.\n\n[9] W. Whitt, \u201cHeavy traf\ufb01c limit theorems for queues: a survey,\u201d in Mathematical Methods in Queueing\n\nTheory. Springer, 1974, pp. 307\u2013350.\n\n[10] H. Kushner, Heavy traf\ufb01c analysis of controlled queueing and communication networks. Springer Science\n\n& Business Media, 2013, vol. 47.\n\n[11] J. Ni\u00f1o-Mora, \u201cDynamic priority allocation via restless bandit marginal productivity indices,\u201d Top, vol. 15,\n\nno. 2, pp. 161\u2013198, 2007.\n\n[12] P. Jacko, \u201cRestless bandits approach to the job scheduling problem and its extensions,\u201d Modern trends in\n\ncontrolled stochastic processes: theory and applications, pp. 248\u2013267, 2010.\n\n[13] D. Cox and W. Smith, \u201cQueues,\u201d Wiley, 1961.\n[14] C. Buyukkoc, P. Varaiya, and J. Walrand, \u201cThe c\u00b5 rule revisited,\u201d Advances in applied probability, vol. 17,\n\nno. 1, pp. 237\u2013238, 1985.\n\n[15] J. A. Van Mieghem, \u201cDynamic scheduling with convex delay costs: The generalized c| mu rule,\u201d The\n\nAnnals of Applied Probability, pp. 809\u2013833, 1995.\n\n[16] J. Ni\u00f1o-Mora, \u201cMarginal productivity index policies for scheduling a multiclass delay-/loss-sensitive\n\nqueue,\u201d Queueing Systems, vol. 54, no. 4, pp. 281\u2013312, 2006.\n\n[17] C. Lott and D. Teneketzis, \u201cOn the optimality of an index rule in multichannel allocation for single-hop\nmobile networks with multiple service classes,\u201d Probability in the Engineering and Informational Sciences,\nvol. 14, pp. 259\u2013297, 2000.\n\n[18] A. Salomon, J.-Y. Audiber, and I. El Alaoui, \u201cLower bounds and selectivity of weak-consistent policies in\nstochastic multi-armed bandit problem,\u201d The Journal of Machine Learning Research, vol. 14, no. 1, pp.\n187\u2013207, 2013.\n\n[19] R. Combes, C. Jiang, and R. Srikant, \u201cBandits with budgets: Regret lower bounds and optimal algorithms,\u201d\nin Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling\nof Computer Systems. ACM, 2015, pp. 245\u2013257.\n\n[20] W. R. Thompson, \u201cOn the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples,\u201d Biometrika, pp. 285\u2013294, 1933.\n\n[21] S. Bubeck, V. Perchet, and P. Rigollet, \u201cBounded regret in stochastic multi-armed bandits,\u201d arXiv preprint\n\narXiv:1302.1611, 2013.\n\n[22] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg, \u201cBatched bandit problems,\u201d arXiv preprint\n\narXiv:1505.00369, 2015.\n\n[23] A. B. Tsybakov, Introduction to nonparametric estimation. Springer Science & Business Media, 2008.\n[24] O. Chapelle and L. Li, \u201cAn empirical evaluation of thompson sampling,\u201d in Advances in neural information\n\nprocessing systems, 2011, pp. 2249\u20132257.\n\n[25] S. L. Scott, \u201cA modern bayesian look at the multi-armed bandit,\u201d Appl. Stoch. Models in Business and\n\nIndustry, vol. 26, no. 6, pp. 639\u2013658, 2010.\n\n[26] E. Kaufmann, N. Korda, and R. Munos, \u201cThompson sampling: An asymptotically optimal \ufb01nite-time\n\nanalysis,\u201d in Algorithmic Learning Theory. Springer, 2012, pp. 199\u2013213.\n\n[27] D. Russo and B. Van Roy, \u201cLearning to optimize via posterior sampling,\u201d Mathematics of Operations\n\nResearch, vol. 39, no. 4, pp. 1221\u20131243, 2014.\n\n9\n\n\f", "award": [], "sourceid": 921, "authors": [{"given_name": "Subhashini", "family_name": "Krishnasamy", "institution": "The University of Texas at Austin"}, {"given_name": "Rajat", "family_name": "Sen", "institution": "The University of Texas at Austin"}, {"given_name": "Ramesh", "family_name": "Johari", "institution": "Stanford University"}, {"given_name": "Sanjay", "family_name": "Shakkottai", "institution": "The University of Texas at Aus"}]}