{"title": "Blocking Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 4784, "page_last": 4793, "abstract": "We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c \\log T + o(\\log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' \\log T+ \\omega(\\log T)$.", "full_text": "Blocking Bandits\n\nSoumya Basu\n\nUT Austin\n\nRajat Sen\nAmazon\n\nSujay Sanghavi\n\nUT Austin, Amazon\n\nSanjay Shakkottai\n\nUT Austin\n\nAbstract\n\nWe consider a novel stochastic multi-armed bandit setting, where playing an arm\nmakes it unavailable for a \ufb01xed number of time slots thereafter. This models\nsituations where reusing an arm too often is undesirable (e.g. making the same\nproduct recommendation repeatedly) or infeasible (e.g. compute job scheduling\non machines). We show that with prior knowledge of the rewards and delays of\nall the arms, the problem of optimizing cumulative reward does not admit any\npseudo-polynomial time algorithm (in the number of arms) unless randomized\nexponential time hypothesis is false, by mapping to the PINWHEEL scheduling\nproblem. Subsequently, we show that a simple greedy algorithm that plays the\navailable arm with the highest reward is asymptotically (1 1/e) optimal. When\nthe rewards are unknown, we design a UCB based algorithm which is shown to\nhave c log T + o(log T ) cumulative regret against the greedy algorithm, leveraging\nthe free exploration of arms due to the unavailability. Finally, when all the delays\nare equal the problem reduces to Combinatorial Semi-bandits providing us with a\nlower bound of c0 log T + !(log T ).\n\n1\n\nIntroduction\n\nWe propose Blocking Bandits a novel stochastic multi armed bandits (MAB) problem where there are\nmultiple arms with i.i.d. stochastic rewards and, additionally, each arm is blocked for a deterministic\nnumber of rounds. In online systems, such blocking constraints arise naturally when repeating an\naction within a time frame may be detrimental, or even be infeasible. In data processing systems, a\nresource (e.g. a compute node, a GPU) may become unavailable for a certain amount of time when a\njob is allocated to it. The detrimental effect is evident in recommendation systems, where it is highly\nunlikely to make an individual attracted to a certain product (e.g. book, movie or song) through\nincessant recommendations of it. A resting time between recommendations of identical products can\nbe effective as it maintains diversity.\nSurprisingly, this simple yet powerful extension of stochastic MAB problem remains unexplored\ndespite the plethora of research surrounding the bandits literature [7, 1, 4, 8, 10] from its onset in [25].\nGiven the extensive research in this \ufb01eld, it is of no surprise that there are multiple existing ways to\nmodel this phenomenon. However, as we discuss such connections next, we observe that none of\nthese approaches are direct, resulting in either large regret bounds or huge time complexity or both.\nWe brie\ufb02y present the problem. There are K arms, where mean reward \u00b5i is the reward and Di is the\ndelay of arm i, for each i = 1 to K. When arm i is played it is blocked for (Di 1) time slots and\nbecomes available on the Di-th time slot after it\u2019s most recent play. The objective is to collect the\nmaximum reward in a given time horizon T .\nIllustrative Example: Consider three arms: arm 1 with delay 1 and mean reward 1/2, arm 2 with\ndelay 4 and mean reward 1, and arm 3 with delay 4 and mean reward 1. The reward maximization\nobjective is met when the arms are played cyclically as 31213121 . . . . There are two observations:\nFirst, due to blocking constraints we are forced to play multiple arms over time. Second, we note that\nthe order in which arms are played is crucial. To illustrate, an alternate schedule 321 321 . . .\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(\u2018\u2019 represents no arm is played) results in strictly less reward compared to the previous one as every\nfourth time slot no arm is available.\n\n1.1 Main Contributions\n\nWe now present the main contributions of this paper.\n1. Formulation: We formulate the blocking Bandits problem where each time an arm is played, it is\nblocked for a deterministic amount of time, and thus provides an abstraction for applications such as\nrecommendations or job scheduling.\n2. Computational Hardness: We prove that when the rewards and the delays are known, the\nproblem of choosing a sequence of available arms to optimize the reward over a time horizon T\nis computationally hard (see, Theorem 3.1). Speci\ufb01cally, we prove the of\ufb02ine optimization is as\nhard as PINWHEEL Scheduling on dense instances [18, 12, 20, 3], which does not permit any\npseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time\nhypothesis [5] is false.\n3. Approximation Algorithm: On the positive side, we prove that the Oracle Greedy algorithm\nthat knows the mean reward of the arms and simply plays the available arm with the highest mean\nreward is (1 1/e O (1/T ))-optimal (see, Theorem 3.3). The approximation guarantee does not\nfollow from standard techniques (e.g. sub-modular optimization bounds); instead it is proved by\nrelating a novel lower bound of the Oracle Greedy algorithm to the LP relaxation based upper bound\non MAXREWARD.\n4. Regret Upper Bound for UCB Greedy: We propose the natural UCB Greedy algorithm which\nplays the available arm with the highest upper con\ufb01dence bound. We provide regret upper bounds for\nthe UCB Greedy as compared to the Oracle Greedy in Theorem 4.1.\nOur proof technique is novel in two ways.\n(i) In each time slot, the Oracle Greedy and the UCB Greedy algorithm have different sets of available\narms (sample-path wise), as the set of available arms is correlated with the past decisions. We\nconstruct a coupling between the Oracle Greedy and the UCB Greedy algorithm, which enables\nus to capture the effect of learning error in UCB Greedy locally in time for each arm, despite the\ncorrelation with past decisions.\n(ii) We prove that due to the blocking constraint, there is free exploration in the UCB Greedy algorithm.\nAs the UCB Greedy algorithm plays the current best arm, it gets blocked, enforcing the play of the next\nsuboptimal arm\u2014a phenomenon we call free exploration. Free exploration ensures that upto a time\nhorizon t, certain number of arms, namely K\u21e4 (de\ufb01ned below), are played ct amount of time each,\nj=1 1/Di 1}, and\n(K,K\u21e4) log T ).\n\nfor c > 0, w.h.p. Suppose \u00b5i-s are non-decreasing with i. Let K\u21e4 = min{i :Pi\n(u, l) = min{\u00b5i \u00b5(i+1) : l \uf8ff i < u}. Then the regret is upper bounded by O( K(KK\u21e4)\nIgnoring free exploration the regret bound is O( K2\n(K,1) log T ), where (K, 1) \uf8ff (K, K\u21e4).\n5. Regret Lower Bound: We provide regret lower bounds for instances where the Oracle Greedy\nalgorithm is optimal, and the regret is accumulated only due to learning errors. We consider the\ninstances where all the delays are equal to K\u21e4 < K. We show under this setting the Oracle\nGreedy algorithm is optimal and the feedback structure of any online algorithm coincides with the\ncombinatorial semi-bandit feedback [17, 13]. We show that for speci\ufb01c instances the regret admits a\nlower bound \u2326( (KK\u21e4)\n\n(K,K\u21e4) log T ) in Theorem 4.3.\n\n1.2 Connections to Existing Bandit Frameworks\n\nWe now brie\ufb02y review related work in bandits, highlighting their shortcomings in solving the\nstochastic blocking bandits problem.\n1. Combinatorial Semi-bandits: The blocking bandit problem is combinatorial in nature as the\ndecisions of playing one arm changes the set of available arms in the future. Instead of viewing this\nproblem on a per-time-slot basis, we can group a large block of time-slots together to determine a\nschedule of arm pulls and repeat this schedule, thus giving us an asymptotically optimal policy. We can\nnow use ideas from stochastic Combinatorial semi bandits [13, 24] to learn the rewards by observing\n\n2\n\n\fall the rewards attained in each block. This approach, however, has two shortcomings. First we might\nneed to consider extremely large blocks of time, speci\ufb01cally of size O(exp(lcm(Di : i 2 [K] log K))\n(lcm stands for the least common multiple), as an optimal policy may have periodic cycles of that\nlength. This will require a large computational time as in the online algorithm the schedule will\nchange depending on the reward estimates. Second, as the set of actions with large blocks is huge,\nthe regret guarantees of such an approach may scale as O(exp(lcm(Di : i 2 [K] log K) log T ).\n2. Budgeted Combinatorial Bandits: There are extensions to the above combinatorial semi bandit\nsetting where additional global budget constraints are imposed, such as Knapsack constraints [26]\u2014\nwhere an arm can only be played for a pre-speci\ufb01ed number of times, and Budget constraints [28]\u2014\nwhere each play of arm has an associated cost and the total expenditure has a budget. However,\nthese settings cannot handle blocking that are local (per arm) in nature. Further, in [19] the authors\nconsider adaptive adversaries, which can model our problem. But their approach will lead to a an\napproximation guarantee of O(1/ log(T )) over T timeslots.\nAn interesting recent work, Recharging Bandits [22] studies a system where the rewards of each arm\nis a concave and weakly increasing function of the time since the arm is played (i.e. a recharging\ntime). However, the results therein do not apply as we focus on hard blocking constraints. Another\nwork on bandits with delay-dependent payoffs [6] is not applicable as the results therein give no\napproximation guarantee for our setting.\n3. Sleeping Bandits: Yet another bandit setting where the set of available actions change across time\nslots is Sleeping Bandits [23]. In this setting, the available action set is the same for all the competing\npolicies including the optimal one in each time slot. However, in our scenario the set of available\naction in a particular time slot is dependent on the actions taken in the past time slots. Therefore,\ndifferent policies may have different available action in each time slot. This precludes the application\nof ideas presented in Sleeping Bandits, and in sleeping combinatorial bandits [21], to our problem.\n4. Online Markov Decision Processes: Finally, we can view this as a general Markov decision\nprocess on the state space S = [D1] \u21e5 [D2] . . . [DK], and the action space of arms A = [K], with\nmean reward \u00b5i for action i. The state space is again exponential in K, leading to huge computational\nbottleneck (O(exp(K))) and regret (O(poly(|S|) log T )) for standard approaches in online Markov\ndecision processes [2, 27, 15].\n\n2 Problem De\ufb01nition\n\nWe consider a multi-armed bandit problem with blocking of arms. We have K arms. For each i 2 [K],\nthe i-th arm provides a reward Xi(t) in time slot t 1, where Xi(t) are i.i.d. random variables\nwith mean \u00b5i and support [0, 1]. Let us order the arms from highest to lowest reward w.l.o.g., s.t.\n\u00b51 \u00b52 \u00b7\u00b7\u00b7 \u00b5K. Let ij = \u00b5i \u00b5j for all 1 \uf8ff i < j \uf8ff K.\nBlocking: For all i 2 [K], each arm i is deterministically blocked for (Di 1) 0 number of time\nslots once it is played. The actions of a player now decide the set of available arms due to blocking.\nIn the t-th time slot, let us denote the set of available arms as At and the arm pulled by the player as\nIt 2 At. For each i 2 [K], and t 1, let the number of timeslots after and including t, the arm i is\nblocked as \u2327i,t = (Di + maxt0\uf8fft{It0 = i} t). The set of available arms at each time t is given\nas At := At(i1, . . . , it1) = {i : i 2 [K],\u2327 i,t \uf8ff 0}. For a \ufb01xed time horizon, T 1, the set of all\nvalid actions is given as IT = {it 2 At(i1, . . . , it1) : t 2 [T ]}.\nOptimization: Our objective is to attain the maximum expected cumulative reward. The expected\ncumulative reward of a policy IT 2I T is given as r(IT ) = E[PT\n\u00b5it. The\n\nof\ufb02ine optimization problem, with the knowledge of delays and mean rewards is stated as below.\n\nt=1 Xit(t)] =Pit2IT\n\nMAXREWARD: Solve OPT = max\nI02IT\n\nr(I0).\n\n\u21b5-Regret: We now de\ufb01ne the \u21b5-regret of a policy, which is identical to the (\u21b5, 1)-regret de\ufb01ned in\nthe combinatorial bandits literature [9]. For any \u21b5 2 [0, 1], the \u21b5-regret of a policy is the difference\nof expected cumulative reward of an \u21b5-optimal policy and the expected cumulative reward of that\npolicy, R\u21b5\n\nT = \u21b5OP T EhPT\n\nt=1 Xit(t)i .\n\n3\n\n\f3 Scheduling with Known Rewards\n\n3.1 Hardness of MAXREWARD\n\nThe of\ufb02ine algorithm is a periodic scheduling problem with the objective of reward maximization.\nIn this section, we \ufb01rst prove (Corollary 3.2) that the of\ufb02ine problem does not admit any pseudo\npolynomial time algorithm in the number of arms, unless randomized exponential time hypothesis\nis false. We show hardness of the MAXREWARD problem by mapping it to the PINWHEEL\nSCHEDULING problem [18] as de\ufb01ned below.\n\nPINWHEEL SCHEDULING: Given K arms with delays {ai\n: i 2 [K]}, the PINWHEEL\nSCHEDULING problem is to decide if there exists a schedule (i.e. mapping \u2303: [ T ] ! [K] for any\nT 1) such that for each i 2 [K] in ai consecutive time slots arm i appears at least once.\nWe call such a schedule, if it exists, a valid schedule. A PINWHEEL SCHEDULING instance with\na valid schedule is a YES instance, otherwise it is a NO instance. A PINWHEEL SCHEDULING\ninstance is called dense ifPK\ni=1 1/ai = 1. Also, note that this problem is also known as Single\n\nMachine Windows Scheduling Problem with Inexact Periods [20].\nTheorem 3.1. MAXREWARD is at least as hard as PINWHEEL SCHEDULING on dense instances.\n\nIn the proof, which is presented in the supplementary material, we show that given dense instances\nof PINWHEEL SCHEDULING there is an instance of MAXREWARD where the optimal value is\nstrictly larger if the dense instance is an YES instance as compared to a NO instance. The following\ncorollary provides hardness of MAXREWARD.\nCorollary 3.2. The problem MAXREWARD does not admit any pseudo-polynomial algorithm unless\nthe randomized exponential time hypothesis is false.\n\nProof. The proof follows from Theorem 3.1 and Theorem 24 in [20]. In [20], the authors shows that\nthe PINWHEEL SCHEDULING with dense instances do not admit any pseudo-polynomial algorithm\nunless the randomized exponential time hypothesis [5] is False.\n\n3.2\n\n(1 1/e)-Approximation of MAXREWARD\n\nWe study the Oracle Greedy algorithm where in each time slot the policy picks the best arm (i.e. the\narm with highest mean reward \u00b5i) in the set of available arms. We show in Theorem 3.3 that the\ngreedy algorithm is (1 1/e O (1/T )) optimal1 for the problem for any time-horizon T and any\nnumber of arms K.\nTheorem 3.3. The greedy algorithm is asymptotically (1 1/e) optimal for the MAXREWARD.\nProof Sketch: The proof is presented in the supplementary material. It relies on three steps. Firstly,\nwe show that using a Linear problem (LP) relaxation it is possible to obtain an upper bound to OPT\nin closed form as a function fupper(T, \u00b5i, Di,8i) of \u00b5i, Di for all i 2 [K]. In the next step, we show\nthat the Greedy algorithm can be lower bounded as another function flower(T, \u00b5i, Di,8i) of \u00b5i, Di\nflower(T,\u00b5i,Di,8i)\nfor all i 2 [K]. The \ufb01nal step is to lower bound the ratio\nfupper(T,\u00b5i,Di,8i). Our\napproach for the \ufb01nal step is to break this non-convex optimization into two steps, \ufb01rstly optimization\nover \u00b5is which takes the form of a linear fractional program with a closed form lower bound as a\nfunction of Di,8i. Secondly, we show that this value can be furthered lower bounded universally\nacross all Di 1,8i, as (1 1/e O (1/T )).\n3.3 Optimality Gap\n\n\u00b5i2[0,1],Di1,8i\n\nmin\n\nWe now show that greedy is suboptimal by constructing instances where greedy attains a cumulative\nreward (3/4 ) times the optimal reward, for any > 0. Finally, the greedy algorithm that plays\nthe available arm with maximum \u00b5i/Di is shown to attain 1/K times the optimal reward in certain\ninstances. We call this algorithm greedy-per-round.\n\n1An algorithm is \u21b5 optimal for the of\ufb02ine problem if the expected cumulative reward is \u21b5 times the optimal\n\nexpected cumulative reward\n\n4\n\n\fProposition 3.4. For any \u270f> 0, there exists an instance with 4 arms where the greedy algorithm\nachieves (3\u270f)\n\n42\u270f fraction of optimal reward.\n\nProof. Consider the instance where arm 1 and 2 have reward 1 and delay 3, arm 3 has reward 1\u270f and\ndelay 1, and arm 4 has reward 0 and delay 0. Also, each arm has only one copy. For any time horizon\nT which is a multiple of 4, the greedy algorithm has the repeated schedule \u20181, 2, 3, 4, 1, 2, 3, 4, . . . \u2019.\nTherefore, the reward for greedy is (3 \u270f)T /4. Whereas, the optimal reward of (4 2\u270f)T /4 is\nattained by the schedule \u20181, 3, 2, 3, 1, 3, 2, 3, . . . \u2019. Therefore, the greedy achieves reward (3\u270f)\n42\u270f times\nthe optimal.\nProposition 3.5. For any \u270f> 0, there exists an instance with K arms where the greedy-per-round\nalgorithm achieves (1+\u270f)\n\n(K2) fraction of the optimal reward.\n\nProof. Consider the instance where the arms 1 to (K 1) each has reward 1, delay (K 2). The\nK-th arm has reward (1 + \u270f)/(K 2) for \u270f> 0 and delay 0. The greedy-per-round will always\nplay the K-th arm, as (1 + \u270f)/(K 2) 1/(K 2), attaining a reward of (1 + \u270f)T /(K 1)\nin T time-slots. Whereas, the optimal algorithm will play the arms 1, 2, . . . , (K 1) in a round\nrobin manner attaining a reward of T in T time-slots. Therefore, greedy-per-round can only attain\n(K 2)/(1 + \u270f) fraction of the optimal reward.\n4 Greedy Scheduling with Unknown Rewards\n\n4.1 UCB Greedy Algorithm\nIn this section, we present the Upper Con\ufb01dence Bound Greedy algorithm that operates without the\nknowledge of the mean rewards and the delays. The algorithm maintains the upper con\ufb01dence bound\nfor the mean reward of each arm, and in each time slot plays the available arm with the highest upper\n\ncon\ufb01dence bound,\u21e3\u02c6\u00b5i +q 8 log t\n\nni \u2318, where for arm i, \u02c6\u00b5i is the estimate of the mean reward and ni\n\nthe total number of time arm i has been played. 2\n\nAlgorithm 1 Upper Con\ufb01dence Bound Greedy\n1: Initialize: Mean estimate \u02c6\u00b5i = 0 and Count ni = 0, for all i 2 [K]\n2: for all t = 1 to T do\n\n3:\n\n4:\n5:\n6:\n\nPlay arm it =(t,\nif it 6= ; then\n\u02c6\u00b5it \u21e31 1\n\nnit nit + 1\n\nif\n\nt \uf8ff K,\n\nit = arg maxi2At\u21e3\u02c6\u00b5i +q 8 log t\nnit\u2318 \u02c6\u00b5it(t) + 1\n\nXit(t).\n\nnit\n\nni \u2318 , o/w.\n\n4.2 Analysis of UCB Greedy\nWe now provide an upper bound to the regret of the UCB Greedy algorithm as compared to the Oracle\nGreedy algorithm that uses the knowledge of the rewards. Let us recall that, the rewards are sorted\n(i.e. \u00b5i is non-increasing with i).\nQuantities used in Regret Bound. Kg is the worst arm with mean reward strictly greater than 0\n\nplayed by the Oracle Greedy algorithm. Let H(m) =P1n=1 1/nm, m > 1 (Reimann zeta function).\nWe de\ufb01ne K\u21e4\u270f = min(K [{ k :P(k1)\nFor each 1 \uf8ff k < k0 \uf8ff K, let (k, k0) := min{\u00b5i \u00b5j : i \uf8ff k, j k0 + 1, i < j}.\nj(j+1)\u25c6. We note\nFurther for all i = 1 to Kg, and j = (i + 1) to K\u21e4\u270f , we de\ufb01ne cij =\u2713 Dj\n\n1/Dk 1 \u270f}) for any \u270f 0; and K\u21e4 := K\u21e40 .\n\n+ K\n\n2\nij\n\ni=1\n\n2\n\n8\u270f 0, (1 \u270f) min\n\ni\n\nDi \uf8ff K\u21e4\u270f \uf8ff min(K, (1 \u270f) max\n\ni\n\nDi + 1), Dmin \uf8ff Kg \uf8ff min(K, Dmax).\n\n2We believe with some increased complexity in the proof, the constant 8 in UCB can be improved to 2.\n\n5\n\n\fTheorem 4.1. The (1 1/e)-Regret of UCB Greedy for a time horizon T is upper bounded, for any\n\u270f> 0, as\nKgXi=1\n\nK\u21e4\u270fXj=(i+1)\n\nKXj=1+\n\n\u270f 1A +\n\u270f log cij\n\n0@2H(4) \u00b5i\u00b5K\n\nKgXi=1\n\n+ H(3)K\n\nmax(i,K\u21e4\u270f )\n\n\u00b5i\u00b5K\u21e4\u270f\n\nij\nDi\n\n32 log t\n\nD4\ni\n\nD3\ni\n\nij\n\n+\n\ncij\n\n.\n\nSimpli\ufb01ed Regret Bound. The regret admits the simpli\ufb01ed upper bound for any \u270f> 0,\n\nlog\u2713 1\n\n\u270f\u25c6\u25c6 +\n\n32Kg(K K\u21e4\u270f )\n\nmini2[K\u21e4\u270f ,...,Kg] i,i+1\n\nlog(T ).\n\nT\n\nR(11/e)\n\n\uf8ffO \u2713 1\ni=1PK\nupper bound the regret asPKg\n\n\u270f\n\nRole of Free Exploration in Regret Bound. Ignoring the free exploration in the system, we can\n. Therefore, by capturing\nthe free exploration, we are able to signi\ufb01cantly improve the regret bound of the UCB Greedy\nalgorithm when min\ni 0\n\n; at = j\u23181 2\n\nt4 ,8i, j 2 [K],\n\n(2)\n\n(3)\n\n(4)\n\n8t 2T i,\n8j \uf8ff K\u21e4\u270f ,\n\nThe \ufb01rst constraint is standard, whereas the second constraint represent the free exploration in the\nsystem. If any arm i is played ni(t) times upto time t then it is available for (t ni(t)Di) time slots.\n\n6\n\n\f(i1)Pj=1 \u21e3 t\n\nDj\n\n+ 1\u2318 times in total, w.p. 1, due to the blocking constraints;\n\nAmong these time slots where arm i is available, UCBG can play\n1) arms 1 \uf8ff j \uf8ff (i1), at most\nand\nKPj=(i+1)\n2) the arms (i + 1) \uf8ff j \uf8ff K, can be played at most\n(1 K/t3), due to the UCB property and union bound over all arms and time slots upto t.\nTherefore, for all i \uf8ff K we have, w.p. at least (1 K/t3),\nDj1A 1\nDi0@\nKXj=(i+1)\n\n+ (i 1)1A .\n\nMore importantly, w.h.p. for all i \uf8ff K\u21e4\u270f we see ni(t) grows linearly with time t. This provides us with\nthe required upper bound after using the lower bounds for nj(t) for j = 1 to K\u21e4\u270f , appropriately.\n\nDi0@1 \n\nmany times in total, w.p. at least\n\nni(t) t\n\n32 log t\n\n2\nij\n\n1\n\n(i1)Xj=1\n\n32 log t\n\n2\nij\n\n(5)\n\n4.3 Easy Instances and Regret Lower Bound\nIn this section, we show that there are class of instances where Oracle greedy is optimal and provide\nregret lower bounds for such a setting.\nDe\ufb01nition 4.2. An instance of the blocking bandit is an easy instance if the Oracle Greedy is an\nof\ufb02ine optimal algorithm for that instance.\n\nExamples: 1) A class of examples of such easy instances is blocking bandits where all the arms have\nequal delay D < K.\n2) When the sequences seqi := {i + kDi : k 2 N} for i = 1 to Kg do not collide in any location\n(seqi \\ seqj = ;,8i 6= j) and cover the integers 8T 1, [T ] \u2713 [Kg\ni=1seqi ( a.k.a. exact covering\nsystems [16]) then Oracle Greedy is asymptotically optimal.\n\nLower Bound: We now provide a lower bound on the regret for easy instances . An algorithm is\nconsistent iff for any instance of stochastic blocking bandit, the regret upto time T , R1\nT = o(T ) for\nall > 0. We prove the regret lower bound over the class of consistent algorithms for easy instances\nof stochastic blocking bandits.\nWe consider an instance with equal delay D < K, which is an easy instance. In this instance,\nthe rewards for each arm i = 1 to K\u21e4 has Bernoulli distribution with mean 1/2; whereas arms\ni = (K\u21e4 + 1) to K has reward (1/2 ). We call this instance K\u21e4-Set and prove the following.\nTheorem 4.3. For any K and K\u21e4 < K and 2 (0, 1/2) the regret of any consistent algorithm on\nthe K\u21e4-Set instance is lower bounded as lim\nT!1\n\nlog T (KK\u21e4)\n\nR1\nT\n\n\n\n.\n\nThe proof of the above theorem makes use of the following lemma which shows that the blocking\nbandit instance is equivalent to that of a combinatorial semi-bandit [11], problem on m-sets, for\nwhich regret lower bounds were established in [1].\nLemma 4.4. For any Blocking Bandit instance where Di = D \uf8ff K for all arms i 2 [K], time\nhorizon T , and any online algorithm AO, there exists an online algorithm AB which chooses arms\nfor blocks of D time slots and obtain the same distribution of the cumulative reward as AO.\nThe proof of the lemma is deferred to the supplementary material.\n\n5 Experimental Evaluation\n\nSynthetic Experiments: We \ufb01rst validate our results on synthetic experiments, where we use\nK = 20 arms. The gaps in mean rewards of the arms are \ufb01xed with i(i+1), chosen uniformly at\nrandom (u.a.r.) from [0.01, 0.05] for all i = 1 to 19. We also \ufb01x \u00b5K = 0. The rewards are distributed\nas Bernoulli random variables with mean \u00b5i. The delays are \ufb01xed either 1) by sampling all delays\nu.a.r. from [1, 10] (small delay instances), or 2) u.a.r. from [11, 20] (large delay instances), or 3) by\n\ufb01xing all the delay to a single value.\n\n7\n\n\ft\nt\ne\nt\ne\nt\ne\ne\nr\nr\nr\nr\ng\ng\ng\ng\ne\ne\ne\ne\nR\nR\nR\nR\n\nt\ne\nr\nt\ne\ng\nr\ng\ne\ne\nR\nR\n\nt\ne\nt\nr\ne\nr\ng\ng\ne\ne\nR\nR\n\nt\ne\nr\ng\ne\nR\n\nTime\nTime\nTime\nTime\n\nTime\nTime\n\nTime\nTime\n\n(a) K\u21e4 = 6, Kg = 9.\n\n(b) K\u21e4 = 20, Kg = 20.\n\n(c) K\u21e4 = 6, Kg = 8.\n\nTime\n\n(d) Regret vs K\u21e4\n\nFigure 1: Cumulative regrets scale as logarithmic, constant, and negative linear regret with randomly\ninitialized delays, in Fig.1a, Fig.1b, and Fig.1c, resp. Fig.1d: Regret vs K\u21e4 with identical delays.\nOnce the rewards and the delays are \ufb01xed, we run both the oracle greedy and the UCB Greedy\nalgorithm 250 times to obtain the expected regret (i.e. Reward of Oracle Greedy - Reward of UCB\nGreedy) trajectory each with 10k timeslots. For each setting, we repeat this process 50 times for each\nexperiment to obtain 50 such trajectories. We then plot the median, 75% and 25% points in each\ntimeslot accorss all these 50 trajectories in Figure 1.\nScaling with Time: We observe three different behaviors. In most of the cases, we observe the regret\nscales logarithmically with T (see, Fig. 1a). In the second situation, when K\u21e4 = Kg the typical\nbehavior is depicted in Fig. 1b where we observe constant regret (for K\u21e4 = K the logarithmic part\nvanishes in our regret bounds). Finally, there are instances, as shown in Fig.1c, when the regret is\nnegative and scales linearly with time. Note as the Oracle greedy is suboptimal UCB Greedy can\npotentially outperform it and have negative regret. As an example consider the illustrative example in\nSection 1. In this example, if due to learning error the UCB greedy plays the sequence \u2018121\u2019 then\nthe UCB Greedy gets latched to the sequence \u201812131213 . . . \u2019\u2014which is optimal. Such events can\nhappen with constant probability, resulting in a reward linearly larger than the Oracle Greedy which\nplays \u2018321 321 . . . \u2019. This example explains the instances with linear negative regret.\nScaling with K\u21e4: In Fig.1d, (where only the median is plotted) we consider the instances with\nidentical delay equal to K\u21e4 = 7, 11, 16, 20. We observe that the regret decreases with increasing K\u21e4,\nwhich is similar to the proved lower bound.\nJokes Recommendation Experiment: We perform jokes recommenda-\ntion experiment using the Jesters joke dataset [14]. In particular, we con-\nsider 70 jokes from the dataset, each joke with at least 15k valid ratings in\nrange [10, 10]. We rescale the ratings to [0, 1] using x ! (x + 10)/20.\nIn our experiments, when a speci\ufb01c joke is recommended a rating out of\nthe more than 15k ratings is selected uniformly at random with repetition\nand this rating acts as the instantaneous reward. The task is to recom-\nmend jokes to maximize the rating over a time horizon, with blocking\nconstraints for each joke. The delays are chosen randomly similar to\nthe synthetic experiments. For each experiment, we plot the expected\nregret trajectory for 15k time slots, taking expectation over 500 simulated\nsample paths. We observe the expected scaling behavior, where the regret scales logarithmically in\ntime and for larger K\u21e4 we observe smaller regret.\n\nFigure 2: Regret vs K\u21e4\nin jokes recommendation\nwith blocking.\n\nNo Blocking\nK = 10,Kg = 13\nK = 18,Kg = 24\nK = 50,Kg = 58\n\n7500\nTime\nTime\n\nt\ne\nt\ne\nr\nr\ng\ng\ne\nR\ne\nR\n\n10000\n\n12500\n\n15000\n\n2500\n\n5000\n\n750\n\n500\n\n250\n\n0\n\n1500\n\n1250\n\n1000\n\n0\n\n6 Conclusion\n\nWe propose blocking bandits, a novel stochastic multi-armed bandit problem, where each arm is\nblocked for a speci\ufb01c number of time slots once it is played. We provide hardness results and\napproximation guarantees for the of\ufb02ine version of the problem, showing an online greedy algorithm\nprovides an (1 1/e) approximation. We propose UCB Greedy and analyze the regret upper bound\nthrough novel techniques, such as free exploration. For instances on which oracle greedy is optimal\nwe provide lower bounds on regret. Improving regret bounds using the knowledge of the delays of\nthe arms is an interesting future direction which we intend to explore. In another direction, providing\nbetter lower bounds through novel constructions (e.g. exact covering systems) can be investigated.\nAcknowledgements: This research was partially supported by NSF Grant 1826320, ARO grant\nW911NF-17-1-0359, the Wireless Networking and Communications Group Industrial Af\ufb01liates\nProgram, and the the US DoT supported D-STOP Tier 1 University Transportation Center.\n\n8\n\n\fReferences\n[1] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically ef\ufb01cient\nallocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE\nTransactions on Automatic Control, 32(11):968\u2013976, 1987.\n\n[2] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pages 49\u201356, 2007.\n\n[3] Thomas Bosman, Martijn Van Ee, Yang Jiao, Alberto Marchetti-Spaccamela, R Ravi, and Leen\nStougie. Approximation algorithms for replenishment problems with \ufb01xed turnover times. In\nLatin American Symposium on Theoretical Informatics, pages 217\u2013230. Springer, 2018.\n\n[4] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\n\nmulti-armed bandit problems. Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] Chris Calabro, Russell Impagliazzo, Valentine Kabanets, and Ramamohan Paturi. The complex-\nity of unique k-sat: An isolation lemma for k-cnfs. Journal of Computer and System Sciences,\n74(3):386\u2013393, 2008.\n\n[6] Leonardo Cella and Nicol\u00f2 Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs,\n\n2019.\n\n[7] Nicol\u00f2 Cesa-Bianchi and Paul Fischer. Finite-time regret bounds for the multiarmed bandit\n\nproblem. In ICML, pages 100\u2013108. Citeseer, 1998.\n\n[8] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and\n\nSystem Sciences, 78(5):1404\u20131422, 2012.\n\n[9] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework\n\nand applications. In International Conference on Machine Learning, pages 151\u2013159, 2013.\n\n[10] Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with budgets: Regret lower\nbounds and optimal algorithms. In Proceedings of the 2015 ACM SIGMETRICS International\nConference on Measurement and Modeling of Computer Systems, pages 245\u2013257. ACM, 2015.\n\n[11] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Com-\nbinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages\n2116\u20132124, 2015.\n\n[12] Peter C Fishburn and Jeffrey C Lagarias. Pinwheel scheduling: Achievable densities. Algorith-\n\nmica, 34(1):14\u201338, 2002.\n\n[13] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with\nunknown variables: Multi-armed bandits with linear rewards and individual observations.\nIEEE/ACM Transactions on Networking (TON), 20(5):1466\u20131478, 2012.\n\n[14] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time\n\ncollaborative \ufb01ltering algorithm. information retrieval, 4(2):133\u2013151, 2001.\n\n[15] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov\n\ndecision processes. In Conference on Learning Theory, pages 861\u2013898, 2015.\n\n[16] IP Goulden, Andrew Granville, L Bruce Richmond, and Jeffrey Shallit. Natural exact covering\n\nsystems and the reversion of the m\u00f6bius series. The Ramanujan Journal, pages 1\u201325, 2018.\n\n[17] Andr\u00e1s Gy\u00f6rgy, Tam\u00e1s Linder, G\u00e1bor Lugosi, and Gy\u00f6rgy Ottucs\u00e1k. The on-line shortest path\nproblem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369\u20132403,\n2007.\n\n[18] Robert Holte, Al Mok, Louis Rosier, Igor Tulchinsky, and Donald Varvel. The pinwheel: A\nreal-time scheduling problem. In [1989] Proceedings of the Twenty-Second Annual Hawaii\nInternational Conference on System Sciences. Volume II: Software Track, volume 2, pages\n693\u2013702. IEEE, 1989.\n\n9\n\n\f[19] Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, and Aleksandrs Slivkins.\nIn IEEE 60th Annual Symposium on Foundations of\n\nAdversarial bandits with knapsacks.\nComputer Science (FOCS). IEEE, 2019.\n\n[20] Tobias Jacobs and Salvatore Longo. A new perspective on the windows scheduling problem.\n\narXiv preprint arXiv:1410.7237, 2014.\n\n[21] Satyen Kale, Chansoo Lee, and D\u00e1vid P\u00e1l. Hardness of online sleeping combinatorial opti-\nmization problems. In Advances in Neural Information Processing Systems, pages 2181\u20132189,\n2016.\n\n[22] Robert Kleinberg and Nicole Immorlica. Recharging bandits. In 2018 IEEE 59th Annual\n\nSymposium on Foundations of Computer Science (FOCS), pages 309\u2013319. IEEE, 2018.\n\n[23] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for\n\nsleeping experts and bandits. Machine learning, 80(2-3):245\u2013272, 2010.\n\n[24] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for\n\nstochastic combinatorial semi-bandits. arXiv preprint arXiv:1410.0949, 2014.\n\n[25] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[26] Karthik Abinav Sankararaman and Aleksandrs Slivkins. Combinatorial semi-bandits with\nknapsacks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1760\u2013\n1770, 2018.\n\n[27] Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for\nirreducible mdps. In Advances in Neural Information Processing Systems, pages 1505\u20131512,\n2008.\n\n[28] Datong P Zhou and Claire J Tomlin. Budget-constrained multi-armed bandits with multiple\n\nplays. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n10\n\n\f", "award": [], "sourceid": 2673, "authors": [{"given_name": "Soumya", "family_name": "Basu", "institution": "University of Texas at Austin"}, {"given_name": "Rajat", "family_name": "Sen", "institution": "Amazon"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}, {"given_name": "Sanjay", "family_name": "Shakkottai", "institution": "University of Texas at Austin"}]}