{"title": "Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 9007, "page_last": 9016, "abstract": "Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where the goal is to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in episodic restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an $\\tilde{\\bigO}(\\sqrt{T})$ Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical findings.", "full_text": "Regret Bounds for Thompson Sampling in\n\nEpisodic Restless Bandit Problems\n\nYoung Hun Jung\n\nDepartment of Statistics\nUniversity of Michigan\nyhjung@umich.edu\n\nAmbuj Tewari\n\nDepartment of Statistics\nUniversity of Michigan\ntewaria@umich.edu\n\nAbstract\n\nRestless bandit problems are instances of non-stationary multi-armed bandits.\nThese problems have been studied well from the optimization perspective, where\nthe goal is to ef\ufb01ciently \ufb01nd a near-optimal policy when system parameters are\nknown. However, very few papers adopt a learning perspective, where the parame-\nters are unknown. In this paper, we analyze the performance of Thompson sampling\n\u221a\nin episodic restless bandits with unknown parameters. We consider a general policy\nmap to de\ufb01ne our competitor and prove an \u02dcO(\nT ) Bayesian regret bound. Our\ncompetitor is \ufb02exible enough to represent various benchmarks including the best\n\ufb01xed action policy, the optimal policy, the Whittle index policy, or the myopic\npolicy. We also present empirical results that support our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nRestless bandits [Whittle, 1988] are variants of multi-armed bandit (MAB) problems [Robbins, 1952].\nUnlike the classical MABs, the arms have non-stationary reward distributions. Speci\ufb01cally, we\nwill focus on the class of restless bandits whose arms change their states based on Markov chains.\nRestless bandits are also distinguished from rested bandits where only the active arms evolve and\nthe passive arms remain frozen. We will assume that each arm changes according to two different\nMarkov chains depending on whether it is played or not. Because of their extra \ufb02exibility in modeling\nnon-stationarity, restless bandits have been applied to practical problems such as dynamic channel\naccess problems [Liu et al., 2011, 2013] and online recommendation systems [Meshram et al., 2017].\nDue to the arms\u2019 non-stationary nature, playing the same set of arms for every round usually\ndoes not produce the optimal performance. This makes the optimal policy highly non-trivial, and\nPapadimitriou and Tsitsiklis [1999] show that it is generally PSPACE hard to identify the optimal\npolicy for restless bandits. As a consequence, many researchers have been devoted to \ufb01nd an ef\ufb01cient\nway to approximate the optimal policy [Liu and Zhao, 2010, Meshram et al., 2018]. This line of work\nprimarily focuses on the optimization perspective in that the system parameters are already known.\nSince the true system parameters are unavailable in many cases, it becomes important to examine\nrestless bandits from a learning perspective. Due to the learner\u2019s additional uncertainty, however,\nanalyzing a learning algorithm in restless bandits is signi\ufb01cantly challenging. Liu et al. [2011, 2013]\nand Tekin and Liu [2012] prove O(log T ) bounds for con\ufb01dence bound based algorithms, but their\ncompetitor always selects a \ufb01xed set of actions, which is known to be weak (see Section 5 for an\nempirical example of the weakness of the best \ufb01xed action competitor). Dai et al. [2011, 2014] show\nO(log T ) bounds against the optimal policy, but their assumptions on the underlying model are very\nlimited. Ortner et al. [2012] prove an \u02dcO(\nT ) bound in general restless bandits, but their algorithm is\nintractable in general.\n\n\u221a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u221a\nIn a different line of work, Osband et al. [2013] study Thompson sampling in the setting of a fully\nobservable Markov decision process (MDP) and show the Bayesian regret bound of \u02dcO(\nT ) (hiding\ndependence on system parameters like state and action space size). Unfortunately, this result is not\napplicable in our setting as ours is partially observable due to bandit feedback. Following Ortner et al.\n[2012], it is possible to transform our setting to the fully observable case, but then we end up having\nexponentially many states, which restricts the practical utility of existing results.\nIn this work, we analyze Thompson sampling in restless bandits where the system resets at the end of\nevery \ufb01xed-length episode and the rewards are binary. We emphasize that this episodic assumption\nsimpli\ufb01es our analysis as the problem boils down to a \ufb01nite time horizon problem. This assumption\ncan be arguably limited, but there are applications such as dynamic channel access problems where\nthe channel provider might reset their system every night for a maintenance-related reason and\nthe episodic assumption becomes natural. We directly tackle the partial observability and achieve\na meaningful regret bound, which when restricted to the classical MABs matches the Thompson\nsampling result in that setting. We are not the \ufb01rst to analyze Thompson sampling in restless bandits,\nand Meshram et al. [2016] study this type of algorithm as well, but their regret analysis remains in\nthe one-armed-case with a \ufb01xed reward of not pulling the arm. They explicitly mention that a regret\nanalysis of Thompson sampling in the multi-armed case is an interesting open question.\n\n, depending on whether the learner pulled the arm or not.\n\n2 Problem setting\nWe begin by introducing our setting. There are K arms indexed by k = 1,\u00b7\u00b7\u00b7 , K, and the algorithm\nselects N arms every round. We denote the learner\u2019s action at time t by a binary vector At \u2208 {0, 1}K\nwhere ||At||1 = N. We call the selected arms as active and the rest as passive. We assume each arm\nk has binary states, {0, 1}, which evolve as a Markov chain with transition matrix either P active\nor\nP passive\nk\nAt round t, pulling an arm k incurs a binary reward Xt,k, which is the arm\u2019s current state. As we are\nin the bandit setting, the learner only observes the rewards of active arms, which we denote by Xt,At,\nand does not observe the passive arms\u2019 rewards nor their states. This feature makes our setting to be\na partially observable Markov decision process, or POMDP. We denote the history of the learner\u2019s\nactions and rewards up to time t by Ht = (A1, X1,A1 ,\u00b7\u00b7\u00b7 , At, Xt,At).\nWe assume the system resets every episode of length L, which is also known to the learner. This means\nthat at the beginning of each episode, the states of the arms are drawn from an initial distribution. The\nentire time horizon is denoted by T , and for simplicity, we assume it is a multiple of L, or T = mL.\n\nk\n\n2.1 Bayesian regret and competitor policy\nLet \u03b8 \u2208 \u0398 denote the entire parameters of the system. It includes transition matrices P active and\nP passive, and an initial distribution of each arm\u2019s state. The learner only knows the prior distribution\nof this parameter at the beginning and does not have access to the exact value.\nIn order to de\ufb01ne a regret, we need a competitor policy, or a benchmark. We \ufb01rst de\ufb01ne a class of\ndeterministic policies and policy mappings.\nDe\ufb01nition 1. A deterministic policy \u03c0 takes time index and history (t,Ht\u22121) as an input and outputs\na \ufb01xed action At = \u03c0(t,Ht\u22121). A deterministic policy mapping \u00b5 takes a system parameter \u03b8 as an\ninput and outputs a deterministic policy \u03c0 = \u00b5(\u03b8).\n\nWe \ufb01x a deterministic policy mapping \u00b5 and let our algorithm compete against a deterministic policy\n\u03c0(cid:63) = \u00b5(\u03b8(cid:63)), where \u03b8(cid:63) represents the true system parameter, which is unknown to the learner.\nWe keep our competitor policy abstract mainly because we are in the non-stationary setting. Unlike\nthe classical (stationary) MABs, pulling the same set of arms with the largest expected rewards is not\nnecessarily optimal. Moreover, it is in general PSPACE hard to compute the optimal policy when \u03b8(cid:63)\nis given. Regarding these statements, we refer the readers to the book by Gittins et al. [1989]. As\na consequence, researchers have identi\ufb01ed conditions that the (ef\ufb01cient) myopic policy is optimal\n[Ahmad et al., 2009] or proven that a tractable index-based policy has a reasonable performance\nagainst the optimal policy [Liu and Zhao, 2010].\n\n2\n\n\fDraw a parameter \u03b8l \u223c Ql and compute the policy \u03c0l = \u00b5(\u03b8l)\nSet H0 = \u2205\nfor t = 1,\u00b7\u00b7\u00b7 , L do\n\nAlgorithm 1 Thompson sampling in restless bandits\n1: Input prior Q, episode length L, policy mapping \u00b5\n2: Initialize posterior Q1 = Q, history H = \u2205\n3: for episodes l = 1,\u00b7\u00b7\u00b7 , m do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nSelect N active arms At = \u03c0l(t,Ht\u22121)\nObserve rewards Xt,At and update Ht\n\nend for\nAppend HL to H and update posterior distribution Ql+1 using H\n\nWe observe that most of proposed policies including the optimal policy, the myopic policy, or the\nindex-based policy are deterministic. Therefore, researchers can plug in whatever competitor policy\nof their choice, and our regret bound will apply as long as the chosen policy mapping is deterministic.\nBefore de\ufb01ning the regret, we introduce a value function\n\n\u03c0,i(H) = E\u03b8,\u03c0[\nV \u03b8\n\nAj \u00b7 Xj|H].\n\n(1)\n\nL(cid:88)\n\nj=i\n\nT(cid:88)\n\nt=1\n\nm(cid:88)\n\n\u03c0(cid:63),1(\u2205) \u2212 E\u03b8(cid:63)\nm(cid:88)\n\nThis is the expected reward of running a policy \u03c0 from round i to L where the system parameter\nis \u03b8 and the starting history is H. Note that the benchmark policy \u03c0(cid:63) obtains V \u03b8(cid:63)\n\u03c0(cid:63),1(\u2205) rewards per\nepisode in expectation. Thus, we can de\ufb01ne the regret as\n\nR(T ; \u03b8(cid:63)) = mV \u03b8(cid:63)\n\nAt \u00b7 Xt.\n\n(2)\n\nIf an algorithm chooses to \ufb01x a policy \u03c0l for the entire episode l, which is the case of our algorithm,\nthen the regret can be written as\n\nR(T ; \u03b8(cid:63)) = mV \u03b8(cid:63)\n\n\u03c0(cid:63),1(\u2205) \u2212 E\u03b8(cid:63)\n\n\u03c0l,1(\u2205) = E\u03b8(cid:63)\nV \u03b8(cid:63)\n\n\u03c0(cid:63),1(\u2205) \u2212 V \u03b8(cid:63)\nV \u03b8(cid:63)\n\n\u03c0l,1(\u2205).\n\nl=1\n\nl=1\n\nWe particularly focus on the case where \u03b8(cid:63) is a random and bound the following Bayesian regret,\n\nBR(T ) = E\u03b8(cid:63)\u223cQR(T ; \u03b8(cid:63)),\n\nwhere Q is a prior distribution over the set of system parameters \u0398. We assume that the prior is\nknown to the learner. We caution our readers that there is at least one other regret de\ufb01nition in the\nliterature, which is called either frequentist regret or worst-case regret. For this type of regret, one\nviews \u03b8(cid:63) as a \ufb01xed unknown object and directly bounds R(T ; \u03b8(cid:63)). Even though our primary interest\nis to bound the Bayesian regret, we can establish a connection to the frequentist regret in the special\ncase where the prior Q has a \ufb01nite support and the benchmark is the optimal policy (see Corollary 6).\n\n3 Algorithm\n\nOur algorithm is an instance of Thompson sampling or posterior sampling, \ufb01rst proposed by Thomp-\nson [1933]. At the beginning of episode l, the algorithm draws a system parameter \u03b8l from the\nposterior and plays \u03c0l = \u00b5(\u03b8l) throughout the episode. Once an episode is over, it updates the\nposterior based on additional observations. Algorithm 1 describes the steps.\nWe want to point out that the history H ful\ufb01lls two different purposes. One is to update the posterior\nQl, and the other is as an input to a policy \u03c0l. For the latter, however, we do not need the entire\nhistory as the arms reset every episode. That is why we set H0 = \u2205 (step 5) and feed Ht\u22121 to \u03c0l (step\n7). Furthermore, as we assume that the arms evolve based on Markov chains, the history Ht\u22121 can\nbe summarized as\n\n(3)\n\n(r1, n1,\u00b7\u00b7\u00b7 , rK, nK),\n\n3\n\n\fwhich means that an arm k is played nk rounds ago and rk is the observed reward in that round. If\nan arm k is never played in the episode, then nk becomes t, and rk becomes the expected reward\nfrom the initial distribution based on \u03b8l. As we assume the episode length is \ufb01xed to be L, there are\nL possible values for nk. Due to the binary reward assumption, rk can take three values including\nthe case where the arm k is never played. From these, we can infer that there are (3L)K possible\ntuples of (r1, n1,\u00b7\u00b7\u00b7 , rK, nK). By considering these tuples as states and following the reasoning\nof Ortner et al. [2012], one can view our POMDP as a fully observable MDP. Then one can use the\nexisting algorithms for fully observable MDPs (e.g., Osband et al. [2013]), but the regret bounds\neasily become vacuous since the number of states depends exponentially on the number of arms K.\nAdditionally, as we assumed a policy mapping, one might argue to use existing expert learning or\nclassical MAB algorithms considering potential policies as experts or arms. This is possible, but the\nnumber of potential policies corresponds to the size of \u0398, which can be very large or even in\ufb01nite.\nFor this reason, existing algorithms are not ef\ufb01cient and/or their regret bounds become too loose.\nDue to its generality, it is hard to analyze the time and space complexity of Algorithm 1. Two major\nsteps are computing the policy (step 4) and updating posterior (step 10). Computing the policy\ndepends on our choice of competitor mapping \u00b5. If the competitor policy has better performance but\nis harder to compute, then our regret bound gets more meaningful as the benchmark is stronger, but the\nrunning time gets longer. Regarding the posterior update, the computational burden depends on the\nchoice of the prior Q and its support. If there is a closed-form update, then the step is computationally\ncheap, but otherwise the burden increases with respect to the size of the support.\n\n4 Regret bound\nIn this section, we bound the Bayesian regret of Algorithm 1 by \u02dcO(\nT ). A key idea in our analysis of\nThompson sampling is that the distributions of \u03b8(cid:63) and \u03b8l are identical given the history up to the end\nof episode l \u2212 1 (e.g., see Lattimore and Szepesv\u00e1ri, Chp. 36). To state it more formally, let \u03c3(H) be\nthe \u03c3-algebra generated by the history H. Then we call a random variable X is \u03c3(H)-measurable, or\nsimply H-measurable, if its value is deterministically known given the information \u03c3(H). Similarly,\nwe call a random function f is H-measurable if its mapping is deterministically known given \u03c3(H).\nWe record as a lemma an observation made by Russo and Van Roy [2014].\nLemma 2. (Expectation identity) Suppose \u03b8(cid:63) and \u03b8l have the same distribution given H. For any\nH-measurable function f, we have\n\n\u221a\n\nE[f (\u03b8(cid:63))|H] = E[f (\u03b8l)|H].\n\nRecall that we assume the competitor mapping \u00b5 is deterministic. Furthermore, the value function\n\u03c0,i(\u2205) in (1) is deterministic given \u03b8 and \u03c0. This implies E[V \u03b8(cid:63)\n\u03c0(cid:63),i(\u2205)|H] = E[V \u03b8l\n\u03c0l,i(\u2205)|H], where\nV \u03b8\nH is the history up to the end of episode l \u2212 1. This observation leads to the following regret\ndecomposition.\nm(cid:88)\nLemma 3. (Regret decomposition) The Bayesian regret of Algorithm 1 can be decomposed as\n\u03c0l,1(\u2205)].\n\n\u03c0l,1(\u2205)] = E\u03b8(cid:63)\u223cQ\n\nBR(T ) = E\u03b8(cid:63)\u223cQ\n\n\u03c0(cid:63),1(\u2205) \u2212 V \u03b8(cid:63)\n\n\u03c0l,1(\u2205) \u2212 V \u03b8(cid:63)\n\nE\u03b8l\u223cQl[V \u03b8(cid:63)\n\nE\u03b8l\u223cQl [V \u03b8l\n\nm(cid:88)\n\nl=1\n\nl=1\n\nProof. The \ufb01rst equality is a simple rewriting of (2) because Algorithm 1 \ufb01xes a policy \u03c0l for the\nentire episode l. Then we apply Lemma 2 along with the tower rule to get\n\nE\u03b8(cid:63)\u223cQ\n\nE\u03b8l\u223cQl V \u03b8(cid:63)\n\n\u03c0(cid:63),1(\u2205) = E\u03b8(cid:63)\u223cQ\n\nE\u03b8l\u223cQl V \u03b8l\n\n\u03c0l,1(\u2205).\n\nl=1\n\nl=1\n\n\u03c0l,1(\u2205)\nNote that we can compute V \u03b8l\nfrom the algorithm\u2019s observations. The main point of Lemma 3 is to rewrite the Bayesian regret using\nterms that are relatively easy to analyze.\nNext, we de\ufb01ne the Bellman operator\n\n\u03c0l,1(\u2205) as we know \u03b8l and \u03c0l. We can also infer the value of V \u03b8(cid:63)\n\nT \u03b8\n\u03c0 V (Ht\u22121) = E\u03b8,\u03c0[At \u00b7 Xt + V (Ht)|Ht\u22121].\n\u03c0,i = T \u03b8\nIt is not hard to check that V \u03b8\n\n\u03c0 V \u03b8\n\n\u03c0,i+1. The next lemma further decomposes the regret.\n\n4\n\nm(cid:88)\n\nm(cid:88)\n\n\fLemma 4. (Per-episode regret decomposition) Fix \u03b8(cid:63) and \u03b8l, and let H0 = \u2205. Then we have\n\n\u03c0l,1(H0) \u2212 V \u03b8(cid:63)\nV \u03b8l\n\n\u03c0l,1(H0) = E\u03b8(cid:63),\u03c0l\n\n(T \u03b8l\n\n\u03c0l\n\n\u2212 T \u03b8(cid:63)\n\n\u03c0l\n\n\u03c0l,t+1(Ht\u22121).\n)V \u03b8l\n\nL(cid:88)\n\nt=1\n\nProof. Using the relation V \u03b8\n\u03c0l,1(H0) \u2212 V \u03b8(cid:63)\nV \u03b8l\n\n\u03c0,i = T \u03b8\n\u03c0 V \u03b8\n\u03c0l,1(H0) = (T \u03b8l\n= (T \u03b8l\n\n\u03c0l\n\nThe second term can be written as E\u03b8(cid:63),\u03c0l [(V \u03b8l\nobtain the desired equation.\n\n\u03c0,i+1, we may write\n\u03c0l,2)(H0)\n\u03c0l,2 \u2212 T \u03b8(cid:63)\nV \u03b8(cid:63)\nV \u03b8l\n\u03c0l\n\u03c0l,2(H0) + T \u03b8(cid:63)\n\u2212 T \u03b8(cid:63)\n)V \u03b8l\n\u03c0l,2 \u2212 V \u03b8(cid:63)\n\u03c0l,2)(H1)|H0], and we can repeat this L times to\n\n\u03c0l,2 \u2212 V \u03b8(cid:63)\n(V \u03b8l\n\n\u03c0l,2)(H0).\n\n\u03c0l\n\n\u03c0l\n\n\u03c0l\n\nNow we are ready to prove our main theorem. A complete proof can be found in Appendix A.\nTheorem 5. (Bayesian regret bound of the Thompson sampling) The Bayesian regret of Algo-\nrithm 1 satis\ufb01es the following bound\n\nKL3N 3T log T ) = O((cid:112)mKL4N 3 log(mL)).\n\nBR(T ) = O(\n\n(cid:112)\n\n\u221a\n\n\u221a\nRemark. If the system is the classical stationary MAB, then it corresponds to the case L = 1, N = 1,\nand our result reproduces the result of O(\nKT log T ) [Lattimore and Szepesv\u00e1ri, Chp. 36]. This\nsuggests our bound is optimal in K and T up to a logarithmic factor. Further, when N > K\n2 , we\ncan think of the problem as choosing the passive arms, and the smaller bound with N replaced by\nK \u2212 N would apply. When L = 1, the problem becomes combinatorial bandits of choosing N\nactive arms out of K. Cesa-Bianchi and Lugosi [2012] propose an algorithm with a regret bound\nO(\nKN T log K) with an assumption that the loss is always bounded by 1. Since our reward can\nbe as big as N, our bound has the same dependence on N with theirs, suggesting tight dependence\nof our bound on N.\nProof Sketch. We \ufb01x an episode l and analyze the regret in this episode. Let tl = (l \u2212 1)L so that\nt=1 1{At,k = 1, rk = r, nk = n}, which\ncounts the number of rounds where the arm k was chosen by the learner with history rk = r and\nnk = n (see (3) for de\ufb01nition). Note that k \u2208 [K], r \u2208 {0, 1, \u03c1(k)}, and n \u2208 [L], where \u03c1(k) is the\ninitial success rate of the arm k. This implies there are 3KL tuples of (k, r, n).\nLet \u03c9\u03b8(k, r, n) denote the conditional probability of Xk = 1 given a history (r, n) and a system\nparameter \u03b8. Also let \u02c6\u03c9(k, r, n) denote the empirical mean of this quantity (using Nl(k, r, n) past\nobservations and set the estimate to 0 if Nl(k, r, n) = 0). Then de\ufb01ne\n\nthe episode starts at time tl + 1. De\ufb01ne Nl(k, r, n) =(cid:80)tl\n\n\u0398l =\n\n\u03b8 | \u2200(k, r, n),\n\n|(\u02c6\u03c9 \u2212 \u03c9\u03b8)(k, r, n)| <\n\n2 log(1/\u03b4)\n1 \u2228 Nl(k, r, n)\n\nSince \u02c6\u03c9(k, r, n) is Htl-measurable, so is the set \u0398l. Using the Hoeffding inequality, one can show\nP(\u03b8(cid:63) /\u2208 \u0398l) = P(\u03b8l /\u2208 \u0398l) \u2264 3\u03b4KL. In other words, we can claim that with high probability,\n|\u03c9\u03b8l (k, r, n) \u2212 \u03c9\u03b8(cid:63)\nWe now turn our attention to the following Bellman operator\n\n(k, r, n)| is small for all (k, r, n).\n\nT \u03b8\n\n\u03c0l\n\n\u03c0l,t(Ht\u22121) = E\u03b8,\u03c0l [Atl+t \u00b7 Xtl+t + V \u03b8l\nV \u03b8l\n\n\u03c0l,t(Ht)|Ht\u22121].\n\nSince \u03c0l is a deterministic policy, Atl+t is also deterministic given Ht\u22121 and \u03c0l. Let (k1, . . . , kN )\nbe the active arms at time tl + t and write \u03c9\u03b8(ki, rki, nki ) = \u03c9\u03b8,i. Then we can rewrite\n\n(cid:115)\n\n(cid:41)\n\n.\n\n(cid:40)\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\nx\u2208{0,1}N\n\nx =(cid:81)N\n\nwhere P \u03b8\n\ni=1 \u03c9xi\n\nT \u03b8\n\n\u03c0l\n\n\u03c0l,t(Ht\u22121) =\nV \u03b8l\n\n\u03c9\u03b8,i +\n\nx V \u03b8l\nP \u03b8\n\n\u03c0l,t(Ht\u22121 \u222a (Atl+t, x)),\n\n\u03b8,i(1 \u2212 \u03c9\u03b8,i)1\u2212xi. Under the event that \u03b8(cid:63), \u03b8l \u2208 \u0398l, we have\n|\u03c9\u03b8l,i \u2212 \u03c9\u03b8(cid:63),i| < 1 \u2227\n\n=: \u2206i(tl + t),\n\n8 log(1/\u03b4)\n\n1 \u2228 Nl(ki, rki, nki)\n\n(cid:115)\n\n5\n\n\fwhere the dependence on tl + t comes from the mapping from i to ki. When \u03c9\u03b8l,i and \u03c9\u03b8(cid:63),i are close\nfor all (k, r, n), we can actually bound the difference between the following Bellman operators as\n\n|(T \u03b8(cid:63)\n\n\u03c0l\n\n\u2212 T \u03b8l\n\n\u03c0l\n\nThen by applying Lemma 4, we get |V \u03b8l\ni=1 \u2206i(tl + t),\nwhich holds whenever \u03b8(cid:63), \u03b8l \u2208 \u0398l. When \u03b8(cid:63) /\u2208 \u0398l or \u03b8l /\u2208 \u0398l, which happens with probability less\nthan 6\u03b4KL, we have a trivial bound |V \u03b8l\n\n\u03c0l,t(Ht\u22121)| \u2264 3LN\n)V \u03b8l\n\u03c0l,1(\u2205) \u2212 V \u03b8(cid:63)\n\u03c0l,1(\u2205) \u2212 V \u03b8(cid:63)\n\n(cid:80)L\n\u2206i(tl + t).\n\u03c0l,1(\u2205)| \u2264 3LNE\u03b8(cid:63),\u03c0l\n\u03c0l,1(\u2205)| \u2264 LN. We can deduce\n\n(cid:80)N\n\nt=1\n\ni=1\n\nN(cid:88)\n\nL(cid:88)\n\nN(cid:88)\n\n\u03c0l,1(\u2205) \u2212 V \u03b8(cid:63)\n|V \u03b8l\n\n\u03c0l,1(\u2205)| \u2264 3LN1(\u03b8(cid:63), \u03b8l \u2208 \u0398l)E\u03b8(cid:63),\u03c0l\n\n\u2206i(tl + t) + 6\u03b4KL2N.\n\nCombining this with Lemma 3, we can show\nBR(T ) \u2264 6\u03b4mKL2N + E\u03b8(cid:63)\u223cQ3LN\n\nt=1\n\ni=1\n\n1(\u03b8(cid:63), \u03b8l \u2208 \u0398l)E\u03b8(cid:63),\u03c0l\n\nm(cid:88)\n\nl=1\n\nL(cid:88)\n\nN(cid:88)\n\nt=1\n\ni=1\n\n\u2206i(tl + t).\n\n(4)\n\nAfter some algebra, bounding sums of \ufb01nite differences by integrals, and applying the Cauchy-\nSchwartz inequality, we can bound the second summation by\n\n18KL3N + 24(cid:112)3KL3N 3T log(1/\u03b4).\n\n(5)\n\nCombining (4), (5), and our assumption that T = mL, we obtain\n\nBR(T ) = O(\u03b4KLN T + KL3N +(cid:112)KL3N 3T log(1/\u03b4)).\n\nSince N T is a trivial upper bound of BR(T ), we may ignore the KL3N term. Setting \u03b4 = 1\nT\ncompletes the proof.\n\nAs discussed in Section 2, researchers sometimes pay more attention to the case where the true\nparameter \u03b8(cid:63) is deterministically \ufb01xed in advance, in which the frequentist regret becomes more\nrelevant. It is not easy to directly extend our analysis to the frequentist regret in general, but we\ncan achieve a meaningful bound with extra assumptions. Suppose our prior Q is discrete and the\ncompetitor is the optimal policy. Then we know R(T ; \u03b8(cid:63)) is always non-negative due to the optimality\nof the benchmark and can deduce qR(T ; \u03b8(cid:63)) \u2264 BR(T ), where q is the probability mass on \u03b8(cid:63). This\nleads to the following corollary.\nCorollary 6. (Frequentist regret bound of Thompson sampling) Suppose the prior Q is discrete\nand puts a non-zero mass on the parameter \u03b8(cid:63). Additionally, assume that the competitor policy is the\noptimal policy. Then Algorithm 1 satis\ufb01es the following bound\n\n(cid:112)\nKL3N 3T log T ) = O((cid:112)mKL4N 3 log(mL)).\n\nR(T ; \u03b8(cid:63)) = O(\n\n5 Experiments\n\nWe empirically investigate the Gilbert-Elliott channel model, which is studied by Liu and Zhao [2010]\nin a restless bandit perspective1. This model can be broadly used in communication systems such as\ncognitive radio networks, downlink scheduling in cellular systems, opportunistic transmission over\nfading channels, and resource-constrained jamming and anti-jamming.\nEach arm k has two parameters pk\n11, which determine the transition matrix. We assume\nP active = P passive and each arm\u2019s transition matrix is independent on the learner\u2019s action. There\nare only two states, good and bad, and the reward of playing an arm is 1 if its state is good and 0\notherwise. Figure 1 summarizes this model. We assume the initial distribution of an arm k follows\nthe stationary distribution. In other words, its initial state is good with probability \u03c9k =\n.\n\n01 and pk\n\npk\n01\n\n01+1\u2212pk\npk\n\n11\n\nWe \ufb01x L = 50 and m = 30. We use Monte Carlo simulation with size 100 or greater to approximate\nexpectations. As each arm has two parameters, there are 2K parameters. For these, we set the prior\ndistribution to be uniform over a \ufb01nite support {0.1, 0.2,\u00b7\u00b7\u00b7 , 0.9}.\n\n1Our code is available at https://github.com/yhjung88/ThompsonSamplinginRestlessBandits\n\n6\n\n\fFigure 1: The Gilbert-Elliott channel model\n\n5.1 Competitors\n\nAs mentioned earlier, one important strength of our result is that various policy mappings can be used\nas benchmarks. Here we test three different policies: the best \ufb01xed arm policy, the myopic policy,\nand the Whittle index policy. We want to emphasize again that these competitor policies know the\nsystem parameters while our algorithm does not.\n\npk\n01\n\n11\n\n01+1\u2212pk\npk\n\nThe best \ufb01xed arm policy computes the stationary distribution \u03c9k =\nfor all k and pulls the\narms with top N values. The myopic policy keeps updating the belief \u03c9k(t) for the arm k being in a\ngood state and pulls the top N arms. Finally, the Whittle index policy computes the Whittle index\nof each arm and uses it to rank the arms. The Whittle index is proposed by Whittle [1988], and Liu\nand Zhao [2010] \ufb01nd a closed-form formula to compute the Whittle index in this particular setting.\nThe Whittle index policy is very popular in optimization literature as it decouples the optimization\nprocess into K independent problems for each arm, which signi\ufb01cantly reduces the computational\ncomplexity while maintaining a reasonable performance against the optimal policy.\nOne observation is that these three policies are reduced to the best \ufb01xed arm policy in the stationary\ncase. However, the \ufb01rst two policies are known to be sub-optimal in general [Gittins et al., 1989].\nLiu and Zhao [2010] justify both theoretically and empirically the performance of the Whittle index\npolicy for the Gilbert-Elliott channel model.\n\n5.2 Results\n\n\u03c0,1(\u2205)\nWe \ufb01rst analyze the Bayesian regret. For this, we use K = 8 and N = 3. The value functions V \u03b8\nof the best \ufb01xed arm policy, the myopic policy, and the Whittle index policy are 105.4, 110.3, and\n111.4, respectively. If a competitor policy has a weak performance, then Thompson sampling also\nuses this weak policy mapping to get a policy \u03c0l for the episode l. This implies that the regret does\nnot necessarily become negative when the benchmark policy is weak. Figure 2 shows the trend of the\nBayesian regret as a function of episode indices. Regardless of the choice of policy mapping, the\nregret is sub-linear, and the slope of log-log plot is less than 1\n\n2, which agrees with Theorem 5.\n\nFigure 2: Bayesian regret of Thompson sampling versus episode (left) and its log-log plot (right)\n\nNext we \ufb01x true parameters and investigate the model\u2019s behavior more closely. For this, we choose\nK = 4, N = 2, and {(pk\n11)}k=1,2,3,4 = {(0.3, 0.7), (0.4, 0.6), (0.5, 0.5), (0.6, 0.4)}. This\n\n01, pk\n\n7\n\n\fchoice results in \u03c9k = 0.5 for all k, and the best \ufb01xed arm policy becomes indifferent. Therefore\nachieving zero regret against the best \ufb01xed arm becomes trivial. We use the same uniform prior as\nthe previous experiment. Figure 3 presents the trend of value functions and how Thompson sampling\nputs more posterior weights on the correct parameters as it proceeds. Three horizontal lines in the\nleft \ufb01gure represent the values of the competitor policies. The values of the best \ufb01xed arm policy,\nthe myopic policy, and the Whittle index policy are 50.2, 54.6, and 55.6, respectively. It is a good\nexample why one should not pull the same arms all the time in restless bandits. The value function of\nThompson sampling successfully converges to the competitor value for every benchmark while the\none with the myopic policy needs more episodes to fully converge. This supports Corollary 6 in that\nour model can be used even in the non-Bayesian setting as far as the prior has a non-zero weight on\nthe true parameters. Also, the posterior weights on the correct parameters monotonically increase\n(Figure 3, right), which again con\ufb01rms our model\u2019s performance. We measure these weights when\nthe competitor map is the Whittle index policy.\n\nFigure 3: Average per-episode value versus episode and the benchmark values (left); the posterior\nweights of the correct parameters versus episode in the case of the Whittle index policy (right)\n\n6 Discussion and future directions\n\n\u221a\nIn this paper, we have analyzed Thompson sampling in restless bandits with binary rewards. The\nBayesian regret can be theoretically bounded as \u02dcO(\nT ), which naturally extends the results in\nthe stationary MAB. One primary strength of our analysis is that the bound applies to arbitrary\ndeterministic competitor policy mappings, which include the optimal policy and many other practical\npolicies. Experiments with the simulated Gilbert-Elliott channel models support the theoretical\nresults. In the special case where the prior has a discrete support and the benchmark is the optimal\npolicy, our result extends to the frequentist regret, which is also supported by empirical results.\nThere are at least two interesting directions to be explored.\n\n1. Our setting is episodic with known length L. The system resets periodically, which makes the\nanalysis of the regret simpler. However, it is sometimes unrealistic to assume this periodic reset\n(e.g., online recommendation system studied by Meshram et al. [2017]). Analyzing a learning\nalgorithm in the non-episodic setting will be useful.\n\n2. Corollary 6 is not directly applicable in the case of continuous prior. In stationary MABs, it\nhas been shown that Thompson sampling enjoys the frequentist regret bound of \u02dcO(\nT ) with\nadditional assumptions [Lattimore and Szepesv\u00e1ri, Chp. 36]. Extending this to the restless bandit\nsetting will be an interesting problem.\n\n\u221a\n\nAcknowledgments\n\nWe acknowledge the support of NSF CAREER grant IIS-1452099. AT was also supported by a Sloan\nResearch Fellowship. AT visited Criteo AI Lab, Paris and had discussions with Criteo researchers \u2013\nMarc Abeille, Cl\u00e9ment Calauz\u00e8nes, and J\u00e9r\u00e9mie Mary \u2013 regarding non-stationarity in bandit problems.\nThese discussions were very helpful in attracting our attention to the regret analysis of restless bandit\nproblems and the need for considering a variety of benchmark competitors when de\ufb01ning regret.\n\n8\n\n\fReferences\nSahand Haji Ali Ahmad, Mingyan Liu, Tara Javidi, Qing Zhao, and Bhaskar Krishnamachari. Opti-\nmality of myopic sensing in multichannel opportunistic access. IEEE Transactions on Information\nTheory, 55(9):4040\u20134050, 2009.\n\nNicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\nWenhan Dai, Yi Gai, Bhaskar Krishnamachari, and Qing Zhao. The non-bayesian restless multi-\narmed bandit: A case of near-logarithmic regret. In IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 2940\u20132943. IEEE, 2011.\n\nWenhan Dai, Yi Gai, and Bhaskar Krishnamachari. Online learning for multi-channel opportunis-\ntic access over unknown markovian channels. In IEEE International Conference on Sensing,\nCommunication, and Networking (SECON), pages 64\u201371. IEEE, 2014.\n\nJohn C Gittins, Kevin D Glazebrook, Richard Weber, and Richard Weber. Multi-armed bandit\n\nallocation indices, volume 25. Wiley Online Library, 1989.\n\nTor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. Cambridge University Press. forthcoming.\n\nHaoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of non-bayesian restless multi-\narmed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pages 1968\u20131971. IEEE, 2011.\n\nHaoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit\n\nwith unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902\u20131916, 2013.\n\nKeqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index\nfor dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547\u20135567,\n2010.\n\nRahul Meshram, Aditya Gopalan, and D Manjunath. Optimal recommendation to users that react:\nOnline learning for a class of pomdps. In IEEE 55th Conference on Decision and Control (CDC),\npages 7210\u20137215. IEEE, 2016.\n\nRahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and\nrecommendation systems. In IEEE International Conference on Communication Systems and\nNetworks (COMSNETS), pages 206\u2013213. IEEE, 2017.\n\nRahul Meshram, D Manjunath, and Aditya Gopalan. On the whittle index for restless multiarmed\n\nhidden markov bandits. IEEE Transactions on Automatic Control, 63(9):3046\u20133053, 2018.\n\nRonald Ortner, Daniil Ryabko, Peter Auer, and R\u00e9mi Munos. Regret bounds for restless markov\nbandits. In International Conference on Algorithmic Learning Theory, pages 214\u2013228. Springer,\n2012.\n\nIan Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011,\n2013.\n\nChristos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control.\n\nMathematics of Operations Research, 24(2):293\u2013305, 1999.\n\nHerbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58(5):527\u2013535, 1952.\n\nDaniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\nCem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on\n\nInformation Theory, 58(8):5588\u20135611, 2012.\n\n9\n\n\fWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nPeter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability,\n\n25(A):287\u2013298, 1988.\n\n10\n\n\f", "award": [], "sourceid": 4824, "authors": [{"given_name": "Young Hun", "family_name": "Jung", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}