{"title": "Recovering Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 14122, "page_last": 14131, "abstract": "We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies.", "full_text": "Recovering Bandits\n\nCiara Pike-Burke\u2217\n\nUniversitat Pompeu Fabra\n\nBarcelona, Spain\n\nSteffen Gr\u00fcnew\u00e4lder\nLancaster University\n\nLancaster, UK\n\nc.pikeburke@gmail.com\n\ns.grunewalder@lancaster.ac.uk\n\nAbstract\n\nWe study the recovering bandits problem, a variant of the stochastic multi-armed\nbandit problem where the expected reward of each arm varies according to some\nunknown function of the time since the arm was last played. While being a natural\nextension of the classical bandit problem that arises in many real-world settings,\nthis variation is accompanied by signi\ufb01cant dif\ufb01culties. In particular, methods\nneed to plan ahead and estimate many more quantities than in the classical bandit\nsetting. In this work, we explore the use of Gaussian processes to tackle the\nestimation and planing problem. We also discuss different regret de\ufb01nitions that let\nus quantify the performance of the methods. To improve computational ef\ufb01ciency\nof the methods, we provide an optimistic planning approximation. We complement\nthese discussions with regret bounds and empirical studies.\n\n1\n\nIntroduction\n\nThe multi-armed bandit problem [2, 29] is a sequential decision making problem, where, in each\nround t, we play an arm Jt and receive a reward Yt,Jt generated from the unknown reward distribution\nof the arm. The aim is to maximize the total reward over T rounds. Bandit algorithms have become\nubiquitous in many settings such as web advertising and product recommendation. Consider, for\nexample, suggesting items to a user on an internet shopping platform. This is modeled as a bandit\nproblem where each product (or group of products) is an arm. Over time, the bandit algorithm will\nlearn to suggest only good products. In particular, once the algorithm learns that a product (eg. a\ntelevision) has high reward, it will continue to suggest it. However, if the user buys a television, the\nbene\ufb01t of continuing to show it immediately diminishes, but may increase again as the television\nreaches the end of its lifetime. To improve customer experience (and pro\ufb01t), it would be bene\ufb01cial for\nthe recommendation algorithm to learn not to recommend the same product again immediately, but\nto wait an appropriate amount of time until the reward from that product has \u2018recovered\u2019. Similarly,\nin \ufb01lm and TV recommendation, a user may wish to wait before re-watching their favorite \ufb01lm,\nor conversely, may want to continue watching a series but will lose interest in it if they haven\u2019t\nseen it recently. It would be advantageous for the recommendation algorithm to learn the different\nreward dynamics and suggest content based on the time since it was last seen. The recovering bandits\nframework presented here extends the stochastic bandit problem to capture these phenomena.\nIn the recovering bandits problem, the expected reward of each arm is given by an unknown function\nof the number of rounds since it was last played. In particular, for each arm j, there is a function\nfj(z) that speci\ufb01es the expected reward from playing arm j when it has not been played for z rounds.\nWe take a Bayesian approach and assume that the fj\u2019s are sampled from a Gaussian process (GP)\n(see Figure 1a). Using GPs allows us to capture a wide variety of functions and deal appropriately\nwith uncertainty. For any round t, let Zj,t be the number of rounds since arm j was last played. This\nchanges for both the played arm (it resets to 0) and also for the unplayed arms (it increases by 1) in\n\n\u2217This work was carried out while CPB was a PhD student at STOR-i, Lancaster University, UK.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fevery round. Thus, the expected reward of every arm changes in every round, and this change depends\non whether it was played. This problem is therefore related to both restless and rested bandits [30].\nIn recovering bandits, the reward of each arm depends on the entire sequence of past actions. This\nmeans that, even if the recovery functions were known, selecting the best sequence of T arms is\nintractable (since, in particular, an MDP representation would be unacceptably large). One alternative\nis to select the action that maximizes the instantaneous reward, without considering future decisions.\nThis is still quite a challenge compared to the K-armed bandit problem, as instead of just learning the\nreward of each arm, we must learn recovery functions. In some cases, maximizing the instantaneous\nreward is not optimal. In particular, using knowledge of the reward dynamics, it is often possible to\n\ufb01nd a sequence of arms whose total reward is greater than that gained by playing the instantaneous\ngreedy arms. Thus, our interest lies in selecting sequences of arms to maximize the reward.\nIn this work, we present and analyze an Upper Con\ufb01dence Bound (UCB) [2] and Thompson Sampling\n[29] algorithm for recovering bandits. By exploiting properties of Gaussian processes, both of these\naccurately estimate the recovery functions and uncertainty, and use these to look ahead and select\nsequences of actions. This leads to strong theoretical and empirical performance. The paper proceeds\nas follows. In Section 2 we discuss related work. We formally de\ufb01ne our problem in Section 3 and the\nregret in Section 4. In Section 5, we present our algorithms and bound their regret. We use optimistic\nplanning in Section 6 to improve computational complexity and show empirical results in Section 7\nbefore concluding.\n\n2 Related Work\n\nalgorithm [7]. The expected frequentist regret of their algorithm is O((cid:80)\n\nIn the restless bandits problem, the reward distribution of any arm changes at any time, regardless of\nwhether it is played. This problem has been studied by [30, 27, 10, 23, 4] and others. In the rested\nbandits problem, the reward distribution of an arm only changes when it is played. [17, 8, 6, 12, 26]\nstudy rested bandits problems with rewards that vary mainly with the number of plays of an arm.\nIn recovering bandits, the rewards depend on the time since the arm was last played. [14] consider\nconcave and increasing recovery functions and [31] study recommendation algorithms with known\nstep recovery functions. The closest work to ours is [18] where the expected reward of each arm\ndepends on a state (which could be the time since the arm was played) via a parametric function. They\nuse maximum likelihood estimation (although there are no guarantees of convergence) in a KL-UCB\nj ) where \u03b4j depends\non the random number of plays of arm j and the minimum difference in the rewards of any arms at\nany time (which can be very small). By the standard worst case analysis, the frequentist problem\nindependent regret is O\u2217(T 2/3K 1/3), where we use the notation O\u2217 to suppress log factors. Our\n\u221a\nalgorithms achieve O\u2217(\nKT ) Bayesian regret and require less knowledge of the recovery functions.\n[18] also provide an algorithm with no theoretical guarantees but improved experimental performance.\nIn Section 7, we show that our algorithms outperform this algorithm experimentally.\nIn Gaussian process bandits, there is a function, f, sampled from a GP and the aim is to minimize the\n\u221a\n(Bayesian) regret with respect to the maximum of f. The GP-UCB algorithm of [28] has Bayesian\nregret O\u2217(\nT \u03b3T ) where \u03b3T is the maximal information gain (see Section 5). By [25], Thompson\nsampling has the same Bayesian regret. [5] consider GP bandits with a slowly drifting reward function\nand [16] study contextual GP bandits. These contexts and drifts do not depend on previous actions.\nIt is important to note that all of the above approaches only look at instantaneous regret whereas\nin recovering bandits, it is more appropriate to consider lookahead regret (see Section 4). We will\nalso consider Bayesian regret. Many naive approaches will not perform well in this problem. For\nexample, treating each (j, z) combination as an arm and using UCB [2] with K|Z| arms leads to\n\nregret O\u2217((cid:112)KT|Z|) (see Appendix F). Our algorithms exhibit only(cid:112)log |Z| dependence on |Z|.\n\nlog(T )/\u03b42\n\nj\n\nAdversarial bandit algorithms will not do well in this setting either since they aim to minimize the\nregret with respect to the best constant arm, which is clearly suboptimal in recovering bandits.\n\n3 Problem De\ufb01nition\nWe have K independent arms and play for T rounds (T is not known). For each arm 1 \u2264 j \u2264 K\nand round 1 \u2264 t \u2264 T , denote by Zj,t the number of rounds since arm j was last played, where\n\n2\n\n\fJt = 1\n\nZt\n\nJt = K\n\nJt+1 = 1\n\nZt+2\n\nZt+1\n\nJt+d\u22121 = 1\n\nZt+d\u22121\n\nJt+d\u22121 = K\nJt+d\u22121 = 1\n\nJt+1 = K\n\nZt+2\n\nZt+d\u22121\n\nJt+1 = 1\n\nZt+2\n\nZt+1\n\nJt+d\u22121 = K\nJt+d\u22121 = 1\n\nZt+d\u22121\n\nJt+d\u22121 = K\nJt+d\u22121 = 1\n\nJt+1 = K\n\nZt+2\n\nZt+d\u22121\n\nJt+d\u22121 = K\n\nLeaves\n\nZt+d\n\nZt+d\n\nZt+d\n\nZt+d\n\nZt+d\n\nZt+d\n\nZt+d\n\nZt+d\n\nM1(Zt)\n\nMKd (Zt)\n\nfJt (ZJt,t)\n\nfJt+1(ZJt+1,t+1)\n\n. . .\n\nfJt+d (ZJt+d,t+d)\n\n(a) Examples of the recovery functions.\n\n(b) An example of a d-step lookahead tree.\n\nFigure 1: Illustration of recovery functions and lookahead trees.\n\n(cid:26)0\n\nZj,t+1 =\n\nmin{zmax, Zj,t + 1}\n\nZj,t \u2208 Z = {0, . . . , zmax} for \ufb01nite zmax \u2208 N. If we play arm Jt at time t then, at time t + 1,\n\nif Jt = j,\nif Jt (cid:54)= j.\n\n(1)\n\nHence, if arm j has not been played for more than zmax steps, Zj,t will stay at zmax. The Zj,t are\nrandom variables since they depend on our past actions. We will assume that T \u2265 K|Z|.\nThe expected reward for arm j is modeled by an (unknown) recovery function, fj. We assume that\nthe fj\u2019s are sampled independently from a Gaussian processes with mean 0 and known kernel (see\nFigure 1a). Let Zt = (Z1,t, . . . , ZK,t) be the vector of covariates for each arm at time t. At round t,\nwe observe Zt and use this and past observations to select an arm Jt to play. We then receive a noisy\nobservation YJt,t = fJt(ZJt,t) + \u0001t where \u0001t are i.i.d. N (0, \u03c32) random variables and \u03c3 is known.\n[24] give an introduction to Gaussian Processes (GP). A Gaussian process gives a distribution over\nfunctions, when for every \ufb01nite set z1, . . . , zN of covariates, the joint distribution of f (z1), . . . , f (zN )\nis Gaussian. A GP is de\ufb01ned by its mean function, \u00b5(z) = E[f (z)], and kernel function, k(z, z(cid:48)) =\nE[(f (z) \u2212 \u00b5(z))(f (z(cid:48)) \u2212 \u00b5(z(cid:48)))]. If we place a GP prior on f and observe YN = (Y1, . . . , YN )T at\nzN = (z1, . . . , zN )T where Yn = f (zn)+\u0001n for \u0001n iid N (0, \u03c32) noise, then the posterior distribution\nafter N observations is GP(\u00b5(z; N ), k(z, z(cid:48); N )). Here for kN (z) = (k(z1, z), . . . , k(zN , z))T and\npositive semi-de\ufb01nite kernel matrix KN = [k(zi, zj)]N\ni,j=1, the posterior mean and covariance are,\nk(z, z; N ) = k(z, z(cid:48)) \u2212 kN (z)T (KN + \u03c32I)\u22121kN (z(cid:48)),\n\u00b5(z; N ) = kN (z)T (KN + \u03c32I)\u22121yN ,\nso \u03c32(z; N ) = k(z, z; N ). For z \u2208 Z, the posterior distribution of f (z) is N (\u00b5(z; N ), \u03c32(z; N )).\nWe consider the posterior distribution of fj for each arm at every round, when it has been played\nsome (random) number of times. For each arm j, denote the posterior mean and variance of fj at z\nafter n plays of the arm by \u00b5j(z; n) and \u03c32\nj (z; n). Let Nj(t) be the (random) number of times arm j\nhas been played up to time t. We denote the posterior mean and variance of arm j at round t by,\n\u00b5t(j) = \u00b5j(Zj,t; Nj(t \u2212 1)),\n\nj (Zj,t; Nj(t \u2212 1)).\n\nand\n\n\u03c32\nt (j) = \u03c32\n\n4 De\ufb01ning the Regret\n\nThe regret is commonly used to measure the performance of an algorithm and is de\ufb01ned as the\ndifference in the cumulative expected reward of an algorithm and an oracle. We will use the Bayesian\nregret, where the expectation is taken over the recovery curves and the actions. In recovering bandits,\nthere are various choices for the oracle. We discuss some of these here.\n\nFull Horizon Regret. One candidate for the oracle is the deterministic policy which knows the\nrecovery functions and T , and using this selects the best sequence of T arms. This policy can be\nhorizon dependent. Anytime algorithms, which are horizon independent, lead to policies that are\nstationary and do not change over time. In various settings, these stationary deterministic policies\nachieve the best possible regret [22]. In the following, we focus on the stationary deterministic\n(SD) oracle. Note that it is computationally intractable to calculate this oracle in all but the easiest\nproblems. This can be seen by formulating the problem as an MDP, with natural state-space of size\nK|Z|. Techniques such as dynamic programming cannot be used unless K and |Z| are very small.\n\n3\n\n051015202530z3210123fj(z)\fInstantaneous Regret. Another candidate for the oracle is the policy which in each round t,\ngreedily plays the arm with the highest immediate reward at Zt. These Zt depend on the previous\nactions of the oracle. Consider a policy which plays this oracle up to time s\u22121, then selects a different\naction at time s, and continues to play greedily. The cumulative reward of this policy could be vastly\ndifferent to that of the oracle since they may have very different Z values. Therefore, de\ufb01ning regret\nin relation to this oracle may penalize us severely for early mistakes. Instead, one can de\ufb01ne the\nregret of a policy \u03c0 with respect to an oracle which selects the best arm at the Zt\u2019s generated by \u03c0.\nWe call this the instantaneous regret. This regret is commonly used in restless bandits and in [18].\n\nd-step Lookahead Regret. A policy with low instantaneous regret may miss out on additional\nreward by not considering the impact of its actions on future Zt\u2019s. Looking ahead and considering\nthe evolution of the Zj,t\u2019s can lead to choosing sequences of arms which are collectively better than\nindividual greedy arms. For example, if two arms j1, j2 have similar fj(Zj,t) but the reward of j1\ndoubles if we do not play it, while the reward of j2 stays the same, it is better to play j2 then j1. We\nwill consider oracles which take the Zt generated by our algorithm and select the best sequence of\nd \u2265 1 arms. We call this regret the d-step lookahead regret and will use this throughout the paper.\nTo de\ufb01ne this regret, we use decision trees. Nodes are Z values and edges represent playing arms and\nupdating Z (see Figure 1b). Each sequence of d arms is a leaf of the tree. Let Ld(Z) be the set of\nleaves of a d-step lookahead tree with root Z. For any i \u2208 Ld(Z), denote by Mi(Z) the expected\nreward at that leaf, that is the sum of the fj\u2019s along the path to i at the relevant Zj\u2019s (see Section 5).\nThe d-step lookahead oracle selects the leaf with highest Mi(Zt) from a given root node Zt, denote\nthis value by M\u2217(Zt). This leaf is the best sequence of d arms from Zt. If we select leaf It at time t,\nwe play the arms to It for d steps, so select a leaf every d rounds. The d-step lookahead regret is,\n\n(cid:20)\n(cid:21)\nM\u2217(Zhd+1) \u2212 MIhd+1(Zhd+1)\n\n,\n\nE\n\n(cid:98)T /d(cid:99)(cid:88)\n\nh=0\n\nE[R(d)\n\nT ] =\n\nT\n\nT\n\nwith expectation over Ihd+1 and fj. If d = T or d = 1, we get the full horizon or instantaneous regret.\nWe study the single play regret, E[R(d,s)\n], where arms can only be played once in a d-step lookahead,\nand the multiple play regret, E[R(d,m)\n], which allows multiple plays of an arm in a lookahead. This\nregret is related to that in episodic reinforcement learning (ERL) [15, 21, 3]. A key difference is that\nin ERL, the initial state is reset or re-sampled every d steps independent of the actions taken. Note\nthat the d-step lookahead regret can be calculated for any policy, regardless of whether the policy is\ndesigned to look ahead and select sequences of d actions.\nFor large d, the total reward from the optimal d-step lookahead policy will be similar to that of the\noptimal full horizon stationary deterministic policy. Let VT (\u03c0) be the total reward of policy \u03c0 up to\nhorizon T and note that the optimal SD policy will be periodic by Lemma 16 (Appendix D). Then,\nProposition 1 Let p\u2217 be the period of the optimal SD policy \u03c0\u2217. For any l = 1, . . . ,(cid:98) T\u2212zmax\n(cid:99), the\noptimal (zmax + lp\u2217)-lookahead policy, \u03c0\u2217\nVT (\u03c0\u2217).\nHence, any algorithm with low (zmax + lp\u2217)-step lookahead regret will also have high total reward.\nIn practice, we may not know the periodicity of \u03c0\u2217. Moreover, if p\u2217 is too large, then looking\n(zmax + lp\u2217) steps ahead may be computationally challenging, and prohibit learning. Hence, we may\nwish to consider smaller values of d. One option is to look far enough ahead that we consider a local\nmaximum of each recovery function. For a GP kernel with lengthscale l (e.g. squared exponential or\nMat\u00e9rn), this requires looking 2l steps ahead [20, 24]. This should still give large reward while being\ncomputationally more ef\ufb01cient and allowing for learning.\n\nl ) \u2265(cid:0)1\u2212 (l+1)p\u2217+zmax\n\nl , satis\ufb01es, VT (\u03c0\u2217\n\np\u2217\nlp\u2217+zmax\n\nlp\u2217\n\nT +p\u2217\n\n(cid:1)\n\n5 Gaussian Processes for Recovering Bandits\n\nIn Algorithm 1 we present a UCB (dRGP-UCB) and Thompson Sampling (dRGP-TS) algorithm for\nthe d-step lookahead recovering bandits problem, for both the single and multiple play case. Our\nalgorithms use Gaussian processes to model the recovery curves, allowing for ef\ufb01cient estimation\nand facilitating the lookahead. For each arm j we place a GP prior on fj and initialize Zj,1 (often\nthis initial value is known, otherwise we set it to 0). Every d steps we construct the d-step lookahead\n\n4\n\n\fAlgorithm 1 d-step lookahead UCB and Thompson Sampling\n\nInput: \u03b1t from (3) (for UCB).\nInitialization: De\ufb01ne Td = {1, d + 1, 2d + 1, . . .}. For all arms j \u2208 A, set Zj,1 = 0 (optional).\nfor t \u2208 Td do\n\nIf t \u2265 T break. Else, construct the d-step lookahead tree. Then,\nIf UCB:\n\n(cid:26)\n\n(cid:27)\n\n(ii) \u2200i \u2208 Ld(Zt), \u02dc\u03b7t(i) =(cid:80)d\u22121\n\nIf TS: (i) \u2200j \u2208 A, sample \u02dcfj from the posterior at Z(d)\nj,t .\n\u02dcfJt+(cid:96)(ZJt+(cid:96),t+(cid:96))\n\n(iii) It = argmaxi\u2208Ld(Zt){\u02dc\u03b7t(i)}\n\nl=0\n\nIt = argmax\ni\u2208Ld(Zt)\n\n\u03b7t(i) + \u03b1t\u03c2t(i)\n\n,\n\nfor (cid:96) = 0, . . . , d \u2212 1 do\n\nPlay (cid:96)th arm to It, J(cid:96), and get reward YJ(cid:96),t+(cid:96).\nSet ZJ(cid:96),t+(cid:96)+1 = 0. For all j (cid:54)= J(cid:96), set Zj,t+(cid:96)+1 = min{Zj,t+(cid:96) + 1, zmax}.\n\nend for\nUpdate the posterior distributions of the played arms.\n\nend for\n\nMi(Zt) =\n\n(cid:96)=0\n\nd\u22121(cid:88)\n\nd\u22121(cid:88)\n\n(cid:96)=0\n\ntree as in Figure 1b. At time t, we select a sequence of arms by choosing a leaf It of the tree with\nroot Zt. For a leaf i \u2208 Ld(Zt), let {Jt+(cid:96)}d\u22121\n(cid:96)=0 be the sequences of arms and z\nvalues (which are updated using (1)) on the path to leaf i. Then de\ufb01ne the total reward at i as,\n\n(cid:96)=0 and{ZJt+(cid:96),t+(cid:96)}d\u22121\nd\u22121(cid:88)\n\nfJt+(cid:96)(ZJt+(cid:96),t+(cid:96)).\n\nSince the posterior distribution of fj(z) is Gaussian, the posterior distribution of the leaves of the\nlookahead tree will also be Gaussian. In particular, \u2200i \u2208 Ld(Zt), Mi(Zt) \u223c N (\u03b7t(i), \u03c2 2\nt (i)) where,\n\n\u03b7t(i) =\n\n\u03c2 2\nt (i) =\n\n\u00b5t(Jt+(cid:96)),\n\ncovt(fJt+(cid:96)(ZJt+(cid:96),t+(cid:96)), fJt+q (ZJt+q,t+q)), and,\n\n(2)\ncovt(fJt+(cid:96)(ZJt+(cid:96),t+(cid:96)), fJt+q (ZJt+q,t+q)) = I{Jt+(cid:96) = Jt+q}kJt+(cid:96)(ZJt+(cid:96),t+(cid:96), ZJt+q,t+q; NJt+(cid:96)(t)).\nHence, using GPs enables us to accurately estimate the reward and uncertainty at the leaves.\nFor dRGP-UCB, we construct upper con\ufb01dence bounds on each Mi(Zt) using Gaussianity. We then\nselect the leaf It with largest upper con\ufb01dence bound at time t. That is,\n\n(cid:96),q=0\n\nIt = argmax\n1\u2264i\u2264Kd\n\n{\u03b7t(i) + \u03b1t\u03c2t(i)}\n\nwhere\n\n\u03b1t =\n\n2 log((K|Z|)d(t + d \u2212 1)2).\n\n(3)\n\n(cid:113)\n\nIn dRGP-TS, we select a sequence of d arms by sampling the recovery function of each arm j at\nj,t = (Zj,t, . . . , Zj,t + d \u2212 1, 0, . . . , d \u2212 1)T and then calculating the \u2018reward\u2019 of each node using\nZ(d)\nthese sampled values. Denote the sampled reward of node i by \u02dc\u03b7t(i). We choose the leaf It with\nhighest \u02dc\u03b7t(i). In both dRGP-UCB and dRGP-TS, by the lookahead property, we will only play an\narm at a large Zj,t value if it has high reward, or high uncertainty there. We play the sequence of d\narms indicated by It over the next d rounds. We then update the posteriors and repeat the process.\nWe analyze the regret in the single and multiple play cases separately. Studying the single play\ncase \ufb01rst allows us to gain more insights about the dif\ufb01culty of the problem. Indeed, from our\nanalysis we observe that the multiple play case is more dif\ufb01cult since we may loose information\nfrom not updating the posterior between plays of the same arm. All proofs are in the appendix.\nThe regret of our algorithms will depend on the GP kernel through the maximal information gain.\nFor a set S of covariates and observations YS = [f (z) + \u0001z]z\u2208S, we de\ufb01ne the information gain,\nI(YS ; f ) = H(YS ) \u2212 H(YS|f ) where H(\u00b7) is the entropy. As in [28], we consider the maximal\ninformation gain from N samples, \u03b3N . If zt \u2208 S is played at time t, then,\n\nI(YS , f ) =\n\n1\n2\n\nlog(1 + \u03c3\u22122\u03c32(zt; t \u2212 1)),\n\nand,\n\n\u03b3N =\n\nmax\n\nS\u2282Z N :|S|=N\n\nI(YS ; f ).\n\n(4)\n\nTheorem 5 of [28] gives bounds on \u03b3T for some kernels. We apply these results using the fact that\nthe dimension, D, of the input space is 1. For any lengthscale, \u03b3T = O(log(T )) for linear kernels,\n\u03b3T = O(log2(T )) for squared exponential kernels, and \u03b3T = O(T 2/(2\u03bd+2) log(T )) for Mat\u00e9rn(\u03bd).\n\n5\n\n|S|(cid:88)\n\nt=1\n\n\f5.1 Single Play Lookahead\n\nIn the single play case, each am can only be played once in the d-step lookahead. This simpli\ufb01es\nthe variance in (2) since the arms are independent. For any leaf i corresponding to playing arms\nJt, . . . , Jt+d\u22121 (at the corresponding z values), \u03c2 2\nt (Jt+(cid:96)). This involves the posterior\nvariances at time t. However, as we cannot repeat arms, if we play arm j at time t+(cid:96) for 0 \u2264 (cid:96) \u2264 d\u22121,\nit cannot have been played since time t, so its posterior is unchanged. Using this and (4), we relate\nthe variance of MIt(Zt) to the information gain about the fj\u2019s. We get the following regret bounds.\nTheorem 2 The d-step single play lookahead regret of dRGP-UCB satis\ufb01es,\n\nt (i) =(cid:80)d\u22121\n\n(cid:96)=0 \u03c32\n\nE[R(d,s)\n\nT\n\n] \u2264 O((cid:112)KT \u03b3T log(T K|Z|)).\n] \u2264 O((cid:112)KT \u03b3T log(T K|Z|)).\n\nTheorem 3 The d-step single play lookahead regret of dRGP-TS satis\ufb01es,\n\nE[R(d,s)\n\nT\n\n5.2 Multiple Play Lookahead\n\nt (It) to the information gain about each fj. In particular, \u03c2 2\n\nWhen arms can be played multiple times in the d-step lookahead, the problem is more dif\ufb01cult\nsince we cannot use feedback from plays within the same lookahead to inform decisions. It is\nalso harder to relate \u03c2 2\nt (It) contains\ncovariance terms and is de\ufb01ned using the posteriors at time t. On the other hand, \u03b3T is de\ufb01ned in\nterms of the posterior variances when each arm is played. These may be different to those at time\nt if an arm is played multiple times in the lookahead. However, using the fact that the posterior\ncovariance matrix is positive semi-de\ufb01nite, 2kj(z1, z2; n) \u2264 \u03c32\nj (z2; n), so we can bound\nt (Jt+(cid:96)). Then, the change in the posterior variance of a repeated arm can be\n\u03c2 2\nbounded by the following lemma.\nLemma 4 For any z \u2208 Z, arm j and n \u2208 N, n \u2265 1, let Z (n)\nThen, \u03c32\n\nt (It) \u2264 3(cid:80)d\u22121\n\nbe the z value at the nth play of arm j.\n\nj (z1; n) + \u03c32\n\n; n \u2212 1).\n\n(cid:96)=0 \u03c32\n\nj (z; n \u2212 1) \u2212 \u03c32\n\nj (z; n) \u2264 \u03c3\u22122\u03c32\n\nj (Z (n)\n\nj\n\nj\n\nWe get the following regret bounds for dRGP-UCB and dRGP-TS. Due to not updating the posterior\nbetween repeated plays of an arm, they increase by a factor of\nd compared to the single play case.\nThus, although by Proposition 1 larger d leads to higher reward, it makes the learning problem harder.\nTheorem 5 The d-step multiple play lookahead regret of dRGP-UCB satis\ufb01es,\n\n\u221a\n\nE[R(d,m)\n\nT\n\n] \u2264 O\n\nKT \u03b3T log((K|Z|)dT )\n\n(cid:18)(cid:113)\n(cid:18)(cid:113)\n\n(cid:19)\n(cid:19)\n\n.\n\n.\n\nTheorem 6 The d-step multiple play lookahead regret of dRGP-TS satis\ufb01es,\n\nE[R(d,m)\n\nT\n\n] \u2264 O\n\n5.3\n\nInstantaneous Algorithm\n\nKT \u03b3T log((K|Z|)dT )\n\nIf we set d = 1 in Algorithm 1, we obtain algorithms for minimizing the instantaneous regret. In\nthis case, T = {1, . . . , T} and there are K leaves of the 1-step lookahead tree, so each Mi(Zt)\ncorresponds to one arm. One arm is selected and played each time step so \u03b7t(i) = \u00b5t(j), \u03c2 2\nt (i) =\nt (j) for some j. For the UCB, we de\ufb01ne \u03b1t as in (3) with d = 1. We get the following regret,\n\u03c32\nCorollary 7 The instantaneous regret of 1RGP-UCB and 1RGP-TS up to horizon T satisfy\n\nT ] \u2264 O((cid:112)KT \u03b3T log(T K|Z|)).\n\nE[R(1)\n\n\u221a\nThe instantaneous regret of both algorithms is O\u2217(\n\non |Z| from(cid:112)|Z| to(cid:112)log |Z| compared to a naive application of UCB (see Appendix F). The\n\nKT \u03b3T ). Hence, we reduced the dependency\n\nsingle play lookahead regret is of the same order as this instantaneous regret. This shows that, in the\nsingle play case, since we still update the posterior after every play of an arm, we do not loose any\ninformation by looking ahead.\n\n6\n\n\f6\n\nImproving Computational Ef\ufb01ciency via Optimistic Planning\n\n\u02dcfj(z), maxj,z\n\nFor large values of K and d, Algorithm 1 may not be computationally ef\ufb01cient since it searches\nK d leaves. We can improve this by optimistic planning [13, 19]. This was developed by [13] for\ndeterministic MDPs with discount factors and rewards in [0, 1]. We adapt this to undiscounted rewards\n\u02dcfj(z)]. We focus on this in the multiple play Thompson sampling algorithm.\nin [minj,z\nAs in Algorithm 1, at time t, we sample \u02dcfj(z) from the posterior of fj at Z(d)\nj,t = (Zj,t, . . . , Zj,t +\nd, 0, . . . d)T for all arms j. Then, instead of searching the entire tree to \ufb01nd the leaf with largest total\n\u02dcfj(z), we use optimistic planning (OP) to iteratively build the tree. We start from an initial tree of one\nnode, i0 = Zt. At step n of the OP procedure, let Tn be the expanded tree and let Sn be the nodes\nnot in Tn but whose parents are in Tn. We select a node in Sn and move it to Tn, adding its children\nto Sn. If we select a node in of depth d, we stop and output node in. Otherwise we continue until\nn = N for a prede\ufb01ned budget N. Let dN be the maximal depth of nodes in TN . We output the node\nat depth dN with largest upper bound on the value of its continuation (i.e. with largest bN (i) in (5)).\nNodes are selected using upper bounds on the total value of a continuation of the path to the node.\nFor node i \u2208 Sn \u222a Tn, let u(i) be the sum of the \u02dcfj(z)\u2019s on the path to i, and l(i) the depth of i. The\nvalue, v(i), of node i is the reward to i, u(i), plus the maximal reward of a path from i to depth d.\nWe upper bound v(i) by,\n\nfor i \u2208 Sn \u222a Tn.\n\nbn(i) = u(i) + \u03a8(z(i), d \u2212 l(i))\n\n(5)\nwhere z(i) is the vector of zj\u2019s at node i, and the function \u03a8(z(i), d \u2212 l(i)) is an upper bound on the\nmaximal reward from node i to a leaf. Let gj(z, l) = max{ \u02dcfj(z), . . . , \u02dcfj(z + l), \u02dcfj(0), . . . , \u02dcfj(l)} be\nthe maximal reward that can be gained from playing arm j in the next 1 \u2264 l \u2264 d steps from z \u2208 Z(d)\nj,t .\nThen, \u03a8(z(i), d \u2212 l(i)) = (d \u2212 l(i)) max1\u2264j\u2264K gj(zj(i), d \u2212 l(i)), and \u03a8(z(i), 0) = 0 for any z(i).\nWe can often bound the error from this procedure. Let v\u2217 = maxi\u2208Ld(Z) v(i) be the value of the\nmaximal node. A node i is \u0001-optimal if v\u2217 \u2212 v(i) \u2264 \u0001, and let pl(\u0001) be the proportion of \u0001-optimal\n\u02dcfj(z), 0} and de\ufb01ne \u03a8\u2217(l) = maxz\u2208Z \u03a8(z, l)\nnodes at depth l. Let \u2206 = maxj,z\nfor any l = 0, 1, . . . , d. We get the following bound (whose proof is in Appendix E).\n\n\u02dcfj(z)\u2212min{minj,z\n\nProposition 8 For the optimistic planning procedure with budget N, if the procedure stops at step\nn < N because a node in of depth d is selected, then v\u2217 \u2212 v(in) = 0. Otherwise, if there exist\n\u03bb \u2208 ( 1\n\nK , 1] and 1 \u2264 d0 \u2264 d such that \u2200l \u2265 d0, pl((d \u2212 l)\u2206) \u2264 \u03bbl, then for N > n0 = Kd0+1\u22121\nK\u22121\n\n,\n\n(6)\n\n(cid:18)\nd \u2212 log(N \u2212 n0)\n\nlog(\u03bbK)\n\nv\u2217 \u2212 v(iN ) \u2264\n\n(cid:19)\n\n\u2212 log(\u03bbK \u2212 1)\n\nlog(\u03bbK)\n\n+ 1\n\n\u2206.\n\nHence, if we stop the procedure at n < N, the node in of depth d we return will be optimal. In many\ncases, especially when the proportion of \u0001-optimal nodes, \u03bb, is small, this will occur. Otherwise, the\nerror will depend on \u03bb, and the budget, N. By (6), for N \u2248 (\u03bbK)d, the error will be near zero.\n\n7 Experimental Results\n\nWe tested our algorithms in experiments with zmax = 30, noise standard deviation \u03c3 = 0.1, and\nhorizon T = 1000. We used GPy [11] to \ufb01t the GPs. We \ufb01rst aimed to check that our algorithms were\nplaying arms at good z values (i.e. play arm j when fj(z) is high). We used K = 10 and sampled the\nrecovery functions from a squared exponential kernel and ran the algorithms once. Figure 2 shows\nthat, for lengthscale l = 2, 1RGP-UCB and 3RGP-UCB accurately estimate the recovery functions\nand learn to play each arm in the regions of Z where the reward is high. Although, as expected,\n3RGP-UCB has more samples on the peaks, it is reassuring that the instantaneous algorithm also\nselects good z\u2019s. The same is true for dRGP-TS and different values of d and l (see Appendix G.1).\nNext, we tested the performance of using optimistic planning (OP) in dRGP-TS. We averaged all\nresults over 100 replications and used a squared exponential kernel with l = 4. In the \ufb01rst case,\nK = 10 and d = 4, so direct tree search may have been possible. Figure 3a shows that, when N,\nthe budget of policies the OP procedure can evaluate per lookahead, increases above 500, the total\n\n7\n\n\f(a) Instantaneous: d = 1\n\n(b) Lookahead: d = 3\n\nFigure 2: The posterior mean (blue) of RGP-UCB with density shaded in blue for a squared\nexponential kernel (l = 2). The true recovery curve is in red and the crosses are the observed samples.\n\n(a) d = 4, K = 10\n\n(b) d = 4, K = 30\n\n(c) d = 8, K = 10\n\nFigure 3: The total reward and \ufb01nal depth of the lookahead tree, dN , as the budget, N, increases.\n\nreward plateaus and the average depth of the returned policy is 4. By Proposition 8, this means that\nwe found an optimal leaf of the lookahead tree while evaluating signi\ufb01cantly fewer policies. We\nthen increased the number of arms to K = 30. Here, searching the whole lookahead tree would be\ncomputationally challenging. Figure 3b shows that we found the optimal policy after about 5,000\npolicies (since here dN = d), which is less than 0.1% of the total number of policies. In Figure 3c,\nwe increased the depth of the lookahead to d = 8. In this case, we had to search more policies to \ufb01nd\noptimal leaves. However, this was still less than 0.1% of the total number of policies. From Figure 3c,\nwe also see that when dN < d, increasing N leads to higher total reward.\nLastly, we compared our algorithms to RogueUCB-Tuned [18] and UCB-Z, the basic UCB algorithm\nwith K|Z| arms (see Appendix F), in two parametric settings. Details of the implementation of\nRogueUCB-Tuned are given in Appendix G.2. As in [18], we only considered d = 1. We used squared\nexponential kernels in 1RGP-UCB and 1RGP-TS with lengthscale l = 5 (results for other lengthscales\nare in Appendix G.3). The recovery functions were logistic, f (z) = \u03b80(1 + exp{\u2212\u03b81(z \u2212 \u03b82)})\u22121\nwhich increases in z, and modi\ufb01ed gamma, f (z) = \u03b80C exp{\u2212\u03b81z}z\u03b82 (with normalizer C), which\nincreases until a point and then decreases. The values of \u03b8 were sampled uniformly and are in\nAppendix G.3. We averaged the results over 500 replications. The cumulative regret (and con\ufb01dence\nregions) is shown in Figure 4 and the cumulative reward (and con\ufb01dence bounds) in Table 1. Our\nalgorithms achieve lower regret and higher reward than RogueUCB-Tuned. UCB-Z does badly since\nthe time required to play each (arm,z) pair during initialization is large.\n\n7.1 Practical Considerations\n\nThere are several issues to consider when applying our algorithms in a practical recommendation\nscenario. The \ufb01rst is the choice of zmax. Throughout, we assumed that this is known and constant\nacross arms. Our work can be easily extended to the case where there is a different value, zmax,j, for\neach arm j, by de\ufb01ning zmax = maxj zmax,j and extending fj to zmax by setting fj(z) = fj(zmax,j)\nfor all z = zmax,j + 1, . . . , zmax. A similar approach can be used if we only know an upper bound on\nzmax. Additionally, in practice, the recovery curves may not be sampled from Gaussian processes, or\nthe kernels may not be known. As demonstrated experimentally, our algorithms can still perform well\nin this case. Indeed, kernels can be chosen to (approximately) represent a wide variety of recovery\ncurves, ranging from uncorrelated rewards to constant functions. In practice, we can also use adaptive\nalgorithms for selecting a kernel function out of a large class of kernel functions (see e.g. Chapter 5\nof [24] for details).\n\n8\n\n32101230510152025303210123051015202530051015202530051015202530051015202530zfj(z)32101230510152025303210123051015202530051015202530051015202530051015202530zfj(z)120013001400Total Reward at T=10000200040006000800010000Policies per Lookahead (N)234Average dN1000110012001300Total Reward at T=100001000020000300004000050000Policies per Lookahead (N)1234Average dN100012001400Total Reward at T=1000020000400006000080000Policies per Lookahead (N)2468Average dN\fTable 1: Total reward at T = 1000 for single step experiments with parametric functions\n\nSetting\nLogistic\nGamma\n\n1RGP-UCB (l = 5)\n461.7 (454.3,468.9)\n145.6 (139.6, 151.7)\n\n1RGP-TS (l = 5)\n462.6 (455.7,469.3)\n156.5 (149.6,163.0)\n\nRogueUCB-Tuned\n446.2 (438.2,453.5)\n132.7 (111.0,144.5)\n\nUCB-Z\n\n242.6 (229.6,256.0)\n116.8 (108.4,125.5)\n\n(a) Logistic setup\n\n(b) Gamma setup\n\nFigure 4: Cumulative instantaneous regret with parametric recovery functions.\n\n8 Conclusion\n\nIn this work, we used Gaussian processes to model the recovering bandits problem and incorporated\nthis into UCB and Thompson sampling algorithms. These algorithms use properties of GPs to\n\u221a\nlook ahead and \ufb01nd good sequences of arms. They achieve d-step lookahead Bayesian regret of\nO\u2217(\nKdT ) for linear and squared exponential kernels, and perform well experimentally. We also\nimproved computational ef\ufb01ciency of the algorithm using optimistic planning. Future work includes\nconsidering the frequentist setting, analyzing online methods for choosing zmax and the kernel, and\ninvestigating the use of GPs in other non-stationary bandit settings.\n\nAcknowledgments\n\nCPB would like to thank the EPSRC funded EP/L015692/1 STOR-i centre for doctoral training and\nSparx. We would also like to thank the reviewers for their helpful comments.\n\nReferences\n[1] M. Abeille and A. Lazaric. Linear thompson sampling revisited. International Conference on Arti\ufb01cial\n\nIntelligence and Statistics, 2017.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2-3):235\u2013256, 2002.\n\n[3] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 263\u2013272. JMLR. org, 2017.\n\n[4] O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In\n\nAdvances in neural information processing systems, pages 199\u2013207, 2014.\n\n[5] I. Bogunovic, J. Scarlett, and V. Cevher. Time-varying gaussian process bandit optimization. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 314\u2013323, 2016.\n\n[6] D. Bouneffouf and R. Feraud. Multi-armed bandit problem with known trend. Neurocomputing, 205:16\u201321,\n\n2016.\n\n[7] O. Capp\u00e9, A. Garivier, O.-A. Maillard, R. Munos, G. Stoltz, et al. Kullback\u2013leibler upper con\ufb01dence\n\nbounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516\u20131541, 2013.\n\n[8] C. Cortes, G. DeSalvo, V. Kuznetsov, M. Mohri, and S. Yang. Discrepancy-based algorithms for non-\n\nstationary rested bandits. arXiv preprint arXiv:1710.10657, 2017.\n\n9\n\n02004006008001000t050100150200cummulative regretRGP-UCBRGP-TSRogueUCBUCB-Z02004006008001000t050100150200250300cummulative regretRGP-UCBRGP-TSRogueUCBUCB-Z\f[9] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001.\n\n[10] A. Garivier and E. Moulines. On upper-con\ufb01dence bound policies for switching bandit problems. In\n\nInternational Conference on Algorithmic Learning Theory, pages 174\u2013188, 2011.\n\n[11] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, 2012\u2013.\n\n[12] H. Heidari, M. Kearns, and A. Roth. Tight policy regret bounds for improving and decaying bandits. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, pages 1562\u20131570, 2016.\n\n[13] J.-F. Hren and R. Munos. Optimistic planning of deterministic systems.\n\nReinforcement Learning, pages 151\u2013164. Springer, 2008.\n\nIn European Workshop on\n\n[14] N. Immorlica and R. D. Kleinberg. Recharging bandits. In IEEE 59th Annual Symposium on Foundations\n\nof Computer Science, 2018.\n\n[15] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of\n\nMachine Learning Research, 11(Apr):1563\u20131600, 2010.\n\n[16] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in Neural\n\nInformation Processing Systems, pages 2447\u20132455, 2011.\n\n[17] N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In Advances in Neural Information Processing\n\nSystems, pages 3077\u20133086, 2017.\n\n[18] Y. Mintz, A. Aswani, P. Kaminsky, E. Flowers, and Y. Fukuoka. Non-stationary bandits with habituation\n\nand recovery dynamics. arXiv preprint arXiv:1707.08423, 2017.\n\n[19] R. Munos et al. From bandits to monte-carlo tree search: The optimistic principle applied to optimization\n\nand planning. Foundations and Trends R(cid:13) in Machine Learning, 7(1):1\u2013129, 2014.\n\n[20] I. Murray. Gaussian processes and kernels. https://www.inf.ed.ac.uk/teaching/courses/mlpr/\n\n2016/notes/w7c_gaussian_process_kernels.pdf, 2016.\n\n[21] I. Osband, D. Russo, and B. Van Roy. (more) ef\ufb01cient reinforcement learning via posterior sampling. In\n\nAdvances in Neural Information Processing Systems, pages 3003\u20133011, 2013.\n\n[22] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley &\n\nSons, 2005.\n\n[23] V. Raj and S. Kalyani. Taming non-stationary bandits: A bayesian approach.\n\narXiv:1707.09727, 2017.\n\narXiv preprint\n\n[24] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. MIT press, 2006.\n\n[25] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations\n\nResearch, 39(4):1221\u20131243, 2014.\n\n[26] J. Seznec, A. Locatelli, A. Carpentier, A. Lazaric, and M. Valko. Rotting bandits are no harder than\nstochastic ones. In The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n2564\u20132572, 2019.\n\n[27] A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In COLT,\n\npages 343\u2013354, 2008.\n\n[28] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\nno regret and experimental design. In Proceedings of the 27th International Conference on International\nConference on Machine Learning, pages 1015\u20131022, 2010.\n\n[29] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[30] P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability,\n\n25(A):287\u2013298, 1988.\n\n[31] J. Yi, C.-J. Hsieh, K. R. Varshney, L. Zhang, and Y. Li. Scalable demand-aware recommendation. In\n\nAdvances in Neural Information Processing Systems, pages 2409\u20132418, 2017.\n\n10\n\n\f", "award": [], "sourceid": 7889, "authors": [{"given_name": "Ciara", "family_name": "Pike-Burke", "institution": "Universitat Pompeu Fabra"}, {"given_name": "Steffen", "family_name": "Grunewalder", "institution": "Lancaster"}]}