{"title": "Learning in Generalized Linear Contextual Bandits with Stochastic Delays", "book": "Advances in Neural Information Processing Systems", "page_first": 5197, "page_last": 5208, "abstract": "In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision maker only after some delay, which is unknown and stochastic, even though a decision must be made at each time step for an incoming set of contexts. We study the performance of upper confidence bound (UCB) based algorithms adapted to this delayed setting. In particular, we design a delay-adaptive algorithm, which we call Delayed UCB, for generalized linear contextual bandits using UCB-style exploration and establish regret bounds under various delay assumptions. In the important special case of linear contextual bandits, we further modify this algorithm and establish a tighter regret bound under the same delay assumptions. \nOur results contribute to the broad landscape of contextual bandits literature by establishing that UCB algorithms, which are widely deployed in modern recommendation engines, can be made robust to delays.", "full_text": "Learning in Generalized\n\nLinear Contextual Bandits with Stochastic Delays\n\nZhengyuan Zhou1,2\u21e4, Renyuan Xu3\u21e4and Jose Blanchet4\n1 Department of Electrical Engineering, Stanford University\n\n2 Bytedance Inc.\n\n3 Department of Industrial Engineering and Operations Research, UC Berkeley\n4 Department of Management Science and Engineering, Stanford University\n\nAbstract\n\nIn this paper, we consider online learning in generalized linear contextual ban-\ndits where rewards are not immediately observed. Instead, rewards are available\nto the decision maker only after some delay, which is unknown and stochastic,\neven though a decision must be made at each time step for an incoming set of\ncontexts. We study the performance of upper con\ufb01dence bound (UCB) based\nalgorithms adapted to this delayed setting. In particular, we design a delay-adaptive\nalgorithm, which we call Delayed UCB, for generalized linear contextual bandits\nusing UCB-style exploration and establish regret bounds under various delay as-\nsumptions. In the important special case of linear contextual bandits, we further\nmodify this algorithm and establish a tighter regret bound under the same delay\nassumptions. Our results contribute to the broad landscape of contextual bandits lit-\nerature by establishing that UCB algorithms, which are widely deployed in modern\nrecommendation engines, can be made robust to delays.\n\n1\n\nIntroduction\n\nThe growing availability of user-speci\ufb01c data has welcomed the exciting era of personalized rec-\nommendation, a paradigm that uncovers the heterogeneity across individuals and provides tailored\nservice decisions that lead to improved outcomes. Such heterogeneity is ubiquitous across a va-\nriety of application domains (including online advertising, medical treatment assignment, prod-\nuct/news recommendation (Li et al. (2010), Bubeck et al. (2012),Chapelle (2014),Bastani and Bayati\n(2015),Schwartz et al. (2017))) and manifests itself as different individuals responding differently\nto the recommended items. Rising to this opportunity, contextual bandits have emerged to be the\npredominant mathematical formalism that provides an elegant and powerful formulation: its three\ncore components, the features (representing individual characteristics), the actions (representing the\nrecommendation), and the rewards (representing the observed feedback), capture the salient aspects\nof the problem and provide fertile ground for developing algorithms that balance exploring and\nexploiting users\u2019 heterogeneity.\nAs such, the last decade has witnessed extensive research efforts in developing effective and ef\ufb01cient\ncontextual bandits algorithms. In particular, two types of algorithms\u2013upper con\ufb01dence bounds (UCB)\nbased algorithms (Li et al. (2010); Filippi et al. (2010); Chu et al. (2011); Jun et al. (2017); Li et al.\n(2017)) and Thompson sampling (TS) based algorithms (Agrawal and Goyal (2013a,b); Russo and\nVan Roy (2014, 2016); Abeille et al. (2017))\u2013stand out from this \ufb02ourishing and fruitful line of work:\ntheir theoretical guarantees have been analyzed in many settings, often yielding (near-)optimal regret\nbounds; their empirical performance have been thoroughly validated, often providing insights into\n\n\u21e4These two authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftheir practical ef\ufb01cacy (including the consensus understanding that TS-based algorithms often suffer\nfrom intensive computation for posterior updates but can leverage a correctly speci\ufb01ed prior and have\nsuperior empirical performance; UCB-based algorithms can often achieve tight theoretical regret\nbounds but are often sensitive to hyper-parameter tuning in empirical performance). To a large extent,\nthese two family of algorithms have been widely deployed in many modern recommendation engines.\nHowever, a key assumption therein\u2013both the algorithm design and their analyses\u2013is that the reward\nis immediately available after an action is taken. Although useful as a \ufb01rst-step abstraction, this\nis a stringent requirement that is rarely satis\ufb01ed in practice, particularly in large-scale systems\nwhere the time-scale of a single recommendation is signi\ufb01cantly smaller than the time-scale of a\nuser\u2019s feedback. For instance, in E-commerce, a recommendation is typically made by the engine\nin milliseconds, whereas a user\u2019s response time (i.e. to buy a product or conversion) is typically\nmuch larger, ranging from hours to days, sometimes even to weeks. Similarly, in clinical trials, it is\ninfeasible to immediately observe and hence take into account the medical outcome after applying\na treatment to a patient\u2013collecting medical feedback can be a time-consuming and often random\nprocess; and in general, it is common to have applied trial treatments to a large number of patients,\nwith individual medical outcomes only available much later at different, random points in time. In\nboth the E-commerce (Kannan et al. (2001); Chapelle (2014); Vernade et al. (2017))and the clinical\ntrials cases (Chow and Chang (2011)), a random and often signi\ufb01cantly delayed reward is present,\nthereby requiring adjustments in classical formulations to understand the impact of delays.\n\n1.1 Related Work\n\nThe problem of learning on bandits with delays has recently been studied in different settings in the\nexisting literature, where most of the efforts have concentrated on the multi-armed bandits setting,\nincluding both the stochastic and the adversarial multi-armed bandits. For stochastic multi-armed\n\nbandits with delays, Joulani et al. (2013) show a regret bound O(log T + E[\u2327 ] +plog TE[\u2327 ]) where\n\nE[\u2327 ] is the mean of the iid delays. Desautels et al. (2014) consider Gaussian Process bandits with a\nbounded stochastic delay. Mandel et al. (2015) follow the work of Joulani et al. (2013) and propose a\nqueue-based multi-armed bandit algorithm to handle delays. Pike-Burke et al. (2017) match the same\nregret bound as in Joulani et al. (2013) when feedback is not only delayed but also anonymous.\nFor adversarial multi-armed bandits with delays, Neu et al. (2010) establish the regret bound of\n)] for Markov decision process, where \u2327const is the constant delay\nE[RT ] \uf8ff O(\u2327const) \u21e5 E[R0T ( T\nand R0T is the regret without delays. Cesa-Bianchi et al. (2019) consider adversarial bandits with\n\n\u2327const\n\n\ufb01xed constant delays on the network graph, with a minimax regret of the order \u02dcO\u21e3p(K + \u2327const)T\u2318,\n\nwhere K is the number of arms. Another related line of work is adversarial learning with full\ninformation (all arms\u2019 rewards are observed), where its different variants in the delayed setting have\nbeen studied by Weinberger and Ordentlich (2002), Mesterharm (2005), Quanrud and Khashabi\n(2015) and Garrabrant et al. (2016). Very recently, Bistritz et al. (2019) studied adversarial bandits\nlearning under arbitrary delays using Exp3 and established \ufb01nite-sample delay-adaptive regret bounds.\nOn the other hand, learning in contextual bandits with delays are much less explored. Joulani et al.\n(2013) consider learning on adversarial contextual bandits with delays and establish an expected\n\n1+E[M\u21e4T ]\u2318i by using a black-box algorithm, where\nregret bound E [RT ] \uf8ff (1 + E[M\u21e4T ]) \u21e5 EhR0T\u21e3\nM\u21e4T is the running maximum number of delays up to round T . Dudik et al. (2011) consider stochastic\ncontextual bandits with a \ufb01xed constant delay. The reward model they consider is general (i.e. not\nnecessarily parametric); however, they require the policy class to be \ufb01nite. In particular, they obtain\nthe regret bound O(pK log N (\u2327const + pT )), where N is the number of policies and \u2327const is again\nthe \ufb01xed constant delay. On a related front, Grover et al. (2018b) studied the problem of best-arm\nidenti\ufb01cation under delayed feedback. There, the objective is to identify the best arm using as few\nsamples as possible, without taking into account the cost incurred along the way (i.e. a different\nobjective from regret). In closing, we also mention that there is also a growing literature in of\ufb02ine\ncontextual bandits learning Swaminathan and Joachims (2015); Kitagawa and Tetenov (2018); Zhou\net al. (2018a). In this domain, delay is typically not a concern as all the data has already been collected\nin a single batch before any learning/decision-making takes place.\n\nT\n\n2\n\n\f1.2 Our Contributions\n\nIn this paper, we consider learning on generalized linear (stochastic) contextual bandits with stochastic\ndelays. More speci\ufb01cally, we design a delay-adaptive algorithm for generalized linear contextual\nbandits using UCB-style exploration, which we call Delayed UCB (DUCB, as given in Algorithm 1).\nDUCB requires a carefully designed delay-adaptive con\ufb01dence parameter, which depends on how\nmany rewards are missing up to the current time step. Next, we give regret characterizations of\nDUCB under independent stochastic, unbounded delays. In particular, as a special case of our\nresults, when the delays are iid with mean \u00b5D, we establish a high-probability regret bound of\n\nthe delays and d is the feature/context dimension. For comparison, the state-of-the-art regret bound\n\n\u02dcO\u21e3p\u00b5Dd + pGd + dpT\u2318 on DUCB, where G is a parameter characterizing the tail bound of\nof UCB on generalized linear contextual bandits without delays is \u02dcO\u21e3dpT\u2318 (Filippi et al. (2010); Li\n\net al. (2017)). Regret bounds for more general delays are also given. Note that our analysis here does\nnot assume the number of actions to be \ufb01nite, and hence these regret bounds apply to in\ufb01nite-action\nsetting as well.\nFinally, we consider the important special case of linear contextual bandits with \ufb01nitely many actions.\nIn this setting, we provide a different UCB-based algorithm that estimates the underlying parameters\nusing a biased estimator (as opposed to the unbiased estimator employed in the generalized linear\ncontextual bandits setting) and provide a more re\ufb01ned analysis that achieves regret bounds which are a\nfactor of O(pd) tighter. More speci\ufb01cally in this setting, as a direct comparison, when the delays are\nagain iid with mean \u00b5D, we establish a high-probability regret bound2 of \u02dcO\u21e3(1 + \u00b5D + G)pdT\u2318.\n\nTo the best of our knowledge, these regret bounds provide the \ufb01rst theoretical characterizations in\n(generalized) linear contextual bandits with large delays and contribute to the broad landscape of\ncontextual bandits literature by delineating the impact of delays on performance.\n\n2 Problem Setup\n\nIn this section, we describe the formulation for learning in generalized linear contextual bandits\n(GLCB) in the presence of delays. We start by reviewing the basics of generalized linear contextual\nbandits, followed by a description of the delay model. Before proceeding, we \ufb01rst \ufb01x some notation.\nFor a vector x 2 Rd, we use kxk to denote its l2-norm and x0 its transpose. Bd := {x 2 Rd : kxk \uf8ff\n1} is the unit ball centered at the origin. The weighted l2-norm associated with a positive-de\ufb01nite\nmatrix A is de\ufb01ned by kxkA := px0Ax. The minimum and maximum singular values of a matrix\nA are written as min(A) and kAk respectively. For two symmetric matrices A and B the same\ndimensions, A \u232b B means that A-B is positive semi-de\ufb01nite. For a real-valued function f, we use \u02d9f\nand \u00a8f to denote its \ufb01rst and second derivatives. Finally, [n] := {1, 2,\u00b7\u00b7\u00b7 , n}.\n2.1 Generalized Linear Contextual Bandits\nDecision procedure. We consider the generalized linear contextual bandits problem with K arms.\nAt each round t, the agent observes a context consisting of a set of K feature vectors xt := {xt,a 2\nRd|a 2 [K]}, which is drawn iid from an unknown distribution with kxt,ak \uf8ff 1. Each feature\nvector xt,a is associated with an unknown stochastic reward yt,a 2 [0, 1]. If the agent selects one\naction at, there is a reward yt,at 2 [0, 1] associated with the selected arm at and the associated xt,at.\nUnder the classic setting, the reward is immediately observed after the decision and the information\ncan be utilized to make decision in the next round.\n\nRelationship between reward Y and context X.\nIn terms of the relationship between yt,at and\nxt,at (t 1), we follow the standard generalized linear contextual bandits literature (Filippi et al.\n2In this case, the number of actions being \ufb01nite is important. In particular, the regret bound has a O(log K)\ndependence. Consequently, strictly speaking, if K is not viewed as a constant, we would also need K to not\n\nbe too large compared to d in order to retain the same regret bound of \u02dcO\u21e3(\u00b5D + G + 1)pdT\u2318. A common\n\n(and rather weak) assumption is the K is polynomial in d.\n\n3\n\n\f(2010); Li et al. (2017)). De\ufb01ne H0\nt = {(s, xs, as, ys,as), s \uf8ff t 1}[{ xt} as the information at\nthe beginning of round t. The agent maximizes the cumulative expected rewards over T rounds\nt at each round t (t 1). Suppose the agent takes action at at round t. Denote\nwith information H0\nby Xt = xt,at, Yt = yt,at and we assume the conditional distribution of Yt given Xt is from the\nexponential family. Therefore its density is given by\nP\u2713\u21e4(Yt|Xt) = exp\u2713 YtX0t\u2713\u21e4 m(X0t\u2713\u21e4)\n\n(1)\nHere, \u2713\u21e4 is an unknown number under the frequentist setting; \u2318 2 R+ is a given parameter; A, m and\nh are three normalization functions mapping from R to R.\nFor exponential families, m is in\ufb01nitely differentiable,\n\u02d9m(X0\u2713\u21e4) = E[Y |X], and \u00a8m(X0\u2713\u21e4) =\nV(Y |X). Denote g(X0\u2713\u21e4) = E[Y |X] , one can easily verify that g(x0\u2713) = x0\u2713 for linear model,\n1+exp(x0\u2713) for logistic model and g(x0\u2713) = exp(x0\u2713) for Poisson model. In the general-\ng(x0\u2713) =\nized linear model (GLM) literature (Nelder and Wedderburn (1972); McCullagh (2018)), g is often\nreferred to as the inverse link function. Note that (1) can be rewritten as the GLCB form,\n\n+ A(Yt,\u2318 )\u25c6 .\n\nh(\u2318)\n\n1\n\nYt = g(X0t\u2713\u21e4) + \u270ft,\n\nwhere {\u270ft, t 2 [T ]} are independent zero-mean noise, H0\ngenerated from (1) automatically satis\ufb01es the sub-Gaussian condition:\n\nt -measurable with E[\u270ft|H0\nt\u21e4 \uf8ff exp\u2713 22\n2 \u25c6 .\n\nE\u21e5exp(\u270ft)|H0\n\nThroughout the paper, we denote > 0 as the sub-Gaussian parameter of the noise \u270ft.\nRemark 1. In this paper, we focus on the GLM with exponential family (1). In general, one can work\nwith model (2) under the sub-Gaussian assumption (3). Our analysis will still hold by considering\nmaximum quasi-likelihood estimator for (2). See more explanations in the appendix.\n\n(2)\nt ] = 0. Data\n\n(3)\n\n2.2 The Delay Model\nUnlike the traditional setting where each reward is immediately observed, here we consider the\ncase where stochastic and unbounded delays are present in revealing the rewards. Let T be the\nnumber of total rounds. At round t, after the agent takes action at, the reward yt,at may not be\navailable immediately. Instead, it will be observed at the end of round t + Dt where Dt is the delay\nat time t. We assume Dt is a non-negative random number which is independent of {Ds}s\uf8fft1 and\n{xs, ys,as, as}s\uf8fft. First, we de\ufb01ne the available information for the agent at each round.\nInformation structure under delays. At any round t, if Ds + s \uf8ff t 1 (reward occurred in round\ns is available at the beginning of round t), then we call (s, xs, ys,as, as) the complete information\ntuple at round t. If Ds + s t, we call (s, xs, as) the incomplete information tuple at the beginning\nof round t. De\ufb01ne\n\nHt = {(s, xs, ys,as, as) | s + Ds \uf8ff t 1}[{ (s, xs, as) | s \uf8ff t 1, s + Ds t}[{ xt} ,\n\nthen Ht is the information (\ufb01ltration) available at the beginning of round t for the agent to choose\naction at. In other words, Ht contains all the incomplete and complete information tuples up to round\nt 1 and the content vector xt at round t.\nMoreover de\ufb01ne\n\nFt = {(s, xs, as, ys,as) | s + Ds \uf8ff t}.\n\n(4)\nThen Ft contains all the complete information tuples (s, xs, as, ys,as) up to the end of round t.\nDenote It = Ft F t1, It is the new complete information tuples revealed at the end of round t.\nPerformance criterion. Under the frequentist setting, assume there exists an unknown true param-\neter \u2713\u21e4 2 Rd. The agent\u2019s strategy can be evaluated by comparing her rewards to the best reward. To\ndo so, de\ufb01ne the optimal action at round t by a\u21e4t = arg maxa2[K] g(x0t,a\u2713\u21e4). Then, the agent\u2019s total\nregret of following strategy \u21e1 can be expressed as follows\n\nRT (\u21e1) :=\n\nTXt=1\u21e3g\u21e3x0t,a\u21e4t\n\n\u2713\u21e4\u2318 gx0t,at\u2713\u21e4\u2318 ,\n\nwhere at \u21e0 \u21e1t and policy \u21e1t maps Ht to the probability simplex K := {(p1,\u00b7\u00b7\u00b7 , pK) | PK\ni=1 pi =\n1, pi 0}. Note that RT (\u21e1) is in general a random variable due to the possible randomness in \u21e1.\n\n4\n\n\fAssumptions. Through out the paper, we assume the following assumption on distribution and\nfunction g, which is standard in the generalized linear bandit literature (Filippi et al. (2010); Li et al.\n(2017); Jun et al. (2017)).\nAssumption 1 (GLCB).\n\n\u2022 min(E[ 1\n\nKPa2[K] xt,ax0t,a]) 2\n\n0 for all t 2 [T ].\n\n\u2022 \uf8ff := inf{kxk\uf8ff1,k\u2713\u2713\u21e4k\uf8ff1} \u02d9g(x0\u2713) > 0.\n\u2022 g is twice differentiable. \u02d9g and \u00a8g are upper bounded by Lg and Mg, respectively.\nt=1 satis\ufb01es the following assumption.\n\nIn addition, we assume the delay sequence {Dt}T\nAssumption 2 (Delay). Assume {Dt}T\nt=1 are independent non-negative random variables with tail-\nenvelope distribution (\u21e0D, \u00b5D, MD). That is, there exists a constant MD > 0 and a distribution \u21e0D\nwith mean \u00b5D < 1 such that for any m MD and t 2 [T ],\n\nwhere D \u21e0 \u21e0D and E[D] = \u00b5D. Furthermore, assume there exists q > 0 such that\n\nP(Dt m) \uf8ff P(D m),\nP(D \u00b5D x) \uf8ff exp\u2713x1+q\nD \u25c6 .\n\n22\n\nP(Di E[Di] x) \uf8ff\u2713x1+q\nD \u25c6 ,\n\n2\u02dc2\n\nNote that when q = 1, D is sub-Gaussian with parameter D. When q 2 (0, 1), D has near-heavy\ntail distribution. When Di\u2019s are iid, the following condition guarantees Assumption 2:\n\nwith some \u02dcD > 0 and q > 0.\nFor ease of reference (as there are many \ufb02oating parameters in this paper), we summarize all the\nparameter de\ufb01nitions in Table 1.\n\nNotation De\ufb01nition\n\nNotation De\ufb01nition\n\nK number of arms\nd\n\uf8ff\n\u2713\u21e4\n\nLg\nMg\n2\n0\n\nfeature dimension\ninf{kxk\uf8ff1,k\u2713\u2713\u21e4k\uf8ff1} \u02d9g(x0\u2713)\nunknown true parameter\nsub-Gaussian parameter for \u270ft\nupper bound on \u02d9g\nupper bound on \u00a8g\nlower bound on\nmin(E[ 1\n\nKPa2[K] xt,ax0t,a])\n\n\u21e0D tail-envelope distribution for the delays\n\nq\n\nparameter of \u21e0D\n\u00b5D expectation of \u21e0D\nMD parameter of \u21e0D\nD parameter of \u21e0D\nG\n\u00b50D expectation of iid delays\n\nsub-Gaussian parameter of Gt\n\nDmax\n\nupper bound on bounded delays\n\nTable 1: Parameters in the GLCB model with delays.\n\n3 Delayed Upper Con\ufb01dence Bound (DUCB) for GLCB\n\nIn this section, we propose a UCB type of algorithm for GLCB, adapting the delay information in an\nonline version. Let us \ufb01rst de\ufb01ne some variables and state the main algorithm.\n\n3.1 Algorithm: DUCB-GLCB\n\nDenote Gt =Pt1\ns=1 I{s + Ds t} as the number of missing reward when the agent is making a\nprediction at round t. Denote Tt = {s : s \uf8ff t 1, Ds + s \uf8ff t 1} as the set containing timestamps\nwith complete information tuples at the beginning of round t. Further denote Wt =Ps2Tt\nXsX0s as\nthe matrix consisting feature information with timestamps in Tt and Vt =Pt1\ns=1 XsX0s as the matrix\nconsisting all available features at the end of round t 1. The main algorithm is given below.\n\n5\n\n\fAlgorithm 1 DUCB-GLCB\n1: Input: the total rounds T , model parameters d and \uf8ff, and tuning parameters \u2327 and .\n\ni=1 XsX0s, T\u2327 +1 := {s :\n\n2: Initialization: randomly choose \u21b5t 2 [K] for t 2 [\u2327 ], set V\u2327 +1 =P\u2327\ns \uf8ff \u2327, s + Ds \uf8ff \u2327}, G\u2327 +1 = \u2327 | T\u2327 +1| and W\u2327 +1 =Ps2T\u2327 +1\n3: for t = \u2327 + 1,\u2327 + 2,\u00b7\u00b7\u00b7 , T do\nUpdate Statistics: calculate the MLE \u02c6\u2713t by solvingPs2Tt\n(Ys g(X0s\u2713))Xs = 0\n4:\n\uf8ffr d\n2 log\u21e31 + 2(tGt)\n\u2318 + log( 1\n ) + pGt\nSelect Action: choose at = arg maxa2[K]\u21e3x0t,a\nt \u2318\n\u02c6\u2713t + tkxt,akV 1\nUpdate Observations: Xt xt,at, Vt+1 Vt + XtX0t and Tt+1 Tt [{ s : s + Ds = t},\nGt+1 = t | Tt+1|, and W\u2327 +1 = W\u2327 +Ps:s+Ds=t XsX0s\n\nUpdate Parameter: t = \n\n8: end for\n\nXsX0s\n\n6:\n7:\n\n5:\n\nd\n\nRemark 2. In step 4, we use Maximum Likelihood Estimators (MLEs) for the parameter estimation\nstep at each round t. For more details on the derivation and explanation, we refer to the appendix.\nRemark 3 (Comparison to UCB-GLM Algorithm in Li et al. (2017)). We make several adjustments\nto the UCB-GLM Algorithm in Li et al. (2017). First, in step 4 (statistics update), we only use data\nwith timestamps in Tt to calculate the estimator using MLE. In this step, using data without reward\nwill cause bias in the estimation. Second, when selecting the action in step 5, parameter t is updated\nadaptively at each round whereas in Li et al. (2017), the corresponding parameter is constant over\ntime. Moreover, in step 4, we choose to use Vt to normalize the context vector Xt,a instead of Wt.\n\n3.2 Preliminary Analysis\nDenote G\u21e4t = max1\uf8ffs\uf8fft Gs as the running maximum number of missing reward up to round t. The\nproperty of Gt and G\u21e4t is the key to analyze the regret bound for UCB algorithm. We next characterize\nthe tail behavior of Gt and G\u21e4t .\nProposition 1 (Properties of Gt and G?\n\nt ). Assume Assumption 2. Denote G =q I\n\n4 + 2\n\nD(1+q)\n\nwith\n\nq\n\nD\n\nI = max\u21e2 1+qp2 log(2)2\n\n1+q + 1. Then,\nD, qq 22\n1. Gt is sub-Gaussian. Moreover, for all t 1,\nG\u25c6 .\nP (Gt 2(\u00b5D + MD) + x) \uf8ff exp\u2713x2\nG\u21e4T \uf8ff 2(\u00b5D + MD) + Gp2 log(T ) + Gs2 log\u2713 1\n\u25c6,\n\n2. With probability 1 ,\n\n22\n\nwhere G\u21e4T = max1\uf8ffs\uf8ffT Gs.\n\n(5)\n\n(6)\n\n3. De\ufb01ne Wt =Ps2Tt\nXsX0s where Xt is drawn iid. from some distribution with support\nin the unit ball Bd. Furthermore, let \u2303:= E[XtX0t] be the second moment matrix, and B\nand > 0 be two positive constants. Then there exist positive, universal constants C1 and\nC2 such that min(Wt) B with probability at least 1 2, as long as\nt 0@\n\n+ 2(\u00b5D + MD) + Gs2 log\u2713 1\n\u25c6.\n\nC1pd + C2qlog( 1\n\n1A\n\nmin(\u2303)\n\nmin(\u2303)\n\n(7)\n\n2B\n\n )\n\n+\n\n2\n\nThe proof of Proposition 1 is deferred to the appendix. This Note that Gt is sub-Gaussian even when\nD has near-heavy tail distribution when p 2 (0, 1).\n\n6\n\n\f3.3 Regret Bounds\n\nd\n\n\uf8ffr d\n2 log\u21e31 + 2(tGt)\n\nand t = \nof the algorithm is upper bounded by\n\nTheorem 2. Assume Assumptions 1-2. Fix any . There exists a universal constant C :=\nC(C1, C2, MD, \u00b5D, 0, G,,\uf8ff ) > 0, such that if we run DUCB-GLCB with \u2327 := Cd + log( 1\n )\n\u2318 + log( 1\n ) +pGt, then, with probability at least 1 5, the regret\n\u25c6\u25c61/4sd log\u2713 T\nRT \uf8ff \u2327 + Lg\"4p\u00b5D + MDsT d log\u2713 T\nd\u25c6 T\n+ 27/4pG (log (T ))1/4sd log\u2713 T\nd\u25c6pT# .\n\nd\u25c6 + 27/4pG\u2713log\u2713 1\nd\u25c6 T +\nlog\u2713 T\n\n2d\n\uf8ff\n\n(8)\n\nFor parameter de\ufb01nition, we refer to Table 1.The proof of Theorem 2 is deferred to the appendix.\nCorollary 3 (Expected regret). Assume Assumptions 1-2. The expected regret is bounded by\n\nE[RT ] = O\u21e3dpT log(T ) +p\u00b5D + MDpT d log (T ) + pGpT d (log(T ))3/4\u2318 .\n\n(9)\n\nGiven the result in (8), (9) holds by choosing = 1\nThe highest order term O(dpT log(T )) does not depend on delays. Delay impacts the expected\nregret bound in two folds. First, the sub-Gaussian parameter G appears in the second-highest order\nterm. Second, the mean-related parameter \u00b5D + MD appears in the third-order term. Note that here\nwe include the log factors in deciding the highest order term, the second higest order term and so on.\nIf we exclude the log terms, then both delay parameters impact the regret bound multiplicatively.\n\nT and using the fact that RT \uf8ff T .\n\n3.4 Tighter Regret Bounds for Special Cases\nWhen the sequence {Ds}T\nprobability bounds on the regret.\nProposition 4. Under Assumption 1, we have:\n\ns=1 satis\ufb01es some speci\ufb01c assumptions, we are able to provide tighter high\n\n(10)\n\n2d\n\uf8ff\n\n1. If there exists a constant Dmax > 0 such that P(Ds \uf8ff Dmax) = 1 for all s 2 [T ]. Fix .\n )),\n\nThere exists a universal constant C > 0 such that by taking \u2327 = Dmax + C(d + log( 1\nwith probability 1 3, the regret of the algorithm is upper bounded by\n\nd\u25c6pT! .\n\nRT \uf8ff \u2327 + Lg 2pDmaxs2T d log\u2713 T\nd\u25c6 +\n\nlog\u2713 T\nTherefore, E[RT ] = O\u21e3pDmaxpdT log(T ) + dpT log(T )\u2318 .\n2. Assume D1,\u00b7\u00b7\u00b7 , DT are iid non-negative random variables with mean \u00b50D that satisfy\nAssumption (2). There exists C > 0 such that by taking \u2327 := Cd + log( 1\n ), with\nprobability 1 5, the regret of the algorithm is upper bounded by\n\u25c6\u25c61/4sd log\u2713 T\nRT \uf8ff \u2327 + Lg\"4q\u00b50DsT d log\u2713 T\nd\u25c6 + 27/4pG\u2713log\u2713 1\nd\u25c6 T\nd\u25c6pT# .\nlog\u2713 T\nTherefore, E[RT ] = O\u21e3\u21e3p\u00b50D + pG log (T )3/4\u2318pT d + d log (T )pT\u2318\n\n+ 27/4pG (log (T ))1/4sd log\u2713 T\n\nd\u25c6 T +\n\n2d\n\uf8ff\n\n7\n\n\f4 Tighter Regret Bounds on Linear Contextual Bandits with Finite Actions\n\nWe now consider the important special case of linear contextual bandits. and tighten the O(d)\ndependence from previous bounds to O(pd). This requires two new elements that we incorporate\ninto DUCB-GLCB in Algorithm 1. First, instead of using MLE which is unbiased, here we use an\nunbiased estimator that incorporates all the contexts (including those contexts for which the rewards\nhave not been received). In the linear contextual bandits setting, one can obtain analytical formulas\nfor the estimation procedure. Second, we extend the Sup-Base UCB decomposition framework (\ufb01rst\ndevised in Auer (2002) and subsequently adapted in Chu et al. (2011); Li et al. (2017)) to the current\nsetting in order to resolve the reward dependency issue. This framework is a commonly used one in\nthe literature that deals with the dependency issue, and provides a O(pdT ) regret bound instead of a\nO(dpT ) regret bound. Here we adapt this framework in the delayed reward setting.\nIn summary, the algorithm has two components, Delayed BaseLinUCB (Algorithm 2) and Delayed\nSupLinUCB (Algorithm 3). Delayed BaseLinUCB performs estimation and the con\ufb01dence bound\ncomputation, using a subset t of the past time steps as opposed to the set of all past time steps (note\nthat when t = 1, the chosen subset t is necessarily the empty set). This subset is carefully chosen\nin Delayed SupLinUCB to make sure rewards are indepenent when conditioned on the past selected\ncontexts. Delayed SupLinUCB is further responsible for selecting an action at each time step.\n\nAlgorithm 2 Delayed BaseLinUCB at Step t\n1: Input: t \u21e2{ 1, 2,\u00b7\u00b7\u00b7 , t 1}.\n2: At = Id +P\u23272 t\nxt,a\u2327 x0t,a\u2327\n3: ct =P\u23272 t 1(D\u2327 + \u2327 \uf8ff t 1)y\u2327,a\u2327 x\u2327,a\u2327\n4: \u2713t = A1\n5: Observe K arm features, xt,1, xt,2,\u00b7\u00b7\u00b7 , xt,K 2 Rd\n6: for a 2 [K] do\nwt,a = \u21b5tqxT\n7:\n8:\n9: end for\n\n\u02c6yt,a \u2713T\n\nt,aA1\n\nt xt,a\n\nt xt,a\n\nt ct\n\nAlgorithm 3 Delayed SupLinUCB\n1: Input: T 2 N, S log(T )\n1 ; for all s 2 [T ]\n2: s\n3: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\ns 1 and \u02c6A1 [K]\n4:\nrepeat\n5:\nUse Delayed BaseLinUCB with s\n6:\nbound, \u02c6ys\nif ws\n\nt,a + ws\n\nt,a, for all a 2 \u02c6As\nt,a \uf8ff 1/pT for all a 2 \u02c6As then\nChoose at = arg maxa2 \u02c6As\u02c6ys\nt,a \uf8ff 2s for all a 2 \u02c6As then\nelse if ws\n\u02c6As+1 { a 2 \u02c6As | \u02c6ys\nt,a + ws\nelse\nChoose at 2 \u02c6As such that ws\ns0\n\n7:\n8:\n9:\n10:\n11:\n12:\n\nend if\n\n13:\n14:\n15: end for\n\nuntil an action at is found.\n\n8\n\nt to calculate the width, ws\n\nt,a, and upper con\ufb01dence\n\nt,a + ws\n\nt,a, Update s0\n\nt+1 s0\n\nt for all s0 2 [S].\nt,a0) 21s}, s s + 1\n\nt,a0 + ws\n\n(\u02c6ys\n\nt,a maxa02 \u02c6As\nt,at > 2s, Update\ns0\nt ,[{t}\ns0\nt ,\n\nt+1 \u21e2\n\nif s = s0\notherwise\n\n\fRemark 4. There are two modi\ufb01cations compared to Algorithm 2 in Chu et al. (2011). First, the\nestimator \u2713t (in step 4) is a biased estimator. We use all the features in matrix At and only use\nfeatures with observed rewards in vector ct. In particular, when the indicator 1(D\u2327 + \u2327 \uf8ff t 1)\nevaluates to 1, the reward corresponding to the action taken at time step \u2327 has been received by\nthe end of (and possibly prior to) t 1 (and hence available at the beginning of t); all the other\nrewards (i.e. those that have not been received by t 1) are excluded. In comparison, Chu et al.\n(2011) construct an unbiased estimator in each time step. Second, the width parameter \u21b5t (in step 7)\nis time-dependent and adapts to new information (based on the delays) in each round. In comparison,\nthe width parameter is constant in Chu et al. (2011) that only depends on the horizon T .\nTheorem 5 (Regret on Delayed SupLinUCB-BaseLinUCB). If Delayed SupLinUCB is run with\n\nof the algorithm is\n\n\u21b5t = \u00af\u21b5 + Gt + 1, where \u00af\u21b5 =r 1\nO pT d (G + 1) log3/2(\n\n\n\n2 ln\u21e3 2T K log(T )\n\n\n\nT K log T\n\n) + log(\n\n\u2318, then with probability at least 1 2, the regret\n)!! . (11)\n\n)(1 + \u00b5D + MD + Grlog\n\n1\n\n\n\n\nT K log T\n\nThe proof of Theorem 5 requires of modi\ufb01cation of two lemmas in Chu et al. (2011). Lemma 6 is\na modi\ufb01cation of (Chu et al., 2011, Lemma 1) and Lemma 7 is a modi\ufb01cation of (Chu et al., 2011,\nLemma 6). We defer the detailed proofs of Lemmas 6-7 to the appendix. Proof of Theorem 5 is also\ngiven in the appendix.\nIn the regret bound (11), the delay parameters (\u00b5D, MD, D) appear on the highest order term pT d.\nAlthough the highest order term pT d is removed from (8), the delay on order O(pT d) is essential\nand this is also true for (8).\nLemma 6. Suppose the input index set t in Delayed BaseLinUCB is constructed so that for \ufb01xed\nx\u2327,a\u2327 with \u2327 2 t, the rewards y\u2327,a\u2327 are independent random variables with means E[y\u2327,a\u2327 ] =\nx0\u2327,a\u2327 \u2713\u21e4. Suppose {Gt} is \ufb01xed and given. Then, with probability at least 1 /T , we have for all\na 2 [K] that\n\nln\u2713 2T K\nLemma 7. Assume G\u21e4T is \ufb01xed and given. For all s 2 [S],\n\n|\u02c6yt,a x0t,a\u2713\u21e4|\uf8ff 1 +s 1\n\n2\n\n \u25c6 + Gt! st,a.\n\n s\n\nT +1 \uf8ff 5 \u00b7 2s\u21e3p2\u00af\u21b5(G\u21e4T + \u00af\u21b5)\u2318qd| s\n\nT +1|\n\nRemark 5 (Why Assumption 1 can be dropped in Theorem 5). There are essentially two methods to\ns=1 XsX0s). One method is to randomly sample actions\nfor \u2327 rounds. In this way, (Li et al., 2017, Proposition 1) guarantees a positive lower bound on\ns=1 XsX0s). This is the method adopted in Algorithm 1 and Theorem 2. The other method\nadds a regularization term. This is adopted in the de\ufb01nition of At (See Algorithm 2 and Theorem 5).\nThis method corresponds to the Ridge regression when estimating parameter \u2713t.\n\nguarantee a positive lower bound min(Pt\nmin(Pt\n\n5 Conclusion\n\nBeyond contextual bandits and looking at the broader landscape of data-driven decision making, de-\nlays have emerged to be an important phenomenon in several domains, including, among other things,\ndistributed stochastic optimization (Bertsekas and Tsitsiklis (1997); Zhou et al. (2018b)), multi-\nagent game-theoretical and reinforcement learning (Zhou et al. (2017); Grover et al. (2018a); Guo\net al. (2019); Mertikopoulos and Zhou (2019)), real-time scheduling in large-scale systems (Pinedo;\nMehdian et al. (2017); Mahdian et al. (2018)). Data-driven decision making with imperfect infor-\nmation is an emerging research paradigm and much remains to be understood in regards to how\ndecision-making needs to be adapted in the presence of delays.\n\n9\n\n\fReferences\nAbeille, M., Lazaric, A., et al. (2017). Linear thompson sampling revisited. Electronic Journal of\n\nStatistics, 11(2):5165\u20135197.\n\nAgrawal, S. and Goyal, N. (2013a). Further optimal regret bounds for thompson sampling. In\n\nArti\ufb01cial intelligence and statistics, pages 99\u2013107.\n\nAgrawal, S. and Goyal, N. (2013b). Thompson sampling for contextual bandits with linear payoffs.\n\nIn International Conference on Machine Learning, pages 127\u2013135.\n\nAuer, P. (2002). Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422.\n\nBastani, H. and Bayati, M. (2015). Online decision-making with high-dimensional covariates.\n\nBertsekas, D. P. and Tsitsiklis, J. N. (1997). Parallel and distributed computation: Numerical methods.\n\nBistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. (2019). Online exp3 learning in\nadversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems.\n\nBubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-\n\narmed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1\u2013122.\n\nCesa-Bianchi, N., Gentile, C., and Mansour, Y. (2019). Delay and cooperation in nonstochastic\n\nbandits. The Journal of Machine Learning Research, 20(1):613\u2013650.\n\nChapelle, O. (2014). Modeling delayed feedback in display advertising. In Proceedings of the\n20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n1097\u20131105. ACM.\n\nChen, K., Hu, I., Ying, Z., et al. (1999). Strong consistency of maximum quasi-likelihood estimators\nin generalized linear models with \ufb01xed and adaptive designs. The Annals of Statistics, 27(4):1155\u2013\n1163.\n\nChow, S.-C. and Chang, M. (2011). Adaptive design methods in clinical trials. Chapman and\n\nHall/CRC.\n\nChu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions.\nIn Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 208\u2013214.\n\nDesautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs\nin gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873\u2013\n3923.\n\nDudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. (2011).\n\nEf\ufb01cient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369.\n\nFilippi, S., Cappe, O., Garivier, A., and Szepesv\u00e1ri, C. (2010). Parametric bandits: The generalized\n\nlinear case. In Advances in Neural Information Processing Systems, pages 586\u2013594.\n\nGarrabrant, S., Soares, N., and Taylor, J. (2016). Asymptotic convergence in online learning with\n\nunbounded delays. arXiv preprint arXiv:1604.05280.\n\nGrover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., and Edwards, H. (2018a). Learning policy\n\nrepresentations in multiagent systems. arXiv preprint arXiv:1806.06464.\n\nGrover, A., Markov, T., Attia, P., Jin, N., Perkins, N., Cheong, B., Chen, M., Yang, Z., Harris, S.,\nChueh, W., et al. (2018b). Best arm identi\ufb01cation in multi-armed bandits with delayed feedback.\narXiv preprint arXiv:1803.10937.\n\nGuo, X., Hu, A., Xu, R., and Zhang, J. (2019). Learning mean-\ufb01eld games. In Advances in Neural\n\nInformation Processing Systems.\n\n10\n\n\fJoulani, P., Gyorgy, A., and Szepesv\u00e1ri, C. (2013). Online learning under delayed feedback. In\n\nInternational Conference on Machine Learning, pages 1453\u20131461.\n\nJun, K.-S., Bhargava, A., Nowak, R., and Willett, R. (2017). Scalable generalized linear bandits:\nOnline computation and hashing. In Advances in Neural Information Processing Systems, pages\n99\u2013109.\n\nKannan, P., Chang, A.-M., and Whinston, A. B. (2001). Wireless commerce: marketing issues\nand possibilities. In Proceedings of the 34th Annual Hawaii International Conference on System\nSciences, pages 6\u2013pp. IEEE.\n\nKitagawa, T. and Tetenov, A. (2018). Who should be treated? empirical welfare maximization\n\nmethods for treatment choice. Econometrica, 86(2):591\u2013616.\n\nLi, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World wide\nweb, pages 661\u2013670. ACM.\n\nLi, L., Lu, Y., and Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual\nbandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 2071\u20132080. JMLR. org.\n\nMahdian, S., Zhou, Z., and Bambos, N. (2018). Robustness of join-the-shortest-queue scheduling to\ncommunication delay. In 2018 Annual American Control Conference (ACC), pages 3708\u20133713.\nIEEE.\n\nMandel, T., Liu, Y.-E., Brunskill, E., and Popovi\u00b4c, Z. (2015). The queue method: Handling delay,\nheuristics, prior data, and evaluation in bandits. In Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence.\n\nMcCullagh, P. (2018). Generalized linear models. Routledge.\n\nMehdian, S., Zhou, Z., and Bambos, N. (2017). Join-the-shortest-queue scheduling with delay. In\n\n2017 American Control Conference (ACC), pages 1747\u20131752. IEEE.\n\nMertikopoulos, P. and Zhou, Z. (2019). Learning in games with continuous action sets and unknown\n\npayoff functions. Mathematical Programming, 173(1-2):465\u2013507.\n\nMesterharm, C. (2005). On-line learning with delayed label feedback. In International Conference\n\non Algorithmic Learning Theory, pages 399\u2013413. Springer.\n\nNelder, J. A. and Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal\n\nStatistical Society: Series A (General), 135(3):370\u2013384.\n\nNeu, G., Antos, A., Gy\u00f6rgy, A., and Szepesv\u00e1ri, C. (2010). Online markov decision processes under\n\nbandit feedback. In Advances in Neural Information Processing Systems, pages 1804\u20131812.\n\nOstrovsky, E. and Sirota, L. (2014). Exact value for subgaussian norm of centered indicator random\n\nvariable. arXiv preprint arXiv:1405.6749.\n\nPike-Burke, C., Agrawal, S., Szepesvari, C., and Grunewalder, S. (2017). Bandits with delayed,\n\naggregated anonymous feedback. arXiv preprint arXiv:1709.06853.\n\nPinedo, M. Scheduling, volume 29. Springer.\n\nQuanrud, K. and Khashabi, D. (2015). Online learning with adversarial delays. In Advances in neural\n\ninformation processing systems, pages 1270\u20131278.\n\nRusso, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243.\n\nRusso, D. and Van Roy, B. (2016). An information-theoretic analysis of thompson sampling. The\n\nJournal of Machine Learning Research, 17(1):2442\u20132471.\n\n11\n\n\fSchwartz, E. M., Bradlow, E. T., and Fader, P. S. (2017). Customer acquisition via display advertising\n\nusing multi-armed bandit experiments. Marketing Science, 36(4):500\u2013522.\n\nSwaminathan, A. and Joachims, T. (2015). Batch learning from logged bandit feedback through\n\ncounterfactual risk minimization. Journal of Machine Learning Research, 16(52):1731\u20131755.\n\nVernade, C., Capp\u00e9, O., and Perchet, V. (2017). Stochastic bandit models for delayed conversions.\n\narXiv preprint arXiv:1706.09186.\n\nVershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027.\n\nWainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48.\n\nCambridge University Press.\n\nWeinberger, M. J. and Ordentlich, E. (2002). On delayed prediction of individual sequences. IEEE\n\nTransactions on Information Theory, 48(7):1959\u20131976.\n\nZhou, Z., Athey, S., and Wager, S. (2018a). Of\ufb02ine multi-action policy learning: Generalization and\n\noptimization. arXiv preprint arXiv:1810.04778.\n\nZhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P., Ye, Y., Li, L.-J., and Fei-Fei, L. (2018b).\nIn\n\nDistributed asynchronous optimization with unbounded delays: How slow can you go?\nInternational Conference on Machine Learning, pages 5965\u20135974.\n\nZhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P. W., and Tomlin, C. (2017). Countering feedback\ndelays in multi-agent learning. In Advances in Neural Information Processing Systems, pages\n6171\u20136181.\n\n12\n\n\f", "award": [], "sourceid": 2821, "authors": [{"given_name": "Zhengyuan", "family_name": "Zhou", "institution": "Stanford University"}, {"given_name": "Renyuan", "family_name": "Xu", "institution": "University of Oxford"}, {"given_name": "Jose", "family_name": "Blanchet", "institution": "Stanford University"}]}