{"title": "Nonstochastic Multiarmed Bandits with Unrestricted Delays", "book": "Advances in Neural Information Processing Systems", "page_first": 6541, "page_last": 6550, "abstract": "We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that \"delayed\" Exp3 achieves the $O(\\sqrt{(KT + D)\\ln K})$ regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. \nThe new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$. \nFor this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. \nThe resulting oracle regret bound is of order $\\min_\\beta (|S_\\beta|+\\beta \\ln K + (KT + D_\\beta)/\\beta)$, where $|S_\\beta|$ is the number of observations with delay exceeding $\\beta$, and $D_\\beta$ is the total delay of observations with delay below $\\beta$. The bound relaxes to $O(\\sqrt{(KT + D)\\ln K})$, but we also provide examples where $D_\\beta \\ll D$ and the oracle bound has a polynomially better dependence on the problem parameters.", "full_text": "Nonstochastic Multiarmed Bandits\n\nwith Unrestricted Delays\n\nTobias Sommer Thune\u2217\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\ntobias.thune@di.ku.dk\n\nNicol\u00f2 Cesa-Bianchi\n\nDSRC & Univ. degli Studi di Milano\n\nMilan, Italy\n\nnicolo.cesa-bianchi@unimi.it\n\nYevgeny Seldin\n\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\nseldin@di.ku.dk\n\nAbstract\n\nWe investigate multiarmed bandits with delayed feedback, where the delays need\nneither be identical nor bounded. We \ufb01rst prove that \"delayed\" Exp3 achieves the\n\nO(cid:0)(cid:112)(KT + D) ln K(cid:1) regret bound conjectured by Cesa-Bianchi et al. [2019]\n\nin the case of variable, but bounded delays. Here, K is the number of actions\nand D is the total delay over T rounds. We then introduce a new algorithm that\nlifts the requirement of bounded delays by using a wrapper that skips rounds with\nexcessively large delays. The new algorithm maintains the same regret bound, but\nsimilar to its predecessor requires prior knowledge of D and T . For this algorithm\nwe then construct a novel doubling scheme that forgoes the prior knowledge\nrequirement under the assumption that the delays are available at action time (rather\nthan at loss observation time). This assumption is satis\ufb01ed in a broad range of\napplications, including interaction with servers and service providers. The resulting\noracle regret bound is of order min\u03b2\nis the number of observations with delay exceeding \u03b2, and D\u03b2 is the total delay of\nwe also provide examples where D\u03b2 (cid:28) D and the oracle bound has a polynomially\nbetter dependence on the problem parameters.\n\n(cid:0)|S\u03b2| + \u03b2 ln K + (KT + D\u03b2)/\u03b2(cid:1), where |S\u03b2|\nobservations with delay below \u03b2. The bound relaxes to O(cid:0)(cid:112)(KT + D) ln K(cid:1), but\n\n1\n\nIntroduction\n\nMultiarmed bandits is an algorithmic paradigm for sequential decision making with a growing\nrange of industrial applications, including content recommendation, computational advertising,\nand many more. In the multiarmed bandit framework an algorithm repeatedly takes actions (e.g.,\nrecommendation of content to a user) and observes outcomes of these actions (e.g., whether the\nuser engaged with the content), whereas the outcome of alternative actions (e.g., alternative content\nthat could have been recommended) remains unobserved. In many real-life situations the algorithm\nexperience delay between execution of an action and observation of its outcome. Within the delay\nperiod the algorithm may be forced to make a series of other actions (e.g., interact with new users)\nbefore observing the outcomes of all the previous actions. This setup falls outside of the classical\nmultiarmed bandit paradigm, where observations happen instantaneously after the actions, and\nmotivates the study of bandit algorithms that are provably robust in the presence of delays.\n\n\u2217Part of this work was done while visiting Universit\u00e0 degli Studi di Milano, Milan, Italy\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe focus on the nonstochastic (a.k.a. oblivious adversarial) bandit setting, where the losses faced\nby the algorithm are generated by an unspeci\ufb01ed deterministic mechanism. Though it might be of\nadversarial intent, the mechanism is oblivious to internal randomization of the algorithm. In the\ndelayed version, the loss of an action executed at time t is observed at time t+dt, where the delay dt is\nalso chosen deterministically and obliviously. Thus, at time step t the algorithm receives observations\nfrom time steps s \u2264 t for which s + ds = t. This delay is the independent of the action chosen.\nThe algorithm\u2019s performance is evaluated by regret, which is the difference between the algorithm\u2019s\ncumulative loss and the cumulative loss of the best static action in hindsight. The regret de\ufb01nition is\nthe same as in the ordinary setting without delays. When all the delays are constant (dt = d for all t),\n\nthe number of actions [Cesa-Bianchi et al., 2019]. Remarkably, this bound is achieved by \u201cdelayed\u201d\nExp3, which is a minor modi\ufb01cation of the standard Exp3 algorithm performing updates as soon as\nthe losses become available.\nThe case of variable delays has previously been studied in the full information setting by Joulani et al.\nt=1 dt is the total delay.\nTheir proof is based on a generic reduction from delayed full information feedback to full information\nwith no delay. The applicability of this technique to the bandit setting is unclear (see Appendix A).\n\nthe optimal regret is known to scale as O(cid:0)(cid:112)(K + d)T ln K(cid:1), where T is the time horizon and K is\n[2016]. They prove a regret bound of order(cid:112)(D + T ) ln K, where D =(cid:80)T\nCesa-Bianchi et al. [2019] conjecture an upper bound of order(cid:112)(KT + D) ln K for the bandit\nthe lower bound \u2126(cid:0)(cid:112)(K + d)T(cid:1), which holds for any d. In a recent paper, Li et al. [2019] study\nand dmax, Li et al. [2019] prove a regret bound of (cid:101)O(cid:0)(cid:112)dmaxK(T + D)(cid:1). Cesa-Bianchi et al. [2018]\nin the last dmax rounds. In this setting Cesa-Bianchi et al. [2018] obtain an O(cid:0)\u221a\ndmaxKT ln K(cid:1)\nder(cid:112)(KT + D)/(ln K), \"delayed\" Exp3 achieves the conjectured bound of O(cid:0)(cid:112)(KT + D) ln K(cid:1).\n\u03b2 that attains the desired O(cid:0)(cid:112)(KT + D) ln K(cid:1) regret bound with \"delayed\" Exp3 wrapped within\n\nregret bound (which is tight to within the ln K factor, and in fact tighter than the bound of Li et al.\n[2019] for an easier problem).\nOur paper is structured in the following way. We start by investigating the regret of Exp3 in the vari-\nable delay setting. We prove that for known T , D, and dmax, and assuming that dmax is at most of or-\n\nIn order to remove the restriction on dmax and eliminate the need of its knowledge we introduce a\nwrapper algorithm, Skipper. Skipper prevents the wrapped bandit algorithm from making updates\nusing observations with delay exceeding a given threshold \u03b2. This threshold acts as a tunable upper\nbound on the delays observed by the underlying algorithm, so if T and D are known we can choose\n\nsetting with variable delays. Note that this bound cannot be improved in the general case because of\n\na harder variant of bandits, where the delays dt remain unknown. As a consequence, if an action\nis played at time s and then more times in between time steps s and s + ds, the learner cannot tell\nwhich speci\ufb01c round the loss observed at time s + ds refers to. In this harder setting, for known T , D,\n\nfurther study an even harder setting of bandits with anonymous composite feedback. In this setting at\ntime step t the learner observes feedback, which is a composition of partial losses of the actions taken\n\nSkipper.\nTo dispense of the need for knowing T and D, the \ufb01rst approach coming to mind is the doubling\ntrick. However, applying the standard doubling to D is problematic, because the event that the actual\ntotal delay d1 + \u00b7\u00b7\u00b7 + dt exceeds an estimate D is observed at time t + dt rather than at time t. In\norder to address this issue, we consider a setting in which the algorithm observes the delay dt at time\nt rather than at time t + dt. To distinguish between this setting and the previous one we say that \"the\ndelay is observed at action time\" if it is observed at time t and \"the delay is observed at observation\ntime\" if it is observed at time t + dt. Observing the delay at action time is motivated by scenarios in\nwhich a learning agent depends on feedback from a third party, for instance a server or laboratory\nthat processes the action in order to evaluate it. In such cases, the third party might partially control\nthe delay, and provide the agent with a delay estimate based on contingent and possibly private\ninformation. In the server example the delay could depend on workload, while the laboratory might\nhave processing times and an order backlog. Other examples include medical imaging where the\navailability of annotations depends on medical professionals work schedule. Common for these\nexamples is that the third party knows the delay before the action is taken.\nWithin the \"delay at action time\" setting we achieve a much stronger regret bound. We show that\nSkipper wrapping delayed Exp3 and combined with a carefully designed doubling trick enjoys an\nimplicit regret bound of order min\u03b2\nof observations with delay below \u03b2. This bound is attained without any assumptions on the sequence\n\n(cid:0)|S\u03b2| + \u03b2 ln K + (KT + D\u03b2)/\u03b2(cid:1), where D\u03b2 is the total delay\n\n2\n\n\fTable 1: Spectrum of delayed feedback settings and the corresponding regret bounds, progressing\nfrom easier to harder settings. Results marked by (*) have matching lower bounds up to the\nln K\nfactor. If all the delays are identical, then D = dT and (**) has a lower bound following from\nCesa-Bianchi et al. [2019] and matching up to the\nln K factor. However, for non-identical delays\nthe regret can be much smaller, as we show in Example 8.\n\n\u221a\n\n\u221a\n\nRegret Bound\n\nO(cid:0)(cid:112)(K + d)T ln K(cid:1)\nO(cid:16)\n(cid:16)|S\u03b2| + \u03b2 ln K + KT +D\u03b2\nO(cid:0)(cid:112)(KT + D) ln K(cid:1)\ndmaxKT ln K(cid:1)\nO(cid:0)\u221a\n\nmin\u03b2\n\n(**)\n\n(*)\n\n(*)\n\n\u03b2\n\nSetting\nFixed delay\n\nDelay at action time\nDelay at observation\ntime with known T, D\nAnonymous, composite\nwith known dmax\n\n(cid:17)(cid:17)\n\nReference\nCesa-Bianchi et al. [2019]\n\nThis paper\nThis paper\n\nCesa-Bianchi et al. [2018]\n\nan explicit bound of O(cid:0)(cid:112)(KT + D) ln K(cid:1), however if D\u03b2 (cid:28) D it can be much tighter. We provide\n\nof delays dt and with no need for prior knowledge of T and D. The implicit bound can be relaxed to\n\nan instance of such a problem in Example 8, where we get a polynomially tighter bound.\nTable 1 summarizes the spectrum of delayed feedback models in the bandit case and places our results\nin the context of prior work.\n\n1.1 Additional related work\n\nOnline learning with delays was pioneered by Mesterharm [2005] \u2014 see also [Mesterharm, 2007,\nChapter 8]. More recent work in the full information setting include [Zinkevich et al., 2009, Quanrud\nand Khashabi, 2015, Ghosh and Ramchandran, 2018]. The theme of large or unbounded delays in\nthe full information setting was also investigated by Mann et al. [2018] and Garrabrant et al. [2016].\nOther related approaches are the works by Shamir and Szlak [2017], who use a semi-adversarial\nmodel, and Chapelle [2014], who studies the role of delays in the context of onlne advertising.\nChapelle and Li [2011] perform an empirical study of the impact of delay in bandit models. This is\nextended in [Mandel et al., 2015]. The analysis of Exp3 in a delayed setting was initiated by Neu\net al. [2014]. In the stochastic case, bandit learning with delayed feedback was studied in [Dud\u00edk\net al., 2011, Vernade et al., 2017]. The results were extended to the anonymous setting by Pike-Burke\net al. [2018] and by Garg and Akash [2019], and to the contextual setting by Arya and Yang [2019].\n\n2 Setting and notation\n\nWe consider an oblivious adversarial multiarmed bandit setting, where K sequences of losses are\ngenerated in an arbitrary way prior to the start of the game. The losses are denoted by (cid:96)a\nt , where t\nindexes the game rounds and a \u2208 {1, . . . , K} indexes the sequences. We assume that all losses are in\nthe [0, 1] interval. We use the notation [K] = {1, . . . , K} for brevity. At each round of the game the\nlearner picks an action At and suffers the loss of that action. The loss (cid:96)At\nis observed by the learner\nt\nafter dt rounds, where the sequence of delays d1, d2, . . . is determined in an arbitrary way before the\ngame starts. Thus, at round t the learner observes the losses of prior actions As for which s + ds = t.\nWe assume that the losses are observed \"at the end of round t\", after the action At has been selected.\nWe consider two different settings for receiving information about the delays dt:\nDelay available at observation time The delay dt is observed when the feedback (cid:96)At\nt\n\narrives at the\n\nend of round t + dt. This corresponds to the feedback being timestamped.\n\nDelay available at action time The delay dt is observed at the beginning of round t, prior to select-\n\ning the action At.\n\nThe following learning protocol provides a formal description of our setting.\n\n3\n\n\fProtocol for bandits with delayed feedback\nFor t = 1, 2, . . .\n\n3. Pairs(cid:0)s, (cid:96)As\n\n1. If delay is available at action time, then dt \u2265 0 is revealed to the learner\n2. The learner picks an action At \u2208 {1, . . . , K} and suffers the loss (cid:96)At\n\n(cid:1) for all s \u2264 t such that s + ds = t are observed\n\nt \u2208 [0, 1]\n\ns\n\nWe measure the performance of the learner by her expected regret \u00afRT , which is de\ufb01ned as the\ndifference between the expected cumulative loss of the learner and the loss of the best static strategy\nin hindsight:\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n\u00afRT = E\n\n(cid:96)At\nt\n\n\u2212 min\n\na\n\n(cid:96)a\nt .\n\nThis regret de\ufb01nition is the same as the one used in the standard multiarmed bandit setting without\ndelay.\n\n3 Delay available at observation time: Algorithms and results\n\nThis section deals with the \ufb01rst of our two settings, namely when delays are observed together\nwith the losses. We \ufb01rst introduce a modi\ufb01ed version of \"delayed\" Exp3, which we name Delayed\nExponential Weights (DEW) and which is capable of handling variable delays. We then introduce a\nwrapper algorithm, Skipper, which \ufb01lters out excessively large delays. The two algorithms also\nserve as the basis for the next section, where we provide yet another wrapper for tuning the parameters\nof Skipper.\n\n3.1 Delayed Exponential Weights (DEW)\n\nDEW is an extension of the standard exponential weights approach to handle delayed feedback. The\nalgorithm, laid out in Algorithm 1, performs an exponential update using every individual feedback\nas it arrives, which means that between each prediction either zero, one, or multiple updates might\noccur. The algorithm assumes that the delays are bounded and that an upper bound dmax \u2265 maxt dt\non the delays is known.\n\nAlgorithm 1: Delayed exponential weights (DEW)\nInput : Learning rate \u03b7; upper bound on the delays dmax\nTruncate the learning rate: \u03b7(cid:48) = min{\u03b7, (4edmax)\u22121};\nInitialize wa\nfor t = 1, 2, . . . do\nwa\nb wb\n\n0 = 1 for all a \u2208 [K];\nt\u22121(cid:80)\nfor a \u2208 [K];\n\nt =\n\nt\u22121\n\nLet pa\nDraw an action At \u2208 [K] according to the distribution pt and play it;\nObserve feedback (s, (cid:96)As\nUpdate wa\n\ns ) for all {s : s + ds = t} and construct estimators \u02c6(cid:96)a\n\n(cid:16)\u2212\u03b7(cid:48)(cid:80)\n\n(cid:17)\n\n;\n\nt = wa\n\nt\u22121 exp\n\n\u02c6(cid:96)a\ns\n\ns:s+ds=t\n\ns = (cid:96)a\n\ns\n\n1(a=As)\n\npa\ns\n\n;\n\nend\n\nThe following theorem provides a regret bound for Algorithm 1. The bound is a generalization of a\nsimilar bound in Cesa-Bianchi et al. [2019].\nTheorem 1. Under the assumption that an upper bound on the delays dmax is known, the regret of\nAlgorithm 1 with a learning rate \u03b7 against an oblivious adversary satis\ufb01es\n\n(cid:26) ln K\n\n\u03b7\n\n\u00afRT \u2264 max\n\n(cid:27)\n\n(cid:18) KT e\n\n2\n\n(cid:19)\n\n+ D\n\n,\n\n, 4edmax ln K\n\n+ \u03b7\n\n4\n\n\fwhere D =(cid:80)T\n\nt=1 dt. In particular, if T and D are known and \u03b7 =\n\n(cid:115)(cid:18) KT e\n\n2\n\n(cid:19)\n\n\u00afRT \u2264 2\n\n+ D\n\nln K.\n\n(cid:113) ln K\n\nKT e\n\n2 +D\n\n\u2264 1\n\n4edmax\n\n, we have\n\n(1)\n\nThe proof of Theorem 1 is based on proving the stability of the algorithm across rounds. The proof is\nsketched out in Section 5. As Theorem 1 shows, Algorithm 1 performs well if dmax is small and we\nalso have preliminary knowledge of dmax, T , and D. However, a single delay of order T increases\ndmax up to order T , which leads to a linear regret bound in Theorem 1. This is an undesired property,\nwhich we address with the skipping scheme presented next.\n\n3.2 Skipping scheme\n\nWe introduce a wrapper for Algorithm 1, called Skipper, which disregards feedback from rounds\nwith excessively large delays. The regret in the skipped rounds is trivially bounded by 1 (because the\nlosses are assumed to be in [0, 1]) and the rounds are taken out of the analysis of the regret of DEW.\nSkipper operates with an externally provided threshold \u03b2 and skips all rounds where dt \u2265 \u03b2. The\nadvantage of skipping is that it provides a natural upper bound on the delays for the subset of rounds\nprocessed by DEW, dmax = \u03b2. Thus, we eliminate the need of knowledge of the maximal delay\nin the original problem. The cost of skipping is the number of skipped rounds, denoted by |S\u03b2|, as\ncaptured in Lemma 2. Below we provide a regret bound for the combination of Skipper and DEW.\n\nAlgorithm 2: Skipper\nInput : Threshold \u03b2; Algorithm A.\nfor t = 1, 2, . . . do\n\nGet prediction At from A and play it;\nObserve feedback (s, (cid:96)As\n\nend\n\ns ) for all {s : s + ds = t}, and feed it to A for each s with ds < \u03b2;\n\nLemma 2. The expected regret of Skipper with base algorithm A and threshold parameter \u03b2\nsatis\ufb01es\n\n\u00afRT \u2264 |S\u03b2| + \u00afRT\\S\u03b2 ,\n\n(2)\nwhere |S\u03b2| is the number of skipped rounds (those for which dt \u2265 \u03b2) and \u00afRT\\S\u03b2 is a regret bound\nfor running A on the subset of rounds [T ]\\S\u03b2 (those, for which dt < \u03b2).\nA proof of the lemma is found in Appendix C. When combined with the previous analysis for DEW,\nLemma 2 gives us the following regret bound.\nTheorem 3. The expected regret of Skipper(\u03b2, DEW(\u03b7, \u03b2)) against an oblivious adversary satis\ufb01es\n\n(cid:26) ln K\n\n(cid:27)\n\n(cid:18) KT e\n\n(cid:19)\n\n, 4e\u03b2 ln K\n\n+ \u03b7\n\n+ D\u03b2\n\n,\n\n(3)\n\nwhere D\u03b2 =(cid:80)\n\nt /\u2208S\u03b2\n\n\u00afRT \u2264 |S\u03b2| + max\n\n2\ndt is the cumulative delay experienced by DEW.\n\n\u03b7\n\nProof. Theorem 1 holds for parameters (\u03b7, \u03b2) for DEW run under Skipper. We then apply Lemma 2.\n\nCorollary 4. Assume that T and D are known and take\n\nThen the expected regret of Skipper(\u03b2, DEW(\u03b7, \u03b2)) against an oblivious adversary satis\ufb01es\n\n(cid:115)\n\n\u03b7 =\n\n\u00afRT \u2264 2\n\n1\n4e\u03b2\n\n,\n\n\u03b2 =\n\n(cid:115)(cid:18) KT e\n\n2\n\neKT /2+D\n\n+ D\n\n.\n\n4e\n4e ln K\n\n(cid:19)\n\n+ (1 + 4e)D\n\nln K.\n\n5\n\n\fProof. Note that D \u2265 \u03b2|S\u03b2| \u21d2 |S\u03b2| \u2264 D\nand substituting the values of \u03b7 and \u03b2 we obtain the result.\n\n\u03b2 . By substituting this into (3), observing that D\u03b2 \u2264 D,\n\nNote that Corollary 4 recovers the regret scaling in Theorem 1, equation (1) within constant factors\nin front of D without the need of knowledge of dmax. Similar to Theorem 1, Corollary 4 is tight in\nthe worst case. The tuning of \u03b2 still requires the knowledge of T and D. In the next section we get\nrid of this requirement.\n\n4 Delay available at action time: Oracle tuning and results\n\nThis section deals with the second setting, where the delays are observed before taking an action.\nThe combined algorithm introduced in the previous section relies on prior knowledge of T and D\nfor tuning the parameters. In this section we eliminate this requirement by leveraging the added\ninformation about the delays at the time of action. The information is used in an implicit doubling\nscheme for tuning Skipper\u2019s threshold parameter \u03b2. Additionally, the new bound scales with the\nexperienced delay D\u03b2 rather than the full delay D and is signi\ufb01cantly tighter when D\u03b2 (cid:28) D. This\nis achieved through direct optimization of the regret bound in terms of |S\u03b2| and D\u03b2, as opposed to\nCorollary 4, which tunes \u03b2 using the potentially loose inequality |S\u03b2| \u2264 D/\u03b2.\n\n4.1 Setup\n\nLet m index the epochs of the doubling scheme. In each epoch we restart the algorithm with new\nparameters and continually monitor the termination condition in equation (6). The learning rate\nwithin epoch m is set to \u03b7m = 1\n, where \u03b2m is the threshold parameter of the epoch. Theorem 3\nprovides a regret bound for epoch m denoted by\n\n4e\u03b2m\n\nBoundm(\u03b2m) := |Sm\n\n\u03b2m\n\n| + 4e\u03b2m ln K +\n\n\u03c3(m)eK/2 + Dm\n\u03b2m\n\n,\n\n4e\u03b2m\nare, respectively, the number of\n\n(4)\n\nwhere \u03c3(m) denotes the length of epoch m and |Sm\nskipped rounds and the experienced delay within epoch m.\nLet \u03c9m = 2m. In epoch m we set\n\n\u03b2m\n\n| and Dm\n\n\u03b2m\n\nand we stay in epoch m as long as the following condition holds:\n\nmax\n\n|Sm\n\n\u03b2m\n\n|2,\n\n+ Dm\n\u03b2m\n\nln K\n\n(cid:26)\n\n\u221a\n\n\u03c9m\n4e ln K\n\n\u03b2m =\n\n(cid:18) eK\u03c3(m)\n\n2\n\n(cid:19)\n\n(cid:27)\n\n\u2264 \u03c9m.\n\n(5)\n\n(6)\n\nSince dt is observed at the beginning of round t, we are able to evaluate condition (6) and start a\nnew epoch before making the selection of At. This provides the desired tuning of \u03b2m for all rounds\nwithout the need of a separate treatment of epoch transition points.\nWhile being more elaborate, this doubling scheme maintains the intuition of standard approaches.\nFirst of all, the condition for doubling (6) ensures that the regret bound in each period is optimized\nby explicitly balancing the contribution of each term in equation (4). Secondly, the geometric\nprogression of the tuning (5) \u2014and thus of the resulting regret bounds\u2014 means that the total regret\nbound summed over the epochs can be bounded in relation to the bound in the \ufb01nal completed epoch.\nIn the following we refer to the doubling scheme de\ufb01ned by (5) and (6) as Doubling.\n\n4.2 Results\n\nThe following results show that the proposed doubling scheme works as well as oracle tuning of \u03b2\nwhen the learning rate is \ufb01xed at \u03b7 = 1/4e\u03b2. We \ufb01rst compare our performance to the optimal tuning\nin a single epoch, where we let\n\n\u03b2\u2217\nm = arg min\n\n\u03b2m\n\nBoundm(\u03b2m)\n\n(7)\n\nbe the minimizer of (4).\n\n6\n\n\fLemma 5. The regret bound (4) for any non-\ufb01nal epoch m, with the epochs and \u03b2m controlled by\nDoubling satis\ufb01es\n\n\u221a\nBoundm(\u03b2m) \u2264 3\n\n\u03c9m \u2264 3 Boundm(\u03b2\u2217\n\nm) + 2e2K ln K + 1.\n\n(8)\n\nThe lemma is the main machinery of the analysis of Doubling and its proof is provided in Appendix C.\nApplying it to Skipper(\u03b2, DEW(\u03b7,\u03b2)) leads to the following main result.\nTheorem 6. The expected regret of Skipper(\u03b2, DEW(\u03b7, \u03b2)) tuned by Doubling satis\ufb01es for any T\n\n\u00afRT \u2264 15 min\n\n\u03b2\n\n|S\u03b2| + 4e\u03b2 ln K +\n\nKT + D\u03b2\n\n4e\u03b2\n\n+ 10e2K ln K + 5.\n\n(cid:26)\n\n(cid:27)\n\nThe proof of Theorem 6 is based on Lemma 5 and is provided in Appendix C.\nCorollary 7. The expected regret of Skipper(\u03b2, DEW(\u03b7, \u03b2)) tuned by Doubling can be relaxed for\nany T to\n\n(cid:115)(cid:18) KT e\n\n2\n\n\u00afRT \u2264 30\n\n(cid:19)\n\n+ (1 + 4e)D\n\nln K + 10e2K ln K + 5.\n\n(9)\n\nProof. The \ufb01rst term in the bound of Theorem 6 can be directly bounded using Corollary 4.\n\nNote that both Theorem 6 and Corollary 7 require no knowledge of T and D.\n\n4.3 Comparison of the oracle and explicit bounds\n\nWe \ufb01nish the section with a comparison of the oracle bound in Theorem 6 and the explicit bound in\nCorollary 7. Ignoring the constant and additive terms, the bounds are\n\nexplicit\n\n: O(cid:16)(cid:112)(KT + D) ln K\n\n(cid:17)\n\n(cid:18)\n\n(cid:26)\n\n,\n|S\u03b2| + \u03b2 ln K +\n\noracle\n\n: O\n\nmin\n\n\u03b2\n\n(cid:27)(cid:19)\n\n.\n\nKT + D\u03b2\n\n\u03b2\n\nNote that the oracle bound is always as strong as the explicit bound. There are, however, cases where\nit is much tighter. Consider the following example.\n\nExample 8. For t < (cid:112)KT / ln K let dt = T \u2212 t and for t \u2265 (cid:112)KT / ln K let dt = 0. Take\n\u03b2 =(cid:112)KT / ln K. Then D = \u0398(T(cid:112)KT / ln K), but D\u03b2 = 0 (assuming that T \u2265 K ln K) and\n|S\u03b2| <(cid:112)KT / ln K. The corresponding regret bounds are\n\n(cid:18)(cid:113)\n: O(cid:16)\u221a\n\n: O\n\n\u221a\n\n(cid:19)\n= O(cid:0)T 1/2(cid:1).\n\nKT\n\n(cid:17)\n\nKT ln K + T\n\nKT ln K\n\n= O(cid:0)T 3/4(cid:1),\n\nexplicit\n\noracle\n\n5 Analysis of Algorithm 1\n\nThis section contains the main points of the analysis of Algorithm 1 leading to the proof of Theorem 1\nwhich were postponed from Section 3. Full proofs are found in Appendix B.\nThe analysis is a generalization of the analysis of delayed Exp3 in Cesa-Bianchi et al. [2019], and\nconsists of a general regret analysis and two stability lemmas.\n\n5.1 Additional notation\nWe let Nt = |{s : s+ds \u2208 [t, t+dt)}| denote the stability-span of t, which is the amount of feedback\nthat arrives between playing action At and observing its feedback. Note that letting N = maxt Nt\nwe have N \u2264 2 maxt dt \u2264 2dmax, since this may include feedback from up to maxs ds rounds prior\nto round t and up to dt rounds after round t.\n\n7\n\n\fWe introduce Z = (z1, ..., zT ) to be a permutation of [T ] = {1, ..., T} sorted in ascending order\naccording to the value of z + dz with ties broken randomly, and let \u03a8i = (z1, ..., zi) be its \ufb01rst\ni elements. Similarly, we also introduce Z(cid:48)\n) as an enumeration of {s : s + ds \u2208\n[t, t + dt)}.\nFor a subset the integers C, corresponding to timesteps, we also introduce\n\nt = (z(cid:48)\n\n1, ..., z(cid:48)\n\nNt\n\n(cid:17)\n\n(cid:16)\u2212\u03b7(cid:48)(cid:80)\n(cid:16)\u2212\u03b7(cid:48)(cid:80)\n\nexp\n\n(cid:80)\n\nb exp\n\ns\u2208C\n\n\u02c6(cid:96)a\ns\n\ns\u2208C\n\n\u02c6(cid:96)b\ns\n\nqa(C) =\n\n(cid:17) .\n\n(10)\n\nThe nominator and denominator in the above expression will also be denoted by wa(C) and W (C)\ncorresponding to the de\ufb01nition of pa\nt .\nBy \ufb01nally letting Ct\u22121 = {s : s + ds < t} we have pa\n\nt = qa(Ct\u22121).\n\n5.2 Analysis of delayed exponential weights\n\nThe starting point is the following modi\ufb01cation of the basic lemma within the Exp3 analysis that\ntakes care of delayed updates of the weights.\nLemma 9. Algorithm 1 satis\ufb01es\n\nT(cid:88)\n\nK(cid:88)\n\nt=1\n\na=1\n\n(cid:88)\n\nt\n\nT(cid:88)\n\nK(cid:88)\n\nt=1\n\na=1\n\n(cid:17)2\n\n(cid:16)\u02c6(cid:96)a\n\nt\n\npa\nt+dt\n\nt \u2212 min\n\u02c6(cid:96)a\na\u2208[K]\n\nt \u2264 ln K\n\u02c6(cid:96)a\n\n\u03b7(cid:48) +\n\n\u03b7\n2\n\npa\nt+dt\n\n.\n\n(11)\n\nTo make use of Lemma 9, we need to \ufb01gure out the relationship between pa\nt . This is\nachieved by the following two lemmas, which are generalizations and re\ufb01nements of Lemmas 1 and\n2 in Cesa-Bianchi et al. [2019].\nLemma 10. When using Algorithm 1 the resulting probabilities ful\ufb01l for every t and a\n\nand pa\n\nt+dt\n\nt \u2265 \u2212\u03b7(cid:48) Nt(cid:88)\n\nqa(cid:0)Ct\u22121 \u222a {z(cid:48)\n\nj : j < i}(cid:1) \u02c6(cid:96)a\n\nz(cid:48)\n\ni\n\n\u2212 pa\n\npa\nt+dt\n\n,\n\n(12)\n\nwhere z(cid:48)\n\nj is an enumeration of {s : s + ds \u2208 [t, t + dt)}.\n\ni=1\n\nThe above lemma allows us to bound pa\nto upper bound the probability, which is captured in the second probability drift lemma.\nLemma 11. The probabilities de\ufb01ned by (10) satisfy for any i\n\nfrom below in terms of pa\n\nt+dt\n\nt . We similarly need to be able\n\n(cid:18)\n\n1 +\n\n1\n\n2N \u2212 1\n\n(cid:19)\n\nqa(\u03a8i) \u2264\n\n5.3 Proof sketch of Theorem 1\n\nqa(\u03a8i\u22121).\n\n(13)\n\nBy using Lemma 10 to bound the left hand side of (11) we have\n\n(cid:88)\n\n(cid:88)\n\nt\n\na\n\n(cid:88)\n\nt\n\npa\nt\n\nt \u2212 min\n\u02c6(cid:96)a\n\na\n\nt \u2264 ln K\n\u02c6(cid:96)a\n\n\u03b7(cid:48) +\n\n\u03b7(cid:48)\n2\n\nT(cid:88)\nK(cid:88)\n(cid:88)\n+ \u03b7(cid:48)(cid:88)\n\na=1\n\nt=1\n\npa\nt+dt\n\nNt(cid:88)\n\n\u02c6(cid:96)a\nt\n\n(cid:16)\u02c6(cid:96)a\n(cid:17)2\nqa(cid:0)Ct\u22121 \u222a {z(cid:48)\n\nt\n\nt\n\na\n\ni=1\n\nj : j < i}(cid:1) \u02c6(cid:96)a\n\n.\n\nz(cid:48)\n\ni\n\nRepeated use of Lemma 11 bounds the second term on the right hand side by \u03b7(cid:48)T Ke/2 in expectation.\nThe third term on the right hand side can be bounded by D. Taking the maximum over the two\npossible values of the truncated learning rate \ufb01nishes the proof.\n\n8\n\n\f6 Discussion\n\nachieves the O(cid:0)(cid:112)(KT + D) ln K(cid:1) regret bound conjectured by Cesa-Bianchi et al. [2019]. The\n\nWe have presented an algorithm for multiarmed bandits with variably delayed feedback, which\n\nalgorithm is based on a procedure for skipping rounds with excessively large delays and re\ufb01ned\nanalysis of the exponential weights algorithm with delayed observations. At the moment the skipping\nprocedure requires prior knowledge of T and D for tuning the skipping threshold. However, if the\ndelay information is available \"at action time\", as in the examples described in the introduction, we\nprovide a sophisticated doubling scheme for tuning the skipping threshold that requires no prior\nknowledge of T and D. Furthermore, the re\ufb01ned tuning also leads to a re\ufb01ned regret bound of order\n\n(cid:1)(cid:1), which is polynomially tighter when D\u03b2 (cid:28) D. We provide\n\n(cid:0)|S\u03b2| + \u03b2 ln K + KT +D\u03b2\n\nO(cid:0) min\u03b2\n\n\u03b2\n\nan example of such a problem in the paper.\nOur work leads to a number of interesting research questions. The main one is whether the two regret\nbounds are achievable when the delays are available \"at observation time\" without prior knowledge\nof D and T . Alternatively, is it possible to derive lower bounds demonstrating the impossibility\nof further relaxation of the assumptions? More generally, it would be interesting to have re\ufb01ned\nlower bounds for problems with variably delayed feedback. Another interesting direction is a design\nof anytime algorithms, which do not rely on the doubling trick. Such algorithms can be used, for\nexample, for achieving simultaneous optimality in stochastic and adversarial setups [Zimmert and\nSeldin, 2019a]. While a variety of anytime algorithms is available for non-delayed bandits, the\nextension to delayed feedback does not seem trivial. Some of these questions are addressed in a\nfollow-up work by Zimmert and Seldin [2019b].\n\nAcknowledgments\n\nNicol\u00f2 Cesa-Bianchi gratefully acknowledges partial support by the Google Focused Award Al-\ngorithms and Learning for AI (ALL4AI) and by the MIUR PRIN grant Algorithms, Games, and\nDigital Markets (ALGADIMAR). Yevgeny Seldin acknowledges partial support by the Independent\nResearch Fund Denmark, grant number 9040-00361B.\n\nReferences\nSakshi Arya and Yuhong Yang. Randomized allocation with nonparametric estimation for contextual\n\nmulti-armed bandits with delayed rewards. arXiv preprint, arXiv:1902.00819, 2019.\n\nNicol\u00f2 Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Nonstochastic bandits with composite\nanonymous feedback. In Proceedings of the International Conference on Computational Learning\nTheory (COLT), 2018.\n\nNicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic\n\nbandits. The Journal of Machine Learning Research, 20(1):613\u2013650, 2019.\n\nOlivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the Interna-\n\ntional Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2014.\n\nOlivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in\n\nNeural Information Processing Systems (NeurIPS), 2011.\n\nMiroslav Dud\u00edk, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and\nTong Zhang. Ef\ufb01cient optimal learning for contextual bandits. In Proceedings of the Conference\non Uncertainty in Arti\ufb01cial Intelligence, 2011.\n\nSiddhant Garg and Aditya Kumar Akash. Stochastic bandits with delayed composite anonymous\n\nfeedback. arXiv preprint, arXiv:1910.01161, 2019.\n\nScott Garrabrant, Nate Soares, and Jessica Taylor. Asymptotic convergence in online learning with\n\nunbounded delays. arXiv preprint, arXiv:1604.05280, 2016.\n\nAvishek Ghosh and Kannan Ramchandran. Online scoring with delayed information: A convex\noptimization viewpoint. In the proceedings of the Annual Allerton Conference on Communication,\nControl, and Computing (Allerton), 2018.\n\n9\n\n\fPooria Joulani, Andras Gyorgy, and Csaba Szepesv\u00e1ri. Delay-tolerant online convex optimization:\nUni\ufb01ed analysis and adaptive-gradient algorithms. In Proceedings of the AAAI Conference on\nArti\ufb01cial Intelligence, 2016.\n\nBingcong Li, Tianyi Chen, and Georgios B. Giannakis. Bandit online learning with unknown delays.\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2019.\n\nTravis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi\u00b4c. The queue method: Handling\ndelay, heuristics, prior data, and evaluation in bandits. In Proceedings of the AAAI Conference on\nArti\ufb01cial Intelligence, 2015.\n\nTimothy A Mann, Sven Gowal, Ray Jiang, Huiyi Hu, Balaji Lakshminarayanan, and Andras Gyorgy.\nLearning from delayed outcomes with intermediate observations. arXiv preprint, arXiv:1807.09387,\n2018.\n\nChris Mesterharm. On-line learning with delayed label feedback. In Proceedings of the International\n\nConference on Algorithmic Learning Theory (ALT), 2005.\n\nChris Mesterharm. Improving Online Learning. PhD thesis, Department of Computer Science,\n\nRutgers University, 2007.\n\nGergely Neu, Andras Gyorgy, Csaba Szepesvari, and Andras Antos. Online markov decision\n\nprocesses under bandit feedback. IEEE Transactions on Automatic Control, 59(3), 2014.\n\nCiara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Gr\u00fcnew\u00e4lder. Bandits with delayed,\naggregated anonymous feedback. In Proceedings of the International Conference on Machine\nLearning (ICML), 2018.\n\nKent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), 2015.\n\nOhad Shamir and Liran Szlak. Online learning with local permutations and delayed feedback. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2017.\n\nClaire Vernade, Olivier Capp\u00e9, and Vianney Perchet. Stochastic bandit models for delayed conversions.\n\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2017.\n\nJulian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2019a.\n\nJulian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary\n\ndelays. arXiv preprint, arXiv:1910.06054, 2019b.\n\nMartin Zinkevich, John Langford, and Alex J Smola. Slow learners are fast. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), 2009.\n\n10\n\n\f", "award": [], "sourceid": 3526, "authors": [{"given_name": "Tobias Sommer", "family_name": "Thune", "institution": "University of Copenhagen"}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "Universit\u00e0 degli Studi di Milano"}, {"given_name": "Yevgeny", "family_name": "Seldin", "institution": "University of Copenhagen"}]}