{"title": "Equipping Experts/Bandits with Long-term Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 5929, "page_last": 5939, "abstract": "We propose the first black-box approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with $K$ actions and $T$ rounds, using our general framework we develop various algorithms with a regret bound of order $\\order(\\sqrt{T(S\\ln T + n \\ln K)})$ compared to any sequence of experts with $S-1$ switches among $n \\leq \\min\\{S, K\\}$ distinct experts. In addition, by plugging specific adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously, which resolves an open problem of Warmuth and Koolen 2014. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandit, a sharp contrast with the non-contextual case.", "full_text": "Equipping Experts/Bandits with Long-term Memory\n\nKai Zheng1,2\n\nzhengk92@pku.edu.cn\n\nHaipeng Luo3\n\nhaipengl@usc.edu\n\nIlias Diakonikolas4\n\nilias.diakonikolas@gmail.com\n\nLiwei Wang1,2\n\nwanglw@cis.pku.edu.cn\n\nAbstract\n\ndevelop various algorithms with a regret bound of order O((cid:112)T (S ln T + n ln K))\n\nWe propose the \ufb01rst reduction-based approach to obtaining long-term memory\nguarantees for online learning in the sense of Bousquet and Warmuth [8], by\nreducing the problem to achieving typical switching regret. Speci\ufb01cally, for the\nclassical expert problem with K actions and T rounds, using our framework we\ncompared to any sequence of experts with S \u2212 1 switches among n \u2264 min{S, K}\ndistinct experts. In addition, by plugging speci\ufb01c adaptive algorithms into our\nframework we also achieve the best of both stochastic and adversarial environments\nsimultaneously. This resolves an open problem of Warmuth and Koolen [35].\nFurthermore, we extend our results to the sparse multi-armed bandit setting and\nshow both negative and positive results for long-term memory guarantees. As a\nside result, our lower bound also implies that sparse losses do not help improve the\nworst-case regret for contextual bandits, a sharp contrast with the non-contextual\ncase.\n\n1\n\nIntroduction\n\nIn this work, we propose a black-box reduction for obtaining long-term memory guarantees for\ntwo fundamental problems in online learning: the expert problem [17] and the multi-armed bandit\n(MAB) problem [6]. In both problems, a learner interacts with the environment for T rounds, with\nK \ufb01xed available actions. At each round, the environment decides the loss for each action while\nsimultaneously the learner selects one of the actions and suffers the loss of this action. In the expert\nproblem, the learner observes the loss of every action at the end of each round (a.k.a. full-information\nfeedback), while in MAB, the learner only observes the loss of the selected action (a.k.a. bandit\nfeedback).\nFor both problems, the classical performance measure is the learner\u2019s (static) regret, de\ufb01ned as\n\u221a\nthe difference between the learner\u2019s total loss and the loss of the best \ufb01xed action.\nIt is well-\nknown that the minimax optimal regret is \u0398(\nT K) [6, 4] for the expert\nproblem and MAB respectively. Comparing against a \ufb01xed action, however, does not always lead\nto meaningful guarantees, especially when the environment is non-stationary and no single \ufb01xed\naction performs well. To address this issue, prior work has considered a stronger measure called\nswitching/tracking/shifting regret, which is the difference between the learner\u2019s total loss and the loss\n\n\u221a\nT ln K) [17] and \u0398(\n\n1 Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n2 Center for Data Science, Peking University\n3 University of Southern California\n4 University of Wisconsin-Madison\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof a sequence of actions with at most S \u2212 1 switches. Various existing algorithms (including some\nblack-box approaches) achieve the following switching regret\n\nfor the expert problem [23, 21, 1, 27, 24],\nfor multi-armed bandits [6, 28].\n\n(1)\n(2)\n\n(cid:40) O((cid:112)T S ln(T K))\nO((cid:112)T KS ln(T K))\n\n(cid:17)\n\nWe call these typical switching regret bounds. Such bounds essentially imply that the learner pays\nthe worst-case static regret for each switch in the benchmark sequence. While this makes sense in\nthe worst case, intuitively one would hope to perform better if the benchmark sequence frequently\nswitches back to previous actions, as long as the algorithm remembers which actions have performed\nwell previously.\nIndeed, for the expert problem, algorithms with long-term memory were developed that guarantee\n, where n \u2264 min{S, K} is the number of\ndistinct actions in the benchmark sequence [8, 2, 13].1 Although there is no known lower bound,\nthis regret bound essentially matches the one achieved by a computationally inef\ufb01cient approach\nof running Hedge over all benchmark sequences with S switches among n experts, an approach\nthat usually leads to the information-theoretically optimal regret guarantee. Compared to the typical\n\nswitching regret of order O(cid:16)(cid:113)\nswitching regret bound of form (1) (which can be written as O((cid:112)T (S ln T + S ln K))), this long-\n\nS + n ln K\nn )\n\nT (S ln nT\n\nterm memory guarantee implies that the learner pays the worst-case static regret only for each distinct\naction encountered in the benchmark sequence, and pays less for each switch, especially when n is\nvery small. Algorithms with long-term memory guarantees have been found to have better empirical\nperformance [8], and applied to practical applications such as TCP round-trip time estimation [30],\nintrusion detection system [29], and multi-agent systems [31]. We are not aware of any similar studies\nfor the bandit setting.\n\nOverview of our contributions. The main contribution of this work is to propose a simple black-\nbox approach to equip expert or MAB algorithms with long-term memory and to achieve switching\nregret guarantees of similar \ufb02avor to those of [8, 2, 13]. The key idea of our approach is to utilize a\nvariant of the con\ufb01dence-rated expert framework of [7], and to use a sub-routine to learn the con\ufb01-\ndence/importance of each action for each time. Importantly this sub-routine itself is an expert/bandit\nalgorithm over only two actions and needs to enjoy some typical switching regret guarantee (for\nexample of form (1) for the expert problem). In other words, our approach reduces the problem\nof obtaining long-term memory to the well-studied problem of achieving typical switching regret.\nCompared to existing methods [8, 2, 13], the advantages of our approach are the following:\n1. While existing methods are all restricted to variants of the classical Hedge algorithm [17], our\napproach allows one to plug in a variety of existing algorithms and to obtain a range of different\n\nalgorithms with switching regret O((cid:112)T (S ln T + n ln K)). (Section 3.1)\nalgorithm whose switching regret is simultaneously O((cid:112)T (S ln T + n ln K)) in the worst-case\n\n2. Due to this \ufb02exibility, by plugging in speci\ufb01c adaptive algorithms, we develop a parameter-free\nand O(S ln T + n ln(K ln T )) if the losses are piece-wise stochastic (see Section 2 for the formal\nde\ufb01nition). This is a generalization of previous best-of-both-worlds results for static or switching\nregret [19, 27], and resolves an open problem of Warmuth and Koolen [35]. The best previous bound\nfor the stochastic case is O(S ln(T K ln T )) [27]. (Section 3.2)\n3. Our framework allows us to derive the \ufb01rst nontrivial long-term memory guarantees for the bandit\nsetting, while existing approaches fail to do so (more discussion to follow). For example, when n is a\nconstant and the losses are sparse, our algorithm achieves switching regret O(S1/3T 2/3 + K 3 ln T )\nfor MAB, which is better than the typical bound (2) when S and K are large. For example, when\nS = \u0398(T 7\n10 ln T ) while bound (2) becomes\nvacuous (linear in T ), demonstrating a strict separation in learnability. (Section 4)\nTo motivate our results on long-term memory guarantees for MAB, a few remarks are in order. It is\n\nnot hard to verify that existing approaches achieve switching regret O((cid:112)T K(S ln T + n ln K)) for\n\n10 ), our bound is of order O(T 9\n\n10 ) and K = \u0398(T 3\n\nMAB. However, the polynomial dependence on the number of actions K makes the improvement of\n\n1The setting considered in [8, 2] is in fact slightly different from, yet closely related to, the expert problem.\n\nOne can easily translate their regret bounds into the bounds we present here.\n\n2\n\n\f\u221a\n\nthis bound over the typical bound (2) negligible. It is well-known that such polynomial dependence on\nK is unavoidable in the worst-case due to the bandit feedback. This motivates us to consider situations\nwhere the necessary dependence on K is much smaller. In particular, Bubeck et al. [10] recently\nshowed that if the loss vectors are \u03c1-sparse, then a static regret bound of order O(\nT \u03c1 ln K +K ln T )\nis achievable, exhibiting a much more favorable dependence on K. We therefore focus on this sparse\nMAB problem and study what nontrivial switching regret bounds are achievable.\n\nWe \ufb01rst show that a bound of order O((cid:112)T \u03c1S ln(KT ) + KS ln T ), a natural generalization of the\n\n\u221a\ntypical switching regret bound of (2) to the sparse setting, is impossible. In fact, we show that for any\nS the worst-case switching regret is at least \u2126(\nT KS), even when \u03c1 = 2. Since achieving switching\nregret for MAB can be seen as a special case of contextual bandits [6, 26], this negative result also\nimplies that, surprisingly, sparse losses do not help improve the worst-case regret for contextual\nbandits, which is a sharp contrast with the non-contextual case studied in [10] (see Theorem 6 and\nCorollary 7). Despite this negative result, however, as mentioned we are able to utilize our general\nframework to still obtain improvements over bound (2) when n is small. Our construction is fairly\nsophisticated, requiring a special sub-routine that uses a novel one-sided log-barrier regularizer and\nadmits a new kind of \u201clocal-norm\u201d guarantee, which may be of independent interest.\n\n2 Preliminaries\nThroughout the paper, we use [m] to denote the set {1, . . . , m} for some integer m. The learning\nprotocol for the expert problem and MAB with K actions and T rounds is as follows: For each time\nt = 1, . . . , T , (1) the learner \ufb01rst randomly selects an action It \u2208 [K] according to a distribution\npt \u2208 \u2206K (the (K \u2212 1)-dimensional simplex); (2) simultaneously the environment decides the loss\nvector (cid:96)t \u2208 [\u22121, 1]K; (3) the learner suffers loss (cid:96)t(It) and observes either (cid:96)t in the expert problem\n(full-information feedback) or only (cid:96)t(It) in MAB (bandit feedback). For any sequence of T actions\ni1, . . . , iT \u2208 [K], the expected regret of the learner against this sequence is de\ufb01ned as\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:96)t(It) \u2212 T(cid:88)\n\nR(i1:T ) = E\n\n(cid:96)t(it)\n\n= E\n\nrt(it)\n\n,\n\nt=1\n\nt=1\n\nt=1\n\nconstraint on the number of switches for the benchmark sequence:(cid:80)T\n\nwhere the expectation is with respect to both the learner and the environment and rt(i), the instanta-\nt (cid:96)t \u2212 (cid:96)t(i). When i1 = \u00b7\u00b7\u00b7 = iT , this becomes the\nneous regret (against action i), is de\ufb01ned as p(cid:62)\ntraditional static regret against a \ufb01xed action. Most existing works on switching regret impose a\nt=2 1{it (cid:54)= it\u22121} \u2264 S \u2212 1. In\nother words, the sequence can be decomposed into S disjoint intervals, each with a \ufb01xed comparator\nas in static regret. Typical switching regret bounds hold for any sequence with this constraint and are\nin terms of T, K and S, such as Eq. (1) and Eq. (2).\nThe number of switches, however, does not fully characterize the dif\ufb01culty of the problem. Intuitively,\na sequence that frequently switches back to previous actions should be an easier benchmark for an\nalgorithm with long-term memory that remembers which actions performed well in the past. To\nencode this intuition, prior works [8, 2, 13] introduced another parameter n = |{i1, . . . , iT}|, the\nnumber of distinct actions in the sequence, to quantify the dif\ufb01culty of the problem, and developed\nswitching regret bounds in terms of T, K, S and n. Clearly one has n \u2264 min{S, K}, and we are\nespecially interested in the case when n (cid:28) min{S, K}, which is natural if the data exhibits some\nperiodic pattern. Our goal is to understand what improvements are achievable in this case and how to\ndesign algorithms that can leverage this property via a uni\ufb01ed framework.\n\nStochastic setting.\nIn general, we do not make any assumptions on how the losses are generated by\nthe environment, which is known as the adversarial setting in the literature. We do, however, develop\nan algorithm (for the expert problem) that enjoys the best of both worlds \u2014 it not only enjoys some\nrobust worst-case guarantee in the adversarial setting, but also achieves much smaller logarithmic\nregret in a stochastic setting. Speci\ufb01cally, in this stochastic setting, without loss of generality, we\nassume the n distinct actions in {i1, . . . , iT} are 1, . . . , n. It is further assumed that for each i \u2208 [n],\nthere exists a constant gap \u03b1i > 0 such that Et [(cid:96)t(j) \u2212 (cid:96)t(i)] \u2265 \u03b1i for all j (cid:54)= i and all t such that\nit = i, where the expectation is with respect to the randomness of the environment conditioned on the\nhistory up to the beginning of round t. In other words, for every time step the algorithm is compared\nto the best action whose expected value is constant away from those of other actions. This is a natural\ngeneralization of the stochastic setting studied for static regret or typical switching regret [19, 27].\n\n3\n\n\fAlgorithm 1: A Simple Reduction for Long-term Memory\n1 Input: expert algorithm A learning over K actions with static regret guarantee (cf. Condition 1),\nexpert algorithms A1, . . . ,AK learning over two actions {0, 1} with switching regret guarantee (cf.\nCondition 2), parameter \u03b7 \u2264 1/5\n\n2 for t = 1, 2, . . . do\n3\n4\n5\n6\n7\n\nReceive sampling distribution wt \u2208 \u2206K from A\nReceive sampling probability zt(i) for action \u201c1\u201d from Ai for each i \u2208 [K]\nSample It \u223c pt where pt(i) \u221d zt(i)wt(i), \u2200i, and receive (cid:96)t \u2208 [\u22121, 1]K\nFeed loss vector ct to A, where ct(i) = \u2212zt(i)rt(i) with rt(i) = p(cid:62)\nFeed loss vector (0, 5\u03b7 \u2212 rt(i)) to Ai for each i \u2208 [K]\n\nt (cid:96)t \u2212 (cid:96)t(i)\n\n(cid:105)\n\nt=1 zt(i)rt(i)\n\nis also scaled by its con\ufb01dence and is now de\ufb01ned as E(cid:104)(cid:80)T\n\nCon\ufb01dence-rated actions. Our approach makes use of the con\ufb01dence-rated expert setting of Blum\nand Mansour [7], a generalization of the expert problem (and the sleeping expert problem [18]). The\nprotocol of this setting is the same as the expert problem, except that at the beginning of each round,\nthe learner \ufb01rst receives a con\ufb01dence score zt(i) for each action i. The regret against a \ufb01xed action i\n. The expert problem is\nclearly a special case with zt(i) = 1 for all t and i. There are a number of known examples showing\nwhy this formulation is useful, and our work will add one more to this list.\nTo obtain a bound on this new regret measure, one can in fact simply reduce it to the regular expert\nproblem [7, 19, 27]. Speci\ufb01cally, let A be some expert algorithm over the same K actions producing\nsampling distributions w1, . . . , wT \u2208 \u2206K. The reduction works by sampling It according to pt\nsuch that pt(i) \u221d zt(i)wt(i), \u2200i and then feeding ct to A where ct(i) = \u2212zt(i)rt(i), \u2200i. Note\nthat by the de\ufb01nition of pt one has w(cid:62)\nt (cid:96)t) = 0. Therefore, one can\ndirectly equalize the con\ufb01dence-rated regret and the regular static regret of the reduced problem:\nt=1(w(cid:62)\n\n(cid:105)\ni wt(i)z(i)((cid:96)t(i) \u2212 p(cid:62)\n\n= E(cid:104)(cid:80)T\n\nt ct =(cid:80)\n\nE(cid:104)(cid:80)T\n\nt=1 zt(i)rt(i)\n\nt ct \u2212 ct(i))\n\n.\n\n(cid:105)\n\n3 General Framework for the Expert Problem\n\nIn this section, we introduce our general framework to obtain long-term memory regret bounds\nand demonstrate how it leads to various new algorithms for the expert problem. We start with\na simpler version and then move on to a more elaborate construction that is essential to obtain\nbest-of-both-worlds results.\n\n3.1 A simple approach for adversarial losses\n\nA simple version of our approach is described in Algorithm 1. At a high level, it simply makes use of\nthe con\ufb01dence-rated action framework described in Section 2. The reduction to the standard expert\nproblem is executed in Lines 5 and 6, with a black-box expert algorithm A.\nIt remains to specify how to come up with the con\ufb01dence score zt(i). We propose to learn these\nscores via a separate black-box expert algorithm Ai for each i. More speci\ufb01cally, each Ai is learning\nover two actions 0 and 1, where action 0 corresponds to con\ufb01dence score 0 and action 1 corresponds\nto score 1. Therefore, the probability of picking action 1 at time t naturally represents a con\ufb01dence\nscore between 0 and 1, which we denote by zt(i) overloading the notation (Line 4).\nAs for the losses fed to Ai, we \ufb01x the loss of action 0 to be 0 (since shifting losses by the same\namount has no real effect), and set the loss of action 1 to be 5\u03b7 \u2212 rt(i) (Line 7). The role of the\nterm \u2212rt(i) is intuitively clear \u2014 the larger the loss of action i compared to the algorithm, the less\ncon\ufb01dent we should be about it; the role of the constant bias term 5\u03b7 will become clear in the analysis\n(in fact, it can even be removed at the cost of a worse bound \u2014 see Appendix B.2).\nFinally we specify what properties we require from the black-box algorithms A,A1, . . . ,AK. In\nshort, A needs to ensure a static regret bound, while A1, . . . ,AK need to ensure a switching regret\nbound. See Figure 1 for an illustration of our reduction. The trick is that since A1, . . . ,AK are\nlearning over only two actions, this construction helps us to separate the dependence on K and the\n\n4\n\n\fCon\ufb01dence-Rated\n\nExpert\n\nStatic\nRegret\n\nLong-Term\nMemory\n\nprovide\n\ncon\ufb01dence\n\nSwitching\n\nRegret\n\nFigure 1: Illustration of reduction. The main idea of our approach is to reduce the problem of\nobtaining long-term memory guarantee to the con\ufb01dence-rated expert problem and the problem of\nobtaining switching regret. Algorithms for the latter learn and provide con\ufb01dence to the con\ufb01dence-\nrated expert problem. The reduction from con\ufb01dence-rated expert to obtaining standard static regret\nis known [7].\n\n\u221a\n\nnumber of switches S. These (static or switching) regret bounds could be the standard worst-case\nT -dependent bounds mentioned in Section 1, in which case we would obtain looser long-term memory\nguarantees (speci\ufb01cally,\nn times worse \u2014 see Appendix B.2). Instead, we require these bounds to\nbe data-dependent and in particular of the form speci\ufb01ed below:\nCondition 1. There exists a constant C > 0 such that for any \u03b7 \u2208 (0, 1/5] and any loss sequence\nc1, . . . , cT \u2208 [\u22122, 2]K, algorithm A (possibly with knowledge of \u03b7) produces sampling distributions\nw1, . . . , wT \u2208 \u2206K and ensures one of the following static regret bounds:\n\nct(i) \u2264 C ln K\n\n\u03b7\n\n+ \u03b7\n\nct(i) \u2264 C ln K\n\n\u03b7\n\n+ \u03b7\n\n\u2200i \u2208 [K]\n\n|ct(i)| ,\n\n(cid:12)(cid:12)w(cid:62)\nt ct \u2212 ct(i)(cid:12)(cid:12) ,\n\n\u2200i \u2208 [K].\n\n(3)\n\n(4)\n\nCondition 2. There exists a constant C > 0 such that for any \u03b7 \u2208 (0, 1/5], any loss sequence\nh1, . . . , hT \u2208 [\u22123, 3]2, and any S \u2208 [T ], algorithm Ai (possibly with knowledge of \u03b7) produces\nsampling distributions q1, . . . , qT \u2208 \u22062 and ensures one of the following switching regret bounds\n\nagainst any sequence b1, . . . , bT \u2208 {0, 1} with(cid:80)T\n\nt=2 1{bt (cid:54)= bt\u22121} \u2264 S \u2212 1:2\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nt ct \u2212 T(cid:88)\nt ct \u2212 T(cid:88)\n\nt=1\n\nw(cid:62)\n\nw(cid:62)\n\nt=1\n\nor\n\nt=1\n\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=1\n\nq(cid:62)\n\nq(cid:62)\n\nt=1\n\nt ht \u2212 T(cid:88)\nt ht \u2212 T(cid:88)\nt ht \u2212 T(cid:88)\n\nt=1\n\nq(cid:62)\n\nt=1\n\nt=1\n\nor\n\nor\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nt=1\n\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=1\n\nht(bt) \u2264 CS ln T\n\n\u03b7\n\nht(bt) \u2264 CS ln T\n\n\u03b7\n\nht(bt) \u2264 CS ln T\n\n\u03b7\n\n+ \u03b7\n\n+ \u03b7\n\n+ \u03b7\n\n|ht(bt)| ,\n\n(cid:12)(cid:12)q(cid:62)\nt ht \u2212 ht(bt)(cid:12)(cid:12) ,\n(cid:88)\n\nqt(b)|ht(b)| .\n\nt=1\n\nb\u2208{0,1}\n\n(5)\n\n(6)\n\n(7)\n\nWe emphasize that these data-dependent bounds are all standard in the online learning literature,3 and\nprovide a few examples below (see Appendix A for brief proofs).\nProposition 1. The following algorithms all satisfy Condition 1: Variants of Hedge [20, 34],\nProd [12], Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25].\nProposition 2. The following algorithms all satisfy Condition 2: Fixed-share [23], a variant of\nFixed-share (Algorithm 5 in Appendix A), and AdaNormalHedge.TV [27].\n\n2In terms of notation in Algorithm 1, qt = (1 \u2212 zt(i), zt(i)).\n3In fact, most standard bounds replace the absolute value we present here with square, leading to even smaller\nbounds (up to a constant). We choose to use the looser ones with absolute values since this makes the conditions\nweaker while still being suf\ufb01cient for all of our analysis.\n\n5\n\n\fAlgorithm 2: A Parameter-free Reduction for Best-of-both-worlds\n1 De\ufb01ne: M = (cid:98)log2\n2 Input: expert algorithm A learning over KM actions with static regret guarantee (cf. Condition 3),\n\n\u221a\n5 (cid:99) + 1, \u03b7j = min\n\nexpert algorithms {Aij}i\u2208[K],j\u2208[M ] learning over two actions {0, 1} with switching regret\nguarantee (cf. Condition 2)\n\nfor j \u2208 [M ]\n\n5 , 2j\u22121\u221a\n\nT\n\nT\n\n(cid:110) 1\n\n(cid:111)\n\n3 for t = 1, 2, . . . do\n4\n5\n\nReceive sampling distribution wt \u2208 \u2206KM from A\nReceive sampling probability zt(i, j) for action \u201c1\u201d from Aij for each i \u2208 [K] and j \u2208 [M ]\n\nSample It \u223c pt where pt(i) \u221d(cid:80)M\n\nFeed loss vector ct to A, where ct(i, j) = \u2212zt(i, j)rt(i) with rt(i) = p(cid:62)\nFeed loss vector (0, 5\u03b7j|rt(i)| \u2212 rt(i)) to Aij for each i \u2208 [K] and j \u2208 [M ]\n\nj=1 zt(i, j)wt(i, j), \u2200i, and receive (cid:96)t \u2208 [\u22121, 1]K\nt (cid:96)t \u2212 (cid:96)t(i)\n\n6\n\n7\n8\n\n(cid:26)\n\n(cid:113) S ln T +n ln K\n\n(cid:27)\n\nWe are now ready to state the main result for Algorithm 1 (see Appendix B.1 for the proof).\n\nT\n\nTheorem 3. Suppose Conditions 1 and 2 both hold. With \u03b7 = min\n\nensures R(i1:T ) = O(cid:16)(cid:112)T (S ln T + n ln K)\n(cid:17)\nsequence i1, . . . , iT such that(cid:80)T\nOur bound in Theorem 3 is slightly worse than the existing bound of O(cid:16)(cid:113)\n(cid:17)\n2],4 but still improves over the typical switching regret O((cid:112)T (S ln T + S ln K)) (Eq. (1)), especially\n\nt=2 1{it (cid:54)= it\u22121} \u2264 S \u2212 1 and |{i1, . . . , iT}| \u2264 n.\n\nfor any loss sequence (cid:96)1, . . . , (cid:96)T and benchmark\n\nS + n ln K\nn )\n\n, Algorithm 1\n\nT (S ln nT\n\n1\n5 ,\n\n[8,\n\n\u221a\n\nwhen n is small and S and K are large. To better understand the implication of our bounds, consider\nthe following thought experiment. If the learner knew about the switch points (that is, {t : it (cid:54)= it\u22121})\nthat naturally divide the whole game into S intervals, she could simply pick any algorithm with\n\u221a\n\u201c#rounds\u201d ln K) and apply S instances of this algorithm, one for each interval,\noptimal static regret (\n\u221a\nT S ln K.\nwhich, via a direct application of the Cauchy-Schwarz inequality, leads to switching regret\nCompared to bound (1), this implies that the price of not knowing the switch points is\nT S ln T .\nSimilarly, if the learner knew not only the switch points, but also the information on which intervals\nshare the same competitor, then she could naturally apply n instances of the static algorithm, one for\neach set of intervals with the same competitor. Again by the Cauchy-Schwarz inequality, this leads\nto switching regret\nT n ln K. Therefore, our bound implies that the price of not having any prior\ninformation of the benchmark sequence is still\nCompared to existing methods, our framework is more \ufb02exible and allows one to plug in any\ncombination of the algorithms listed in Propositions 1 and 2. This \ufb02exibility is crucial and allows\nus to solve the problems discussed in the following sections. The approach of [2] makes use of a\nsleeping expert framework, a special case of the con\ufb01dence-rated expert framework. However, their\napproach is not a general reduction and does not allow plugging in different algorithms. Finally,\nwe note that our construction also shares some similarity with the black-box approach of [14] for a\nmulti-task learning problem.\n\nT S ln T .\n\n\u221a\n\n\u221a\n\n3.2 Best of both worlds\n\nTo further demonstrate the power of our approach, we now show how to use our framework to con-\nstruct a parameter-free algorithm that enjoys the best of both adversarial and stochastic environments,\nresolving the open problem of [35] (see Algorithm 2). The key is to derive an adaptive switching\nregret bound that replaces the dependence on T by the sum of the magnitudes of the instantaneous\nt |rt(i)|, which previous works [19, 27] show is suf\ufb01cient for adapting to the stochastic\n\nregret(cid:80)\n\nsetting and achieving logarithmic regret.\n\n4In fact, using the adaptive guarantees of AdaNormalHedge [27] or iProd/Squint [25] that replaces the ln K\ndependence in Eq. (4) by a KL divergence term, one can further improve the term n ln K in our bound to n ln K\nn\nmatching previous bounds. Since this improvement is small, we omit the details.\n\n6\n\n\f|{t : it = i}| now becomes(cid:80)\n\nTo achieve this goal, the \ufb01rst modi\ufb01cation we need is to change the bias term for the loss of action \u201c1\u201d\nfor Ai from 5\u03b7 to 5\u03b7|rt(i)|. Following the proof of Theorem 3, one can show that the dependence on\nt:it=i |rt(i)| for the regret against i. If we could tune \u03b7 optimally in\nterms of this data-dependent quality, then this would imply logarithmic regret in the stochastic setting\nby the same reasoning as in [19, 27].\nHowever, the dif\ufb01culty is that the optimal tuning of \u03b7 is unknown beforehand, and more importantly,\ndifferent actions require tuning \u03b7 differently. To address this issue, at a high level we discretize\nthe learning rate and pick M = \u0398(ln T ) exponentially increasing values (Line 1), then we make\nM = \u0398(ln T ) copies of each action i \u2208 [K], one for each learning rate \u03b7j. More speci\ufb01cally,\nthis means that the number of actions for A increases from K to KM, and so does the number of\nsub-routines with switching regret, now denoted as Aij for i \u2208 [K] and j \u2208 [M ]. Different copies of\nan action i share the same loss (cid:96)t(i) for A, while action \u201c1\u201d for Aij now suffers loss 5\u03b7j|rt(i)|\u2212 rt(i)\n(Line 8). The rest of the construction remains the same. Note that selecting a copy of an action\nis the same as selecting the corresponding action, which explains the update rule of the sampling\nprobability pt in Line 6 that marginalizes over j. Also note that for a vector in RKM (e.g., wt, ct, zt),\nwe use (i, j) to index its coordinates for i \u2208 [K] and j \u2208 [M ].\nFinally, with this new construction, we need algorithm A to exhibit a more adaptive static regret\nbound and in some sense be aware of the fact that different actions now correspond to different\nlearning rates. More precisely, we replace Condition 1 with the following condition:\nCondition 3. There exists a constant C > 0 such that for any \u03b71, . . . , \u03b7M \u2208 (0, 1/5] and any loss\nsequence c1, . . . , cT \u2208 [\u22122, 2]KM , algorithm A (possibly with knowledge of \u03b71, . . . , \u03b7M ) produces\nsampling distributions w1, . . . , wT \u2208 \u2206KM and ensures the following static regret bounds: for all\ni \u2208 [K] and j \u2208 [M ]:5\n\nT(cid:88)\n\nt ct \u2212 T(cid:88)\n\nw(cid:62)\n\nt=1\n\nt=1\n\nct(i, j) \u2264 C ln(KM )\n\n\u03b7j\n\n+ \u03b7j\n\n(cid:12)(cid:12)w(cid:62)\nt ct \u2212 ct(i, j)(cid:12)(cid:12) .\n\nT(cid:88)\n\nt=1\n\n(8)\n\nOnce again, this requirement is achievable by many existing algorithms and we provide some\nexamples below (see Appendix A for proofs).\nProposition 4. The following algorithms all satisfy Condition 3: A variant of Hedge (Algorithm 6 in\nAppendix A), Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25].\n\nWe now state our main result for Algorithm 2 (see Appendix B.3 for the proof).\nAlgorithm 2 ensures that for any benchmark sequence i1, . . . , iT such that(cid:80)T\nTheorem 5. Suppose algorithm A satis\ufb01es Condition 3 and {Aij}i\u2208[K],j\u2208[M ] all satisfy Condition 2.\nt=2 1{it (cid:54)= it\u22121} \u2264\n(cid:17)\n\u2022 In the adversarial setting, we have R(i1:T ) = O(cid:16)(cid:112)T (S ln T + n ln(K ln T ))\nS \u2212 1 and |{i1, . . . , iT}| \u2264 n, the following hold:\n(cid:17)\n\u2022 In the stochastic setting (de\ufb01ned in Section 2), we have R(i1:T ) = O(cid:16)(cid:80)n\nt=2 1{(it\u22121 = i \u2227 it (cid:54)= i) \u2228 (it\u22121 (cid:54)= i \u2227 it = i)} s.t.(cid:80)\nwhere Si = 1 +(cid:80)T\ni\u2208[n] Si \u2264 3S.6\n(cid:17)\nis achieved by AdaNormalHedge.TV [27], with regret O(cid:16)(cid:112)T (S ln(T K ln T ))\ncase and O(cid:16)(cid:80)n\n\nIn other words, with a negligible price of ln ln T for the adversarial setting, our algorithm achieves\nlogarithmic regret in the stochastic setting with favorable dependence on S and n. The best prior result\nfor the adversarial\n\nfor the stochastic case. We also remark that a variant of the\nalgorithm of [8] with a doubling trick can achieve a guarantee similar to ours, but weaker in the sense\nthat each \u03b1i is replaced by mini \u03b1i. To the best of our knowledge this was previously unknown and\nwe provide the details in Appendix B.4 for completeness.\n\nSi ln T +ln(K ln T )\n\nSi ln(T K ln T )\n\n(cid:17)\n\ni=1\n\n\u03b1i\n\ni=1\n\n\u03b1i\n\n;\n\n,\n\n5In fact an analogue of Eq. (3) with individual learning rates would also suf\ufb01ce, but we are not aware of any\n\nalgorithms that achieve such guarantee.\n\n6This de\ufb01nition of Si is the same as the one in the proof of Theorem 3.\n\n7\n\n\f3 Initialize: w1 = 1\n4 for t = 1, 2, . . . do\n5\n\nAlgorithm 3: A Sparse MAB Algorithm with Long-term Memory\n1 Input: parameter \u03b7 \u2264 1\n500 , \u03b3, \u03b4\n2 De\ufb01ne: regularizers \u03c8(w) = 1\n\u03b7\n\ni=1 w(i) ln w(i) + \u03b3(cid:80)K\n(cid:80)K\n\ndivergence D\u03c6(z, z(cid:48)) = \u03c6(z) \u2212 \u03c6(z(cid:48)) \u2212 \u03c6(cid:48)(z(cid:48))(z \u2212 z(cid:48))\n\ni=1 ln 1\n\nw(i) and \u03c6(z) = 1\n\nCompute \u02dcpt = (1 \u2212 \u03b7)pt + \u03b7\n\n\u03b7 ln 1\nK where 1 \u2208 RK is the all-one vector, and z1(i) = 1 for all i \u2208 [K]\nt (cid:98)(cid:96)t \u2212(cid:98)(cid:96)t(i) and ct(i) = \u2212zt(i)rt(i) \u2212 \u03b7zt(i)(cid:98)(cid:96)t(i)2 for each i \u2208 [K]\n\nSample It \u223c \u02dcpt, receive (cid:96)t(It), and construct loss estimator(cid:98)(cid:96)(i) = (cid:96)t(i)\n\nSet rt(i) = p(cid:62)\nUpdate wt+1 = argminw\u2208\u2206K\nUpdate zt+1(i) = argminz\u2208[\u03b4,1] \u2212rt(i)z + D\u03c6(z, zt(i)) for each i \u2208 [K]\n\nK 1 where pt(i) \u221d zt(i)wt(i), \u2200i\n(cid:80)t\n\u03c4 =1 w(cid:62)c\u03c4 + \u03c8(w)\n\n6\n\n7\n8\n9\n\n\u02dcpt(i) 1{i = It} , \u2200i\n\nz , Bregman\n\n(cid:66) update of A\n(cid:66) update of Ai\n\n4 Long-term Memory under Bandit Feedback\n\n\u221a\n\nIn this section, we move on to the bandit setting where the learner only observes the loss of the selected\naction (cid:96)t(It) instead of (cid:96)t. As mentioned in Section 1, one could directly generalize the approach\n\nof [8, 2, 13] to obtain a bound of order O((cid:112)T K(S ln T + n ln K)), a natural generalization of the\n\nfull information guarantee, but such a bound is not a meaningful improvement compared to (2), due to\nthe\nK dependence that is unavoidable for MAB in the worst case. Therefore, we consider a special\ncase where the dependence on K is much smaller: the sparse MAB problem [10]. Speci\ufb01cally, in\nthis setting we make the additional assumption that all loss vectors are \u03c1-sparse for some \u03c1 \u2208 [K],\nthat is, (cid:107)(cid:96)t(cid:107)0 \u2264 \u03c1 for all t. It was shown in [10] that for sparse MAB the static regret is of order\nO(\n\nT \u03c1 ln K + K ln T ), exhibiting a much favorable dependence on K.\n\n\u221a\n\nNegative result. To the best of our knowledge, there are no prior results on switching regret for\nsparse MAB. In light of bound (2), a natural conjecture would be that it would be possible to achieve\n\nswitching regret of O((cid:112)T \u03c1S ln(KT ) + KS ln T ) with S switches. Perhaps surprisingly, we show\n\nthat this is in fact impossible.\nTheorem 6. For any T, S, K \u2265 2 and any MAB algorithm, there exists a sequence of loss vectors\nthat are 2-sparse, such that the switching regret of this algorithm is at least \u2126(\n\nT KS).\n\n\u221a\n\nThe high level idea of the proof is to force the algorithm to overfocus on one good action and thus miss\nan even better action later. This is similar to the construction of [15, Lemma 3] and [37, Theorem 4.1],\nand we defer the proof to Appendix C.1. This negative result implies that sparsity does not help\nimprove the typical switching regret bound (2). In fact, since switching regret for MAB can be seen\nas a special case of the contextual bandits problem [6, 26], this result also immediately implies the\nfollowing corollary, a sharp contrast compared to the positive result for the non-contextual case\nmentioned earlier (see Appendix C.1 for the de\ufb01nition of contextual bandit and related discussions).\nCorollary 7. Sparse losses do not help improve the worst-case regret for contextual bandits.\n\nLong-term memory to the rescue. Despite the above negative results, we next show how long-\n\u221a\nterm memory can still help improve the switching regret for sparse MAB. Speci\ufb01cally, we use our\ngeneral framework to develop a MAB algorithm whose switching regret is smaller than O(\nT KS)\nwhenever \u03c1 and n are small while S and K are large. Note that this is not a contradiction with\nTheorem 6, since in the construction of its proof, n is as large as min{S, K}.\nAt a high level, our algorithm (Algorithm 3) works by constructing the standard unbiased importance-\n\nweighted loss estimator (cid:98)(cid:96)t (Line 6) and plugging it into our general framework (Algorithm 1).\n\nHowever, we emphasize that it is highly nontrivial to control the variance of these estimators without\nleading to bad dependence on K in this framework where two types of sub-routines interact with each\nother. To address this issue, we design specialized sub-algorithms A and Ai to learn wt and zt(i)\nrespectively. For learning wt, we essentially deploy the algorithm of [10] for sparse MAB, which is\nan instance of the standard follow-the-regularized-leader algorithm with a special hybrid regularizer,\n\n8\n\n\fcombining the entropy and the log-barrier (Lines 2 and 8). However, note that the loss ct we feed to\nthis algorithm is not sparse and we cannot directly apply the guarantee from [10], but it turns out that\none can still utilize the implicit exploration of this algorithm, as shown in our analysis. Compared\n\nto Algorithm 1, we also incorporate an extra bias term \u2212\u03b7zt(i)(cid:98)(cid:96)t(i)2 in the de\ufb01nition of ct (Line 7),\n\nwhich is important for canceling the large variance of the loss estimator.\nFor learning zt(i) for each i, we design a new algorithm that is an instance of the standard Online\nMirror Descent algorithm (see e.g., [22]). Recall that this is a one-dimensional problem, as we\nare trying to learn the distribution (1 \u2212 zt(i), zt(i)) over actions {0, 1}. We design a special one-\nz , which can be seen as a one-sided log-barrier,7 to bias towards\ndimensional regularizer \u03c6(z) = 1\naction \u201c1\u201d. Technically, this provides a special \u201clocal-norm\u201d guarantee that is critical for our analysis\nand may be of independent interest (see Lemma 14 in Appendix C.2). In addition, we remove the\nbias term in the loss for action \u201c1\u201d (so it is only \u2212rt(i) now) as it does not help in the bandit case, and\nwe also force zt(i) to be at least \u03b4 for some parameter \u03b4, which is important for achieving switching\nregret. Line 9 summarizes the update for zt(i).\nFinally, we also enforce a small amount of uniform exploration by sampling It from \u02dcpt, a smoothed\nversion of pt (Line 5). We present the main result of our algorithm below (proven in Appendix C.2).\nTheorem 8. With \u03b7 = max\nT \u03b7n , \u03b3 = 200K 2, Algorithm 3\nensures\n\n(cid:113) ln K\n3 + n(cid:112)T \u03c1 ln K + nK 3 ln T\n\n(9)\nfor any sequence of \u03c1-sparse losses (cid:96)1, . . . , (cid:96)T and any benchmark sequence i1, . . . , iT such that\n\n(cid:110)\nR(i1:T ) = O(cid:16)\n\n(cid:80)T\nt=2 1{it (cid:54)= it\u22121} \u2264 S \u2212 1 and |{i1, . . . , iT}| \u2264 n.\nover the existing bound O((cid:112)T KS ln(T K)) when ( T\n\nIn the case when \u03c1 and n are constants, our bound (9) becomes O(S 1\n\n3 +K 3 ln T ), which improves\n5 (also recall the example in\n\n3 \u03c1\u2212 2\nS 1\n\n3 (nT )\u2212 1\n3 ,\n\n(cid:113) S\n\n(\u03c1S)\n\n1\n3 (nT )\n\n2\n\n, \u03b4 =\n\nT \u03c1\n\n3 T 2\n\nS ) 1\n\n3 < K < (T S) 1\n\n(cid:111)\n\n(cid:17)\n\n\u03b7 ln 1\n\nSection 1 where our bound is sublinear in T while existing bounds become vacuous).\nAs a \ufb01nal remark, one might wonder if similar best-of-both-worlds results are also possible for MAB\nin terms of switching regret, given the positive results for static regret [9, 33, 5, 32, 36, 38]. We point\nout that the answer is negative \u2014 the proof of [37, Theorem 4.1] implicitly implies that even with\none switch, logarithmic regret is impossible for MAB in the stochastic setting.\n\n5 Conclusion\n\nIn this work, we propose a simple reduction-based approach to obtaining long-term memory regret\nguarantee. By plugging various existing algorithms into this framework, we not only obtain new\nalgorithms for this problem in the adversarial case, but also resolve the open problem of Warmuth\nand Koolen [35] that asks for a single algorithm achieving the best of both stochastic and adversarial\nenvironments in this setup. We also extend our results to the bandit setting and show both negative\nand positive results.\nOne clear open question is whether our bound for the bandit case (Theorem 8) can be improved, and\nmore generally what is the best achievable bound in this case.\n\nAcknowledgments. The authors would like to thank Alekh Agarwal, S\u00e9bastien Bubeck, Dylan\nFoster, Wouter Koolen, Manfred Warmuth, and Chen-Yu Wei for helpful discussions. Kai Zheng\nand Liwei Wang were supported by Natioanl Key R&D Program of China (no. 2018YFB1402600),\nBJNSF (L172037). Haipeng Luo was supported by NSF Grant IIS-1755781. Ilias Diakonikolas was\nsupported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.\n\n7The usual log-barrier regularizer (see e.g. [16, 3, 36]) would be 1\n\n\u03b7 (ln 1\n\nz + ln 1\n\n1\u2212z ) in this case.\n\n9\n\n\fReferences\n[1] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In\n\nInternational Conference on Algorithmic Learning Theory, pages 290\u2013304. Springer, 2012.\n\n[2] D. Adamskiy, M. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In Advances in neural\n\ninformation processing systems, pages 135\u2013143, 2012.\n\n[3] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms.\n\nConference on Learning Theory, 2017.\n\n[4] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring.\n\nJournal of Machine Learning Research, 11(Oct):2785\u20132836, 2010.\n\n[5] P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic\n\nand adversarial bandits. In Conference on Learning Theory, pages 116\u2013120, 2016.\n\n[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[7] A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning\n\nResearch, 8(Jun):1307\u20131324, 2007.\n\n[8] O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors.\n\nJournal of Machine Learning Research, 3(Nov):363\u2013396, 2002.\n\n[9] S. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits. In\n\nConference on Learning Theory, pages 42\u20131, 2012.\n\n[10] S. Bubeck, M. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In\n\nAlgorithmic Learning Theory, pages 111\u2013127, 2018.\n\n[11] S. Bubeck, Y. Li, H. Luo, and C.-Y. Wei. Improved path-length regret bounds for bandits. In\n\nConference On Learning Theory, 2019.\n\n[12] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with\n\nexpert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[13] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets \ufb01xed share (and\nfeels no regret). In Advances in Neural Information Processing Systems, pages 980\u2013988, 2012.\n\n[14] P. Christiano. Manipulation-resistant online learning. PhD thesis, University of California,\n\nBerkeley, 2017.\n\n[15] A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Interna-\n\ntional Conference on Machine Learning, pages 1405\u20131411, 2015.\n\n[16] D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and E. Tardos. Learning in games: Robustness of\nfast convergence. In Advances in Neural Information Processing Systems, pages 4734\u20134742,\n2016.\n\n[17] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\n[18] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that\nspecialize. In In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of\nComputing. Citeseer, 1997.\n\n[19] P. Gaillard, G. Stoltz, and T. Van Erven. A second-order bound with excess losses. In Conference\n\non Learning Theory, pages 176\u2013196, 2014.\n\n[20] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in\n\ncosts. Machine learning, 80(2-3):165\u2013188, 2010.\n\n[21] E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. In Electronic\n\ncolloquium on computational complexity (ECCC), volume 14, 2007.\n\n10\n\n\f[22] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[23] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine learning, 32(2):151\u2013178,\n\n1998.\n\n[24] K.-S. Jun, F. Orabona, S. Wright, R. Willett, et al. Online learning for changing environments\n\nusing coin betting. Electronic Journal of Statistics, 11(2):5282\u20135310, 2017.\n\n[25] W. M. Koolen and T. Van Erven. Second-order quantile methods for experts and combinatorial\n\ngames. In Conference on Learning Theory, pages 1155\u20131175, 2015.\n\n[26] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, pages 817\u2013824, 2008.\n\n[27] H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference\n\non Learning Theory, pages 1286\u20131304, 2015.\n\n[28] H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Ef\ufb01cient contextual bandits in non-stationary\n\nworlds. In Conference On Learning Theory, pages 1739\u20131776, 2018.\n\n[29] H. T. Nguyen and K. Franke. Adaptive intrusion detection system via online machine learning.\nIn 2012 12th International Conference on Hybrid Intelligent Systems (HIS), pages 271\u2013277.\nIEEE, 2012.\n\n[30] B. A. A. Nunes, K. Veenstra, W. Ballenthin, S. Lukin, and K. Obraczka. A machine learning\nframework for tcp round-trip time estimation. EURASIP Journal on Wireless Communications\nand Networking, 2014(1):47, 2014.\n\n[31] T. Santarra. Communicating Plans in Ad Hoc Multiagent Teams. PhD thesis, UC Santa Cruz,\n\n2019.\n\n[32] Y. Seldin and G. Lugosi. An improved parametrization and analysis of the exp3++ algorithm\n\nfor stochastic and adversarial bandits. In Conference on Learning Theory, 2017.\n\n[33] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits.\n\nIn International Conference on Machine Learning, pages 1287\u20131295, 2014.\n\n[34] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient\n\nalgorithm. In International Conference on Machine Learning, pages 1593\u20131601, 2014.\n\n[35] M. K. Warmuth and W. M. Koolen. Open problem: Shifting experts on easy data. In Conference\n\non Learning Theory, pages 1295\u20131298, 2014.\n\n[36] C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Conference On\n\nLearning Theory, pages 1263\u20131291, 2018.\n\n[37] C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic\nenvironments. In Advances in neural information processing systems, pages 3972\u20133980, 2016.\n\n[38] J. Zimmert, H. Luo, and C.-Y. Wei. Beating stochastic and adversarial semi-bandits optimally\n\nand simultaneously. In International Conference on Machine Learning, 2019.\n\n11\n\n\f", "award": [], "sourceid": 3185, "authors": [{"given_name": "Kai", "family_name": "Zheng", "institution": "Peking University"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "University of Southern California"}, {"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "UW Madison"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}]}