{"title": "Multi-armed Bandits: Competing with Optimal Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 207, "abstract": "We consider sequential decision making problem in the adversarial setting, where regret is measured with respect to the optimal sequence of actions and the feedback adheres the bandit setting. It is well-known that obtaining sublinear regret in this setting is impossible in general, which arises the question of when can we do better than linear regret? Previous works show that when the environment is guaranteed to vary slowly and furthermore we are given prior knowledge regarding its variation (i.e., a limit on the amount of changes suffered by the environment), then this task is feasible. The caveat however is that such prior knowledge is not likely to be available in practice, which causes the obtained regret bounds to be somewhat irrelevant. Our main result is a regret guarantee that scales with the variation parameter of the environment, without requiring any prior knowledge about it whatsoever. By that, we also resolve an open problem posted by [Gur, Zeevi and Besbes, NIPS' 14]. An important key component in our result is a statistical test for identifying non-stationarity in a sequence of independent random variables. This test either identifies non-stationarity or upper-bounds the absolute deviation of the corresponding sequence of mean values in terms of its total variation. This test is interesting on its own right and has the potential to be found useful in additional settings.", "full_text": "Multi-armed Bandits:\n\nCompeting with Optimal Sequences\n\nOren Anava\n\noren@voleon.com\n\nThe Voleon Group\n\nBerkeley, CA\n\nZohar Karnin\nYahoo! Research\nNew York, NY\n\nzkarnin@yahoo-inc.com\n\nAbstract\n\nWe consider sequential decision making problem in the adversarial setting, where\nregret is measured with respect to the optimal sequence of actions and the feedback\nadheres the bandit setting. It is well-known that obtaining sublinear regret in\nthis setting is impossible in general, which arises the question of when can we\ndo better than linear regret? Previous works show that when the environment is\nguaranteed to vary slowly and furthermore we are given prior knowledge regarding\nits variation (i.e., a limit on the amount of changes suffered by the environment),\nthen this task is feasible. The caveat however is that such prior knowledge is not\nlikely to be available in practice, which causes the obtained regret bounds to be\nsomewhat irrelevant.\nOur main result is a regret guarantee that scales with the variation parameter of the\nenvironment, without requiring any prior knowledge about it whatsoever. By that,\nwe also resolve an open problem posted by Gur, Zeevi and Besbes [8]. An important\nkey component in our result is a statistical test for identifying non-stationarity\nin a sequence of independent random variables. This test either identi\ufb01es non-\nstationarity or upper-bounds the absolute deviation of the corresponding sequence\nof mean values in terms of its total variation. This test is interesting on its own\nright and has the potential to be found useful in additional settings.\n\nIntroduction\n\n1\nMulti-Armed Bandit (MAB) problems have been studied extensively in the past, with two important\nspecial cases: the Stochastic Multi-Armed Bandit, and the Adversarial (Non-Stochastic) Multi-Armed\nBandit. In both formulations, the problem can be viewed as a T -round repeated game between\na player and nature. In each round, the player chooses one of k actions1 and observes the loss\ncorresponding to this action only (the so-called bandit feedback). In the adversarial formulation, it is\nusually assumed that the losses are chosen by an all-powerful adversary that has full knowledge of\nour algorithm. In particular, the loss sequences need not comply with any distributional assumptions.\nOn the other hand, in the stochastic formulation each action is associated with some mean value\nthat does not change throughout the game. The feedback from choosing an action is an i.i.d. noisy\nobservation of this action\u2019s mean value.\nThe performance of the player is traditionally measured using the static regret, which compares\nthe total loss of the player with the total loss of the benchmark playing the best \ufb01xed action in\nhindsight. A stronger measure of the player\u2019s performance, sometimes referred to as dynamic regret2\n(or just regret for brevity), compares the total loss of the player with this of the optimal benchmark,\nplaying the best possible sequence of actions. Notice that in the stochastic formulation both measures\ncoincide, assuming that the benchmark has access to the parameters de\ufb01ning the random process of\n\n1We sometimes use the terminology arm for an action throughout.\n2The dynamic regret is occasionally referred to as shifting regret or tracking regret in the literature.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe losses but not to the random bits generating the loss sequences. In the adversarial formulation this\nis clearly not the case, and it is not hard to show that attaining sublinear regret is impossible in general,\nwhereas obtaining sublinear static regret is possible indeed. This can perhaps explain why most of\nthe literature is concerned with optimizing the static regret rather than its dynamic counterpart.\nPrevious attempts to tackle the problem of regret minimization in the adversarial formulation mostly\ntook advantage of some niceness parameter of nature (that is, some non-adversarial behavior of the\nloss sequences). This line of research becomes more and more popular, as full characterizations of the\nregret turn out to be feasible with respect to speci\ufb01c niceness parameters. In this work we focus on a\nbroad family of such niceness parameters \u2014usually called variation type parameters\u2014 originated\nfrom the work of [8] in the context of (dynamic) regret minimization. Essentially, we consider a\nMAB setting in which the mean value of each action can vary over time in an adversarial manner,\nand the feedback to the player is a noisy observation of that mean value. The variation is then de\ufb01ned\nas the sum of distances between the vectors of mean values over consecutive rounds, or formally,\n\nVT\n\ndef\n=\n\n|\u00b5t(i) \u2212 \u00b5t\u22121(i)|,\n\nmax\n\ni\n\n(1)\n\nT(cid:88)\n\nt=2\n\nwhere \u00b5t(i) denotes the mean value of action i at round t. Despite the presentation of VT using the\nmaximum norm, any other norm will lead to similar qualitative formulations. Previous approaches to\nthe problem at hand relied on strong (and sometimes even unrealistic) assumptions on the variation\n(we refer the reader to Section 1.3, in which related work is discussed in detail). The natural question\nis whether it is possible to design an algorithm that does not require any assumptions on the variation,\nyet can achieve o(T ) regret whenever VT = o(T ). In this paper we answer this question in the\naf\ufb01rmative and prove the following.\nTheorem (Informal). Consider a MAB setting with two arms and time horizon T . Assume that at\neach round t \u2208 {1, . . . , T}, the random variables of obtainable losses correspond to a vector of\n\nmean values \u00b5t. Then, Algorithm 1 achieves a regret bound of \u02dcO(cid:0)T 0.771 + T 0.82V 0.18\n\n(cid:1).\n\nT\n\nOur techniques rely on statistical tests designed to identify changes in the environment on the one\nhand, but exploit the best option observed so far in case there was no such signi\ufb01cant environment\nchange. We elaborate on the key ideas behind our techniques in Section 1.2.\n\n1.1 Model and Motivation\nA player is faced with a sequential decision making task: In each round t \u2208 {1, . . . , T} = [T ], the\nplayer chooses an action it \u2208 {1, . . . , k} = [k] and observes loss (cid:96)t(it) \u2208 [0, 1]. We assume that\nE [(cid:96)t(i)] = \u00b5t(i) for any i \u2208 [k] and t \u2208 [T ], where {\u00b5t(i)}T\nt=1 are \ufb01xed beforehand by the adversary\n(that is, the adversary is oblivious). For simplicity, we assume that {(cid:96)t(i)}T\nt=1 are also generated\nbeforehand. The goal of the player is to minimize the regret, which is henceforth de\ufb01ned as\n\nt=1\n\nRT =\n\n\u00b5t(i\u2217\nt ),\nt=1 has no-regret if RT = o(T ). It is\nt = arg mini\u2208[k]{\u00b5t(i)}. A sequence of actions {it}T\nwhere i\u2217\nwell-known that generating no-regret sequences in our setting is generally impossible, unless the\nbenchmark sequence is somehow limited (for example, in its total number of action switches) or\nalternatively, some characterization of {\u00b5t(i)}T\nt=1 are characterized\nvia the variation). While limiting the benchmark makes sense only when we have a strong reason to\nbelieve that an action sequence from the limited class has satisfactory performance, characterizing\nthe environment is an approach that leads to guarantees of the following type:\n\nt=1 is given (in our case, {\u00b5t(i)}T\n\nt=1\n\nIf the environment is well-behaved (w.r.t. our characterization), then our per-\nformance is comparable with the optimal sequence of actions. If not, then no\nalgorithm is capable of obtaining sublinear regret without further assumptions on\nthe environment.\n\nObtaining algorithms with such guarantee is an important task in many real-world applications. For\nexample, an online forecaster must respond to time-related trends in her data, an investor seeks\nto detect trading trends as quickly as possible, a salesman should adjust himself to the constantly\nchanging taste of his audience, and many other examples can be found. We believe that in many of\n\n2\n\nT(cid:88)\n\n\u00b5t(it) \u2212 T(cid:88)\n\n\fthese examples the environment is often likely to change slowly, making guarantees of the type we\npresent highly desirable.\n1.2 Our Techniques\nAn intermediate (noiseless) setting. We begin with an easier setting, in which the observable\nlosses are deterministic. That is, by choosing arm i at round t, rather than observing the realization\nof a random variable with mean value \u00b5t(i) we simply observe \u00b5t(i). Note that {\u00b5t}T\nt=1 are still\nassumed to be generated adversarially. In this setting, the following intuitive solution can be shown\nto work (for two arms): Pull each arm once and observe two values. Now, in each round pull the arm\nwith the smaller loss w.p. 1 \u2212 o(1) and the other arm w.p. o(1), where the latter is decreasing with\ntime. As long as the mean values of the arms did not signi\ufb01cantly shift compared to their original\nvalues, continue. Once a signi\ufb01cant shift is observed, reset all counters and start over. We note that\nwhile the algorithm is simple, its analysis is not straightforward and contains some counterintuitive\nideas. In particular, we show that the true (unknown) variation can be replaced with a crude proxy\ncalled the observed variation (to be later de\ufb01ned), while still maintaining mathematical tractability of\nthe problem.\nTo see the importance of this proxy, let us \ufb01rst describe a different approach to the problem at hand\nthat in particular extends directly the approach of [8] who show that if an upper-bound on VT is\nknown in advance, then the optimal regret is attainable. Therefore, one might guess that having\nan unbiased estimator for VT will eliminate the need in this prior knowledge. Obtaining such an\nunbiased estimator is not hard (via importance sampling), but it turns out that it is not suf\ufb01cient: the\nvalues of the variation to be identi\ufb01ed are simply too small in order to be accurately estimated. Here\ncomes into the picture the observed variation, which is loosely de\ufb01ned as the loss difference between\ntwo successive pulls of the same arm. Clearly, the true variation is only larger, but as we show, it\ncannot be much larger without us noticing it. We provide a complete analysis of the noiseless setting\nin Appendix A. This analysis is not directly used for dealing with the noisy setting but acts as a warm\nup and contains some of the key techniques used for it.\nBack to our (noisy) setting. Here also we focus on the case of k = 2 arms. When the losses are\nstochastic the same basic ideas apply but several new major issues come up. In particular, here as\nwell we present an algorithm that resets all counters and starts over once a signi\ufb01cant change in the\nenvironment is detected. The similarity however ends here, mainly because of the noisy feedback that\nmakes it hard to determine whether the changes we see are due to some environmental shift or due\nto the stochastic nature of the problem. The straightforward way of overcoming this is to forcefully\ndivide the time into \u2018bins\u2019 in which we continuously pull the same arm. By doing this, and averaging\nthe observed losses within a bin we can obtain feedback that is not as noisy. This meta-technique\nraises two major issues: The \ufb01rst is, how long should these bins be? A long period would eliminate\nthe noise originating from the stochastic feedback but cripple our adaptive capabilities and make us\nmore vulnerable to changes in the environment. The second issue is, if there was a change in the\nenvironment that is in some sense local to a single bin, how can we identify it? and assuming we did,\nwhen should we tolerate it?\nThe algorithm we present overcomes the \ufb01rst issue by starting with an exploration phase, where both\narms are queried with equal probability. We advance to the next phase only once it is clear that the\naverage loss of one arm is greater than the other, and furthermore, we have a solid estimate of the\ngap between them. In the next exploitation phase, we mimic the above algorithm for deterministic\nfeedback by pulling the arms in bins of length proportional to the (inverted squared) gap between the\narms. The techniques from above take care of the regret compared to a strategy that must be \ufb01xed\ninside the bins, or alternatively, against the optimal strategy if we were somehow guaranteed that\nthere are no signi\ufb01cant environment changes within bins. This leads us to the second issue, but \ufb01rst\nconsider the following example.\nExample 1. During the exploration phase, we associated arm #1 with an expected loss of 0.5 and\narm #2 with an expected loss of 0.6. Now, consider a bin in which we pull arm #1. In the \ufb01rst half\nof the bin the expected loss is 0.25 and in the second it is 0.75. The overall expected loss is 0.5,\nhence without performing some test w.r.t. the pulls inside the bin we do not see any change in the\nenvironment, and as far as we know we mimicked the optimal strategy. The optimal strategy however\ncan clearly do much better and we suffer a linear regret in this bin. Furthermore, the variation during\nthe bin is constant! The good news is that in this scenario a simple test would determine that the\noutcome of the arm pulls inside the bin do not resemble those of an i.i.d. random variables, meaning\nthat the environment change can be detected.\n\n3\n\n\fFigure 1: The optimal policy of an adversary that minimizes variation while maximizing deviation.\n\nExample 1 clearly demonstrates the necessity of a statistical test inside a bin. However, there are\ncases in which the changes of the environment are unidenti\ufb01able and the regret suffered by any\nalgorithm will be linear, as can be seen in the following example.\nExample 2. Assume that arm #1 has mean value of 0.5 for all t, and arm #2 has mean value of\n1 with probability 0.5 and 0 otherwise. The feedback from pulling arm i at a speci\ufb01c round is a\nBernoulli random variable with the mean value of that arm. Clearly, there is no way to distinguish\nbetween these arms, and thus any algorithm would suffer linear regret. The point, however, is that\nthe variation in this example is also linear, and thus linear regret is unavoidable in general.\nExample 2 shows that if the adversary is willing to put enough effort (in terms of variation), then\nlinear regret is unavoidable. The intriguing question is whether the adversary can put some less\neffort (that is, to invest less than linear variation) and still cause us to suffer linear regret, while not\nproviding us the ability to notice that the environment has changed. The crux of our analysis is the\ndesign of two tests, one per phase (exploration or exploitation), each is able to identify changes\nwhenever it is possible or to ensure they do not hurt the regret too much whenever it is not. This\nbuilding block, along with the \u2018outer bin regret\u2019 analysis mentioned above, allows us to achieve our\nresult in this setting. The essence of our statistical tests is presented here, while formal statements\nand proofs are deferred to Section 2.\nOur statistical tests (informal presentation). Let X1, . . . , Xn \u2208 [0, 1] be a sequence of realiza-\ntions, such that each Xi is generated from an arbitrary distribution with mean value \u00b5i. Our task is\nto determine whether it is likely that \u00b5i = \u00b50 for all i, where \u00b50 is a given constant. In case there\nis not enough evidence to reject this hypothesis, the test is required to bound the absolute deviation\ni=1 (henceforth denoted by (cid:107)\u00b5n(cid:107)ad) in terms of its total variation3 (cid:107)\u00b5n(cid:107)tv. Assume\nof \u00b5n = {\u00b5i}n\nfor simplicity that \u00af\u00b51:n = 1\ni=1 \u00b5i is close enough to \u00b50 (or even exactly equal to it), which\nn\neliminates the need to check the deviation of the average from \u00b50. We are thus left with checking the\ninner-sequence dynamics.\nIt is worthwhile to consider the problem from the adversary\u2019s point of view: The adversary has full\ncontrol of the values of {\u00b5i}n\ni=1, and his task is to deviate as much as he can from the average without\nproviding us the ability to identify this deviation. Now, consider a partition of [n] into consecutive\nsegments, such that (\u00b5i \u2212 \u00b50) has the same sign for any i within a segment. Given this partition, it can\nbe shown that the optimal policy of an adversary that tries to minimize the total variation of {\u00b5i}n\ni=1\nwhile maximizing its absolute deviation, is to set \u00b5i to be equal within each segment. The length\nof a segment [a, b] is thus limited to be at most 1/|\u00af\u00b5a:b \u2212 \u00b50|2, or otherwise the deviation is notable\n(this follows by standard concentration arguments). Figure 1 provides a visualization of this optimal\npolicy. Summing the absolute deviation over the segments and using H\u00f6lder\u2019s inequality ensures\nthat (cid:107)\u00b5n(cid:107)ad \u2264 n2/3(cid:107)\u00b5n(cid:107)1/3\n, or otherwise there exists a segment in which the distance between the\nrealization average and \u00b50 is signi\ufb01cantly large. Our test is thus the simple test that measures this\ndistance for every segment. Notice that the test runs in polynomial time; further optimization might\nimprove the polynomial degree, but is outside the scope of this paper.\nThe test presented above aims to bound the absolute deviation w.r.t. some given mean value. As such,\nit is appropriate only for the exploitation phase of our algorithm, in which a solid estimation of each\narm\u2019s mean value is given. However, it turns out that bounding the absolute deviation with respect to\nsome unknown mean value can be done using similar ideas, yet is slightly more complicated.\n\n(cid:80)n\n\ntv\n\n3We use standard notions of total variation and absolute deviation. That is, the total variation of a sequence\ni=1 |\u00b5i \u2212 \u00af\u00b51:n|,\n\ni=2 |\u00b5i \u2212 \u00b5i\u22121|, and its absolute deviation is (cid:107)\u00b5n(cid:107)ad =(cid:80)n\n\ni=1 is de\ufb01ned as (cid:107)\u00b5n(cid:107)tv =(cid:80)n\n\n{\u00b5i}n\nwhere \u00af\u00b51:n = 1\nn\n\n(cid:80)n\n\ni=1 \u00b5i.\n\n4\n\n\ft=2 (cid:107)at \u2212 at\u22121(cid:107), where {at}T\n\ncan be bounded in terms of CT =(cid:80)T\n\nAlternative approaches. We point out that the approach of running a meta-bandit algorithm over\n(logarithmic many) instances of the algorithm proposed by [8] will be very dif\ufb01cult to pursue. In this\napproach, whenever an EXP3 instance is not chosen by the meta-bandit algorithm it is still forced to\nplay an arm chosen by a different EXP3 instance. We are not aware of an analysis of EXP3 nor other\nalgorithm equipped to handle such a harsh setting. Another idea that will be hard to pursue is tackling\nthe problem using a doubling trick. This idea is common when parameters needed for the execution\nof an algorithm are unknown in advanced, but can in fact be guessed and updated if necessary. In\nour case, the variation is not observed due to the bandit feedback, and moreover, estimating it using\nimportance sampling will lead to estimators that are too crude to allow a doubling trick.\n1.3 Related Work\nThe question of whether (and when) it is possible to obtain bounds on other than the static regret is\nlong studied in a variety of settings including Online Convex Optimization (OCO), Bandit Convex\nOptimization (BCO), Prediction with Expert Advice, and Multi-Armed bandits (MAB). Stronger\nnotions of regret include the dynamic regret (see for instance [17, 4]), the adaptive regret [11], the\nstrongly adaptive regret [5], and more. From now on, we focus on the dynamic regret only. Regardless\nof the setting considered, it is not hard to construct a loss sequence such that obtaining sublinear\ndynamic regret is impossible (in general). Thus, the problem of minimizing it is usually weakened in\none of the two following forms: (1) restricting the benchmark; and (2) characterizing the niceness of\nthe environment.\nWith respect to the \ufb01rst weakening form, [17] showed that in the OCO setting the dynamic regret\nt=1 is the benchmark sequence. In\nparticular, restricting the benchmark sequence with CT = 0 gives the standard static regret result.\n[6] suggested that this type of result is attainable in the BCO setting as well, but we are not familiar\nwith such result. In the MAB setting, [1] de\ufb01ned the hardness of a bencmark sequence as the number\nof its action switches, and bounded the dynamic regret in terms of this hardness. Here again, the\nstandard static regret bound is obtained if the hardness is restricted to 0. The concept of bounding the\ndynamic regret in terms of the total number of action switches was studied by [14], in the setting of\nPrediction with Expert Advice.\nWith respect to the second weakening form, one can \ufb01nd an immense amount of MAB literature that\nuses stochastic assumptions to model the environment. In particular, [16] coined the term restless\nbandits; a model in which the loss sequences change in time according to an arbitrary, yet known in\nadvance, stochastic process. To cope with the hard nature of this model, subsequent works offered\napproximations, relaxations, and more detailed models [3, 7, 15, 2]. Perhaps the \ufb01rst attempt to\nhandle arbitrary loss sequences in the context of dynamic regret and MAB, appears in the work of [8].\nIn a setting identical to ours, the authors fully characterize the dynamic regret: \u0398(T 2/3V 1/3\nT ), if a\nbound on VT is known in advance. We provide a high-level description of their approach.\nRoughly speaking, their algorithm divides the time horizon into (equally-sized) blocks and applies\nthe EXP3 algorithm of [1] in each of them. This guarantees sublinear static regret w.r.t. the best \ufb01xed\naction in the block. Now, since the number of blocks is set to be much larger than the value of VT (if\nVT = o(T )), it can be shown that in most blocks, the variation inside the block is o(1) and the total\nloss of the best \ufb01xed action (within a block) turns out to be not very far from the total loss of the best\nsequence of actions. The size of the blocks (which is \ufb01xed and determined in advance as a function\nof T and VT ) is tuned accordingly to obtain the optimal rate in this case. The main shortcomings of\nthis algorithms are the reliance on prior knowledge of VT , and the restarting procedure that does not\ntake the variation into account.\nWe also note the work of [13], in which the two forms of weakening are combined together to obtain\ndynamic regret bounds that depend both on the complexity of the benchmark and on the niceness of\nthe environment. Another line of work that is close to ours (at least in spirit) aims to minimize the\nstatic regret in terms of the variation (see for instance [9, 10]).\nA word about existing statistical tests. There are actually many different statistical tests such as\nz-test, t-test, and more, that aim to determine whether a sample data comes from a distribution with\na particular mean value. These tests however are not suitable for our setting since (1) they mostly\nrequire assumptions on the data generation (e.g., Gaussianity), and (2) they lack our desired bound\non the total absolute deviation of the mean sequence in terms of its total variation. The latter is\nespecially important in light of Example 2, which demonstrates that a mean sequence can deviate\nfrom its average without providing us any hint.\n\n5\n\n\f2 Competing with Optimal Sequences\nBefore presenting our algorithm and analysis we introduce some general notation and de\ufb01nitions. Let\ni=1 \u2208 [0, c]n be a sequence of independent random variables, and denote \u00b5i = E [Xi].\nXn = {Xi}n\nFor any n1, n2 \u2208 [n], where n1 \u2264 n2, we denote by \u00afXn1:n2 the average of Xn1, . . . , Xn2, and by\n(cid:33)\n\u00afX c\nn1:n2 the average of the other random variables. That is,\n\n(cid:32)n1\u22121(cid:88)\n\nn(cid:88)\n\nand\n\n\u00afX c\n\nn1:n2\n\n=\n\n1\n\nn \u2212 n2 + n1 \u2212 1\n\nXi +\n\nXi\n\n.\n\ni=1\n\ni=n2+1\n\ni /\u2208{n1,...,n2} for the second sum when n is implied from the context.\nn1:n2, respectively. We use\n\nn1:n2 are denoted by \u00af\u00b5n1:n2 and \u00af\u00b5c\n\n1\n\nXi\n\n\u00afXn1:n2 =\n\nn2 \u2212 n1 + 1\n\nn2(cid:88)\nWe sometimes use the notation(cid:80)\n(cid:19)1/2\n\n(cid:18)\n\ni=n1\n\n\u03b51 (n1, n2)\n\ndef\n=\n\n1\n\nn2 \u2212 n1 + 1\n\nThe expected values of \u00afXn1:n2 and \u00afX c\ntwo additional quantities de\ufb01ned w.r.t. n1, n2:\n\n(cid:19)1/2\n\n.\n\n(cid:18)\n\n= (cid:80)n2\n\nand \u03b52 (n1, n2)\n\ndef\n=\n\n1\n\nn2 \u2212 n1 + 1\n\n+\n\n1\n\nn \u2212 n2 + n1 \u2212 1\n\ndef\n\ni=1 \u2208 [0, 1]n over the interval {n1, . . . , n2}.\n\ni=n1+1 |\u00b5i \u2212 \u00b5i\u22121| as the total variation of a\ni=1 \u2208 [0, 1]n is \u03b1-weakly\n\nWe slightly abuse notation and de\ufb01ne Vn1:n2\nmean sequence \u00b5n = {\u00b5i}n\nDe\ufb01nition 2.1. (weakly stationary, non-stationary) We say that \u00b5n = {\u00b5i}n\nstationary if V1:n \u2264 \u03b1. We say that \u00b5n is \u03b1-non-stationary if it is not \u03b1-weakly stationary4.\nThroughout the paper, we mostly use these de\ufb01nitions with \u03b1 = 1/\nn. In this case we will shorthand\nthe notation and simply say that a sequence is weakly stationary (or non-stationary). In the sequel,\nwe somewhat abuse notation and use capital letters (X1, . . . , Xn) both for random variables and\nrealizations. The speci\ufb01c use should be clear from the context, if not spelled out explicitly. Next, we\nde\ufb01ne a notion of a concentrated sequence that depends on a parameter T . In what follows, T will\nalways be used as the time horizon.\nDe\ufb01nition 2.2. (concentrated, strongly concentrated) We say that a sequence Xn = {Xi}n\ni=1 \u2208\n[0, c]n is concentrated w.r.t. \u00b5n if for any n1, n2 \u2208 [n] it holds that:\n\n\u221a\n\n(1) (cid:12)(cid:12) \u00afXn1:n2 \u2212 \u00af\u00b5n1:n2\n(2) (cid:12)(cid:12) \u00afXn1:n2 \u2212 \u00afX c\n\nn1:n2\n\n(cid:12)(cid:12) \u2264(cid:0)2.5c2 log(T )(cid:1)1/2\n\n\u2212 \u00af\u00b5n1:n2 + \u00af\u00b5c\n\nn1:n2\n\n\u03b51 (n1, n2).\n\n(cid:12)(cid:12) \u2264(cid:0)2.5c2 log(T )(cid:1)1/2\n\n\u03b52 (n1, n2).\n\n.\n\ni=n1\n\nWe further say that Xn is strongly concentrated w.r.t. \u00b5n if any successive sub-sequence {Xi}n2\nXn is concentrated w.r.t. {\u00b5i}n2\nWhenever the mean sequence is inferred from the context, we will simply say that Xn is concen-\ntrated (or strongly concentrated). The parameters in the above de\ufb01nition are set so that standard\nconcentration bounds lead to the statement that any sequence of independent random variables is\nstrongly concentrated with high probability. The formal statement is given below and is proven in\nAppendix B.\nClaim 2.3. Let XT = {Xi}T\ni=1 \u2208 [0, c]T be a sequence of independent random variables, such that\nT \u2265 2 and c > 0. Then, XT is strongly concentrated with probability at least 1 \u2212 1\nT .\n\n\u2286\n\ni=n1\n\n2.1 Statistical Tests for Identifying Non-Stationarity\n\nTEST 1 (the of\ufb02ine test). The goal of the of\ufb02ine test is to determine whether a sequence of\nrealizations Xn is likely to be generated from a mean sequence \u00b5n that is close (in a sense) to some\ngiven value \u00b50. This will later be used to determine whether a series of pulls of the same arm (inside\na single bin) in the exploitation phase exhibit the same behavior as those observed in the exploration\nphase. We would like to have a two sided guarantee. If the means did not signi\ufb01cantly shift the\nalgorithm must state that the sequence is weakly stationary. On the other hand, if the algorithm states\nthat the sequence is weakly stationary we require the absolute deviation of \u00b5n to be bounded in terms\nof its total variation. We provide an analysis of Test 1 in Appendix B.\n\n6\n\n\fInput: a sequence Xn = {Xi}n\nThe test: for any two indices n1, n2 \u2208 [n] such that n1 < n2, check whether\n\ni=1 \u2208 [0, c]n, and a constant \u00b50 \u2208 [0, 1].\n\n(cid:12)(cid:12) \u00afXn1:n2 \u2212 \u00b50\n\n(cid:12)(cid:12) \u2265(cid:0)\u221a\n\n2.5c + 2(cid:1) log1/2(T )\u03b51 (n1, n2) .\n\nOutput: non-stationary if such n1, n2 were found; weakly stationary otherwise.\n\nTEST 1: (the of\ufb02ine test) The test aims to identify variation during the exploitation phase.\n\nInput: a sequence XQ = {Xi}Q\nThe test: for n = 2, 3, . . . , Q:\n\ni=1 \u2208 [0, c]Q, that is revealed gradually (one Xi after the other).\n\n(1) observe Xn and set Xn = {Xi}n\n(2) for any two indices n1, n2 \u2208 [n] such that n1 < n2, check whether\n\ni=1.\n\n(cid:12)(cid:12) \u2265(cid:0)\u221a\n\n2.5c + 1(cid:1) log1/2(T )\u03b52 (n1, n2) .\n\n(cid:12)(cid:12) \u00afXn1:n2 \u2212 \u00afX c\n\nn1:n2\n\nand terminate the loop if such n1, n2 were found.\n\nOutput: non-stationary if the loop was terminated before n = Q, weakly stationary otherwise.\n\nTEST 2: (the online test) The test aims to identify variation during the exploration phase.\n\nTEST 2 (the online test). The online test gets a sequence XQ in an online manner (one variable\nafter the other), and has to stop whenever non-stationarity is exhibited (or the sequence ends). Here,\nthe value of Q is unknown to us beforehand, and might depend on the values of sequence elements\nXi. The rationale here is the following: In the exploration phase of the main algorithm we sample\nthe arms uniformly until discovering a signi\ufb01cant gap between their average losses. While doing\nso, we would like to make sure that the regret is not large due to environment changes within\nthe exploration process. We require a similar two sided guarantee as in the previous test, with an\nadditional requirement informally ensuring that if we exit the block in the exploration phase the\nbound on the absolute deviation still applies. We provide the formal analysis in Appendix B.\n\n2.2 Algorithm and Analysis\n\nHaving this set of testing tools, we proceed to provide a non-formal description of our algorithm.\nBasically, the algorithm divides the time horizon into blocks according to the variation it identi\ufb01es.\nThe blocks are denoted by {Bj}N\nj=1, where N is the total number of blocks generated by the\nalgorithm. The rounds within block Bj are split into an exploration and exploitation phase,\nhenceforth denoted Ej,1 and Ej,2 respectively. Each exploitation phase is further divided into bins,\nwhere the size of the bins within a block is determined in the exploration phase and does not change\nthroughout the block. The bins within block Bj are denoted by {Aj,a}Nj\na=1, where Nj is the total\nnumber of bins in block Bj. Note that both N and Nj are random variables. We use t(j, \u03c4 ) to denote\nthe \u03c4-th round in the exploration phase of block Bj, and t(j, a, \u03c4 ) to denote the \u03c4-th round of the\na-th bin in the exploitation phase of block Bj. As before, notice that t(j, \u03c4 ) might vary from one run\nof the algorithm to another, yet is uniquely de\ufb01ned per one run of the algorithm (and the same holds\nfor t(j, a, \u03c4 )). Our algorithm is formally given in Algorithm 1, and the working scheme is visually\npresented in Figure 2. We discuss the extension of the algorithm to k arms in Appendix D.\n\nTheorem 2.4. Set \u03b8 = 1\nAlgorithm 1 is\n\nT(cid:88)\n\n2 and \u03bb =\n\n\u00b5t(it) \u2212 T(cid:88)\n\nRT =\n\n\u221a\n\n37\u22125\n2\n\n\u00b5t(i\u2217\n\n. Then, with probability of at least 1 \u2212 10\n\nt ) \u2264 O(cid:0)log(T )T 0.82V 0.18\n\nT + log(T )T 0.771(cid:1) .\n\nT the regret of\n\nt=1\n\nt=1\n\nProof sketch. Notice that the feedback we receive throughout the game is strongly concentrated with\nhigh probability, and thus it suf\ufb01ces to prove the theorem for this case. We analyze separately (a)\nblocks in which the algorithm did not reach the exploitation, and (b) blocks in which it did.\n\n4We use stationarity-related terms to classify mean sequences. Our de\ufb01nition might be not consistent with\nstationarity-related de\ufb01nitions in the statistical literature, which are usually used to classify sequences of random\nvariables based on higher moments or CDF\u2019s.\n\n7\n\n\fFigure 2: The time horizon is divided into blocks, where each block is split into an exploration phase and an\nexploitation phase. The exploitation phase is further divided into bins.\n\nInput: parameters \u03bb and \u03b8.\nAlgorithm: In each block j = 1, 2, . . .\n(Exploration phase)\n\n(3) If the test identi\ufb01es non-stationarity (on either one of the actions), exit block. Otherwise, if\n\nIn each round \u03c4 = 1, 2, . . .\n\n(1) Select action it(j,\u03c4 ) \u223c Uni{1, 2} and observe loss (cid:96)t(j,\u03c4 )(it(j,\u03c4 )).\n(2) Set Xt(j,\u03c4 )(i) =\n\n(cid:26) 2(cid:96)t(j,\u03c4 )(it(j,\u03c4 ))\n= | \u00afXt(j,1):t(j,\u03c4 )(1) \u2212 \u00afXt(j,1):t(j,\u03c4 )(2)| \u2265 16(cid:0)\u221a\n\nand add Xt(j,\u03c4 )(i) (separately, for i \u2208 {1, 2}) as an input to TEST 2.\n\nif i = it(j,\u03c4 )\notherwise,\n\n\u2206\n\ndef\n\n0\n\nmove to the next phase with \u02c6\u00b50(i) = \u00afXt(j,1):t(j,\u03c4 )(i) for i \u2208 {1, 2}.\n\n10 + 2(cid:1)2 log(T )\u03c4\n(cid:26) arg mini{\u02c6\u00b50(i)} w.p. 1 \u2212 a\u2212\u03b8\n\notherwise,\n\n\u2212\u03bb/2\n\n(Exploitation phase) Play in bins, each of size n = 4/\u22062. During each bin a = 1, 2, . . .\n\n(1) Select action it(j,a,1), . . . , it(j,a,n) =\n\nand observe losses {(cid:96)t(j,a,\u03c4 )(it(j,a,\u03c4 ))}n\n\n(2) Run TEST 1 on {(cid:96)t(j,a,\u03c4 )(it(j,a,\u03c4 ))}n\n\nUni{1, 2}\n\u03c4 =1.\n\n\u03c4 =1, and exit the block if it returned non-stationary.\n\nAlgorithm 1: An algorithm for the non-stationary multi-armed bandit problem.\n\nAnalysis of part (a). From TEST 2, we know that as long as the test does not identify non-\nstationarity in the exploration phase E1, we can \u201ctrust\u201d the feedback we observe as if we are in\nthe stationary setting, i.e. standard stochastic MAB, up to an additive factor of |E1|2/3V 1/3\nto the\nregret. This argument holds even if TEST 2 identi\ufb01ed non-stationarity, by simply excluding the last\nround. Now, since our stopping condition of the exploration phase is roughly \u2206 \u2265 \u03c4\u2212\u03bb/2, we suffer\nan additional regret of |E1|1\u2212\u03bb/2 throughout the exploration phase. This gives an overall bound of\n|E1|2/3V 1/3\n+ |E1|1\u2212\u03bb/2 for the regret (formally proven in Lemma C.4). The terms of the form\n|E1|1\u2212\u03bb/2 are problematic, as summing them may lead to an expression linear in T . To avoid this\nwe use a lower bound on the variation VE1 guaranteed by the fact that TEST 2 caused the block to\nend during the exploration phase. This lower bound allows to express |E1|1\u2212\u03bb/2 as |E1|1\u2212\u03bb/3V \u03bb/3\nleading to a meaningful regret bound on the entire time horizon (as detailed in Lemma C.5).\n\nE1\n\nE1\n\nE1\n\nAnalysis of part (b). The regret suffered in the exploration phase is bounded by the same arguments\nas before, where the bound on |E1|1\u2212\u03bb/2 is replaced by |E1|1\u2212\u03bb/2 \u2264 |B|1\u2212\u03bb/3V 1\u2212\u03bb/3\nwith B being\nthe set of block rounds. This bound is achieved via a lower bound on VB, the variation in the block,\nguaranteed by the algorithm behavior along with fact that the block ended in the exploitation phase.\nFor the regret in the exploitation phase, we \ufb01rst utilize the guarantees of TEST 1 to show that at the\nexpense of an additive cost of |E2|2/3V 1/3\nto the regret, we may assume that there is no change to the\nenvironment inside bins. From hereon the analysis becomes very similar to that of the deterministic\nsetting, as the noise corresponding to a bin is guaranteed to be lower than the gap \u2206 between the\narms, and thus has no affect on the algorithm\u2019s performance. The \ufb01nal regret bound for blocks of type\n(b) comes from adding up the above mentioned bounds and is formally given in Lemma C.10.\n\nE2\n\nB\n\n8\n\n\fReferences\n[1] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[2] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic\noptimization under correlated bandit feedback. In ICML, volume 32 of JMLR Workshop and\nConference Proceedings, pages 1557\u20131565, 2014.\n\n[3] Dimitris Bertsimas and Jos\u00e9 Ni\u00f1o-Mora. Restless bandits, linear programming relaxations, and\n\na primal-dual index heuristic. Operations Research, 48(1):80\u201390, 2000.\n\n[4] Olivier Bousquet and Manfred K. Warmuth. Tracking a small set of experts by mixing past\n\nposteriors. Journal of Machine Learning Research, 3:363\u2013396, 2002.\n\n[5] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In\nICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 1405\u20131411, 2015.\n[6] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimiza-\ntion in the bandit setting: gradient descent without a gradient. In SODA, pages 385\u2013394. SIAM,\n2005.\n\n[7] Sudipto Guha and Kamesh Munagala. Approximation algorithms for partial-information based\nstochastic control with markovian rewards. In FOCS, pages 483\u2013493. IEEE Computer Society,\n2007.\n\n[8] Yonatan Gur, Assaf J. Zeevi, and Omar Besbes. Stochastic multi-armed-bandit problem with\n\nnon-stationary rewards. In NIPS, pages 199\u2013207, 2014.\n\n[9] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: regret bounded by variation\n\nin costs. Machine Learning, 80(2-3):165\u2013188, 2010.\n\n[10] Elad Hazan and Satyen Kale. Better algorithms for benign bandits. Journal of Machine Learning\n\nResearch, 12:1287\u20131311, 2011.\n\n[11] Elad Hazan and C. Seshadhri. Ef\ufb01cient learning algorithms for changing environments. In\nICML, volume 382 of ACM International Conference Proceeding Series, pages 393\u2013400. ACM,\n2009.\n\n[12] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American statistical association, 58(301):13\u201330, 1963.\n\n[13] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online\nIn AISTATS, volume 38 of JMLR\n\noptimization : Competing with dynamic comparators.\nWorkshop and Conference Proceedings, 2015.\n\n[14] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,\n\n108(2):212\u2013261, 1994.\n\n[15] Ronald Ortner, Daniil Ryabko, Peter Auer, and R\u00e9mi Munos. Regret bounds for restless markov\n\nbandits. Theor. Comput. Sci., 558:62\u201376, 2014.\n\n[16] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied\n\nprobability, pages 287\u2013298, 1988.\n\n[17] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn ICML, pages 928\u2013936. AAAI Press, 2003.\n\n9\n\n\f", "award": [], "sourceid": 148, "authors": [{"given_name": "Zohar", "family_name": "Karnin", "institution": "Yahoo Research"}, {"given_name": "Oren", "family_name": "Anava", "institution": "Technion"}]}