{"title": "Cornering Stationary and Restless Mixing Bandits with Remix-UCB", "book": "Advances in Neural Information Processing Systems", "page_first": 3339, "page_last": 3347, "abstract": "We study the restless bandit problem where arms are associated with stationary $\\varphi$-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of carefully recovering some independence by `ignoring' the values of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off, which we do by considering the idea of a {\\em waiting arm} in the new Remix-UCB algorithm, a generalization of Improved-UCB for the problem at hand, that we introduce. We provide a regret analysis for this bandit strategy; two noticeable features of Remix-UCB are that i) it reduces to the regular Improved-UCB when the $\\varphi$-mixing coefficients are all $0$, i.e. when the i.i.d scenario is recovered, and ii) when $\\varphi(n)=O(n^{-\\alpha})$, it is able to ensure a controlled regret of order $\\Ot\\left( \\Delta_*^{(\\alpha- 2)/\\alpha} \\log^{1/\\alpha} T\\right),$ where $\\Delta_*$ encodes the distance between the best arm and the best suboptimal arm, even in the case when $\\alpha<1$, i.e. the case when the $\\varphi$-mixing coefficients {\\em are not} summable.", "full_text": "Cornering Stationary and Restless Mixing Bandits\n\nwith Remix-UCB\n\nJulien Audiffren\n\nCMLA\n\nENS Cachan, Paris Saclay University\n\n94235 Cachan France\n\naudiffren@cmla.ens-cachan.fr\n\nliva.ralaivola@lif.univ-mrs.fr\n\nLiva Ralaivola\n\nQARMA, LIF, CNRS\n\nAix Marseille University\n\nF-13289 Marseille cedex 9, France\n\nAbstract\n\nWe study the restless bandit problem where arms are associated with stationary\n\u03d5-mixing processes and where rewards are therefore dependent: the question that\narises from this setting is that of carefully recovering some independence by \u2018ig-\nnoring\u2019 the values of some rewards. As we shall see, the bandit problem we tackle\nrequires us to address the exploration/exploitation/independence trade-off, which\nwe do by considering the idea of a waiting arm in the new Remix-UCB algo-\nrithm, a generalization of Improved-UCB for the problem at hand, that we\nintroduce. We provide a regret analysis for this bandit strategy; two noticeable\nfeatures of Remix-UCB are that i) it reduces to the regular Improved-UCB\nwhen the \u03d5-mixing coef\ufb01cients are all 0, i.e. when the i.i.d scenario is recovered,\nand ii) when \u03d5(n) = O(n\u2212\u03b1), it is able to ensure a controlled regret of order\n, where \u2206\u2217 encodes the distance between the best arm\nand the best suboptimal arm, even in the case when \u03b1 < 1, i.e. the case when the\n\u03d5-mixing coef\ufb01cients are not summable.\n\n\u2206(\u03b1\u22122)/\u03b1\n\u2217\n\nlog1/\u03b1 T\n\n(cid:17)\n\n(cid:16)\n\n(cid:101)\u0398\n\n1\n\nIntroduction\n\nBandit with mixing arms. The bandit problem consists in an agent who has to choose at each step\nbetween K arms. A stochastic process is associated to each arm, and pulling an arm produces a\nreward which is the realization of the corresponding stochastic process. The objective of the agent\nis to maximize its long term reward. In the abundant bandit literature, it is often assumed that the\nstochastic process associated to each arm is a sequence of independently and identically distributed\n(i.i.d) random variables (see, e.g. [12]). In that case, the challenge the agent has to address is the\nwell-known exploration/exploitation problem: she has to simultaneously make sure that she collects\ninformation from all arms to try to identify the most rewarding ones\u2014this is exploration\u2014and to\nmaximize the rewards along the sequence of pulls she performs\u2014this is exploitation. Many algo-\nrithms have been proposed to solve this trade-off between exploration and exploitation [2, 3, 6, 12].\nWe propose to go a step further than the i.i.d setting and to work in the situation where the process\nassociated with each arm is a stationary \u03d5-mixing process: the rewards are thus dependent from one\nanother, with a strength of dependence that weakens over time. From an application point of view,\nthis is a reasonable dependence structure: if a user clicks on some ad (a typical use of bandit algo-\nrithms) at some point in time, it is very likely that her choice will have an in\ufb02uence on what she will\nclick in the close future, while it may have a (lot) weaker impact on what ad she will choose to view\nin a more distant future. As it shall appear in the sequel, working with such dependent observations\nposes the question of how informative are some of the rewards with respect to the value of an arm\nsince, because of the dependencies and the strong correlation between close-by (in time) rewards,\nthey might not re\ufb02ect the true \u2018value\u2019 of the arms. However, as the dependencies weaken over time,\nsome kind of independence might be recovered if some rewards are ignored, in some sense. This\n\n1\n\n\factually requires us to deal with a new tradeoff, the exploration/exploitation/independence tradeoff,\nwhere the usual exploration/exploitation compromise has to be balanced with the need for some\nindependence. Dealing with this new tradeoff is the pivotal feature of our work.\nNon i.i.d bandit. A closely related setup that addresses the bandit problem with dependent rewards\nis when they are distributed according to Markov processes, such as Markov chains and Markov\ndecision process (MDP) [16, 22], where the dependences between rewards are of bounded range,\nwhich is what distinguishes those works with ours. Contributions in this area study two settings:\nthe rested case, where the process attached to an arm evolves only when the arm is pulled, and the\nrestless case, where all processes simultaneously evolve at each time step. In the present work, we\nwill focus on the restless setting. The adversarial bandit setup (see e.g. [1, 4, 19]) can be seen as\na non i.i.d setup as the rewards chosen by the adversary might depend on the agent\u2019s past actions.\nHowever, even if the algorithms developed for this framework can be used in our setting, they might\nperform very poorly as they are not designed to take advantage of any mixing structure. Finally, we\nmay also mention the bandit scenario where the dependencies are between the arms instead being\nwithin-arm time-dependent (e.g., [17]); this is orthogonal to what we propose to study here.\nMixing Processes. Mixing process theory is hardly new. One of the seminal works on the study of\nmixing processes was done by Bernstein [5] who introduced the well-known block method, central\nto prove results on mixing processes.\nIn statistical machine learning, one of the \ufb01rst papers on\nestimators for mixing processes is [23]. More recent works include the contributions of Mohri and\nRostamizadeh [14, 15], which address the problem of stability bound and Rademacher stability for\n\u03d5- and \u03b2-mixing processes; Kulkarni et al [11] establish the consistency of regularized boosting\nalgorithms learning from \u03b2-mixing processes, Steinwart et al [21] prove the consistency of support\nvector machines learning from \u03b1-mixing processes and Steinwart and Christmann [20] establish a\ngeneral oracle inequality for generic regularized learning algorithms and \u03b1-mixing observations. As\nfar as we know, it is the \ufb01rst time that mixing processes are studied in a multi-arm bandit framework.\nContribution. Our main result states that a strategy based on the improved Upper Con\ufb01dence\nBound (or Improved-UCB, in the sequel) proposed by Auer and Ortner [2], allows us to achieve a\ncontrolled regret in the restless mixing scenario. Namely, our algorithm, Remix-UCB (which stands\nlog1/\u03b1 T ), where \u2206\u2217 encodes\nthe distance between the best arm and the best suboptimal arm, \u03b1 encodes the rate of decrease\n\nfor Restless Mixing UCB), achieves a regret of the form (cid:101)\u0398(\u2206(\u03b1\u22122)/\u03b1\nof the \u03d5 coef\ufb01cients, i.e. \u03d5(n) = O(n\u03b1), and (cid:101)\u0398 is a O-like notation (that neglects logarithmic\n\n\u2217\n\ndependencies, see Section 2.2). It is worth noticing that all the results we give hold for \u03b1 < 1, i.e.\nwhen the dependencies are no longer summable. When the mixing coef\ufb01cients at hand are all zero,\ni.e. in the i.i.d case, the regret of our algorithm naturally reduces to the classical Improved-UCB.\nRemix-UCB uses the assumption about known (convergence rates of) \u03d5-mixing coef\ufb01cients, which\nis a classical standpoint that has been used by most of the papers studying the behavior of machine\nlearning algorithms in the case of mixing processes (see e.g. [9, 14, 15, 18, 21, 23]). The estimation\nof the mixing coef\ufb01cients poses a learning problem on its own (see e.g. [13] for the estimation of\n\u03b2-mixing coef\ufb01cients) and is beyond the scope of this paper.\nStructure of the paper. Section 2 de\ufb01nes our setup: \u03d5-mixing processes are recalled, together with\na relevant concentration inequality for such processes [10, 15]; the notion of regret we focus on is\ngiven. Section 3 is devoted to the presentation of our algorithm, Remix-UCB, and to the statement\nof our main result regarding its regret. Finally, Section 4 discusses the obtained results.\n\n2 Overview of the Problem\n\n2.1 Concentration of Stationary \u03d5-mixing Processes\nLet (\u2126,F, P) be a probability space. We recall the notions of stationarity and \u03d5-mixing processes.\nDe\ufb01nition 1 (Stationarity). A sequence of random variables X = {Xt}t\u2208Z is stationary if, for any\nt, m \u2265 0, s \u2265 0, (Xt, . . . , Xt+m) and (Xt+s, . . . , Xt+m+s) are identically distributed.\nDe\ufb01nition 2 (\u03d5-mixing process). Let X = {Xt}t\u2208Z be a stationary sequence of random variables.\nFor any i, j \u2208 Z \u222a {\u2212\u221e, +\u221e}, let \u03c3j\ni denote the \u03c3-algebra generated by {Xt : i \u2264 t \u2264 j}. Then,\n\n2\n\n\ffor any positive n, the \u03d5-mixing coef\ufb01cient \u03d5(n) of the stochastic process X is de\ufb01ned as\n\n(cid:12)(cid:12)P [A|B] \u2212 P [A](cid:12)(cid:12).\n\n(1)\n\n\u03d5(n) =\n\nsup\n\nt,A\u2208\u03c3+\u221e\n\nt+n,B\u2208\u03c3t\u2212\u221e,P(B)>0\n\nX is \u03d5-mixing if \u03d5(n) \u2192 0. X is algebraically mixing if \u2203\u03d50 > 0, \u03b1 > 0 so that \u03d5(n) = \u03d50n\u2212\u03b1.\nAs we recall later, concentration inequalities are the pivotal tools to devise multi-armed bandits\nstrategy. Hoeffding\u2019s inequality [7, 8] is, for instance, at the root of a number of UCB-based methods.\nThis inequality is yet devoted to characterize the deviation of the sum of independent variables from\nits expected value and cannot be used in the framework we are investigating. In the case of stationary\n\u03d5-mixing distributions, there however is the following concentration inequality, due to [10] and [15].\nTheorem 1 ([10, 15]). Let \u03c8m : U m \u2192 R be a function de\ufb01ned over a countable space U, and X\nbe a stationary \u03d5-mixing process. If \u03c8m is (cid:96)-Lipschitz wrt the Hamming metric for some (cid:96) > 0, then\n\n= 1 + 2(cid:80)m\n\n.\n\n\u2200\u03b5 > 0, PX [|\u03c8m(X) \u2212 E\u03c8m(X)| > \u03b5] \u2264 2 exp\n\n,\n\n(2)\n\nwhere \u039bm\n\n\u03c4 =1 \u03d5(\u03c4 ) and \u03c8m(X) = \u03c8m(X0, . . . , Xm).\n\nHere, we do not have to use this concentration inequality in its full generality as we will restrict\nto the situation where \u03c8m is the mean of its arguments, i.e. \u03c8m(Xt1, . . . , Xtm )\ni=1 Xti ,\nwhich is obviously 1/m-Lipschitz provided that the Xt\u2019s have range [0; 1]\u2014which will be one of\nour working assumptions. If, with a slight abuse of notation, \u039bm is now used to denote\n\n.\n= 1\nm\n\n(cid:80)m\n\n(cid:20)\n\n\u2212 \u03b52\n\n2m(cid:96)2\u039b2\nm\n\n(cid:21)\n\n\u039bm(t)\n\n.\n= 1 + 2\n\n\u03d5(ti \u2212 t1),\n\n(3)\n\nm(cid:88)\n\ni=2\n\n(cid:20)\n\n(cid:35)\n\nP{Xt}t\u2208t\n\nm(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > \u03b5\n\nfor an increasing sequence t = (ti)m\nserve our purpose is given in the next corollary.\nCorollary 1 ([10, 15]). Let X be a stationary \u03d5 mixing process. The following holds: for all \u03b5 > 0\nand all m-sequence t = (ti)m\n\ni=1 of times steps, then, the concentration inequality that will\n\ni=1 with t1 < . . . < tm,\n\n(cid:21)\n(Thanks to the stationarity of {Xt}t\u2208Z and the linearity of the expectation, E(cid:80)m\n\n(cid:110)\n1 + 2(cid:80)m\n\ni=1 Xti = mEXt1.)\nRemark 3. According to Kontorovitch\u2019s paper\nfunction \u039bm should be\n. However, when the time lag between two consecutive\nmaxj\ntime steps ti and ti+1 is non-decreasing, which will be imposed by the Remix-UCB algorithm\n(see below), and the mixing coef\ufb01cients are decreasing, which is a natural assumption that simply\nsays that the amount of dependence between Xt and Xt(cid:48) reduces when |t \u2212 t(cid:48)| increases, then \u039bm\nreduces to the more compact expression given by (3).\n\ni=j+1 \u03d5(ti \u2212 tj)\n\n(cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n(cid:111)\n\n\u2212 m\u03b52\n2\u039b2\nm(t)\n\nXti \u2212 EX1\n\n[10],\n\nthe\n\n\u2264 2 exp\n\n.\n\n(4)\n\nm\n\nNote that when there is independence, then \u03d5(\u03c4 ) = 0, for all \u03c4, \u039bm = 1 and, as a consequence,\nEquation (4) reduces to Hoef\ufb01ng\u2019s inequality: the precise values of the time instants in t do not\nimpact the value of the bound and the length m of t is the central parameter that matters. This\nis in clear contrast with what happens in the dependent setting, where the bound on the deviation\ni=1 Xti/m from its expectation directly depends on the timepoints ti through \u039bm. For two\ni=1 Xti/m may be more sharply con-\n/m provided \u039bm(t) < \u039bm(t(cid:48)), which can be a consequence\n\nof(cid:80)m\ncentrated around EX1 than(cid:80)m\n\ni=1 of m timepoints,(cid:80)m\n\nsequences t = (ti)m\nof a more favorable spacing of the points in t than in t(cid:48).\n\ni=1 and t(cid:48) = (t(cid:48)\ni)m\ni=1 Xt(cid:48)\n\ni\n\n2.2 Problem: Minimize the Expected Regret\n\nWe may now de\ufb01ne the multi-armed bandit problem we consider and the regret we want to control.\nRestless \u03d5-mixing Bandits. We study the problem of sampling from a K-armed \u03d5-mixing bandit.\nIn our setting, pulling arm k at time t provides the agent with a realization of the random variable X k\nt ,\n\n3\n\n\fwhere the family(cid:8)X k\n\n(cid:9)\n\nt\n\nt\u2208Z satis\ufb01es the following assumptions: (A) \u2200k, (X k\n\nmixing process with decreasing mixing coef\ufb01cients \u03d5k and (B) \u2200k, X k\n\ufb01nite set (by stationarity, the same holds for any X k\nRegret The regret we want to bound is the classical pseudo-regret which, after T pulls, is given by\n\nt )t\u2208Z is a stationary \u03d5-\n1 takes its values in a discrete\n\nt , with t (cid:54)= 1) included in [0; 1].\nT(cid:88)\n\nR(T )\n\n.\n= T \u00b5\n\n\u2217 \u2212 E\n\n\u00b5It\n\n(5)\n\nt=1\n\n.\n= EX k\n\n1 , \u00b5\u2217 .\n\nwhere \u00b5k\n= maxk \u00b5k, and It is the index of the arm selected at time t. We want to\ndevise a strategy is capable to select, at each time t, the arm It so that the obtained regret is minimal.\nBottleneck. The setting we assume entails the possibility of long-term dependencies between the\nrewards output by the arms. Hence, as evoked earlier, in order to choose which arm to pull, the agent\nis forced to address the exploration/exploitation/independence trade-off where independence may\nbe partially recovered by taking advantage of the observation regarding spacings of timepoints that\ninduce sharper concentration of the empirical rewards than others. As emphasized later, targetting\ngood spacing in the bandit framework translates into the idea of ignoring the rewards provided by\nsome pulls to compute the empirical averages: this idea is carried by the concept of a waiting arm,\nwhich is formally de\ufb01ned later on. The questions raised by the waiting arm that we address with\nthe Remix-UCB algorithm are a) how often should the waiting arm be pulled so the concentration\nof the empirical means is high enough to be relied on (so the usual exploration/exploitation tradeoff\ncan be tackled) and b) from the regret standpoint, how hindering is it to pull the waiting arm?\n\nO and(cid:101)\u0398 analysis. In the analysis of Remix-UCB that we provide, just as is the case for most, if not\nheavily rely on the usual O notation and on the (cid:101)\u0398 notation, that bears the following meaning.\nDe\ufb01nition 4 ((cid:101)\u0398 notation). For any two functions f, g from R to R, we say that f = (cid:101)\u0398(g) if there\n\nall, analyses that exist for bandit algorithms, we will focus in the order of the regret and we will not\nbe concerned about the precise constants involved in the derived results. We will therefore naturally\n\nexist \u03b1, \u03b2 > 0 so that |f| log\u03b1 |f| \u2264 |g|, and |g| log\u03b2 |g| \u2264 |f|.\n\n3 Remix-UCB: a UCB Strategy for Restless Mixing Bandits\n\nThis section contains our main contributions: the Remix-UCB algorithm. From now on, we use\na \u2228 b (resp. a \u2227 b) for the maximum (resp. minimum) of two elements a and b. We consider that\nthe processes attached to the arms are algebraically mixing and for arm k, the exponent is \u03b1k > 0:\nthere exist \u03d5k,0 such that \u03d5k(t) = \u03d5k,0t\u2212\u03b1k\u2014this assumption is not very restrictive as considering\nrates such as t\u2212\u03b1k are appropriate/natural to capture and characterize the decreasing behavior of the\nconvergent sequence (\u03d5k(t))t. Also, we will sometimes say that arm k is faster (resp. slower) than\narm k(cid:48) for k (cid:54)= k(cid:48), to convey the fact that \u03b1k > \u03b1k(cid:48) (resp. \u03b1k < \u03b1(cid:48)\nk).\nFor any k and any increasing sequence \u03c4 = (\u03c4 (n))t\n\nn=1 of t timepoints, the empirical reward(cid:98)\u00b5\u03c4\n\nk of\n\u03c4 (n). The subscripted notation \u03c4 k = (\u03c4k(n))1\u2264n\u2264t is used to denote\nk in a similar way as\n\nthe sequence of timepoints at which arm k was selected. Finally, we de\ufb01ne \u039b\u03c4\nin (3), the difference with the former notation being the subscript k, as\n\nk given \u03c4 is(cid:98)\u00b5\u03c4\n\nk\n\n(cid:80)t\n\nn=1 X k\n\n.\n= 1\nt\n\n\u039b\u03c4 k\nk\n\n.\n= 1 + 2\n\n\u03d5k(\u03c4k(n) \u2212 \u03c4k(1)).\n\n(6)\n\nWe feel important to discuss when Improved-UCB may be robust to the mixing process scenario.\n\nn=1\n\n3.1 Robustness of Improved-UCB to Restless \u03d5-Mixing Bandits\n\nWe will not recall the Improved-UCB algorithm [2] in its entirety as it will turn out to be a special\ncase of our Remix-UCB algorithm, but it is instructive to identify its distinctive features that make\nit a relevant base algorithm for the handling of mixing processes. First, it is essential to keep in mind\nthat Improved-UCB is designed for the i.i.d case and that it achieves an optimal O(log T ) regret.\nSecond, it is an algorithm that works in successive rounds/epochs, at the end of each of which a\nnumber of arms are eliminated because they are identi\ufb01ed (with high probability) as being the least\n\n4\n\nt(cid:88)\n\n\fpromising ones, from a regret point of view. More precisely, at each round, the same number of\nconsecutive pulls is planned for each arm: this number is induced by Hoeffding\u2019s inequality [8] and\ndevised in such a way that all remaining arms share the same con\ufb01dence interval for their respective\nexpected gains, the \u00b5k = EX k\n1 , for k in the set of remaining arms at the current round. From a\ntechnical standpoint, this is what makes it possible to draw conclusions on whether an arm is useless\n(i.e. eliminated) or not. It is enlightening to understand what are the favorable and unfavorable\nsetups for Improved-UCB to keep working when facing restless mixing bandits. The following\nProposition depicts the favorable case.\nt \u03d5k(t) < +\u221e, \u2200k, then the classical Improved-UCB run on the restless\n\u03d5-mixing bandit preserves its O(log T ) regret.\n\nProof. Straightforward. Given the assumption on the mixing coef\ufb01cients, it exists M > 0 such that\nt\u22650 \u03d5k(t) < M. Therefore, from Theorem 1, for any arm k, and any sequence \u03c4\n, which is akin to Hoeffd-\ning\u2019s inequality up to the multiplicative (1 + 2M )2 constant in the exponential. This, and the lines\nto prove the O(log T ) regret of Improved-UCB [2] directly give the desired result.\n\nProposition 5. If(cid:80)\nmaxk\u2208{1,\u00b7\u00b7\u00b7 ,K}(cid:80)\nof |\u03c4| consecutive timepoints, P (|\u00b5k \u2212(cid:98)\u00b5\u03c4\nIn the general case where(cid:80)\nk = O(t3/4) and the concentration inequality from Corollary 1 for(cid:98)\u00b5\u03c4\n\u22121/2(cid:17)\n\nt \u03d5k(t) < +\u221e does not hold for every k, then nothing ensures for\nImproved-UCB to keep working, the idea of consecutive pulls being the essential culprit. To\nillustrate the problem, suppose that \u2200k, \u03d5k(n) = n\u22121/4. Then, after a sequence \u03c4 = (t1 + 1, t1 +\n2, . . . , t1 + t) of t consecutive time instances where k was selected, simple calculations give that\n\u039b\u03c4\n\n(cid:16)\u2212 |\u03c4|\u03b52\n\nk| > \u03b5) \u2264 2 exp\n\nk reads as\n\n2(1+2M )2\n\n(cid:17)\n\nP(|\u00b5k \u2212(cid:98)\u00b5\u03c4\n\nk| > \u03b5) \u2264 2 exp\n\n(cid:16)\u2212C\u03b52t\n\nwhere C is some strictly positive constant. The quality of the con\ufb01dence interval that can be derived\nfrom this concentration inequality degrades when additional pulls are performed, which counters\nthe usual nature of concentration inequalities and prevents the obtention of a reasonable regret for\nImproved-UCB. This is a direct consequence of the dependency of the \u03d5-mixing variables. In-\ndeed, if \u03d5(n) decreases slowly, taking the average over multiple consecutive pulls may move the\nestimator away from the mean value of the stationary process.\nAnother way of understanding the difference between the i.i.d. case and the restless mixing case is\nto look at the sizes of the con\ufb01dence intervals around the true value of an arm when the time t to the\nnext pull increases. Given Corollary 1, Improved-UCB run in the restless mixing scenario would\nadvocate a pulling strategy based on the lengths \u03bak of the con\ufb01dence intervals given by\n\n(7)\n\n(8)\n\n= |\u03c4 k|\u22121/2(cid:113)\n\n.\n\n\u2200k, \u03bak(t)\n\n2(\u039b\u03c4 k\n\nk + 2\u03d5k(t \u2212 \u03c4 (1)))2 log(t)\n\nwhere t is the overall time index. This shows that working in the i.i.d. case or in the mixing case\ncan imply two different behaviors for the lengths of the con\ufb01dence interval: in the i.i.d. scenario,\n\u03bak has the same form as the classical UCB term (as \u03d5k = 0 and \u039b\u03c4 k\nk = 1) and is an increasing\nfunction of t while in the \u03d5-mixing scenario the behavior may be non-monotonic with a decreasing\ncon\ufb01dence interval up to some point after which the con\ufb01dence interval becomes increasingly larger.\nAs the purpose of exploration is to tighten the con\ufb01dence interval as much as possible, the mixing\nframework points to carefully designed strategies. For instance, when an arm is slow, it is bene\ufb01cial\nto wait between two successive pulls of this arm.\nBy alternating the pulls of the different arms, it is possible to wait up to K unit of time between\ntwo consecutive pulls of the same arm. However, it is not suf\ufb01cient to recover enough independence\nbetween the two observed values. For instance, in the case described in (7), after a sequence \u03c4 =\nk = O((Kt)3/4) and the concentration\n(t1, t1 + K, . . . , t1 + tK), simple calculations give that \u039b\u03c4\n\nk| > \u03b5) \u2264 2 exp(cid:0)\u2212CK 3/2\u03b52t\u22121/2(cid:1) which\n\ninequality from Corollary 1 for(cid:98)\u00b5\u03c4\n\nk reads as P(|\u00b5k \u2212(cid:98)\u00b5\u03c4\n\nentails the same problem.\nThe problem exhibited above is that if the decrease of the \u03d5k is too slow, pulling an arm in the\ntraditional way, with consecutive pulls, and updating the value of the empirical estimator may lower\nthe certainty with which the estimation of the expected gain is performed. To solve this problem\nand reduce the con\ufb01dence interval that are computed for each arm, a better independence between\n\n5\n\n\fAlgorithm 1 Remix-UCB, with parameter K, (\u03b1i)i=1\u00b7\u00b7\u00b7K, T , G de\ufb01ned in (11)\n\nB0 \u2190 {1,\u00b7\u00b7\u00b7 , K},\u03b1 \u2190 1 \u2227 mini\u2208B0 \u03b1i,(cid:98)\u00b5i \u2190 0, ni\n\nfor s = 1, . . . ,(cid:98)G\u22121(T )(cid:99) do\n\n0 \u2190 0, , k = 1, . . . , K , i\u2217 \u2190 1\n\nSelect arm : If |Bs| > 1, then until total time Ts = (cid:100)G(s)(cid:101) pull each arm i \u2208 Bs at time \u03c4i(\u00b7)\nde\ufb01ned in (10). If no arm is ready to be pulled, pull the waiting arm i\u2217 instead.\nUpdate :\n\nUpdate the empirical mean(cid:98)\u00b5i and the number of pulls ni for each arm i \u2208 Bs.\n(cid:115)\n(cid:98)\u00b5i +\n\nObtain Bs+1 by eliminating from Bs each arm i such that\n\nj=1 \u03d5k(\u03c4k(j)))2 log(T 2\u22122s)\n\nj=1 \u03d5i(\u03c4i(j)))2 log(T 2\u22122s)\n\n(1 + 2(cid:80)nk\n\n(1 + 2(cid:80)ni\n\n(cid:115)\n\n< max\n\n1.\n2.\n\n2\n\n2\n\nni\n\nnk\n\nk\u2208Bs(cid:98)\u00b5k \u2212\n(cid:115)\n\n(cid:98)\u00b5i +\n\n2\n\n(1 + 2(cid:80)ni\n\nj=1 \u03d5i(\u03c4i(j)))2 log(T 2\u22122s)\n\nni\n\n3.\n\nupdate\n\n\u03b1 \u2190 1 \u2227 min\ni\u2208Bs+1\n\n\u03b1i, and i\n\n\u2217 \u2190 argmax\ni\u2208Bs+1\n\nend for\n\nthe values observed from a given arm is required. This can only be achieved by waiting for the\ntime to pass by. Since an arm must be pulled at each time t, simulating the time passing by may be\n\nimplemented by the idea to pull an arm but not to update the empirical mean(cid:98)\u00b5k of this arm with\n\nthe observed reward. At the same time, it is important to note that even if we do not update the\nempirical mean of the arm, the resort to the waiting arm may impact the regret. It is therefore crucial\nto ensure that we pull the best possible arm to limit the resulting regret, whence the arm with the\nbest optimistic value, being used as the waiting arm. Note that this arm may change over time. For\nthe rest of the paper, \u03c4 will only refer to signi\ufb01cant pulls of an arm, that is, pulls that lead to an\nupdate of the empirical value of the arm.\n\n3.2 Algorithm and Regret bound\n\nWe may now introduce Remix-UCB, depicted in Algorithm 1. As Improved-UCB, Remix-UCB\nworks in epochs and eliminates, at each epoch, the signi\ufb01cantly suboptimal arms.\n+ and (\u03b4s)s\u2208N \u2208 RN\nHigh-Level View. Let (\u03b8s)s\u2208N be a decreasing sequence of R\u2217\n+. The main idea\npromoted by Remix-UCB is to divide the time available in epochs 1, . . . , smax (the outer loop of\nthe algorithm), such that at the end of each epoch s, for all the remaining arms k the following holds,\nk \u2264 \u00b5k \u2212 \u03b8s) \u2264 \u03b4s, where \u03c4 k identi\ufb01es the time instants up to current\n\nP((cid:98)\u00b5\u03c4 k\nk \u2265 \u00b5k + \u03b8s) \u2228 P((cid:98)\u00b5\u03c4 k\n\ntime t when arm k was selected. Using (4), this means that, for all k, with high probability:\n\n2(\u039b\u03c4\n\nk )2 log(\u03b4s).\n\n(9)\n\nk\n\nwith which the empirical rewards (cid:98)\u00b5\u03c4 k\n\nThus, at the end of epoch s we have, with high probability, a uniform control of the uncertainty\napproximate their corresponding rewards \u00b5k. Based on\nthis, the algorithm eliminates the arms that appear signi\ufb01cantly suboptimal (step 2 of the update\nof Remix-UCB). Just as in Improved-UCB, the process is re-iterated with parameters \u03b4s and \u03b8s\nadjusted as \u03b4s = 1/(T \u03b82\ns) and \u03b8s = 1/2s, where T is the time budget; the modi\ufb01cations of the \u03b4s\nand \u03b8s values makes it possible to gain additional information, through new pulls, on the quality of\nthe remaining arms, so arms associated with close-by rewards can be distinguished by the algorithm.\nPolicy for pulling arms at epoch s. The objective of the policy is to obtain a uniform control of the\nuncertainty/con\ufb01dence intervals (9) of all the remaining arms. For some arm k and \ufb01xed time budget\n< \u03b5 where\nT , such a policy could be obtained as the solution of min\u03b7s,(ti)\u03b7s\nthe times of pulls ti\u2019s must be increasing and greater than t0 the last element of \u03c4 s\u22121, \u03c4 s = \u03c4 s\u22121 \u222a\n(t1, ...t\u03b7s) and ns\u22121 (the number of times this arm has already been pulled), \u03b5, \u03c4 s\u22121 are given. This\nconveys our aim to obtain as fast and ef\ufb01ciently the targetted con\ufb01dence interval. However, this\nproblem does not have a closed-form solution and, even if it could be solved ef\ufb01ciently, we are more\ninterested in assessing whether it is possible to devise relevant sequences of timepoints that induce a\ncontrolled regret, even if they do not solve the optimization problem. To this end, we only focus on\n\nt\u03b7s such that\n\n(\u039b\u03c4 s )2\nns\u22121+\u03b7s\n\ni=1\n\n|(cid:98)\u00b5\u03c4 k\nk \u2212 \u00b5k| \u2264 nk\n\n\u22121/2(cid:113)\n\n6\n\n\fthe best sampling rate of the arms, which is an approximation of the previous minimization problem:\nfor each k, we search for sampling schemes of the form \u03c4k(n) = tn = O(n\u03b2) for \u03b2 \u2265 1. For the\ncase where the \u03d5k are not summable ( \u03b1k \u2264 1), we have the following result.\nProposition 6. Let \u03b1k \u2208 (0; 1] (recall that \u03d5k(n) = n\u2212\u03b1k). The optimal sampling rate \u03c4k for arm\n\nk is \u03c4k(n) = (cid:101)\u0398(n1/\u03b1k ).\nIn other words,(cid:80)\nis to take \u03b2 = 1/\u03b1, which directly comes from the fact that this is the point where(cid:80)\n\nProof. The idea of the proof is that if the sampling is too frequent (i.e. \u03b2 close to 1), then the\ndependency between the values of the arm reduces the information obtained by taking the average.\nn \u03d5k(\u03c4k(n)) increases too quickly. On the other hand, if the sampling is too scarce\n(i.e. \u03b2 is very large), the information obtained at each pull is important, but the total amount of pulls\nin a given time T is approximately T 1/\u03b2 and thus is too low. The optimal solution to this trade-off\nn \u03d5k(\u03c4k(n))\n\nbecomes logarithmic. The complete proof is available in the supplementary material.\n\nIf \u03b1k < 1, for all k, this result means that the best policy (with a sampling scheme of the form\nO(n\u03b2)) should update the empirical means associated with each arm k at a rate O(n1/\u03b1k ); contrary\nto the i.i.d case it is therefore not relevant to try and update the empirical rewards at each time step.\nThere henceforth must be gaps between updates of the means: this is precisely the role of the waiting\narm to make this gaps possible. As seen in the depiction of Remix-UCB, when pulled, the waiting\narm provides a reward that will count for the cumulative gains of the agent and help her control her\nregret, but that will not be used to update any empirical mean.\nAs for a precise pulling strategy to implement given Proposition 6, it must be understood that it\nis the slowest arm that determines the best uniform control possible, since it is the one which will\nbe selected the least number of times: it is unnecessary to pull the fastest arms more often than\n= 1 \u2227\n.\nthe slowest arm. Therefore, if i1, . . . , iks are the ks remaining arms at epoch s, and \u03b1\n1, then an arm selection strategy based on the rate of the slowest arm suggests to\nmini\u2208{i1,...,iks} \u03b1i\npull arm im and update \u02c6\u00b5\u03c4 im\nim\n\nfor the n-th time at time instants\n\n(cid:26) (\u03c4i1 (n \u2212 1) + ks) \u2228 (cid:100)n1/\u03b1(cid:101)\n\n\u03c4i1 (n) + m \u2212 1\n\nif m = 1\notherwise\n\n(10)\n\n(i.e. all arms are pulled at the same O(n1/\u03b1) frequency) and to pull the waiting arm while waiting.\nTime budget per epoch. In the Remix-UCB algorithm, the function G de\ufb01nes the size of the\nrounds. The de\ufb01nition of G is rather technical: we have G(s) = maxk\u2208Bs Gk(s) where\n\n= inf(cid:8)t \u2208 N+, 2(\u039b\u03c4\n\n.\n\nGk(s)\n\nk )2 log(1/\u03b4s) \u2264 t\u03b8s\n\n(cid:9)\n\n(11)\n\ns\n\nwhere the \u03c4k(n) are de\ufb01ned above.\nIn other words, Gk encodes the minimum amount of time\nnecessary to reach the aimed length of con\ufb01dence interval by following the aforementioned policy.\nlog(\u03b4s))1/\u03b1). This is the key element\n\nBut the most interesting property of G is that G(s) = (cid:101)\u0398((\u03b8\u22122\n\nwhich will be used in the proof of the regret bound which can be found in Theorem 2 below.\nPutting it all together. At epoch s, the Remix-UCB algorithm starts by selecting the best empirical\narm and \ufb02ags it as the waiting arm. It then determines the speed \u03b1 of the slowest arm, after which it\ncomputes a time budget Ts = G(s). Then, until this time horizon is reached, it pulls arms following\nthe policy described above. Finally, after the time budget is reached, the algorithm eliminates the\narms whose empirical mean is signi\ufb01cantly lower than the best available empirical mean.\nNote that when all the \u03d5k are summable, we have \u03b1 = 1, and thus the algorithm never pulls the\nwaiting arm: Remix-UCB mainly differs from Improved-UCB by its strategy of alternate pulls.\nThe result below provides an upper bound for the regret of the Remix-UCB algorithm:\nTheorem 2. For all arm k, let 1 \u2265 \u03b1k > 0 and \u03d5k(n) = n\u2212\u03b1k . Let \u03b1 = mink\u2208{1,\u00b7\u00b7\u00b7 ,K} \u03b1k and\n\u2206\u2217 = mink\u2208{1,\u00b7\u00b7\u00b7 ,K}{\u2206k > 0}. If \u03b1 \u2264 1, the regret of Remix-UCB is bounded in order by\n\n(cid:16)\n\n(cid:101)\u0398\n\n\u2206(\u03b1\u22122)/\u03b1\n\n\u2217\n\nlog(T )1/\u03b1(cid:17)\n\n.\n\n(12)\n\n1Since 1/\u03b1 encodes the rate of sampling, it cannot be greater than 1.\n\n7\n\n\fProof. The proof follows the same line as the proof of the upper bound of the regret of the\nImproved-UCB algorithm. The important modi\ufb01cation is the sizes of the blocks, which depend\nin the mixing case of the \u03d5 mixing coef\ufb01cient, and might grow arbitrary large, and the waiting arm,\nwhich does not exist in the i.i.d. setting. The dominant term in the regret mentioned in Theorem 2\nis related to the pulls of the waiting arm. Indeed, the waiting arm is pulled with an always increas-\ning frequency, but the quality of the waiting arm tends to increase over time, as the arms with the\nsmallest values are eliminated. The complete proof is available in the supplementary material.\n\n4 Discussion and Particular Cases\n\nWe here discuss Theorem 2, and some of its variations for special cases of \u03d5-mixing processes.\n\nFirst, in the i.i.d case, the regret of Improved-UCB is upper bounded by O(cid:0)\u2206\u22121\u2217\n\nlog(T )(cid:1) [2].\n\nObserve that (12) comes down to this bound when \u03b1 tends to 1. Also, note that it is an upper bound\nof the regret in the algebraically mixing case. It re\ufb02ects the fact that in this particular case, it is\npossible to ignore the dependency of the mixing process. It also implies that, even if \u03b1 < 1, i.e.\neven if the dependency cannot be ignored, by properly using the \u03d5 mixing property of the different\nstationary processes, it is possible to obtain an upper bound of polynomial logarithmic order.\nAnother question is to see what happens when \u03b1k = 1, which is an important threshold in our study.\nIndeed, if \u03b1k = 1 the \u03d5k are not summable, but from Proposition 6 we have that \u03c4k(n) \u2248 O(n), i.e.\nthe arms should be sampled as often as possible. Theorem 2 states that the regret is upper bounded\nlog T ). However, it is not possible to know if this bound is comparable to\n\nin this case by (cid:101)\u0398(\u2206\u22121\u2217\nthat of the i.i.d case due to the (cid:101)\u0398. Still, from the proof of Theorem 2 we get the following result:\n\nCorollary 2. For all arm k, let 1 \u2265 \u03b1k > 0 and \u03d5k(n) = n\u2212\u03b1k . Let \u03b1 = mink\u2208{1,\u00b7\u00b7\u00b7 ,K} \u03b1k. Then\nif \u03b1 = 1, the regret for Algorithm 1 is upper bounded in order by\n\nO(cid:0)\u2206\n\n\u22121\u2217 G\u03b1(log(T ))(cid:1)\n\n(13)\n\n\u03b1 (x) = x\u03b1/(log(x))2.\n\nwhere \u2206\u2217 = mink\u2208{1,\u00b7\u00b7\u00b7 ,K}{\u2206k > 0} and G is solution of G\u22121\nAlthough we do not have an explicit formula for the regret in the case \u03b1 = 1, it is interesting to note\n\nlog(T )(cid:1).\nthat (13) is strictly negligible with respect to (12) \u2200\u03b1 < 1, but strictly dominates O(cid:0)\u2206\u22121\u2217\nwill only achieve a regret of (cid:101)\u0398(cid:0)exp(cid:2)(T /\u2206\u2217)1/\u03b1(cid:3)(cid:1) , which is no longer logarithmic in T. In other\n\nThis comes from that while in the case \u03b1 = 1 the waiting arm is no longer used, the time budget\nnecessary to complete step s is still higher that in the i.i.d case.\nWhen \u03d5(n) decreases at a logarithmic speed (\u03d5(n) \u2248 1/log(n)\u03b1 for some \u03b1 > 0), it is still possible\nto apply the same reasoning as the one developed in this paper. But in this case, Remix-UCB\n\nwords, if the \u03d5 mixing coef\ufb01cients decrease too slowly, the information given by the concentration\ninequality in Theorem 1 is not suf\ufb01cient to deduce interesting information about the mean value of\nthe arms. In this case, the successive values of the \u03d5-mixing processes are too dependent, and the\nrandomness in the sequence of values is almost negligible; an adversarial bandit algorithm such as\nExp4 [4] may give better results than Remix-UCB.\n\n5 Conclusion\n\nWe have studied an extension of the multi-armed bandit problem to the stationary \u03d5-mixing frame-\nwork in the restless case, by providing a functional algorithm and an upper bound of the regret in a\ngeneral framework. Future work might include a study of a lower bound for the regret in the mixing\nprocess case: our \ufb01rst \ufb01ndings on the issue are that the analysis of the worst-case scenario in the\nmixing framework bears signi\ufb01cant challenges. Another interesting point would be the study of the\nmore dif\ufb01cult case of \u03b2-mixing processes. A rather different, but very interesting question that we\nmay address in the future is the possibility to exploit a possible structure of the correlation between\nrewards over time. For instance, in the case wher the correlation of an arm with the close past is\nmuch higher than the correlation with the distant past, it might be interesting to see if the analysis\ndone in [16] can be extended to exploit this correlation structure.\nAcknowledgments. This work is partially supported by the ANR-funded projet GRETA \u2013 Greedi-\nness: theory and algorithms (ANR-12-BS02-004-01) and the ND project.\n\n8\n\n\fReferences\n[1] Audibert JY, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. In: Annual Con-\n\nference on Learning Theory\n\n[2] Auer P, Ortner R (2010) Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit\n\nproblem. Periodica Mathematica Hungarica 61:5565\n\n[3] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multi- armed bandit problem. Ma-\n\nchine Learning Journal 47(23):235\u2013256\n\n[4] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem.\n\nSIAM Journal on Computing 32(1):48\u201377\n\n[5] Bernstein S (1927) Sur l\u2019extension du th\u00b4eor`eme limite du calcul des probabilit\u00b4es aux sommes de quantit\u00b4es\n\nd\u00b4ependantes. Mathematische Annalen 97(1):1\u201359\n\n[6] Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit\n\nProblems, Foundation and Trends in Machine Learning, vol 5. NOW\n\n[7] Hoeffding W (1948) A Class of Statistics with Asymptotically Normal Distribution. Annals of Mathe-\n\nmatical Statistics 19(3):293\u2013325\n\n[8] Hoeffding W (1963) Probability inequalities for sums of bounded random variables. Journal of the Amer-\nican Statistical Association 58(301):13\u201330, DOI 10.2307/2282952, URL http://dx.doi.org/10.\n2307/2282952\n\n[9] Karandikar RL, Vidyasagar M (2002) Rates of uniform convergence of empirical means with mixing\n\nprocesses. Statistics & probability letters 58(3):297\u2013307\n\n[10] Kontorovich L, Ramanan K (2008) Concentration inequalities for dependent random variables via the\n\nmartingale method. The Annals of Probability 36(6):2126\u20132158\n\n[11] Kulkarni S, Lozano A, Schapire RE (2005) Convergence and consistency of regularized boosting algo-\nrithms with stationary \u03b2-mixing observations. In: Advances in neural information processing systems, pp\n819\u2013826\n\n[12] Lai TL, Robbins H (1985) Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Math-\n\nematics 6:422\n\n[13] McDonald D, Shalizi C, Schervish M (2011) Estimating beta-mixing coef\ufb01cients. arXiv preprint\n\narXiv:11030941\n\n[14] Mohri M, Rostamizadeh A (2009) Rademacher complexity bounds for non-i.i.d. processes. In: Koller D,\nSchuurmans D, Bengio Y, Bottou L (eds) Advances in Neural Information Processing Systems 21, pp\n1097\u20131104\n\n[15] Mohri M, Rostamizadeh A (2010) Stability bounds for stationary -mixing and -mixing processes. Journal\n\nof Machine Learning Research 11:789\u2013814\n\n[16] Ortner R, Ryabko D, Auer P, Munos R (2012) Regret bounds for restless markov bandits. In: Proceeding\n\nof the Int. Conf. Algorithmic Learning Theory, pp 214\u2013228\n\n[17] Pandey S, Chakrabarti D, Agarwal D (2007) Multi-armed bandit problems with dependent arms. In: Pro-\n\nceedings of the 24th international conference on Machine learning, ACM, pp 721\u2013728\n\n[18] Ralaivola L, Szafranski M, Stempfel G (2010) Chromatic pac-bayes bounds for non-iid data: Applications\nto ranking and stationary \u03b2-mixing processes. The Journal of Machine Learning Research 11:1927\u20131956\n[19] Seldin Y, Slivkins A (2014) One practical algorithm for both stochastic and adversarial bandits. In: Pro-\n\nceedings of the 31st International Conference on Machine Learning (ICML-14), pp 1287\u20131295\n\n[20] Steinwart I, Christmann A (2009) Fast learning from non-iid observations. In: Advances in Neural Infor-\n\nmation Processing Systems, pp 1768\u20131776\n\n[21] Steinwart I, Hush D, Scovel C (2009) Learning from dependent observations. Journal of Multivariate\n\nAnalysis 100(1):175\u2013194\n\n[22] Tekin C, Liu M (2012) Online learning of rested and restless bandits. IEEE Transactions on In-\nformation Theory 58(8):5588\u20135611, URL http://dblp.uni-trier.de/db/journals/tit/\ntit58.html#TekinL12\n\n[23] Yu B (1994) Rates of convergence for empirical processes of stationary mixing sequences. Annals of\n\nProbability 22(1):94\u2013116\n\n9\n\n\f", "award": [], "sourceid": 1843, "authors": [{"given_name": "Julien", "family_name": "Audiffren", "institution": "CMLA, ENS Cachan"}, {"given_name": "Liva", "family_name": "Ralaivola", "institution": "Univesity of Marseille"}]}