{"title": "The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information", "book": "Advances in Neural Information Processing Systems", "page_first": 817, "page_last": 824, "abstract": "We present Epoch-Greedy, an algorithm for multi-armed bandits with observable side information. Epoch-Greedy has the following properties: No knowledge of a time horizon $T$ is necessary. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. The regret scales as $O(T^{2/3} S^{1/3})$ or better (sometimes, much better). Here $S$ is the complexity term in a sample complexity bound for standard supervised learning.", "full_text": "The Epoch-Greedy Algorithm for Contextual\n\nMulti-armed Bandits\n\nJohn Langford\nYahoo! Research\n\njl@yahoo-inc.com\n\nTong Zhang\n\ntongz@rci.rutgers.edu\n\nDepartment of Statistics\n\nRutgers University\n\nAbstract\n\nWe present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also\nknown as bandits with side information). Epoch-Greedy has the following prop-\nerties:\n\n1. No knowledge of a time horizon T is necessary.\n2. The regret incurred by Epoch-Greedy is controlled by a sample complexity\n\nbound for a hypothesis class.\n\n3. The regret scales as O(T 2/3S1/3) or better (sometimes, much better). Here S\nis the complexity term in a sample complexity bound for standard supervised\nlearning.\n\n1 Introduction\n\nThe standard k-armed bandits problem has been well-studied in the literature (Lai & Robbins, 1985;\nAuer et al., 2002; Even-dar et al., 2006, for example).\nIt can be regarded as a repeated game\nbetween two players, with every stage consisting of the following: The world chooses k rewards\nr1, ..., rk \u2208 [0, 1]; the player chooses an arm i \u2208 {1, k} without knowledge of the world\u2019s chosen\nrewards, and then observes the reward ri. The contextual bandits setting considered in this paper\nis the same except for a modi\ufb01cation of the \ufb01rst step, in which the player also observes context\ninformation x which can be used to determine which arm to pull.\nThe contextual bandits problem has many applications and is often more suitable than the standard\nbandits problem, because settings with no context information are rare in practice. The setting\nconsidered in this paper is directly motivated by the problem of matching ads to web-page contents\non the internet. In this problem, a number of ads (arms) are available to be placed on a number of\nweb-pages (context information). Each page visit can be regarded as a random draw of the context\ninformation (one may also include the visitor\u2019s online pro\ufb01le as context information if available)\nfrom an underlying distribution that is not controlled by the player. A certain amount of revenue is\ngenerated when the visitor clicks on an ad. The goal is to put the most relevant ad on each page to\nmaximize the expected revenue. Although one may potentially put multiple ads on each web-page,\nwe focus on the problem that only one ad is placed on each page (which is like pulling an arm given\ncontext information). The more precise de\ufb01nition is given in Section 2.\nPrior Work. The problem of bandits with context has been analyzed previously (Pandey et al., 2007;\nWang et al., 2005), typically under additional assumptions such as a correct prior or knowledge of\nthe relationship between the arms. This problem is also known as associative reinforcement learning\n(Strehl et al., 2006, for example) or bandits with side information. A few results under as weak or\nweaker assumptions are directly comparable.\n\n1. The Exp4 algorithm (Auer et al., 1995) notably makes no assumptions about the world.\nEpoch-Greedy has a worse regret bound in T (O(T 2/3) rather than O(T 1/2)) and is only\n\n1\n\n\fanalyzed under an IID assumption. An important advantage of Epoch-Greedy is a much\nbetter dependence on the size of the set of predictors. In the situation where the number\nof predictors is in\ufb01nite but with \ufb01nite VC-Dimension d, Exp4 has a vacuous regret bound\nwhile Epoch-Greedy has a regret bound no worse than O(T 2/3(ln m)1/3). Sometimes we\ncan achieve much better dependence on T , depending on the structure of the hypothe-\nsis space. For example, we will show that it is possible to achieve O(ln T ) regret bound\nusing Epoch-Greedy, while this is not possible with Exp4 or any simple modi\ufb01cation of\nit. Another substantial advantage is reduced computational complexity. The ERM step in\nEpoch-Greedy can be replaced with any standard learning algorithm that achieves approxi-\nmate loss minimization, making guarantees that degrade gracefully with the approximation\nfactor. Exp4 on the other hand requires computation proportional to the explicit count of\nhypotheses in a hypothesis space.\n\n2. The random trajectories method (Kearns et al., 2000) for learning policies in reinforcement\nlearning with hard horizon T = 1 is essentially the same setting. In this paper, bounds are\nstated for a batch oriented setting where examples are formed and then used for choosing\na hypothesis. Epoch-Greedy takes advantage of this idea, but it also has analysis which\nstates that it trades off the number of exploration and exploitation steps so as to maximize\nthe sum of rewards incurred during both exploration and exploitation.\n\nWhat we do. We present and analyze the Epoch-Greedy algorithm for multiarmed bandits with\ncontext. This has all the nice properties stated in the abstract, resulting in a practical algorithm for\nsolving this problem.\nThe paper is broken up into the following sections.\n\n1. In Section 2 we present basic de\ufb01nitions and background.\n2. Section 3 presents the Epoch-Greedy algorithm along with a regret bound analysis which\n\nholds without knowledge of T .\n\n3. Section 4 analyzes the instantiation of the Epoch-Greedy algorithm in several settings.\n\n2 Contextual bandits\n\nWe \ufb01rst formally de\ufb01ne contextual bandit problems and algorithms to solve them.\n\nDe\ufb01nition 2.1 (Contextual bandit problem) In a contextual bandits problem, there is a distribu-\ntion P over (x, r1, ..., rk), where x is context, a \u2208 {1, . . . , k} is one of the k arms to be pulled,\nand ra \u2208 [0, 1] is the reward for arm a. The problem is a repeated game: on each round, a sample\n(x, r1, ..., rk) is drawn from P , the context x is announced, and then for precisely one arm a chosen\nby the player, its reward ra is revealed.\nDe\ufb01nition 2.2 (Contextual bandit algorithm) A contextual bandits algorithm B determines an\narm a \u2208 {1, . . . , k} to pull at each time step t, based on the previous observation sequence\n(x1, a1, ra,1), . . . , (xt\u22121, at\u22121, ra,t\u22121), and the current context xt.\n\nOur goal is to maximize the expected total rewardPT\n\nt=1 E(xt,~rt)\u223cP [ra,t]. Note that we use the\nnotation ra,t = rat to improve readability. Similar to supervised learning, we assume that we are\ngiven a set H consisting of hypotheses h : X \u2192 {1, . . . , k}. Each hypothesis maps side information\nx to an arm a. A natural goal is to choose arms to compete with the best hypothesis in H. We\nintroduce the following de\ufb01nition.\n\nDe\ufb01nition 2.3 (Regret) The expected reward of a hypothesis h is\n\nR(h) = E(x,~r)\u223cD\n\nConsider any contextual bandits algorithm B. Let Z T = {(x1, ~r1), . . . , (xT , ~rT )}, and the expected\nregret of B with respect to a hypothesis h be:\n\n\u2206R(B, h, T ) = T R(h) \u2212 EZT \u223cP T\n\nrB(x),t.\n\n(cid:2)rh(x)\n\n(cid:3) .\nTX\n\nt=1\n\n2\n\n\fThe expected regret of B up to time T with respect to hypothesis space H is de\ufb01ned as\n\n\u2206R(B,H, T ) = sup\nh\u2208H\n\n\u2206R(B, h, T ).\n\nThe main challenge of the contextual bandits problem is that when we pull an arm, rewards of\nother arms are not observed. Therefore it is necessary to try all arms (explore) in order to form an\naccurate estimation. In this context, methods we investigate in the paper make explicit distinctions\nbetween exploration and exploitation steps. In an exploration step, the goal is to form unbiased\nsamples by randomly pulling all arms to improve the accuracy of learning. Because it does not\nfocus on the best arm, this step leads to large immediate regret but can potentially reduce regret\nfor the future exploitation steps. In an exploitation step, the learning algorithm suggests the best\nhypothesis learned from the samples formed in the exploration steps, and the arm given by the\nhypothesis is pulled: the goal is to maximize immediate reward (or minimize immediate regret).\nSince the samples in the exploitation steps are biased (toward the arm suggested by the learning\nalgorithm using previous exploration samples), we do not use them to learn the hypothesis for the\nfuture steps. That is, in methods we consider, exploitation does not help us to improve learning\naccuracy for the future.\nMore speci\ufb01cally, in an exploration step, in order to form unbiased samples, we pull an arm a \u2208\n{1, . . . , k} uniformly at random. Therefore the expected regret comparing to the best hypothesis\nin H can be as large as O(1). In an exploitation step, the expected regret can be much smaller.\nTherefore a central theme we examine in this paper is to balance the trade-off between exploration\nand exploitation, so as to achieve a small overall expected regret up to some time horizon T .\nNote that if we decide to pull a speci\ufb01c arm a with side information x, we do not observe rewards\nra0 for a0 6= a. In order to apply standard sample complexity analysis, we \ufb01rst show that exploration\nsamples, where a is picked uniformly at random, can create a standard learning problem without\nmissing observations. This is simply achieved by setting fully observed rewards r0 such that\n\nr0\na0(ra) = kI(a0 = a)ra,\n\n(1)\nwhere I(\u00b7) is the indicator function. The basic idea behind this transformation from partially ob-\nserved to fully observed data dates back to the analysis of \u201cSample Selection Bias\u201d (Heckman,\n1979). The above rule is easily generalized to other distribution over actions p(a) by replacing k\nwith 1/p(a).\nThe following lemma shows that this method of \ufb01lling missing reward components is unbiased.\nLemma 2.1 For all arms a0: E~r\u223cP|x [ra0] = E~r\u223cP|x,a\u223cU (1,...,k) [r0\npothesis h(x), we have R(h) = E(x,~r)\u223cP,a\u223cU (1,...,k)\n\na0(ra)]. Therefore for any hy-\n\nh\n\ni\n\nr0\nh(x)(ra)\n\n.\n\nProof We have:\n\nE~r\u223cP|x,a\u223cU (1,...,k) [r0\n\na0(ra)] =E~r\u223cP|x\n\n=E~r\u223cP|x\n\nkX\nkX\n\na=1\n\na=1\n\nk\u22121 [r0\n\na0(ra)]\n\nk\u22121 [kraI(a0 = a)] = E~r\u223cP|x [ra0] .\n\nsamples as P\n\nLemma 2.1 implies that we can estimate reward R(h) of any hypothesis h(x) using expectation\nwith respect to exploration samples (x, a, ra). The right hand side can then be replaced by empirical\nt I(h(xt) = at)ra,t for hypotheses in a hypothesis space H. The quality of this\n\nestimation can be obtained with uniform convergence learning bounds.\n\n3 Exploration with the Epoch-Greedy algorithm\n\nThe problem of treating contextual bandits as standard bandits is that the information in x is lost.\nThat is, the optimal arm to pull should be a function of the context x, but this is not captured by the\n\n3\n\n\fstandard bandits setting. An alternative approach is to regard each hypothesis h as a separate arti\ufb01-\ncial \u201carm\u201d, and then apply a standard bandits algorithm to these arti\ufb01cial arms. Using this approach,\nlet m be the number of hypotheses, we can get a bound of O(m). However, this solution ignores\nthe fact that many hypotheses can share the same arm so that choosing an arm yields information\nfor many hypotheses. For this reason, with a simple algorithm, we can get a bound that depends on\nm logarithmically, instead of O(m) as would be the case for the standard bandits solution discussed\nabove.\nAs discussed earlier, the key issue in the algorithm is to determine when to explore and when to\nexploit, so as to achieve appropriate balance. If we are given the time horizon T in advance, and\nwould like to optimize performance with the given T , then it is always advantageous to perform a\n\ufb01rst phase of exploration steps, followed by a second phase of exploitation steps (until time step T ).\nThe reason that there is no advantage to take any exploitation step before the last exploration step is:\nby switching the two steps, we can more accurately pick the optimal hypothesis in the exploitation\nstep due to more samples from exploration. With \ufb01xed T , assume that we have taken n steps of\nexploration, and obtain an average regret bound of \u0001n for each exploitation step at the point, then\nwe can bound the regret of the exploration phase as n, and the exploitation phase as \u0001n(T \u2212 n). The\ntotal regret is n + (T \u2212 n)\u0001n. Using this bound, we shall switch from exploration to exploitation at\nthe point n that minimizes the sum.\nWithout knowing T in advance, but with the same generalization bound, we can run explo-\nration/exploitation in epochs, where at the beginning of each epoch \u2018, we perform one step of explo-\nration, followed by d1/\u0001ne steps of exploitation. We then start the next epoch. After epoch L, the\nn=1(1 + \u0001nd1/\u0001ne) \u2264 3L. Moreover, the epoch L\u2217 contain-\ning T is no more than the optimal regret bound minn[n + (T \u2212 n)\u0001n] (with known T and optimal\nstopping point). Therefore the performance of our method (which does not need to know T ) is no\nworse than three time the optimal bound with known T and optimal stopping point. This motivates\na modi\ufb01ed algorithm in Figure 1. The idea described above is related to forcing in (Lai & Yakowitz,\n1995).\nProposition 3.1 Consider a sequence of nonnegative and monotone non-increasing numbers {\u0001n}.\n\ntotal average regret is no more thanPL\n\nLet L\u2217 = min{L :PL\n\n\u2018=1(1 + d1/\u0001\u2018e) \u2265 T}, then\n\nL\u2217 \u2264 min\nn\u2208[0,T ]\n\n[n + (T \u2212 n)\u0001n].\n\nProof Let n\u2217 = arg minn\u2208[0,T ][n + (T \u2212 n)\u0001n]. The bound is trivial if n\u2217 \u2265 L\u2217. We only\n\u2018=1 (1 + 1/\u0001\u2018) \u2264 T \u2212 1. Since\n\u2018=n\u2217 1/\u0001\u2018 \u2265 (L\u2217 \u2212 n\u2217)1/\u0001n\u2217, we have L\u2217 \u2212 1 + (L\u2217 \u2212 n\u2217)1/\u0001n\u2217 \u2264 T \u2212 1.\n\nneed consider the case n\u2217 \u2264 L\u2217 \u2212 1. By assumption, PL\u2217\u22121\nPL\u2217\u22121\n\u2018=1 1/\u0001\u2018 \u2265 PL\u2217\u22121\nRearranging, we have L\u2217 \u2264 n\u2217 + (T \u2212 L\u2217)\u0001n\u2217.\n\n1 ) = d1/\u0001n(Z n\n\n1 )e, where \u0001n(Z n\n\n1 ) is a sample-dependent (integer valued) exploitation step count. Proposition 3.1\nIn Figure 1, s(Z n\nsuggests that choosing s(Z n\n1 ) is a sample dependent average gen-\neralization bound, yields performance comparable to the optimal bound with known time horizon\nT .\nDe\ufb01nition 3.1 (Epoch-Greedy Exploitation Cost) Consider a hypothesis space H consisting of\nhypotheses that take values in {1, 2, . . . , k}. Let Zt = (xt, at, ra,t) for i = 1, . . . , n be indepen-\ndent random samples, where ai is uniform randomly distributed in {1, . . . , k}, and ra,t \u2208 [0, 1] is\n1 = {Z1, . . . , Zn}, and the empirical reward maximization\nthe observed (random) reward. Let Z n\nestimator\n\n1 ) = arg max\nh\u2208H\nGiven any \ufb01xed n, \u03b4 \u2208 [0, 1], and observation Z n\nstep count. Then the per-epoch exploitation cost is de\ufb01ned as:\n\nt=1\n\n\u02c6h(Z n\n\nra,tI(h(xt) = at).\n\n1 , we denote by s(Z n\n\n1 ) a data-dependent exploitation\n\n\u00b5n(H, s) = EZn\n\n1\n\nR(h) \u2212 R(\u02c6h(Z n\n1 ))\n\ns(Z n\n\n1 ).\n\n(cid:19)\n\nnX\n\n4\n\n(cid:18)\n\nsup\nh\u2208H\n\n\fEpoch-Greedy (s(W\u2018)) /*parameter s(W\u2018): exploitation steps*/\ninitialize: exploration samples W0 = {} and t1 = 1\niterate \u2018 = 1, 2, . . .\nt = t\u2018, and observe xt /*do one-step exploration*/\nselect an arm at \u2208 {1, . . . , k} uniformly at random\nreceive reward ra,t \u2208 [0, 1]\nW\u2018 = W\u2018\u22121 \u222a {(xt, at, ra,t)}\n\ufb01nd best hypothesis \u02c6h\u2018 \u2208 H by solving\nraI(h(x) = a)\n\nmaxh\u2208HP\n\n(x,a,ra)\u2208W\u2018\nt\u2018+1 = t\u2018 + s(W\u2018) + 1\nfor t = t\u2018 + 1,\u00b7\u00b7\u00b7 , t\u2018+1 \u2212 1 /*do s(W\u2018)-steps exploitation*/\nselect arm at = \u02c6h\u2018(xt)\nreceive reward ra,t \u2208 [0, 1]\nend for\n\nend iterate\n\nFigure 1: Exploration by \u0001-greedy in epochs\n\nTheorem 3.1 For all T, n\u2018, L such that: T \u2264 L +PL\nLX\n\nin Figure 1 is bounded by\n\n\u2206R(Epoch-Greedy,H, T ) \u2264 L +\n\n\u2018=1 n\u2018, the expected regret of Epoch-Greedy\n\nLX\n\n\u00b5\u2018(H, s) + T\n\nP [s(Z \u2018\n\n1) < n\u2018].\n\n\u2018=1\n\n\u2018=1\n\nThis theorem statement is very general, because we want to allow sample dependent bounds to be\nused. When sample-independent bounds are used the following simple corollary holds:\n\nCorollary 3.1 Assume we choose s(Z \u2018\n\n1) = s\u2018 \u2264 b1/\u00b5\u2018(H, 1)c (\u2018 = 1, . . .), and let LT =\n\u2018=1 s\u2018 \u2265 T}. Then the expected regret of Epoch-Greedy in Figure 1 is\n\narg minL{L : L +PL\n\nbounded by\n\n\u2206R(Epoch-Greedy,H, T ) \u2264 2LT .\n\n(of the main theorem) Let B be the Epoch-Greedy algorithm. One of the following events\n\nProof\nwill occur:\n\n\u2022 A: s(Z \u2018\n\u2022 B: s(Z \u2018\n\n1) < n\u2018 for some \u2018 = 1, . . . , L.\n1) \u2265 n\u2018 for all \u2018 = 1, . . . , L.\n\nIf event A occurs, then since each reward is in [0,1], up to time T , regret cannot be larger than T .\nThus the total expected contribution of A to the regret \u2206R(B,H, T ) is at most\n\nLX\n\nT P (A) \u2264 T\n\nP [s(Z \u2018\n\n1) < n\u2018].\n\n(2)\n\n\u2018=1\n\nIf event B occurs, then t\u2018+1 \u2212 t\u2018 \u2265 n\u2018 + 1 for \u2018 = 1, . . . , L, and thus tL+1 > T . Therefore the\nexpected contribution of B to the regret \u2206R(B,H, T ) is at most the sum of expected regret in the\n\ufb01rst L epochs.\nBy de\ufb01nition and construction, after the \ufb01rst step of epoch \u2018, W\u2018 consists of \u2018 random observations\nZj = (xj, aj, ra,j) where aj is drawn uniformly at random from {1, . . . , k}, and j = 1, . . . , \u2018.\nThis is independent of the number of exploitation steps before epoch \u2018. Therefore we can treat\nW\u2018 as \u2018 independent samples. This means that the expected regret associated with exploitation\nsteps in epoch \u2018 is \u00b5\u2018(H, s). Since the exploration step in each epoch contributes at most 1 to the\n\n5\n\n\fIt follows that there exists a universal constant c > 0 such that\n\nTherefore in Figure 1, if we choose\n\ni + c0k ln(1/\u03b7) \u2264 c0\n\ni=1\n\nEx2\n\nnX\n\u00b5n(H, 1) \u2264 c\u22121pk ln m/n.\n1) = bcp\u2018/(k ln m)c,\nn\u2018 = bcp\u2018/(k ln m)c.\n\ns(Z \u2018\n\ntotal expected regret for epochs \u2018 = 1, . . . , L is at most L +PL\n\nexpected regret, the total expected regret for each epoch \u2018 is at most 1 + \u00b5\u2018(H, s). Therefore the\n\u2018=1 \u00b5\u2018(H, s). Combined with (2),\n\nwe obtain the desired bound.\n\nIn the theorem, we bound the expected regret of each exploration step by one. Clearly this assumes\nthe worst case scenario and can often be improved. Some consequences of the theorem with speci\ufb01c\nfunction classes are given in Section 4.\n\n4 Examples\n\nTheorem 3.1 is quite general. In this section, we present a few simple examples to illustrate the\npotential applications.\n\n4.1 Finite hypothesis space worst case bound\nConsider the \ufb01nite hypothesis space situation, with m = |H| < \u221e. We apply Theorem 3.1 with a\nworst-case deviation bound.\nLet x1, . . . , xn \u2208 [0, k] be iid random variables, such that Exi \u2264 1, then Bernstein inequality\nimplies that there exists a constant c0 > 0 such that \u2200\u03b7 \u2208 (0, 1), with probability 1 \u2212 \u03b7:\n\npnk ln(1/\u03b7) + c0k ln(1/\u03b7).\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nX\n\ni=1\n\nxi \u2212 nX\n\ni=1\n\nvuutln(1/\u03b7)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 c0\n\nExi\n\nthen \u00b5\u2018(H, s) \u2264 1: this is consistent with the choice recommended in Proposition 3.1.\nIn order to obtain a performance bound of this scheme using Theorem 3.1, we can simply take\n\nsatis\ufb01es the condition T \u2264 PL\n\nThis implies that P (s(Z \u2018\n\nthat for any given T , we can take\n\n1) < n\u2018) = 0. Moreover, with this choice, for any T , we can pick an L that\n\u2018=1 n\u2018. It implies that there exists a universal constant c0 > 0 such\n\nin Theorem 3.1.\nIn summary, if we choose s(Z \u2018\n\nL = bc0T 2/3(k ln m)1/3c\n\n1) = bcp\u2018/(k ln m)c in Figure 1, then\n\n\u2206(Epoch-Greedy,H, T ) \u2264 2L \u2264 2c0T 2/3(k ln m)1/3.\n\nReducing the problem to standard bandits, as discussed at the beginning of Section 3, leads to\na bound of O(m ln T ) (Lai & Robbins, 1985; Auer et al., 2002). Therefore when m is large,\nthe Epoch-Greedy algorithm in Figure 1 can perform signi\ufb01cantly better. In this particular situa-\n\u221a\ntion, Epoch-Greedy does not do as well as Exp4 in (Auer et al., 1995), which implies a regret of\nkT ln m). However, the advantage of Epoch-Greedy is that any learning bound can be applied.\nO(\nFor many hypothesis classes, the ln m factor can be improved for Epoch-Greedy. In fact, a similar\nresult can be obtained for classes with in\ufb01nitely many hypotheses but \ufb01nite VC dimensions. More-\nover, as we will see next, under additional assumptions, it is possible to obtain much better bounds\nin terms of T for Epoch-Greedy, such as O(k ln m + k ln T ). This extends the classical O(ln T )\nbound for standard bandits, and is not possible to achieve using Exp4 or simple variations of it.\n\n6\n\n\f4.2 Finite hypothesis space with unknown expected reward gap\n\nThis example illustrates the importance of allowing sample-dependent s(Z \u2018\n1). We still assume a\n\ufb01nite hypothesis space, with m = |H| < \u221e. However, we would like to improve the performance\nbound by imposing additional assumptions. In particular we note that the standard bandits problem\nhas regret of the form O(ln T ) while in the worst case, our method for the contextual bandits problem\nhas regret O(T 2/3). A natural question is then: what are the assumptions we can impose so that the\nEpoch-Greedy algorithm can have a regret of the form O(ln T ).\nThe main technical reason that the standard bandits problem has regret O(ln T ) is that the expected\nreward of the best bandit and that of the second best bandit has a gap:\nthe constant hidden in\nthe O(ln T ) bound depends on this gap, and the bound becomes trivial (in\ufb01nity) when the gap\napproaches zero. In this example we show that a similar assumption for contextual bandits problems\nleads to a similar regret bound of O(ln T ) for the Epoch-Greedy algorithm.\nLet H = {h1, . . . , hm}, and assume without loss of generality that R(h1) \u2265 R(h2) \u2265 \u00b7\u00b7\u00b7 \u2265\nR(hm). Suppose that we know that R(h1) \u2265 R(h2) + \u2206 for some \u2206 > 0, but the value of \u2206 is not\nknown in advance.\nAlthough \u2206 is not known, it can be estimated from the data Z n\nbe\n\n1 . Let the empirical reward of h \u2208 H\n\nnX\n\nt=1\n\n\u02c6R(h|Z n\n\n1 ) = k\nn\n\nra,tI(h(xt) = at).\n\nLet \u02c6h1 be the hypothesis with highest empirical reward on Z n\nhighest empirical reward. We de\ufb01ne the empirical gap as\n\n1 , and \u02c6h2 be the hypothesis with second\n\n\u02c6\u2206(Z n\n\n1 ) = \u02c6R(\u02c6h1|Z n\n\n1 ) \u2212 \u02c6R(\u02c6h2|Z n\n1 ).\n\nLet h1 be the hypothesis with the highest true expected reward, then we suffer a regret when \u02c6h1 6=\nh1. Again, the standard large deviation bound implies that there exists a universal constant c > 0\nsuch that for all j \u2265 1:\n\nP ( \u02c6\u2206(Z n\n\n1 ) \u2265 (j \u2212 1)\u2206, \u02c6h1 6= h1) \u2264me\u2212ck\u22121n(1+j2)\u22062\n\n1 ) \u2264 0.5\u2206) \u2264me\u2212ck\u22121n\u22062\n\n.\n\n1 )2c. With this choice, there exists a constant c0 > 0\n\n1 ) \u2264 j\u2206}P ( \u02c6\u2206(Z n\n\n1 ) \u2208 [(j \u2212 1)\u2206, j\u2206], \u02c6h1 6= h1)\n\nP ( \u02c6\u2206(Z n\n\n1 ) \u2208 [(j \u2212 1)\u2206, j\u2206], \u02c6h1 6= h1)\n\nNow we can set s(Z n\nsuch that\n\n\u00b5n(H, s) \u2264\n\nj=1\n\nj=1\n\n\u2264\n\nsup{s(Z n\n\n1 ) : \u02c6\u2206(Z n\n\nP ( \u02c6\u2206(Z n\n1 ) = bm\u22121e(2k)\u22121cn \u02c6\u2206(Zn\nd\u2206\u22121eX\nd\u2206\u22121eX\nd\u2206\u22121eX\nd\u2206\u22121eX\n\u2264c0pk/n\u2206\u22121e\u2212ck\u22121n\u22062\n\ne\u2212ck\u22121n(0.5j2+1)\u22062\n\nm\u22121e(2k)\u22121cnj2\u22062\n\n\u2264\n\n\u2264\n\nj=1\n\nj=1\n\n.\n\ne(2k)\u22121cnj2\u22062\u2212ck\u22121n(1+j2)\u22062\n\nThere exists a constant c00 > 0 such that for any L:\n\n\u221eX\n\npk/\u2018\u2206\u22121e\u2212ck\u22121\u2018\u22062\n\nLX\n\n\u00b5\u2018(H, s) \u2264L + c0\n\n\u2018=1\n\n\u2018=1\n\n\u2264L + c00k\u2206\u22122.\n\n7\n\n\fNow, consider any time horizon T . If we set n\u2018 = 0 when \u2018 < L, nL = T , and\n\n(cid:24)8k(ln m + ln(T + 1))\n\n(cid:25)\n\nc\u22062\n\n,\n\nL =\n\nthen\n\nP (s(Z L\n\n1 ) \u2264 nL) \u2264 P ( \u02c6\u2206(Z L\n1 ) = bm\u22121e(2k)\u22121cn \u02c6\u2206(Zn\n\n1 ) \u2264 0.5\u2206) \u2264 me\u2212ck\u22121L\u22062 \u2264 1/T.\n(cid:25)\n\n(cid:24)8k(ln m + ln(T + 1))\n\n1 )2c in Figure 1, then\n\nThat is, if we choose s(Z n\n\u2206R(Epoch-Greedy,H, T ) \u2264 2L + 1 + c00k\u2206\u22122 \u2264 2\n\nc\u22062\n\n+ 1 + c00k\u2206\u22122.\n\n1 ) choice of Section 4.1 when \u02c6\u2206(Z n\n\nThe regret for this choice is O(ln T ), which is better than O(T 2/3) of Section 4.1. However, the\nconstant depends on the gap \u2206 which can be small. It is possible to combine the two strategies (that\nis, use the s(Z n\n1 ) is small) and obtain bounds that not only work\nwell when the gap \u2206 is large, but also not much worse than the bound of Section 4.1 when \u2206 is small.\nAs a special case, we can apply the method in this section to solve the standard bandits problem.\nThe O(k ln T ) bound of the Epoch-Greedy method matches those more specialized algorithms for\nthe standard bandits problem, although our algorithm has a larger constant.\n\n5 Conclusion\n\nWe consider a generalization of the multi-armed bandits problem, where observable context can\nbe used to determine which arm to pull and investigate the sample complexity of the explo-\nration/exploitation trade-off for the Epoch-Greedy algorithm.\nThe Epoch-Greedy algorithm analysis leaves one important open problem behind. Epoch-Greedy is\nmuch better at dealing with large hypothesis spaces or hypothesis spaces with special structures due\nto its ability to employ any data-dependent sample complexity bound. However, for \ufb01nite hypothesis\nspace, in the worst case scenario, Exp4 has better dependency on T . In such situations, it\u2019s possible\nthat a better designed algorithm can achieve both strengths.\n\nReferences\nAuer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite time analysis of the multi-armed bandit\n\nproblem. Machine Learning, 47, 235\u2013256.\n\nAuer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The\n\nadversarial multi-armed bandit problem. FOCS.\n\nEven-dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the\n\nmulti-armed bandit and reinforcement learning problems. JMLR, 7, 1079\u20131105.\n\nHeckman, J. (1979). Sample selection bias as a speci\ufb01cation error. Econometrica, 47, 153\u2013161.\nKearns, M., Mansour, Y., & Ng, A. Y. (2000). Approximate planning in large pomdps via reusable\n\ntrajectories. NIPS.\n\nLai, T., & Robbins, H. (1985). Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6, 4\u201322.\n\nLai, T., & Yakowitz, S. (1995). Machine learning and nonparametric bandit theory. IEEE TAC, 40,\n\n1199\u20131209.\n\nPandey, S., Agarwal, D., Chakrabarti, D., & Josifovski, V. (2007). Bandits for taxonomies: a model-\n\nbased approach. SIAM Data Mining Conference.\n\nStrehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (2006). Experience-ef\ufb01cient learning in\n\nassociative bandit problems. ICML.\n\nWang, C.-C., Kulkarni, S. R., & Poor, H. V. (2005). Bandit problems with side observations. IEEE\n\nTransactions on Automatic Control, 50, 338\u2013355.\n\n8\n\n\f", "award": [], "sourceid": 785, "authors": [{"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}