{"title": "PAC-Bayesian Analysis of Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1683, "page_last": 1691, "abstract": "We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) $N$ goes as $\\sqrt{N I_{\\rho_t}(S;A)}$, where $I_{\\rho_t}(S;A)$ is the mutual information between states and actions (the side information) used by the algorithm at round $t$. If the algorithm uses all the side information, the regret bound scales as $\\sqrt{N \\ln K}$, where $K$ is the number of actions (arms). However, if the side information $I_{\\rho_t}(S;A)$ is not fully used, the regret bound is significantly tighter. In the extreme case, when $I_{\\rho_t}(S;A) = 0$, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with computational complexity that is a linear in the number of actions.", "full_text": "PAC-Bayesian Analysis of Contextual Bandits\n\nYevgeny Seldin1,4 Peter Auer2 Franc\u00b8ois Laviolette3 John Shawe-Taylor4 Ronald Ortner2\n\n1Max Planck Institute for Intelligent Systems, T\u00a8ubingen, Germany\n\n2Chair for Information Technology, Montanuniversit\u00a8at Leoben, Austria\n\n3D\u00b4epartement d\u2019informatique, Universit\u00b4e Laval, Qu\u00b4ebec, Canada\n4Department of Computer Science, University College London, UK\n\nseldin@tuebingen.mpg.de, {auer,ronald.ortner}@unileoben.ac.at,\n\nfrancois.laviolette@ift.ulaval.ca, jst@cs.ucl.ac.uk\n\nAbstract\n\nWe derive an instantaneous (per-round) data-dependent regret bound for stochas-\ntic multiarmed bandits with side information (also known as contextual bandits).\nThe scaling of our regret bound with the number of states (contexts) N goes as\n\npN I\u21e2t(S; A), where I\u21e2t(S; A) is the mutual information between states and ac-\ntions (the side information) used by the algorithm at round t. If the algorithm\nuses all the side information, the regret bound scales as pN ln K, where K is\nthe number of actions (arms). However, if the side information I\u21e2t(S; A) is not\nfully used, the regret bound is signi\ufb01cantly tighter. In the extreme case, when\nI\u21e2t(S; A) = 0, the dependence on the number of states reduces from linear to\nlogarithmic. Our analysis allows to provide the algorithm large amount of side\ninformation, let the algorithm to decide which side information is relevant for the\ntask, and penalize the algorithm only for the side information that it is using de\nfacto. We also present an algorithm for multiarmed bandits with side information\nwith O(K) computational complexity per game round.\n\n1\n\nIntroduction\n\nMultiarmed bandits with side information are an elegant mathematical model for many real-life\ninteractive systems, such as personalized online advertising, personalized medical treatment, and so\non. This model is also known as contextual bandits or associative bandits (Kaelbling, 1994, Strehl\net al., 2006, Langford and Zhang, 2007, Beygelzimer et al., 2011). In multiarmed bandits with side\ninformation the learner repeatedly observes states (side information) {s1, s2, . . .} (for example,\nsymptoms of a patient) and has to perform actions (for example, prescribe drugs), such that the\nexpected regret is minimized. The regret is usually measured by the difference between the reward\nthat could be achieved by the best (unknown) \ufb01xed policy (for example, the number of patients that\nwould be cured if we knew the best drug for each set of symptoms) and the reward obtained by the\nalgorithm (the number of patients that were actually cured).\nMost of the existing analyses of multiarmed bandits with side information has focused on the ad-\nversarial (worst-case) model, where the sequence of rewards associated with each state-action pair\nis chosen by an adversary. However, many problems in real-life are not adversarial. We derive data-\ndependent analysis for stochastic multiarmed bandits with side information. In the stochastic setting\nthe rewards for each state-action pair are drawn from a \ufb01xed unknown distribution. The sequence of\nstates is also drawn from a \ufb01xed unknown distribution. We restrict ourselves to problems with \ufb01nite\nnumber of states N and \ufb01nite number of actions K and leave generalization to continuous state and\naction spaces to future work. We also do not assume any structure of the state space. Thus, for us\na state is just a number between 1 and N. For example, in online advertising the state can be the\ncountry from which a web page is accessed.\n\n1\n\n\fThe result presented in this paper exhibits adaptive dependency on the side information (state iden-\ntity) that is actually used by the algorithm. This allows us to provide the algorithm a large amount\nof side information and let the algorithm decide, which of this side information is actually relevant\nto the task. For example, in online advertising we can increase the state resolution and provide the\nalgorithm the town from which the web page was accessed, but if this re\ufb01ned state information is not\nused by the algorithm the regret bound will not deteriorate. This can be opposed to existing analysis\nof adversarial multiarmed bandits, where the regret bound depends on a prede\ufb01ned complexity of\nthe underlying expert class (Beygelzimer et al., 2011). Thus, the existing analysis of adversarial\nmultiarmed bandits would either become looser if we add more side information or a-priori limit the\nusage of the side information through its internal structure. (We note that through the relation be-\ntween PAC-Bayesian analysis and the analysis of adversarial online learning described in Banerjee\n(2006) it might be possible to extend our analysis to adversarial setting, but we leave this research\ndirection to future work.)\nThe idea of regularization by relevant mutual information goes back to the Information Bottleneck\nprinciple in supervised and unsupervised learning (Tishby et al., 1999). Tishby and Polani (2010)\nfurther suggested to measure the complexity of a policy in reinforcement learning by the mutual\ninformation between states and actions used by the policy. We note, however, that our starting point\nis the regret bound and we derive the regularization term from our analysis without introducing it\na-priori. The analysis also provides time and data dependent weighting of the regularization term.\nOur results are based on PAC-Bayesian analysis (Shawe-Taylor and Williamson, 1997, Shawe-\nTaylor et al., 1998, McAllester, 1998, Seeger, 2002), which was developed for supervised learning\nwithin the PAC (Probably Approximately Correct) learning framework (Valiant, 1984). In PAC-\nBayesian analysis the complexity of a model is de\ufb01ned by a user-selected prior over a hypothesis\nspace. Unlike in VC-dimension-based approaches and their successors, where the complexity is\nde\ufb01ned for a hypothesis class, in PAC-Bayesian analysis the complexity is de\ufb01ned for individual\nhypotheses. The analysis provides an explicit trade-off between individual model complexity and\nits empirical performance and a high probability guarantee on the expected performance.\nAn important distinction between supervised learning and problems with limited feedback, such\nas multiarmed bandits and reinforcement learning more generally, is the fact that in supervised\nlearning the training set is given, whereas in reinforcement learning the training set is generated by\nthe learner as it plays the game. In supervised learning every hypothesis in a hypothesis class can\nbe evaluated on all the samples, whereas in reinforcement learning rewards of one action cannot\nbe used to evaluate another action. Recently, Seldin et al. (2011b,a) generalized PAC-Bayesian\nanalysis to martingales and suggested a way to apply it under limited feedback. Here, we apply this\ngeneralization to multiarmed bandits with side information.\nThe remainder of the paper is organized as follows. We start with de\ufb01nitions in Section 2 and provide\nour main results in Section 3, which include an instantaneous regret bound and a new algorithm for\nstochastic multiarmed bandits with side information. In Section 4 we present an experiment that\nillustrates our theoretical results. Then, we dive into the proof of our main results in Section 5 and\ndiscuss the paper in Section 6.\n\n2 De\ufb01nitions\n\nIn this section we provide all essential de\ufb01nitions for our main results in the following section. We\nstart with the de\ufb01nition of stochastic multiarmed bandits with side information. Let S be a set of\n|S| = N states and let A be a set of |A| = K actions, such that any action can be performed in any\nstate. Let s 2 S denote the states and a 2 A denote the actions. Let R(a, s) be the expected reward\nfor performing action a in state s. At each round t of the game the learner is presented a state St\ndrawn i.i.d. according to an unknown distribution p(s). The learner draws an action At according to\nhis choice of a distribution (policy) \u21e1t(a|s) and obtains a stochastic reward Rt with expected value\nR(At, St). Let {S1, S2, . . .} denote the sequence of observed states, {\u21e11, \u21e12, . . .} the sequence of\npolicies played, {A1, A2, . . .} the sequence of actions played, and {R1, R2, . . .} the sequence of\nobserved rewards. Let Tt = {{S1, . . . , St},{\u21e11, . . . , \u21e1t},{A1, . . . , At},{R1, . . . , Rt}} denote the\nhistory of the game up to time t.\n\n2\n\n\fAssume that \u21e1t(a|s) > 0 for all t, a, and s. For t 1, a 2 {1, . . . , K}, and the sequence of\nobserved states {S1, . . . , St} de\ufb01ne a set of random variables Ra,St\nif At = a\notherwise.\n\n\u21e1t(a|St) Rt,\n\nRa,St\n\n0,\n\n:\n\n1\n\nt\n\nt =\u21e2\n\nt\n\n(The variables Ra,s\nare de\ufb01ned only for the observed state s = St.) Note that whenever de\ufb01ned,\nE[Ra,St\nis generally known as importance weighted\nsampling (Sutton and Barto, 1998). Importance weighted sampling is required for application of\nPAC-Bayesian analysis, as will be shown in the technical part of the paper.\n\n|Tt1, St] = R(a, St). The de\ufb01nition of Ra,s\n\nt\n\nt\n\nfunction). We de\ufb01ne the empirical rewards of state-action pairs as:\n\n\u2327 =1 I{S\u2327 =s} as the number of times state s appeared up to time t (I is the indicator\n\nDe\ufb01ne nt(s) =Pt\n\n\u02c6Rt(a, s) =( P{\u2327 =1,...,t:S\u2327 =s} Ra,s\n\nnt(s)\n0,\n\n\u2327\n\n,\n\nif nt(s) > 0\notherwise.\n\nNote that whenever nt(s) > 0 we have E \u02c6Rt(a, s) = R(a, s). For every state s we de\ufb01ne the \u201cbest\u201d\naction in that state as a\u21e4 = arg maxa R(a, s) (if there are multiple \u201cbest\u201d actions, one of them is\nchosen arbitrarily). We then de\ufb01ne the expected and empirical regret for performing any other action\na in state s as:\n\n(a, s) = R(a\u21e4(s), s) R(a, s),\n\n\u02c6t(a, s) = \u02c6Rt(a\u21e4(s), s) \u02c6Rt(a, s).\n\nt\n\nLet \u02c6pt(s) = nt(s)\nbe the empirical distribution over states observed up to time t. For any pol-\nicy \u21e2(a|s) we de\ufb01ne the empirical reward, empirical regret, and expected regret of the policy\nas: \u02c6Rt(\u21e2) = Ps \u02c6pt(s)Pa \u21e2(a|s) \u02c6Rt(a, s), \u02c6t(\u21e2) = Ps \u02c6pt(s)Pa \u21e2(a|s) \u02c6t(a, s), and (\u21e2) =\nPs p(s)Pa \u21e2(a|s)(a, s).\nWe de\ufb01ne the marginal distribution over actions that corresponds to a policy \u21e2(a|s) and the uniform\nNPs \u21e2(a|s) and the mutual information between actions and states\ndistribution over S as \u00af\u21e2(a) = 1\ncorresponding to the policy \u21e2(a|s) and the uniform distribution over S as\n\nI\u21e2(S; A) =\n\n1\n\nN Xs,a\n\n\u21e2(a|s) ln\n\n\u21e2(a|s)\n\u00af\u21e2(a)\n\n.\n\nFor the proof of our main result and also in order to explain the experiments we also have to de\ufb01ne\na hypothesis space for our problem. This de\ufb01nition is not used in the statement of the main result.\nLet H be a hypothesis space, such that each member h 2 H is a deterministic mapping from S to\nA. Denote by a = h(s) the action assigned by hypothesis h to state s. It is easy to see that the size\nof the hypothesis space |H| = KN. Denote by R(h) =Ps2S p(s)R(h(s), s) the expected reward\nof a hypothesis h. De\ufb01ne:\n\n\u02c6Rt(h) =\n\nRh(S\u2327 ),S\u2327\n\n\u2327\n\n.\n\n1\nt\n\ntX\u2327 =1\n\nNote that E \u02c6Rt(h) = R(h).\nLet h\u21e4 = arg maxh2H R(h) be the \u201cbest\u201d hypothesis (the one that chooses the \u201cbest\u201d action in each\nstate). (If there are multiple hypotheses achieving maximal reward pick any of them.) De\ufb01ne:\n\n(h) = R(h\u21e4) R(h),\n\n\u02c6t(h) = \u02c6Rt(h\u21e4) \u02c6Rt(h).\n\nAny policy \u21e2(a|s) de\ufb01nes a distribution over H: we can draw an action a for each state s according\nto \u21e2(a|s) and thus obtain a hypothesis h 2 H. We use \u21e2(h) to denote the respective probability\nof drawing h. For a policy \u21e2 we de\ufb01ne (\u21e2) = E\u21e2(h)[(h)] and \u02c6t(\u21e2) = E\u21e2(h)[ \u02c6t(h)]. By\nmarginalization these de\ufb01nitions are consistent with our preceding de\ufb01nitions of (\u21e2) and \u02c6t(\u21e2).\ns=1 Ih(s)=a be the number of states in which action a is played by the\nbe the normalized cardinality pro\ufb01le (histogram) over the\n\nFinally, let nh(a) = PN\nhypothesis h. Let Ah = n nh(a)\n\nN oa2A\n\n3\n\n\fnh(a)\nN ln nh(a)\n\nactions played by hypothesis h (with respect to the uniform distribution over S). Let H(Ah) =\nPa\nN be the entropy of this cardinality pro\ufb01le. In other words, H(Ah) is the entropy\nof an action choice of hypothesis h (with respect to the uniform distribution over S). Note, that the\noptimal policy \u21e2\u21e4(a|s) (the one, that selects the \u201cbest\u201d action in each state) is deterministic and we\nhave I\u21e2\u21e4(S; A) = H(Ah\u21e4).\n\n3 Main Results\n\nOur main result is a data and complexity dependent regret bound for a general class of prediction\nstrategies of a smoothed exponential form. Let \u21e2t(a) be an arbitrary distribution over actions, let\n\n\u21e2exp\nt\n\n(a|s) =\n\n\u21e2t(a)et \u02c6Rt(a,s)\n\nZ(\u21e2exp\n\nt\n\n, s)\n\n,\n\n(1)\n\nwhere Z(\u21e2exp\n\nt\n\n, s) =Pa \u21e2t(a)et \u02c6Rt(a,s) is a normalization factor, and let\n\n\u02dc\u21e2exp\nt\n\nt\n\n(a|s) + \"t+1\n\n(a|s) = (1 K\"t+1)\u21e2exp\n\n(2)\nbe a smoothed exponential policy. The following theorem provides a regret bound for playing \u02dc\u21e2exp\nat round t + 1 of the game. For generality, we assume that rounds 1, . . . , t were played according to\narbitrary policies \u21e11, . . . , \u21e1t.\nTheorem 1. Assume that in game rounds 1, . . . , t policies {\u21e11, . . . , \u21e1t} were played and assume\nthat mina,s \u21e1t(a|s) \"t for an arbitrary \"t that is independent of Tt. Let \u21e2t(a) be an arbitrary\ndistribution over A that can depend on Tt and satis\ufb01es mina \u21e2t(a) \u270ft. Let c > 1 be an arbitrary\nnumber that is independent of Tt. Then, with probability greater than 1 over Tt, simultaneously\nfor all policies \u02dc\u21e2exp\n\nde\ufb01ned by (2) that satisfy\n\nt\n\nt\n\nN I\u21e2exp\n\nt\n\n(S; A) + K(ln N + ln K) + ln 2mt\n\n\n2(e 2)t\n\n\"t\nc2\n\n\uf8ff\n\n(3)\n\nwe have:\n\n(\u02dc\u21e2exp\n\nt\n\n) \uf8ff (1 + c)s 2(e 2)(N I\u21e2exp\n\nt\n\nwhere mt = ln\u21e3q (e2)t\n\n \u2318 / ln(c), and for all \u21e2exp\n\nln 2\n\nt\n\n(S; A) + K(ln N + ln K) + ln 2mt\n )\n\nt\"t\n\n+ K\"t+1,\n(4)\nthat do not satisfy (3), with the same probability:\n\n+\n\nln 1\n\u270ft+1\nt\n\n2(N I\u21e2exp\n\nt\n\n(S; A) + K(ln N + ln K) + ln 2mt\n )\n\n(\u02dc\u21e2exp\n\nt\n\n) \uf8ff\n\nt\"t\n\n+\n\nln 1\n\u270ft+1\nt\n\n+ K\"t+1.\n\n.\nand not \u02dc\u21e2exp\nNote that the mutual information in Theorem 1 is calculated with respect to \u21e2exp\nTheorem 1 allows to tune the learning rate t based on the sample. It also provides an instantaneous\nregret bound for any algorithm that plays the policies {\u02dc\u21e2exp\n, . . .} throughout the game. In\norder to obtain such a bound we just have to take a decreasing sequence {\"1, \"2, . . .} and substitute\n in Theorem 1 with t = \nt(t+1). Then, by the union bound, the result holds with probability greater\nthan 1 for all rounds of the game simultaneously. This leads to Algorithm 1 for stochastic\nmultiarmed bandits with side information. Note that each round of the algorithm takes O(K) time.\nTheorem 1 is based on the following regret decomposition and the subsequent theorem and two\nlemmas that bound the three terms in the decomposition.\n\n, \u02dc\u21e2exp\n\n1\n\n2\n\nt\n\nt\n\nt\n\n(\u02dc\u21e2exp\n\n) = [(\u21e2exp\n\n(5)\nTheorem 2. Under the conditions of Theorem 1 on {\u21e11, . . . , \u21e1t} and c, simultaneously for all\npolicies \u21e2 that satisfy (3) with probability greater than 1 :\n\n) R(\u02dc\u21e2exp\n\n) \u02c6t(\u21e2exp\n\n)] + \u02c6t(\u21e2exp\n\n) + [R(\u21e2exp\n\n)].\n\nt\n\nt\n\nt\n\nt\n\nt\n\n(\u21e2) \u02c6t(\u21e2) \uf8ff (1 + c)s 2(e 2)(N I\u21e2(S;A) + K(ln N + ln K) + ln 2mt\n\nt\"t\n\n )\n\n,\n\n(6)\n\n4\n\n\fAlgorithm 1: Algorithm for stochastic contextual bandits. (See text for de\ufb01nitions of \"t and t.)\nInput: N, K\n\u02c6R(a, s) 0 for all a, s (These are cumulative [unnormalized] rewards)\nK for all a\n\u21e2(a) 1\nn(s) 0 for all s\nt 1\nwhile not terminated do\nif\"t 1\n\u21e2(a|St) \u21e2(a) for all a\n\nK or (n(St) = 0) then\n\nObserve state St.\n\nelse\n\n\u21e2(a|St) (1 K\"t)\n\u21e2(a) N1\nN \u21e2(a) + 1\n\n\u21e2(a)et \u02c6R(a,St)/n(St)\n\nPa0 \u21e2(a0)et \u02c6R(a0,St)/n(St) + \"t for all a\nN \u21e2(a|St) for all a\n\nDraw action At according to \u21e2(a|St) and play it.\nObserve reward Rt.\nn(St) n(St) + 1\n\u02c6R(At, St) \u02c6R(At, St) + Rt\nt t + 1\n\n\u21e2(At|St)\n\nand for all \u21e2 that do not satisfy (3) with the same probability:\n\n(\u21e2) \u02c6t(\u21e2) \uf8ff\n\n2(N I\u21e2(S; A) + K(ln N + ln K) + ln 2mt\n )\n\nt\"t\n\n.\n\nNote that Theorem 2 holds for all possible \u21e2-s, including those that do not have an exponential form.\nLemma 1. For any distribution \u21e2exp\n\nof the form (1), where \u21e2t(a) \u270f for all a, we have:\n\nt\n\n\u02c6t(\u21e2exp\n\nt\n\n) \uf8ff\n\nln 1\n\u270f\nt\n\n.\n\nLemma 2. Let \u02dc\u21e2 be an \"-smoothed version of a policy \u21e2, such that \u02dc\u21e2(a|s) = (1 K\")\u21e2(a|s) + \",\nthen\n\nR(\u21e2) R(\u02dc\u21e2) \uf8ff K\".\n\nt\n\nln 1\n\u270ft+1\n\ndepends on the trade-off between its complexity, N I\u21e2exp\n\nProof of Theorem 2 is provided in Section 5 and proofs of Lemmas 1 and 2 are provided in the\nsupplementary material.\nComments on Theorem 1. Theorem 1 exhibits what we were looking for: the regret of a policy\n(S; A), and the empirical regret, which\n\u02dc\u21e2exp\nt\n. We note that 0 \uf8ff I\u21e2t(S; A) \uf8ff ln K, hence, the result is interesting when\nis bounded by 1\nt\nN K, since otherwise K ln K term in the bound neutralizes the advantage we get from having\nsmall mutual information values. The assumption that N K is reasonable for many applications.\nWe believe that the dependence of the \ufb01rst term of the regret bound (4) on \"t is an artifact of our\ncrude upper bound on the variance of the sampling process (given in Lemma 3 in the proof of The-\norem 2) and that this term should not be in the bound. This is supported by an empirical study of\nstochastic multiarmed bandits (Seldin et al., 2011a). With the current bound the best choice for \"t\nis \"t = (Kt)1/3, which, by integration over the game rounds, yields O(K1/3t2/3) dependence of\nthe cumulative regret on the number of arms and game rounds. However, if we manage to derive a\ntighter analysis and remove \"t from the \ufb01rst term in (4), the best choice of \"t will be \"t = (Kt)1/2\nand the dependence of the cumulative regret on the number of arms and time horizon will improve\nto O((Kt)1/2). One way to achieve this is to apply EXP3.P-style updates (Auer et al., 2002b), how-\never, Seldin et al. (2011a) empirically show that in stochastic environments EXP3 algorithm of Auer\net al. (2002b), which is closely related to Algorithm 1, has signi\ufb01cantly better performance. Thus,\nit is desirable to derive a better analysis for EXP3 algorithm in stochastic environments. We note\n\n5\n\n\f(a|s) and thus higher mutual information values I\u21e2exp\n\nthat although UCB algorithm for stochastic multiarmed bandits (Auer et al., 2002a) is asymptoti-\ncally better than the EXP3 algorithm, it is not compatible with PAC-Bayesian analysis and we are\nnot aware of a way to derive a UCB-type algorithm and analysis for multiarmed bandits with side\ninformation, whose dependence on the number of states would be better than O(N ln K). Seldin\net al. (2011a) also demonstrate that empirically it takes a large number of rounds until the asymptotic\nadvantage of UCB over EXP3 translates into a real advantage in practice.\nIt is not trivial to minimize (4) with respect to t analytically. Generally, higher values of t decrease\nthe second term of the bound, but also lead to more concentrated policies (conditional distributions)\n(S; A). A simple way to address this\n\u21e2exp\nt\ntrade-off is to set t such that the contribution of the second term is as close to the contribution of\nthe \ufb01rst term as possible. This can be approximated by taking the value of mutual information from\nthe previous round (or approximation of the value of mutual information from the previous round).\nMore details on parameter setting for the algorithm are provided in the supplementary material.\nComments on Algorithm 1. By regret decomposition (5) and Theorem 2, regret at round t + 1 is\nminimized by a policy \u21e2t(a|s) that minimizes a certain trade-off between the mutual information\nI\u21e2(S; A) and the empirical regret \u02c6Rt(\u21e2). This trade-off is analogical to rate-distortion trade-off in\ninformation theory (Cover and Thomas, 1991). Minimization of rate-distortion trade-off is achieved\nby iterative updates of the following form, which are known as Blahut-Arimoto (BA) algorithm:\n\nt\n\n\u21e2BA\nt\n\n(a|s) =\n\n(a)et \u02c6Rt(a,s)\n\n(a)et \u02c6Rt(a,s)\n\n\u21e2BA\nt\n\nPa \u21e2BA\n\nt\n\n,\n\n\u21e2BA\nt\n\n(a) =\n\n1\n\nN Xs\n\n\u21e2BA\nt\n\n(a|s).\n\nRunning a similar type of iterations in our case would be prohibitively expensive, since they require\niteration over all states s 2 S at each round of the game. We approximate these iterations by\napproximating the marginal distribution over the actions by a running average:\n\n\u02dc\u21e2exp\nt+1(a) =\n\nN 1\nN\n\n\u02dc\u21e2exp\nt\n\n(a) +\n\n1\nN\n\n\u02dc\u21e2exp\nt\n\n(a|St).\n\n(7)\n\nt\n\nt\n\n(a|s) is bounded from zero by a decreasing sequence \"t+1, the same automatically holds\nt+1(a) (meaning that in Theorem 1 \u270ft = \"t). Note that Theorem 1 holds for any choice of\n\nSince \u21e2exp\nt\nfor \u02dc\u21e2exp\n\u21e2t(a), including (7).\n(a) propagates information between different states, but Theo-\nWe point out an interesting fact: \u21e2exp\nK , which corresponds to application of EXP3\nrem 1 also holds for the uniform distribution \u21e2(a) = 1\nalgorithm in each state independently. If these independent multiarmed bandits independently con-\nverge to similar strategies, we still get a tighter regret bound. This happens because the correspond-\ning subspace of the hypothesis space is signi\ufb01cantly smaller than the total hypothesis space, which\nenables us to put a higher prior on it (Seldin and Tishby, 2010). Nevertheless, propagation of infor-\nmation between states via the distribution \u21e2exp\n(a) helps to achieve even faster convergence of the\nregret, as we can see from the experiments in the next section.\nComparison with state-of-the-art. We are not aware of algorithms for stochastic multiarmed ban-\ndits with side information. The best known to us algorithm for adversarial multiarmed bandits with\n\nside information is EXP4.P by Beygelzimer et al. (2011). EXP4.P has O(pKt ln|H|) regret and\nO(K|H|) complexity per game round. In our case |H| = KN, which means that EXP4.P would\nhave O(pKtN ln K) regret and O(KN +1) computational complexity. For hard problems, where\nall side information has to be used, our regret bound is inferior to the regret bound of Beygelzimer\net al. (2011) due to O(t2/3) dependence on the number of game rounds. However, we believe that\nthis can be improved by a more careful analysis of the existing algorithm. For simple problems\nthe dependence of our regret bound on the number of states is signi\ufb01cantly better, up to the point\nthat when the side information is irrelevant for the task we can get O(pK ln N ) dependence on the\nnumber of states versus O(pN ln K) in EXP4.P. For N K this leads to tighter regret bounds for\n\nsmall t even despite the \u201cincorrect\u201d dependence on t of our bound, and if we improve the analysis\nit will lead to tighter regret bounds for all t. As we said it already, our algorithm is able to \ufb01lter\nrelevant information from large amounts of side information automatically, whereas in EXP4.P the\nusage of side information has to be restricted externally through the construction of the hypothesis\nclass.\n\n6\n\n\f4\n\n15x 10\n\nH(Ah*)=0\nH(Ah*)=1\nH(Ah*)=2\nH(Ah*)=3\nBaseline\n\n10\n\n)\nt\n(\n\u2206\n\n5\n\n \n\n0\n0\n\n \n\n)\n\np\nx\nte\n~\u03c1\n(\n\u2206\nn\no\n\n \n\n \n\nd\nn\nu\no\nB\n\n1\n\n2\nt\n\n(a) (t)\n\n3\n\n4\n6\nx 10\n\n2.5\n\n2\n\n1.5\n\n1\n\n \n\n0.5\n0\n\n \n\nH(Ah*)=0\nH(Ah*)=1\nH(Ah*)=2\nH(Ah*)=3\n\n)\n\nA\nS\n\n;\n\n(\n\nt\n\n\u03c1\n\nI\n\n1\n\n2\nt\n\n3\n\n4\n6\nx 10\n\n(b) bound on (\u02dc\u21e2exp\n\nt\n\n)\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n \n\n0\n0\n\nH(Ah*)=0\nH(Ah*)=1\nH(Ah*)=2\nH(Ah*)=3\n\n \n\n1\n\n2\nt\n\n3\n\n4\n6\nx 10\n\n(c) I\u21e2exp\n\nt\n\n(S; A)\n\n),\nFigure 1: Behavior of: (a) cumulative regret (t), (b) bound on instantaneous regret (\u02dc\u21e2exp\nand (c) the approximation of mutual information I\u21e2exp\n(S; A). \u201cBaseline\u201d in the \ufb01rst graph cor-\nresponds to playing N independent multiarmed bandits, one in each state. Each line in the graphs\ncorresponds to an average over 10 repetitions of the experiment.\n\nt\n\nt\n\nThe second important advantage of our algorithm is the exponential improvement of computational\ncomplexity. This is achieved by switching from the space of experts to the state-action space in all\nour calculations.\n\n4 Experiments\n\n(S; A) is approximated by a running average (see supplementary material for details).\n\nexperiment for T = 4, 000, 000 rounds and calculate the cumulative regret (t) =Pt\n\nWe present an experiment on synthetic data that illustrates our results. We take N = 100, K = 20, a\nuniform distribution over states (p(s) = 0.01), and consider four settings, with H(Ah\u21e4) = ln(1) =\n0, H(Ah\u21e4) = ln(3) \u21e1 1, H(Ah\u21e4) = ln(7) \u21e1 2, and H(Ah\u21e4) = ln(20) \u21e1 3, respectively. In the\n\ufb01rst case, the same action is the best in all states (and hence H(Ah\u21e4) = 0 for the optimal hypothesis\nh\u21e4). In the second case, for the \ufb01rst 33 states the best action is number 1, for the next 33 states\nthe best action is number 2, and for the rest third of the states the best action is number 3 (thus,\ndepending on the state, one of the three actions is the \u201cbest\u201d and H(Ah\u21e4) = ln(3)). In the third\ncase, there are seven groups of 14 states and each group has its own best action. In the last case,\nthere are 20 groups of 5 states and each of K = 20 actions is the best in exactly one of the 20\ngroups. For all states, the reward of the best action in a state has Bernoulli distribution with bias 0.6\nand the rewards of all other actions in that state have Bernoulli distribution with bias 0.5. We run the\n\u2327 =1 (\u02dc\u21e2exp\n)\nand instantaneous regret bound given in (4). For computational ef\ufb01ciency, the mutual information\nI\u21e2exp\nAs we can see from the graphs (see Figure 1), the algorithm exhibits sublinear cumulative regret\n(put attention to the axes\u2019 scales). Furthermore, for simple problems (with small H(Ah\u21e4)) the\nregret grows slower than for complex problems. \u201cBaseline\u201d in Figure 1.a shows the performance\nof an algorithm with the same parameter values that runs N multiarmed bandits, one in each state\nindependently of other states. We see that for all problems except the hardest one our algorithm\nperforms better than the baseline and for the hardest problem it performs almost as good as the\nbaseline. The regret bound in Figure 1.b provides meaningful values for the simplest problem after\n1 million rounds (which is on average 500 samples per state-action pair) and after 4 million rounds\nfor all the problems (the graph starts at t = 10, 000). Our estimates of the mutual information\n(S; A) re\ufb02ect H(Ah\u21e4) for the corresponding problems (for H(Ah\u21e4) = 0 it converges to zero,\nI\u21e2exp\nfor H(Ah\u21e4) \u21e1 1 it is approximately one, etc.).\n5 Proof of Theorem 2\n\n\u2327\n\nt\n\nt\n\nThe proof of Theorem 2 is based on PAC-Bayes-Bernstein inequality for martingales (Seldin et al.,\n2011b). Let KL(\u21e2k\u00b5) denote the KL-divergence between two distributions (Cover and Thomas,\n1991). Let {Z1(h), . . . , Zn(h) : h 2 H} be martingale difference sequences indexed by h with\nrespect to the \ufb01ltration (U1), . . . , (Un), where Ui = {Z1(h), . . . , Zi(h) : h 2 H} is the subset\nof martingale difference variables up to index i and (Ui) is the -algebra generated by Ui. This\nmeans that E[Zi(h)|(Ui1)] = 0, where Zi(h) may depend on Zj(h0) for all j < i and h0 2 H.\nThere might also be interdependence between {Zi(h) : h 2 H}. Let \u02c6Mi(h) = Pi\nj=1 Zj(h) be\n\n7\n\n\fthe corresponding martingales. Let Vi(h) =Pi\nj=1 E[Zj(h)2|(Uj1)] be cumulative variances of\nthe martingales \u02c6Mi(h). For a distribution \u21e2 over H de\ufb01ne \u02c6Mi(\u21e2) = E\u21e2(h)[ \u02c6Mi(h)] and Vt(\u21e2) =\nE\u21e2(h)[Vt(h)] as weighted averages of the martingales and their cumulative variances according to a\ndistribution \u21e2.\nTheorem 3 (PAC-Bayes-Bernstein Inequality). Assume that |Zi(h)| \uf8ff b for all h with probability\n1. Fix a prior distribution \u00b5 over H. Pick an arbitrary number c > 1. Then with probability greater\nthan 1 over Un, simultaneously for all distributions \u21e2 over H that satisfy\n\ns KL(\u21e2k\u00b5) + ln 2m\n(e 2)Vn(\u21e2) \uf8ff\n\n\n\n1\ncb\n\nwe have\n\nwhere m = ln\u21e3q (e2)n\n\nln 2\n\n| \u02c6Mn(\u21e2)| \uf8ff (1 + c)s(e 2)Vn(\u21e2)\u2713KL(\u21e2k\u00b5) + ln\n \u2318 / ln(c), and for all other \u21e2\n \u25c6 .\n\n| \u02c6Mn(\u21e2)| \uf8ff 2b\u2713KL(\u21e2k\u00b5) + ln\n\n2m\n\n2m\n\n \u25c6,\n\n\u2327\n\nNote that Mt(h) = t((h) \u02c6t(h)) are martingales and their cumulative variances are Vt(h) =\n\u2327 =1 E\u21e5[Rh\u21e4(S\u2327 ),S\u2327\nPt\n),1 a prior \u00b5(h) over H, and calculate (or upper bound) the\nhave to derive an upper bound on Vt(\u21e2exp\nKL-divergence KL(\u21e2exp\nLemma 3. If {\"1, \"2, . . .} is a decreasing sequence, such that \"t \uf8ff mina,s \u21e1t(a|s), then for all h:\n\n] [R(h\u21e4) R(h)]2T\u23271\u21e4. In order to apply Theorem 3 we\n\n Rh(S\u2327 ),S\u2327\nt k\u00b5). This is done in the following three lemmas.\n\n\u2327\n\nt\n\nVt(h) \uf8ff\n\n2t\n\"t\n\n.\n\nt\n\n\"t\n\n) \uf8ff 2t\n\nThe proof of the lemma is provided in the supplementary material. Lemma 3 provides an imme-\ndiate, but crude, uniform upper bound on Vt(h), which yields Vt(\u21e2exp\n. Since our algorithm\nconcentrates on h-s with small (h), which, in turn, concentrate on the best action in each state, the\nvariance Vt(h) for the corresponding h-s is expected to be of the order of 2Kt and not 2t\n. However,\n\"t\nwe were not able to prove yet that the probability \u21e2exp\n(h) of the remaining hypotheses (those with\nlarge (h)) gets suf\ufb01ciently small (of order K\"t), so that the weighted cumulative variance would\nbe of order 2Kt. Nevertheless, this seems to hold in practice starting from relatively small values of\nt (Seldin et al., 2011a). Improving the upper bound on Vt(\u21e2exp\n) will improve the regret bound, but\nfor the moment we present the regret bound based on the crude upper bound Vt(\u21e2exp\nThe remaining two lemmas, which de\ufb01ne a prior \u00b5 over H and bound KL(\u21e2k\u00b5), are due to Seldin\nand Tishby (2010).\nLemma 4. It is possible to de\ufb01ne a distribution \u00b5 over H that satis\ufb01es:\n\n) \uf8ff 2t\n\n\"t\n\n.\n\nt\n\nt\n\nt\n\nLemma 5. For the distribution \u00b5 that satis\ufb01es (8) and any distribution \u21e2(a|s):\n\nKL(\u21e2k\u00b5) \uf8ff N I\u21e2(S; A) + K ln N + K ln K.\n\n\u00b5(h) eN H(Ah)K ln NK ln K.\n\n(8)\n\nSubstitution of the upper bounds on Vt(\u21e2exp\n\nt\n\n) and KL(\u21e2exp\n\nt k\u00b5) into Theorem 3 yields Theorem 2.\n\n6 Discussion\n\nWe presented PAC-Bayesian analysis of stochastic multiarmed bandits with side information. Our\nanalysis provides data-dependent algorithm and data-dependent regret analysis for this problem.\nThe selection of task-relevant side information is delegated from the user to the algorithm. We also\nprovide a general framework for deriving data-dependent algorithms and analyses for many other\nstochastic problems with limited feedback. The analysis of the variance of our algorithm still waits\nto be improved and will be addressed in future work.\n\n1Seldin et al. (2011b) show that Vn(\u21e2) can be replaced by an upper bound everywhere in Theorem 3.\n\n8\n\n\fAcknowledgments\n\nWe would like to thank all the people with whom we discussed this work and, in particular, Nicol`o\nCesa-Bianchi, G\u00b4abor Bart\u00b4ok, Elad Hazan, Csaba Szepesv\u00b4ari, Miroslav Dud\u00b4\u0131k, Robert Shapire, John\nLangford, and the anonymous reviewers, whose comments helped us to improve the \ufb01nal version of\nthis manuscript. This work was supported in part by the IST Programme of the European Commu-\nnity, under the PASCAL2 Network of Excellence, IST-2007-216886, and by the European Commu-\nnity\u2019s Seventh Framework Programme (FP7/2007-2013), under grant agreement N o231495. This\npublication only re\ufb02ects the authors\u2019 views.\n\nReferences\nPeter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47, 2002a.\n\nPeter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal of Computing, 32(1), 2002b.\n\nArindam Banerjee. On Bayesian bounds. In Proceedings of the International Conference on Machine Learning\n\n(ICML), 2006.\n\nAlina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algo-\nrithms with supervised learning guarantees. In Proceedings on the International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2011.\n\nThomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.\nLeslie Pack Kaelbling. Associative reinforcement learning: Functions in k-DNF. Machine Learning, 15, 1994.\nJohn Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances\n\nin Neural Information Processing Systems (NIPS), 2007.\n\nDavid McAllester. Some PAC-Bayesian theorems. In Proceedings of the International Conference on Compu-\n\ntational Learning Theory (COLT), 1998.\n\nMatthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01cation. Journal of\n\nMachine Learning Research, 2002.\n\nYevgeny Seldin and Naftali Tishby. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine\n\nLearning Research, 11, 2010.\n\nYevgeny Seldin, Nicol`o Cesa-Bianchi, Peter Auer, Franc\u00b8ois Laviolette, and John Shawe-Taylor. PAC-Bayes-\nBernstein inequality for martingales and its application to multiarmed bandits. 2011a. In review. Preprint\navailable at http://arxiv.org/abs/1110.6755.\n\nYevgeny Seldin, Franc\u00b8ois Laviolette, Nicol`o Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-Bayesian\n\ninequalities for martingales. 2011b. In review. Preprint available at http://arxiv.org/abs/1110.6886.\n\nJohn Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In Proceedings of the\n\nInternational Conference on Computational Learning Theory (COLT), 1997.\n\nJohn Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization\n\nover data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.\n\nAlexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. Experience-ef\ufb01cient learning in\nassociative bandit problems. In Proceedings of the International Conference on Machine Learning (ICML),\n2006.\n\nRichard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\nNaftali Tishby and Daniel Polani. Information theory of decisions and actions. In Vassilis Cutsuridis, Amir\nHussain, John G. Taylor, and Daniel Polani, editors, Perception-Reason-Action Cycle: Models, Algorithms\nand Systems. Springer, 2010.\n\nNaftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Allerton Con-\n\nference on Communication, Control and Computation, 1999.\n\nLeslie G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27\n\n(11), 1984.\n\n9\n\n\f", "award": [], "sourceid": 948, "authors": [{"given_name": "Yevgeny", "family_name": "Seldin", "institution": null}, {"given_name": "Peter", "family_name": "Auer", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}, {"given_name": "Ronald", "family_name": "Ortner", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}]}