{"title": "An Empirical Evaluation of Thompson Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 2249, "page_last": 2257, "abstract": "Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against.", "full_text": "An Empirical Evaluation of Thompson Sampling\n\nOlivier Chapelle\nYahoo! Research\nSanta Clara, CA\n\nchap@yahoo-inc.com\n\nLihong Li\n\nYahoo! Research\nSanta Clara, CA\n\nlihong@yahoo-inc.com\n\nAbstract\n\nThompson sampling is one of oldest heuristic to address the exploration / ex-\nploitation trade-off, but it is surprisingly unpopular in the literature. We present\nhere some empirical results using Thompson sampling on simulated and real data,\nand show that it is highly competitive. And since this heuristic is very easy to\nimplement, we argue that it should be part of the standard baselines to compare\nagainst.\n\n1\n\nIntroduction\n\nVarious algorithms have been proposed to solve exploration / exploitation or bandit problems. One\nof the most popular is Upper Con\ufb01dence Bound or UCB [7, 3], for which strong theoretical guar-\nantees on the regret can be proved. Another representative is the Bayes-optimal approach of Gittins\n[4] that directly maximizes expected cumulative payoffs with respect to a given prior distribution.\nA less known family of algorithms is the so-called probability matching. The idea of this heuristic\nis old and dates back to [16]. This is the reason why this scheme is also referred to as Thompson\nsampling.\nThe idea of Thompson sampling is to randomly draw each arm according to its probability of being\noptimal. In contrast to a full Bayesian method like Gittins index, one can often implement Thompson\nsampling ef\ufb01ciently. Recent results using Thompson sampling seem promising [5, 6, 14, 12]. The\nreason why it is not very popular might be because of its lack of theoretical analysis. Only two\npapers have tried to provide such analysis, but they were only able to prove asymptotic convergence\n[6, 11].\nIn this work, we present some empirical results, \ufb01rst on a simulated problem and then on two real-\nworld ones: display advertisement selection and news article recommendation. In all cases, despite\nits simplicity, Thompson sampling achieves state-of-the-art results, and in some cases signi\ufb01cantly\noutperforms other alternatives like UCB. The \ufb01ndings suggest the necessity to include Thompson\nsampling as part of the standard baselines to compare against, and to develop \ufb01nite-time regret bound\nfor this empirically successful algorithm.\n\n2 Algorithm\n\nThe contextual bandit setting is as follows. At each round we have a context x (optional) and a set\nof actions A. After choosing an action a \u2208 A, we observe a reward r. The goal is to \ufb01nd a policy\nthat selects actions such that the cumulative reward is as large as possible.\nThompson sampling is best understood in a Bayesian setting as follows. The set of past observations\nD is made of triplets (xi, ai, ri) and are modeled using a parametric likelihood function P (r|a, x, \u03b8)\ndepending on some parameters \u03b8. Given some prior distribution P (\u03b8) on these parameters, the pos-\n\nterior distribution of these parameters is given by the Bayes rule, P (\u03b8|D) \u221d(cid:81) P (ri|ai, xi, \u03b8)P (\u03b8).\n\n1\n\n\fIn the realizable case, the reward is a stochastic function of the action, context and the unknown,\ntrue parameter \u03b8\u2217. Ideally, we would like to choose the action maximizing the expected reward,\nmaxa E(r|a, x, \u03b8\u2217).\nOf course, \u03b8\u2217 is unknown. If we are just interested in maximizing the immediate reward (exploita-\n\ntion), then one should choose the action that maximizes E(r|a, x) =(cid:82) E(r|a, x, \u03b8)P (\u03b8|D)d\u03b8.\n\nBut in an exploration / exploitation setting, the probability matching heuristic consists in randomly\nselecting an action a according to its probability of being optimal. That is, action a is chosen with\nprobability\n\n(cid:90)\n\nI(cid:104)E(r|a, x, \u03b8) = max\n\na(cid:48)\n\n(cid:105)\n\nE(r|a(cid:48), x, \u03b8)\n\nP (\u03b8|D)d\u03b8,\n\nwhere I is the indicator function. Note that the integral does not have to be computed explicitly: it\nsuf\ufb01ces to draw a random parameter \u03b8 at each round as explained in Algorithm 1. Implementation\nof the algorithm is thus ef\ufb01cient and straightforward in most applications.\n\nAlgorithm 1 Thompson sampling\n\nD = \u2205\nfor t = 1, . . . , T do\nReceive context xt\nDraw \u03b8t according to P (\u03b8|D)\nSelect at = arg maxa Er(r|xt, a, \u03b8t)\nObserve reward rt\nD = D \u222a (xt, at, rt)\n\nend for\n\nIn the standard K-armed Bernoulli bandit, each action corresponds to the choice of an arm. The\nreward of the i-th arm follows a Bernoulli distribution with mean \u03b8\u2217\ni . It is standard to model the mean\nreward of each arm using a Beta distribution since it is the conjugate distribution of the binomial\ndistribution. The instantiation of Thompson sampling for the Bernoulli bandit is given in algorithm\n2. It is straightforward to adapt the algorithm to the case where different arms use different Beta\ndistributions as their priors.\n\nAlgorithm 2 Thompson sampling for the Bernoulli bandit\nRequire: \u03b1, \u03b2 prior parameters of a Beta distribution\nSi = 0, Fi = 0, \u2200i. {Success and failure counters}\nfor t = 1, . . . , T do\n\nfor i = 1, . . . , K do\n\nDraw \u03b8i according to Beta(Si + \u03b1, Fi + \u03b2).\n\nend for\nDraw arm \u02c6\u0131 = arg maxi \u03b8i and observe reward r\nif r = 1 then\nS\u02c6\u0131 = S\u02c6\u0131 + 1\n\nelse\n\nF\u02c6\u0131 = F\u02c6\u0131 + 1\n\nend if\nend for\n\n3 Simulations\n\nWe present some simulation results with Thompson sampling for the Bernoulli bandit problem and\ncompare them to the UCB algorithm. The reward probability of each of the K arms is modeled\nby a Beta distribution which is updated after an arm is selected (see algorithm 2). The initial prior\ndistribution is Beta(1,1).\nThere are various variants of the UCB algorithm, but they all have in common that the con\ufb01dence\nparameter should increase over time. Speci\ufb01cally, we chose the arm for which the following upper\n\n2\n\n\fcon\ufb01dence bound [8, page 278] is maximum:\n\n(cid:115)\n\nk\nm\n\n+\n\n2 k\nm log 1\n\n\u03b4\n\nm\n\n+\n\n2 log 1\n\u03b4\n\nm\n\n,\n\n\u03b4 =\n\n(cid:114) 1\n\nt\n\n,\n\n(1)\n\nwhere m is the number of times the arm has been selected and k its total reward. This is a tight\nupper con\ufb01dence bound derived from Chernoff\u2019s bound.\nIn this simulation, the best arm has a reward probability of 0.5 and the K \u2212 1 other arms have a\nprobability of 0.5 \u2212 \u03b5. In order to speed up the computations, the parameters are only updated after\nevery 100 iterations. The regret as a function of T for various settings is plotted in \ufb01gure 1. An\nasymptotic lower bound has been established in [7] for the regret of a bandit algorithm:\n\nR(T ) \u2265 log(T )\n\n+ o(1)\n\n,\n\n(2)\n\n(cid:34) K(cid:88)\n\ni=1\n\np\u2217 \u2212 pi\nD(pi||p\u2217)\n\n(cid:35)\n\nwhere pi is the reward probability of the i-th arm, p\u2217 = max pi and D is the Kullback-Leibler\ndivergence. This lower bound is logarithmic in T with a constant depending on the pi values. The\nplots in \ufb01gure 1 show that the regrets are indeed logarithmic in T (the linear trend on the right hand\nside) and it turns out that the observed constants (slope of the lines) are close to the optimal constants\ngiven by the lower bound (2). Note that the offset of the red curve is irrelevant because of the o(1)\nterm in the lower bound (2). In fact, the red curves were shifted such that they pass through the\nlower left-hand corner of the plot.\n\nFigure 1: Cumulative regret for K \u2208 {10, 100} and \u03b5 \u2208 {0.02, 0.1}. The plots are averaged over\n100 repetitions. The red line is the lower bound (2) shifted such that it goes through the origin.\n\nAs with any Bayesian algorithm, one can wonder about the robustness of Thompson sampling to\nprior mismatch. The results in \ufb01gure 1 include already some prior mismatch because the Beta prior\nwith parameters (1,1) has a large variance while the true probabilities were selected to be close to\n\n3\n\n1021031041051060100200300400500600700800900TRegretK=10, \u03b5=0.1  1021031041051061070200040006000800010000TRegretK=100, \u03b5=0.1  10210310410510610705001000150020002500300035004000TRegretK=10, \u03b5=0.02  102104106108012345x 104TRegretK=100, \u03b5=0.02  ThompsonUCBAsymptotic lower boundThompsonUCBAsymptotic lower boundThompsonUCBAsymptotic lower boundThompsonUCBAsymptotic lower bound\fFigure 2: Regret of optimistic Thompson sampling [11] in the same setting as the lower left plot of\n\ufb01gure 1.\n\n0.5. We have also done some other simulations (not shown) where there is a mismatch in the prior\nmean. In particular, when the reward probability of the best arm is 0.1 and the 9 others have a\nprobability of 0.08, Thompson sampling\u2014with the same prior as before\u2014is still better than UCB\nand is still asymptotically optimal.\nWe can thus conclude that in these simulations, Thompson sampling is asymptotically optimal and\nachieves a smaller regret than the popular UCB algorithm. It is important to note that for UCB,\nthe con\ufb01dence bound (1) is tight; we have tried some other con\ufb01dence bounds, including the one\noriginally proposed in [3], but they resulted in larger regrets.\n\nOptimistic Thompson sampling The intuition behind UCB and Thompson sampling is that, for\nthe purpose of exploration, it is bene\ufb01cial to boost the predictions of actions for which we are uncer-\ntain. But Thompson sampling modi\ufb01es the predictions in both directions and there is apparently no\nbene\ufb01t in decreasing a prediction. This observation led to a recently proposed algorithm called Op-\ntimistic Bayesian sampling [11] in which the modi\ufb01ed score is never smaller than the mean. More\nprecisely, in algorithm 1, Er(r|xt, a, \u03b8t) is replaced by max(Er(r|xt, a, \u03b8t), Er,\u03b8|D(r|xt, a, \u03b8)).\nSimulations in [12] showed some gains using this optimistic version of Thompson sampling. We\ncompared in \ufb01gure 2 the two versions of Thompson sampling in the case K = 10 and \u03b5 = 0.02.\nOptimistic Thompson sampling achieves a slightly better regret, but the gain is marginal. A pos-\nsible explanation is that when the number of arms is large, it is likely that, in standard Thompson\nsampling, the selected arm has a already a boosted score.\n\nPosterior reshaping Thompson sampling is a heuristic advocating to draw samples from the pos-\nterior, but one might consider changing that heuristic to draw samples from a modi\ufb01ed distribution.\nIn particular, sharpening the posterior would have the effect of increasing exploitation while widen-\ning it would favor exploration. In our simulations, the posterior is a Beta distribution with parame-\nters a and b, and we have tried to change it to parameters a/\u03b1, b/\u03b1. Doing so does not change the\nposterior mean, but multiply its variance by a factor close to \u03b12.\nFigure 3 shows the average and distribution of regret for different values of \u03b1. Values of \u03b1 smaller\nthan 1 decrease the amount of exploration and often result in lower regret. But the price to pay is a\nhigher variance: in some runs, the regret is very large. The average regret is asymptotically not as\ngood as with \u03b1 = 1, but tends to be better in the non-asymptotic regime.\n\nImpact of delay In a real world system, the feedback is typically not processed immediately\nbecause of various runtime constraints. Instead it usually arrives in batches over a certain period of\ntime. We now try to quantify the impact of this delay by doing some simulations that mimic the\nproblem of news articles recommendation [9] that will be described in section 5.\n\n4\n\n102103104105106107020040060080010001200140016001800TRegret  ThompsonOptimistic Thompson\fFigure 3: Thompson sampling where the parameters of the Beta posterior distribution have been\ndivided by \u03b1. The setting is the same as in the lower left plot of \ufb01gure 1 (1000 repetitions). Left:\naverage regret as a function of T . Right: distribution of the regret at T = 107. Since the outliers can\ntake extreme values, those above 6000 are compressed at the top of the \ufb01gure.\n\nTable 1: In\ufb02uence of the delay: regret when the feedback is provided every \u03b4 steps.\n\n1\n\u03b4\nUCB 24,145\n9,105\nTS\n2.65\nRatio\n\n3\n24,695\n9,199\n2.68\n\n10\n25,662\n9,049\n2.84\n\n32\n28,148\n9,451\n2.98\n\n100\n37,141\n11,550\n3.22\n\n316\n77,687\n21,594\n3.60\n\n1000\n226,220\n59,256\n3.82\n\nWe consider a dynamic set of 10 items. At a given time, with probability 10\u22123 one of the item retires\nand is replaced by a new one. The true reward probability of a given item is drawn according to a\nBeta(4,4) distribution. The feedback is received only every \u03b4 time units. Table 1 shows the average\nregret (over 100 repetitions) of Thompson sampling and UCB at T = 106. An interesting quantity\nin this simulation is the relative regret of UCB and Thompson sampling. It appears that Thompson\nsampling is more robust than UCB when the delay is long. Thompson sampling alleviates the\nin\ufb02uence of delayed feedback by randomizing over actions; on the other hand, UCB is deterministic\nand suffers a larger regret in case of a sub-optimal choice.\n\n4 Display Advertising\n\nWe now consider an online advertising application. Given a user visiting a publisher page, the\nproblem is to select the best advertisement for that user. A key element in this matching problem is\nthe click-through rate (CTR) estimation: what is the probability that a given ad will be clicked given\nsome context (user, page visited)? Indeed, in a cost-per-click (CPC) campaign, the advertiser only\npays when his ad gets clicked. This is the reason why it is important to select ads with high CTRs.\nThere is of course a fundamental exploration / exploitation dilemma here: in order to learn the CTR\nof an ad, it needs to be displayed, leading to a potential loss of short-term revenue. More details on\non display advertising and the data used for modeling can be found in [1].\nIn this paper, we consider standard regularized logistic regression for predicting CTR. There are\nseveral features representing the user, page, ad, as well as conjunctions of these features. Some\nof the features include identi\ufb01ers of the ad, advertiser, publisher and visited page. These features\nare hashed [17] and each training sample ends up being represented as sparse binary vector of\ndimension 224.\nIn our model, the posterior distribution on the weights is approximated by a Gaussian distribution\nwith diagonal covariance matrix. As in the Laplace approximation, the mean of this distribution is\nthe mode of the posterior and the inverse variance of each weight is given by the curvature. The use\n\n5\n\n102103104105106107050010001500200025003000350040004500TRegret  \u03b1=2\u03b1=1\u03b1=0.5\u03b1=0.25Asymptotic lower bound210.50.2501000200030004000500060007000\u03b1Regret\fof this convenient approximation of the posterior is twofold. It \ufb01rst serves as a prior on the weights\nto update the model when a new batch of training data becomes available, as described in algorithm\n3. And it is also the distribution used in Thompson sampling.\n\nAlgorithm 3 Regularized logistic regression with batch updates\nRequire: Regularization parameter \u03bb > 0.\n\nmi = 0, qi = \u03bb. {Each weight wi has an independent prior N (mi, q\u22121\nfor t = 1, . . . , T do\n\ni\n\n)}\n\nGet a new batch of training data (xj, yj), j = 1, . . . , n.\nFind w as the minimizer of: 1\n2\n\nqi(wi \u2212 mi)2 +\n\nd(cid:88)\n\ni=1\n\nn(cid:88)\n\nj=1\n\nlog(1 + exp(\u2212yjw(cid:62)xj)).\n\nn(cid:88)\n\nmi = wi\n\nqi = qi +\n\nijpj(1 \u2212 pj), pj = (1 + exp(\u2212w(cid:62)xj))\u22121 {Laplace approximation}\nx2\n\nend for\n\nj=1\n\nEvaluating an explore / exploit policy is dif\ufb01cult because we typically do not know the reward of an\naction that was not chosen. A possible solution, as we shall see in section 5, is to use a replayer in\nwhich previous, randomized exploration data can be used to produce an unbiased of\ufb02ine estimator\nof the new policy [10]. Unfortunately, their approach cannot be used in our case here because\nit reduces the effective data size substantially when the number of arms K is large, yielding too\nhigh variance in the evaluation results. [15] studies another promising approach using the idea of\nimportance weighting, but the method applies only when the policy is static, which is not the case\nfor online bandit algorithms that constantly adapt to its history.\nFor the sake of simplicity, therefore, we considered in this section a simulated environment. More\nprecisely, the context and the ads are real, but the clicks are simulated using a weight vector w\u2217.\nThis weight vector could have been chosen arbitrarily, but it was in fact a perturbed version of some\nweight vector learned from real clicks. The input feature vectors x are thus as in the real world set-\nting, but the clicks are arti\ufb01cially generated with probability P (y = 1|x) = (1 + exp(\u2212w\u2217(cid:62)x))\u22121.\nAbout 13,000 contexts, representing a small random subset of the total traf\ufb01c, are presented every\nhour to the policy which has to choose an ad among a set of eligible ads. The number of eligible ads\nfor each context depends on numerous constraints set by the advertiser and the publisher. It varies\nbetween 5,910 and 1 with a mean of 1,364 and a median of 514 (over a set of 66,373 ads). Note that\nin this experiment, the number of eligible ads is smaller than what we would observe in live traf\ufb01c\nbecause we restricted the set of advertisers.\nThe model is updated every hour as described in algorithm 3. A feature vector is constructed for\nevery (context, ad) pair and the policy decides which ad to show. A click for that ad is then generated\nwith probability (1 + exp(\u2212w\u2217(cid:62)x))\u22121. This labeled training sample is then used at the end of the\nhour to update the model. The total number of clicks received during this one hour period is the\nreward. But in order to eliminate unnecessary variance in the estimation, we instead computed the\nexpectation of that number since the click probabilities are known.\nSeveral explore / exploit strategies are compared; they only differ in the way the ads are selected; all\nthe rest, including the model updates, is identical as described in algorithm 3. These strategies are:\n\nThompson sampling This is algorithm 1 where each weight is drawn independently according to\n) (see algorithm 3). As in section 3, we\nare \ufb01rst multiplied by a factor\n\nits Gaussian posterior approximation N (mi, q\u22121\n\u22121/2\nalso consider a variant in which the standard deviations q\n\u03b1 \u2208 {0.25, 0.5}. This favors exploitation over exploration.\ni\n\nLinUCB This is an extension of the UCB algorithm to the parametric case [9].\n\nIt selects the\nad based on mean and standard deviation.\nIt also has a factor \u03b1 to control the ex-\nploration / exploitation trade-off. More precisely, LinUCB selects the ad for which\n\ni\n\n(cid:80)d\n\n(cid:113)(cid:80)d\n\ni=1 mixi + \u03b1\n\ni=1 q\u22121\n\ni x2\n\ni is maximum.\n\nExploit-only Select the ad with the highest mean.\nRandom Select the ad uniformly at random.\n\n6\n\n\fMethod\nParameter\nRegret (%)\n\n0.25\n4.45\n\nTable 2: CTR regrets on the display advertising data.\nTS\n0.5\n3.72\n\n0.005\n5.05\n\n\u03b5-greedy\n\nLinUCB\n\n0.01\n4.98\n\n2\n\n4.14\n\n0.02\n5.22\n\n1\n\n3.81\n\n0.5\n4.99\n\n1\n\n4.22\n\nExploit Random\n\n5.00\n\n31.95\n\nFigure 4: CTR regret over the 4 days test period for 3 algorithms: Thompson sampling with \u03b1 = 0.5,\nLinUCB with \u03b1 = 2, Exploit-only. The regret in the \ufb01rst hour is large, around 0.3, because the\nalgorithms predict randomly (no initial model provided).\n\n\u03b5-greedy Mix between exploitation and random: with \u03b5 probability, select a random ad; otherwise,\n\nselect the one with the highest mean.\n\nResults A preliminary result is about the quality of the variance prediction. The diagonal Gaussian\napproximation of the posterior does not seem to harm the variance predictions. In particular, they\nare very well calibrated: when constructing a 95% con\ufb01dence interval for CTR, the true CTR is in\nthis interval 95.1% of the time.\nThe regrets of the different explore / exploit strategies can be found in table 2. Thompson sampling\nachieves the best regret and interestingly the modi\ufb01ed version with \u03b1 = 0.5 gives slightly better\nresults than the standard version (\u03b1 = 1). This con\ufb01rms the results of the previous section (\ufb01gure 3)\nwhere \u03b1 < 1 yielded better regrets in the non-asymptotic regime.\nExploit-only does pretty well, at least compared to random selection. This seems at \ufb01rst a bit sur-\nprising given that the system has no prior knowledge about the CTRs. A possible explanation is that\nthe change in context induces some exploration, as noted in [13]. Also, the fact that exploit-only\nis so much better than random might explain why \u03b5-greedy does not beat it: whenever this strat-\negy chooses a random action, it suffers a large regret in average which is not compensated by its\nexploration bene\ufb01t.\nFinally \ufb01gure 4 shows the regret of three algorithms across time. As expected, the regret has a\ndecreasing trend over time.\n\n5 News Article Recommendation\n\nIn this section, we consider another application of Thompson sampling in personalized news article\nrecommendation on Yahoo! front page [2, 9]. Each time a user visits the portal, a news article out\nof a small pool of hand-picked candidates is recommended. The candidate pool is dynamic: old\narticles may retire and new articles may be added in. The average size of the pool is around 20.\nThe goal is to choose the most interesting article to users, or formally, maximize the total number of\nclicks on the recommended articles. In this case, we treat articles as arms, and de\ufb01ne the payoff to\nbe 1 if the article is clicked on and 0 otherwise. Therefore, the average per-trial payoff of a policy is\nits overall CTR.\n\n7\n\n010203040506070809000.010.020.030.040.050.060.070.080.090.1HourCTR regret  ThompsonLinUCBExploit\fFigure 5: Normalized CTRs of various algorithm on the news article recommendation data with dif-\nferent update delays: {10, 30, 60} minutes. The normalization is with respect to a random baseline.\n\nEach user was associated with a binary raw feature vector of over 1000 dimension, which indicates\ninformation of the user like age, gender, geographical location, behavioral targeting, etc. These\nfeatures are typically sparse, so using them directly makes learning more dif\ufb01cult and is compu-\ntationally expensive. One can \ufb01nd lower dimension feature subspace by, say, following previous\npractice [9]. Here, we adopted the simpler principal component analysis (PCA), which did not ap-\npear to affect the bandit algorithms much in our experience. In particular, we performed a PCA and\nprojected the raw user feature onto the \ufb01rst 20 principal components. Finally, a constant feature 1 is\nappended, so that the \ufb01nal user feature contains 21 components. The constant feature serves as the\nbias term in the CTR model described next.\nWe use logistic regression, as in Algorithm 3, to model article CTRs: given a user feature vector\nx \u2208 (cid:60)21, the probability of click on an article a is (1 + exp(\u2212x(cid:62)wa))\u22121 for some weight vector\nwa \u2208 (cid:60)21 to be learned. The same parameter algorithm and exploration heuristics are applied as in\nthe previous section. Note that we have a different weight vector for each article, which is affordable\nas the numbers of articles and features are both small. Furthermore, given the size of data, we have\nnot found article features to be helpful. Indeed, it is shown in our previous work [9, Figure 5] that\narticle features are helpful in this domain only when data are highly sparse.\nGiven the small size of candidate pool, we adopt the unbiased of\ufb02ine evaluation method of [10]\nto compare various bandit algorithms. In particular, we collected randomized serving events for\na random fraction of user visits; in other words, these random users were recommended an article\nchosen uniformly from the candidate pool. From 7 days in June 2009, over 34M randomized serving\nevents were obtained.\nAs in section 3, we varied the update delay to study how various algorithms degrade. Three values\nwere tried: 10, 30, and 60 minutes. Figure 5 summarizes the overall CTRs of four families of\nalgorithm together with the exploit-only baseline. As in the previous section, (optimistic) Thompson\nsampling appears competitive across all delays. While the deterministic UCB works well with short\ndelay, its performance drops signi\ufb01cantly as the delay increases. In contrast, randomized algorithms\nare more robust to delay, and when there is a one-hour delay, (optimistic) Thompson sampling is\nsigni\ufb01cant better than others (given the size of our data).\n\n6 Conclusion\n\nThe extensive experimental evaluation carried out in this paper reveals that Thompson sampling is a\nvery effective heuristic for addressing the exploration / exploitation trade-off. In its simplest form,\nit does not have any parameter to tune, but our results show that tweaking the posterior to reduce\nexploration can be bene\ufb01cial. In any case, Thompson sampling is very easy to implement and should\nthus be considered as a standard baseline. Also, since it is a randomized algorithm, it is robust in the\ncase of delayed feedback.\nFuture work includes of course, a theoretical analysis of its \ufb01nite-time regret. The bene\ufb01t of this\nanalysis would be twofold. First, it would hopefully contribute to make Thompson sampling as\npopular as other algorithms for which regret bounds exist. Second, it could provide guidance on\ntweaking the posterior in order to achieve a smaller regret.\n\n8\n\n10306011.21.41.61.82Delay (min)Normalized CTR  TS 0.5TS 1OTS 0.5OTS 1UCB 1UCB 2UCB 5EG 0.05EG 0.1Exploit\fReferences\n[1] D. Agarwal, R. Agrawal, R. Khanna, and N. Kota. Estimating rates of rare events with multiple\nIn Proceedings of the 16th ACM SIGKDD\n\nhierarchies through scalable log-linear models.\ninternational conference on Knowledge discovery and data mining, pages 213\u2013222, 2010.\n\n[2] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu\nIn\n\nRamakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization.\nAdvances in Neural Information Processing Systems 21, pages 17\u201324, 2008.\n\n[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-\n\nlem. Machine learning, 47(2):235\u2013256, 2002.\n\n[4] John C. Gittins. Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systems\n\nand Optimization. John Wiley & Sons Inc, 1989.\n\n[5] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale\nBayesian click-through rate prediction for sponsored search advertising in Microsoft\u2019s Bing\nsearch engine. In Proceedings of the Twenty-Seventh International Conference on Machine\nLearning (ICML-10), pages 13\u201320, 2010.\n\n[6] O.-C. Granmo. Solving two-armed bernoulli bandit problems using a bayesian learning au-\ntomaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207\u2013234, 2010.\n[7] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6:4\u201322, 1985.\n\n[8] J. Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine\n\nLearning Research, 6(1):273\u2013306, 2005.\n\n[9] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670, 2010.\n\n[10] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. In Proceedings of the fourth ACM interna-\ntional conference on Web search and data mining, pages 297\u2013306, 2011.\n\n[11] Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic Bayesian\nsampling in contextual-bandit problems. Technical Report 11:01, Statistics Group, Department\nof Mathematics, University of Bristol, 2011. Submitted to the Annals of Applied Probability.\n[12] Benedict C. May and David S. Leslie. Simulation studies in optimistic Bayesian sampling in\ncontextual-bandit problems. Technical Report 11:02, Statistics Group, Department of Mathe-\nmatics, University of Bristol, 2011.\n\n[13] J. Sarkar. One-armed bandit problems with covariates. The Annals of Statistics, 19(4):1978\u2013\n\n2002, 1991.\n\n[14] S. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in\n\nBusiness and Industry, 26:639\u2013658, 2010.\n\n[15] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged\nimplicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS-\n10), pages 2217\u20132225, 2011.\n\n[16] William R. Thompson. On the likelihood that one unknown probability exceeds another in\n\nview of the evidence of two samples. Biometrika, 25(3\u20134):285\u2013294, 1933.\n\n[17] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. Feature hashing for\n\nlarge scale multitask learning. In ICML, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1232, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Lihong", "family_name": "Li", "institution": null}]}