{"title": "PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 4624, "page_last": 4633, "abstract": "We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.", "full_text": "PG-TS: Improved Thompson Sampling for Logistic\n\nContextual Bandits\n\nLewis Sigler Institute for Integrative Genomics\n\nDepartment of Computer Science\n\nBianca Dumitrascu\u2217\n\nPrinceton University\nPrinceton, NJ 08540\n\nbiancad@princeton.edu\n\nKaren Feng\u2217\n\nPrinceton University\nPrinceton, NJ 08540\n\nkarenfeng@princeton.edu\n\nBarbara E Engelhardt\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08540\nbee@princeton.edu\n\nAbstract\n\nWe address the problem of regret minimization in logistic contextual bandits, where\na learner decides among sequential actions or arms given their respective contexts\nto maximize binary rewards. Using a fast inference procedure with P\u00f3lya-Gamma\ndistributed augmentation variables, we propose an improved version of Thompson\nSampling, a Bayesian formulation of contextual bandits with near-optimal perfor-\nmance. Our approach, P\u00f3lya-Gamma augmented Thompson Sampling (PG-TS),\nachieves state-of-the-art performance on simulated and real data. PG-TS explores\nthe action space ef\ufb01ciently and exploits high-reward arms, quickly converging to\nsolutions of low regret. Its explicit estimation of the posterior distribution of the\ncontext feature covariance leads to substantial empirical gains over approximate\napproaches. PG-TS is the \ufb01rst approach to demonstrate the bene\ufb01ts of P\u00f3lya-\nGamma augmentation in bandits and to propose an ef\ufb01cient Gibbs sampler for\napproximating the analytically unsolvable integral of logistic contextual bandits.\n\n1\n\nIntroduction\n\nA contextual bandit is an online learning framework for modeling sequential decision-making\nproblems. Contextual bandits have been applied to problems ranging from advertising [1] and\nrecommendations [22, 21] to clinical trials [37] and mobile health [33]. In a contextual bandit\nalgorithm, a learner is given a choice among K actions or arms, for which contexts are available\nas d-dimensional feature vectors, across T sequential rounds. During each round, the learner uses\ninformation from previous rounds to estimate associations between contexts and rewards. The\nlearner\u2019s goal in each round is to select the arm that minimizes the cumulative regret, which is the\ndifference between the optimal oracle rewards and the observed rewards from the chosen arms. To\ndo this, the learner must balance exploring arms that improve the expected reward estimates and\nexploiting the current expected reward estimates to select arms with the largest expected reward. In\nthis work, we focus on scenarios with binary rewards.\nTo address the exploration-exploitation trade-off in sequential decision making, two directions are\ngenerally considered: Upper Con\ufb01dence Bound algorithms (UCB) and Thompson Sampling (TS).\n\n\u2217indicates equal authorship\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fUCB algorithms are based on the principle of optimism in the face of adversity [3, 6, 15] and rely\non choosing actions according to expected rewards perturbed by their respective upper con\ufb01dence\nbounds. Based on Bayesian ideas, TS [34] assumes a prior distribution over the parameters governing\nthe relationship between contexts and rewards. At each step, an action corresponding to a random\nparameter sampled from the posterior distribution is chosen. Upon observing the reward for each\nround, the posterior distribution is updated via Bayes\u2019 rule. TS has been successfully applied in a\nwide range of settings [2, 32, 9, 28].\nWhile UCB algorithms have simple implementations and good theoretical regret bounds [22], TS\nachieves better empirical performance in many simulated and real-world settings without sacri\ufb01cing\nsimplicity [9, 15]. Furthermore, TS is amenable to scaling through hashing, thus making it attractive\nfor large scale applications [20]. In addition, recent studies have bridged the theoretical gap between\nTS and UCB based methods by analyzing regret and Bayesian regret in TS approaches for both\ngeneralized linear bandits and reinforcement learning settings [2, 28, 26, 29, 4, 5].\nIn this work, we focus on improving the TS approach for contextual bandits with logistic rewards\n[9, 15]. The logistic rewards setting is of pragmatic interest because of its natural application to\nproblems such as modeling click-through rates in advertisement applications [22]. Computationally,\nthe functional form of its logistic regression likelihood leads to an intractable posterior \u2013 the necessary\nintegrals are not available in closed form and dif\ufb01cult to approximate. This intractability makes\nthe sampling step of TS with binary or categorical rewards challenging. From an optimization\nperspective, the logistic loss is exp-concave, thus allowing second-order methods in a purely online\nsetting [19, 25]. However, the convergence rate is exponential in the number of features d, making\nthese solutions impractical in most real-world settings [19].\nExisting Bayesian solutions to logistic contextual bandits rely on regularized logistic regression\nwith batch updates in which the posterior distribution is estimated via Laplace approximations. The\nLaplace approximation is a second-order moment matching method that estimates the posterior with\na multivariate Gaussian distribution. Despite offering asymptotic convergence guarantees under\nrestricted assumptions [7], the Laplace approximation struggles when the dimension of the context\n(arm features) is larger than the number of arms, and when the features themselves are non-Gaussian.\nBoth of these situations arise in the online learning setting, creating a need for novel TS approaches\nto inference. Recent work suggests that a double sampling approach via MCMC can improve TS [35].\nThis approach provides MCMC schemes for bandits with binary and Gaussian rewards, but these\nalgorithms do not generalize to the logistic contextual bandit.\nWe propose P\u00f3lya-Gamma augmented Thompson sampling (PG-TS), a fully Bayesian alternative\nto Laplace-TS. PG-TS uses a Gibbs sampler built on parameter augmentation with a P\u00f3lya-Gamma\ndistribution [27, 36, 31]. We compare results from PG-TS to state-of-the-art approaches on simula-\ntions that include toy models with speci\ufb01ed and unspeci\ufb01ed priors, and on two data sets previously\nconsidered in the contextual bandit literature.\nThe remainder of this paper is organized as follows. Section 2 reviews relevant background and\nintroduces the problem. The details of P\u00f3lya-Gamma augmentation are provided in Section 3. Section\n4 includes an empirical evaluation and shows substantial performance improvements in favor of\nPG-TS over existing approaches. We conclude in Section 5.\n\n2 Background\nIn the following, x \u2208 Rd denotes a d-dimensional column vector with scalar entries xj, indexed by\nintegers j = {1, 2 . . . d}; x(cid:62) is transposed vector x. X denotes a square matrix, while X refers to a\nrandom variable. We use (cid:107) \u00b7 (cid:107) for the 2-norm, while (cid:107)x(cid:107)A denotes x(cid:62)Ax, for a matrix A. Let 1B(x)\nbe the indicator function of a set B de\ufb01ned as 1 if x \u2208 B, and 0 otherwise. M V N (b, B) denotes a\nmultivariate normal distribution with mean b and covariance B, and Id is the d \u00d7 d identity matrix.\n\n2.1 Contextual Bandits with Binary Rewards\n\nWe consider contextual bandits with binary rewards with a \ufb01nite, but possibly large, number of arms\nK. These models belong to the class of generalized linear bandits with binary rewards [15]. Let\nA be the set of arms. At each time step t, the learner observes contexts xt,a \u2208 Rd, where d is the\nnumber of features per arm. The learner then chooses an arm at and receives a reward rt \u2208 {0, 1}.\n\n2\n\n\fThe expectation of this reward is related to the context through a parameter \u03b8\u2217 \u2208 Rd and a logistic\nlink function \u00b5: E[r|x] = \u00b5(x(cid:62)\u03b8\u2217), where \u00b5(z) = exp(z)/(1 + exp(z)).\nFor example, in a news article recommendation setting, the recommendation algorithm (learner) has\naccess to a discrete number of news articles (arms) A and interacts with users across discrete trials\nt = 1, 2, . . . where the logistic reward is whether or not the user clicks on the recommended article.\nThe articles and the users are characterized by attributes (context), such as genre and popularity\n(articles), or age and gender (users). At trial t, the learner observes the current user ut, the available\narticles a \u2208 A, and the corresponding contexts xt,a. The context is a d-dimensional summary of both\nthe user\u2019s and the available articles\u2019 context. At each time point, the goal of the learner is to provide\nthe user with an article recommendation (arm choice) that they then may choose to click (reward\nof 1) or not (reward of 0). The relationship between rewards and contexts is mediated through an\nunderlying coef\ufb01cient vector \u03b8\u2217, which can be interpreted as an encoding of the users\u2019 preferences\nwith respect to the various context features of the articles.\nFormally, let Dt be the set of triplets (xi,ai, ai, ri) for i = 1, . . . , t representing the past t observations\nof the contexts, the actions chosen, and their corresponding rewards. The objective of the learner is to\nminimize the cumulative regret given Dt\u22121 after a \ufb01xed budget of t steps. The regret is the expected\ndifference between the optimal reward received by always playing the optimal arm a\u2217 and the reward\nreceived following the actual arm choices made by the learner.\ni,a\u2217 \u03b8\u2217) \u2212 \u00b5(x(cid:62)\n\nt(cid:88)\n\n\u00b5(x(cid:62)\n\n\u03b8\u2217)\n\nrt =\n\n(cid:104)\n\n(cid:105)\n\n(1)\n\ni,ai\n\ni=1\n\nThe parameter \u03b8 is reestimated after each round t using a generalized linear model estimator [15], The\npoint estimate of the coef\ufb01cient at round t, \u03b8t, can be computed using approaches for online convex\noptimization [18, 19]. However, these approaches scale exponentially with the context dimension\nd, leading to computationally intractable solutions for many real world contextual logistic bandit\nproblems [19, 25].\n\n2.2 Thompson Sampling for Contextual Logistic Bandits\n\nrule and is proportional to the distribution(cid:81)t\u22121\n\nTS provides a \ufb02exible and computationally tractable framework for inference in contextual logistic\nbandits. TS for the contextual bandit is broadly de\ufb01ned in Bayesian terms, where a prior distribution\np(\u03b8) over the parameter \u03b8 is updated iteratively using a set of historical observations Dt\u22121 =\n{(xi,ai, ai, ri)|i = 1, . . . , t \u2212 1}. The posterior distribution p(\u03b8|Dt\u22121) is calculated using Bayes\u2019\ni=1 p(ri|ai, xi,ai, \u03b8)p(\u03b8). A random sample \u03b8t is drawn\nfrom this posterior, corresponding to a stochastic estimate of \u03b8\u2217 after t steps. The optimal arm is then\nthe arm offering the highest reward with respect to the current estimate \u03b8t. In other words, the arm\nwith the highest expected reward is chosen according to a probability p(at = a|\u03b8t,Dt\u22121), which is\nexpressed formally as\n\n(cid:16)\n\n(cid:17)\n\n1Amax\n\nt\n\n(\u03b8t)\n\nE[rt|a, xt,a, \u03b8t]\n\np(\u03b8t|Dt\u22121)d\u03b8t,\n\n(2)\n\nwhere Amax\nAfter t steps, the joint probability mass function over the rewards r1, r2, . . . , rt observed upon taking\n\n(\u03b8t) is the set of arms with maximum rewards at step t if the true parameter were \u03b8t.\n\n(cid:90)\nactions a1, a2, . . . , at is(cid:81)t\n\nt\n\ni=1 p(ri = 1|ai, xi,a, \u03b8i) or\nt(cid:89)\n\n\u00b5(x(cid:62)\n\ni,ai\n\n\u03b8i)ri[1 \u2212 \u00b5(x(cid:62)\n\ni,ai\n\n\u03b8i)]1\u2212ri,\n\n(3)\n\ni=1\n\nwhere \u03b81, \u03b82, . . . , \u03b8t are the estimates of \u03b8\u2217 at each trial up to t.\nIn the case of logistic regression for binary rewards, the posterior derived from this joint probability\nis intractable. Laplace-TS addresses this issue by approximating the posterior with a multivariate\nGaussian distribution with a diagonal covariance matrix following a Laplace approximation. The\nmean of this distribution is the maximum a posteriori estimate and the inverse variance of each feature\nis the curvature [15].\nLaplace approximations are effective in \ufb01nding smooth densities peaked around their posterior modes,\nand are thus applicable to the logistic posterior, which is strictly exp-concave [18]. This approach has\n\n3\n\n\fshown superior empirical performance versus UCB algorithms [9] and other TS-based approximation\nmethods [30]. Laplace-TS is therefore an appropriate benchmark in the evaluation of contextual\nbandit algorithms using TS approaches.\n\n3 P\u00f3lya-Gamma Augmentation for Logistic Contextual Bandits\n\nThe Laplace approximation leads to simple, iterative algorithms, which in the of\ufb02ine setting lead to\naccurate estimates across a potentially large number of sparse models [7]. In this section, we propose\nPG-TS, an alternative approach stemming from recent developments in augmentation for Bayesian\ninference in logit models [27, 31].\n\n3.1 The P\u00f3lya-Gamma Augmentation Scheme\nConsider a logit model with t binary observations ri \u223c Bin(1, \u00b5(x(cid:62)\ni \u03b8)), parameter \u03b8 \u2208 Rd and\ncorresponding regressors xi \u2208 Rd, i = 1, . . . , t. To estimate the posterior p(\u03b8|Dt), classic MCMC\nmethods use independent and identically distributed (i.i.d) samples. Such samples can be challenging\nto obtain, especially if the dimension d is large [10]. To address this challenge, we reframe the discrete\nrewards as functions of latent variables with P\u00f3lya-Gamma (PG) distributions over a continuous\nspace [27]. The PG latent variable construction relies on the theoretical properties of PG random\nvariables to exploit the fact that the logistic likelihood is a mixture of Gaussians with PG mixing\ndistributions [27, 12, 13].\n\nDe\ufb01nition 1 Let X be a real-valued random variable. X follows a P\u00f3lya-Gamma distribution with\nparameters b > 0 and c \u2208 R, X \u223c P G(b, c) if the following holds:\n\n\u221e(cid:88)\n\nk=1\n\nX =\n\n1\n2\u03c02\n\nGk\n\n(k \u2212 1/2)2 + c2/(4\u03c02)\n\n,\n\nwhere Gk \u223c Ga(b, 1) are independent gamma variables.\nThe identity central to the PG augmentation scheme [27] is\n\n(cid:90) \u221e\n\n0\n\n(cid:90) \u221e\n\n0\n\nt(cid:89)\n\ni=1\n\nt(cid:89)\n\ni=1\n\n(e\u03c8)a\n\n(1 + e\u03c8)b = 2\u2212be\u03ba\u03c8\n\ne\u2212\u03c9\u03c82/2p(\u03c9)d\u03c9,\n\n(4)\n\nwhere \u03c8 \u2208 R, a \u2208 R, b > 0, \u03ba = a \u2212 b/2 and \u03c9 \u223c P G(b, 0). When \u03c8 = x(cid:62)\nallows us to write the logistic likelihood contribution of step t as\n\nt \u03b8, the previous identity\n\nLt(\u03b8) =\n\n[exp(x(cid:62)\nt \u03b8)]rt\n1 + exp(x(cid:62)\nt \u03b8)\n\n\u221d exp(\u03batx(cid:62)\n\nt \u03b8)\n\nexp[\u2212\u03c9t(x(cid:62)\n\nt \u03b8)2/2]p(\u03c9t; 1, 0)d\u03c9t,\n\nwhere \u03bat = rt\u22121/2 and p(\u03c9t; 1, 0) is the density of a PG-distributed random variable with parameters\n(1, 0). In turn, the conditional posterior of \u03b8 given latent variables \u03c9 = [\u03c91, . . . , \u03c9t] and past rewards\nr = [r1, . . . , rt] is a conditional Gaussian:\n\np(\u03b8|\u03c9, r) = p(\u03b8)\n\nLi(\u03b8|\u03c9i) \u221d p(\u03b8)\n\nexp{ \u03c9i\n2\n\n(x(cid:62)\n\ni \u03b8 \u2212 \u03bai/\u03c9i)2}.\n\nWith a multivariate Gaussian prior for \u03b8 \u223c M V N (b, B), this identity leads to an ef\ufb01cient Gibbs\nsampler. The main parameters are drawn from a Gaussian distribution, which is parameterized with\nlatent variables drawn from the PG distribution [27]. The two steps are:\n\n(\u03c9i|\u03b8) \u223c P G(1, x(cid:62)\ni \u03b8)\n(\u03b8|r, \u03c9) \u223c N (m\u03c9, V\u03c9),\n\n(5)\n(6)\n\nwith V\u03c9 = (X(cid:62)\u2126X + B\u22121)\u22121, and m\u03c9 = V\u03c9(X(cid:62)\u03ba + B\u22121b) where \u03ba = [\u03ba1, . . . , \u03bat].\nConveniently, ef\ufb01cient algorithms for sampling from the PG distribution exist [27]. Based on ideas\nfrom Devroye [12, 13], which avoid the need to truncate the in\ufb01nite sum in Eq 4, the algorithm relies\non an accept-reject strategy for which the proposal distribution only requires exponential, uniform,\n\n4\n\n\fand Gaussian random variables. With an acceptance probability uniformly lower bounded by 0.9992\n(at most 9 rejected draws out of every 10, 000 proposed), the resulting algorithm is more ef\ufb01cient than\nall previously proposed augmentation schemes in terms of both effective sample size and effective\nsampling rate [27]. Furthermore, the PG sampling procedure leads to a uniformly ergodic mixture\ntransition distribution of the iterative estimates {\u03b8i}\u221e\ni=0 [10]. This result guarantees the existence\nof central limit theorems regarding sample averages involving {\u03b8i}\u221e\ni=0 and allows for consistent\nestimators of the asymptotic variance. The advantage of PG augmentation has been proven in multiple\nGibbs sampling and variational inference approaches, including binomial models [27], multinomial\nmodels [24], and negative binomial regression models with logit link functions [38, 31]. In the next\nsection, we leverage its strengths to perform online, fully Bayesian inference for logistic contextual\nbandits with state-of-the-art performance.\n\n3.2 PG-TS Algorithm De\ufb01nition\n\n2 , . . . , rt \u2212 1\n\nAlgorithm 1 PG-TS\n\nOur algorithm, PG-TS, uses the PG augmentation scheme to represent the binomial distributions of\nthe sequential rewards in terms of latent variables with Gaussian distributions to perform tractable\nBayesian logistic regression in a Thompson sampling setting.\nWe consider a multivariate Gaussian distribution over parameter \u03b8 \u223c M V N (b, B) with prior mean\nb and covariance B. For simplicity, let Xt be the d \u00d7 t design matrix [x1, . . . , xt] that includes the\ncontext of all arms chosen up to round t. \u2126t is the diagonal matrix of the PG auxiliary variables\n[\u03c91, . . . , \u03c9t] and let \u03bat = [r1 \u2212 1\n2 ]. Further, let rt = [r1, . . . , rt] be the history of rewards.\nThe PG-TS algorithm uses a Gibbs sampler\nbased on the PG augmentation scheme to ap-\nproximate the logistic likelihood corresponding\nto observations up to the current step. At each\nstep, sampling from the posterior is exact. The\nergodicity of the sampler guarantees that, as the\nnumber of trials increases, the algorithm is able\nto consistently estimate both the mean and the\nvariance of parameter \u03b8 [36].\nWe sample from the PG distribution [24, 27] in-\ncluding M = 100 burn-in steps. This number\nis empirically tuned, as evaluating how close a\nsampled \u03b8t is to the true GLM estimator \u03b8GLM\nas a function of the burn-in step M is a challeng-\ning problem. Thus, frequentist-derived TS algo-\nrithms and regret bounds cannot be derived for\nthe PG distributions, unlike other formulations\nof this problem [2]. In our empirical studies, we\n\ufb01nd PG-TS with M = 100 to be suf\ufb01cient for\nreliable mixing, as measured by the competitive\nregret achieved. When M = 1, the resulting\nalgorithm, PG-TS-stream, is reminiscent of a\nstreaming Gibbs inference scheme. In practice,\nthis leads to a faster algorithm. As shown in\nthe Results, PG-TS-stream shows competitive performance in terms of cumulative rewards in both\nsimulated and real-world data scenarios.\n\nInput: b, B, M, D = \u2205, \u03b80 \u223c M V N (b, B)\nfor t = 1, 2, . . . do\nReceive contexts xt,a \u2208 Rd\nt \u2190 \u03b8t\u22121\n\u03b8(0)\nfor m = 1 to M do\nfor i = 1 to t \u2212 1 do\n\u03c9i|\u03b8(m\u22121)\n(cid:104)\n\n\u03b8t \u2190 \u03b8(M )\nSelect arm at \u2190 argmaxa\u00b5(x(cid:62)\nObserve reward rt \u2208 {0, 1}\nD = D \u222a {xt,at, at, rt}\n\n\u223c P G(1, x(cid:62)\n\n\u03b8(m\u22121)\n\nt,a\u03b8t)\n\nt\n\nt\n\n)\n\n(cid:105)(cid:62)\n\nt\n\ni,ai\n\nt\n\u2126t\u22121 = diag(\u03c91, \u03c92, . . . \u03c9t\u22121)\n2 , ..., rt\u22121 \u2212 1\n\u03bat\u22121 =\nV\u03c9 \u2190 (X(cid:62)\nm\u03c9 \u2190 V\u03c9(X(cid:62)\u03bat\u22121 + B\u22121b)\n\u03b8(m)\nt\n\nr1 \u2212 1\nt\u22121\u2126t\u22121Xt\u22121 + B\u22121)\u22121\n\n|rt\u22121, \u03c9 \u223c M V N (m\u03c9, V\u03c9)\n\n2\n\n4 Results of PG-TS for contextual bandit applications\n\nWe evaluate and compare our PG-TS method with Laplace-TS. Laplace-TS has been shown to\noutperform its UCB competitors in all settings considered here [9].\nWe evaluate our algorithm in three scenarios: simulated data sets with parameters sampled from\nGaussian and mixed Gaussian distributions, a toy data set based on the Forest Cover Type data set\nfrom the UCI repository [15], and an of\ufb02ine evaluation method for bandit algorithms that relies on\nreal-world log data [23].\n\n5\n\n\f4.1 Generating Simulated Data\n\nFigure 1: Comparison of the average cumulative re-\ngret of the PG-TS, PG-TS-stream, and Laplace-TS\nalgorithms on the simulated data set with Gaussian\n\u03b8\u2217 over 100 runs with 1, 000 trials (standard devi-\nation shown as shaded region). Both PG-TS and\nPG-TS-stream outperform Laplace-TS in consis-\ntently achieving lower cumulative regret.\n\nGaussian simulation. We generated a simu-\nlated data set with 100 arms and 10 features per\ncontext across 1, 000 trials. We generated con-\ntexts xt,a \u2208 R10 from multivariate Gaussian dis-\ntributions xt,a \u223c M V N (\u22123, I10) for all arms\na. The true parameters were simulated from a\nmultivariate Gaussian with mean 0 and identity\ncovariance matrix, \u03b8\u2217 \u223c M V N (0, I10). The\nresulting reward associated with the optimal arm\nwas 0.994 and the mean reward was 0.195. We\nset the hyperparameters b = 0, and B = I10.\nWe averaged the experiments over 100 runs.\nWe \ufb01rst considered the effect of the burn-in pa-\nrameter M on the resulting average cumulative\nregret (Eq. 1; Fig. S1 Supplementary Material).\nAs expected, larger M led to lower regret, as\nthe Markov chain had more time to mix. We\nnote that M > 100 burn-in iterations was not\nnoticeably better than M = 100, while the com-\nputational time grew. Interestingly, the average cumulative regret of PG-TS-stream with M = 1 was\ncomparable to that of PG-TS. This suggests that, after a number of steps greater than the number of\niterations necessary for mixing, the sampler in PG-TS-stream has had suf\ufb01cient time to mix.\nIn this simulation, both PG-TS strategies outperformed their Laplace counterpart, which failed to\nconverge on average (Fig. 1). This behavior is expected: due to its simple Gaussian approximation,\nLaplace-TS does not always converge to the global optimum of the logistic likelihood in non-\nasymptotic settings.\nFurthermore, the PG-TS algorithms outperform Laplace-TS in terms of balancing exploration and\nexploitation: Laplace-TS gets stuck on sub-optimal arm choices, while PG-TS continues to explore\nrelative to the estimated variance of the posterior distribution of \u03b8 to \ufb01nd the optimal arm (Fig. 2).\nMixture of Gaussians: Prior misspeci\ufb01cation. Laplace approximations are sensitive to multimodal-\nity. We therefore explored a prior misspeci\ufb01cation scenario, where true parameter \u03b8\u2217 is sampled from\na four-component Gaussian mixture model, as opposed to the Gaussian distribution assumed by both\nalgorithms. As before, we simulated a data set with 100 arms, each with 10 features, and marginally\nindependent contexts xt,a \u223c M V N (0, I10), across 5, 000 trials.\nThe true parameters were generated from a mixed model with variances \u03c32\nInverse-Gamma(3, 1), means \u00b5j=1:4 \u223c N (\u22123, \u03c32\n\nsuch that \u03b8\u2217(i) \u223c (cid:80)4\n\nj=1:4 \u223c\nj ), and mixture weights \u03c6 \u223c Dirichlet(1, 3, 5, 7)\n10]. The reward associated with\nthe optimal arm was 0.999 and the mean reward was 0.306. We found that the misspeci\ufb01ed model\ndoes not prevent the PG-TS algorithms from consistently \ufb01nding the correct arm, while Laplace-TS\nexhibits poor average behavior (Fig. S3 Supplementary Materials).\n\nj=1 \u03c6jN (\u00b5j, \u03c32\n\nj ), with \u03b8\u2217 = [\u03b8\u2217\n\n1, \u03b8\u2217\n\n2, . . . , \u03b8\u2217\n\n4.2 PG-TS applied to Forest Cover Type Data\n\nWe further compared these methods using the Forest Cover Type data from the UCI Machine Learning\nrepository [8]. These data contain 581, 021 labeled observations from regions of a forest area. The\nlabels indicate the dominant species of trees (cover type) in each region. Following the preprocessing\npipeline proposed by [15], we centered and standarized the 10 non-categorical variables and added a\nconstant covariate; we then partitioned the 581, 012 samples into k = 32 clusters using unsupervised\nmini-batch k-means clustering. We took the cluster centroids to be the contexts corresponding to\neach of our arms. To \ufb01t the logistic reward model, rewards were binarized for each data point by\nassociating the \ufb01rst class \"Spruce/Fir\" to a reward of 1, and to a reward of 0 otherwise. We then set\nthe reward for each arm to be the average reward of the data points in the corresponding cluster; these\nranged from 0.020 to 0.579. The task then becomes the problem of \ufb01nding the cluster with the highest\n\n6\n\n\fFigure 2: Comparison of arm choices for the PG-TS (Left) and Laplace-TS algorithms (Right)\non simulated data with Gaussian \u03b8\u2217 across 1, 000 trials. Arms were sorted by expected reward in\ndecreasing order, with arm 0 giving the highest reward, and arm 99 the lowest. The selected arms are\ncolored by the distance of their expected reward from the optimal reward (regret). Laplace-TS gets\nstuck on a sub-optimal arm, while PG-TS explores successfully and settles on the optimal one.\n\nFigure 3: Left: comparison of the average cumulative regret of the PG-TS, PG-TS-stream, Laplace-\nTS, and GLM-UCB algorithms on the Forest Cover Type data over 100 runs with 1, 000 trials (one\nstandard deviation shaded). PG-TS signi\ufb01cantly outperforms Laplace-TS and GLM-UCB, with slight\nimprovement over PG-TS-stream. Right: median frequencies of the six best arms\u2019 draws. The arms\nwere sorted by expected reward in decreasing order, with arm 0 giving the highest reward, and arm 5\nthe lowest. PG-TS explores better than Laplace-TS, which gets stuck in a sub-optimal arm.\n\nproportion of Spruce/Fir forest cover in a setting with 32 arms and 11 context features. As a baseline,\nwe implemented the generalized linear model upper con\ufb01dence bound algorithm (GLM-UCB) [15].\nOn this forest cover task, the PG-TS algorithms show improved cumulative regret with respect to both\nthe Laplace-TS and the GLM-UCB procedures, with PG-TS performing slightly better of the two\n(Fig. 3). Both PG-TS and PG-TS-stream explored the arm space more successfully, and exploited\nhigh-reward arms with a higher frequency than their competitors (Fig. 3).\n\n4.3 PG-TS Applied to News Article Recommendation\n\nWe evaluated the performance of PG-TS in the context of news article recommendations on the public\nbenchmark Yahoo! Today Module data through an unbiased of\ufb02ine evaluation protocol [22]. As\nbefore, users are assumed to click on articles in an i.i.d manner. Available articles represent the pool\nof arms, the binary payoff is whether a user clicks on a recommended article, and the expected payoff\nof an article is the click-through rate (CTR). Our goal is to choose the article with the maximum\nexpected CTR at each visit, which is equivalent to maximizing the total expected reward. The full data\nset contains 45, 811, 883 user visits from the \ufb01rst 10 days of May 2009; for each user visit, the module\nfeatures one article from a changing pool of K \u2248 20 articles, which the user either clicks (r = 1) or\ndoes not click (r = 0). We use 200, 000 of these events in our evaluation for ef\ufb01ciency; \u2264 24, 000\n\n7\n\n02004006008001000Trial020406080Arm0.00.20.40.60.8Regret02004006008001000Trial020406080100Arm0.00.20.40.60.8Regret\fFigure 4: Comparison of the average click-through rate (CTR) achieved by the PG-TS, PG-TS-stream,\nand Laplace-TS algorithms with 10-minute delay (Left) and with varying delay (Right) on 24, 000\nevents in the Yahoo! Today Module data set over 20 runs. Left: the moving average CTR is observed\nevery 1, 000 observations. Right: the standard deviation of the average CTR is shown. PG-TS\nachieves higher CTR across all delays, especially for short delays.\n\nof these are valid events for each of our evaluated algorithms. Each article is associated with a\nfeature vector (context) x \u2208 R6 including a constant feature capturing an intercept, preprocessed\nusing a conjoint analysis with a bilinear model [11]; note that we do not use user features as context.\nIn this evaluation, we maintain separate estimates \u03b8a for each arm. We also update the model in\nbatches (groups of observations across time delays) to better match the real-world scenario where\ncomputation is expensive and delay is necessary. In all settings, PG-TS consistently and signi\ufb01cantly\nout-performs the Laplace-TS approach (Fig. 4). In particular, PG-TS shows a signi\ufb01cant improvement\nin CTR across all delays. Note that PG-TS bene\ufb01ts in performance in particular with short delays.\nDespite showing only marginal improvement when compared to Laplace-TS, PG-TS-stream offers\nthe advantage of a \ufb02exible, fast data streaming approach without compromising performance on this\ntask.\n\n5 Discussion\n\nWe introduced PG-TS, a fully Bayesian algorithm based on the P\u00f3lya-Gamma augmentation scheme\nfor contextual bandits with logistic rewards. This is the \ufb01rst method where P\u00f3lya-Gamma augmen-\ntation is leveraged to improve bandit performance. Our approach addresses two de\ufb01ciencies in\ncurrent methods. First, PG-TS provides an ef\ufb01cient online approximation scheme for the analytically\nintractable logistic posterior. Second, because PG-TS explicitly estimates context feature covariances,\nit is more effective in balancing exploration and exploitation relative to Laplace-TS, which assumes\nindependence of each context feature. We showed through extensive evaluation in both simulated\nand real-world data that our approach offers improved empirical performance while maintaining\ncomparable computational costs by leveraging the simplicity of the Thompson sampling framework.\nWe plan to extend our framework to address computational challenges in high-dimensional data via\nhash-amenable extensions [20].\nMotivated by our results and by recent work on PG inference in dependent multinomial models [24],\nwe aim to extend our work to the problem of multi-armed bandits with categorical rewards. We\nfurther envision a generalization of this approach to sampling in bandit problems where additional\nstructure is imposed on the contexts; for example, settings where arm contexts are sampled from\ndynamic linear topic models [17], or settings in which social network information is available for\nusers and contexts [16].\nFuture work will address the more general reinforcement learning setting of Bayes-Adaptive MDP\nwith discrete state and action sets [14]. In this case, the state transition probabilities are multinomial\ndistributions; therefore, our online P\u00f3lya-Gamma Gibbs sampling procedure can be extended to\napproximate the emerging intractable posteriors.\n\n8\n\n\fAcknowledgments\n\nWe would like to thank Scott Linderman, Diana Cai, and Jean Feng for insightful discussions and\ntheir helpful feedback. Lastly, we thank all the anonymous reviewers for their valuable comments.\n\nReferences\n[1] N. Abe and A. Nakamura. Learning to optimally schedule internet banner advertisements. In\n\nICML, volume 99, pages 12\u201321, 1999.\n\n[2] M. Abeille and A. Lazaric. Linear Thompson Sampling Revisited. In AISTATS 2017-20th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[3] R. Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed bandit\n\nproblem. Advances in Applied Probability, 27(4):1054\u20131078, 1995.\n\n[4] S. Agrawal and N. Goyal. Thompson Sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[5] S. Agrawal and N. Goyal. Near-Optimal Regret Bounds for Thompson Sampling. Journal of\n\nthe ACM (JACM), 64(5):30, 2017.\n\n[6] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[7] R. F. Barber, M. Drton, and K. M. Tan. Laplace approximation in high-dimensional Bayesian\n\nregression. In Statistical Analysis for High-Dimensional Data, pages 15\u201336. Springer, 2016.\n\n[8] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for\ndata mining research and experimentation. ACM SIGKDD Explorations Newsletter, 2(2):81\u201385,\n2000.\n\n[9] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in neural\n\ninformation processing systems, pages 2249\u20132257, 2011.\n\n[10] H. M. Choi, J. P. Hobert, et al. The Polya-Gamma Gibbs sampler for Bayesian logistic regression\n\nis uniformly ergodic. Electronic Journal of Statistics, 7:2054\u20132064, 2013.\n\n[11] W. Chu, S. taek Park, T. Beaupre, N. Motgi, A. Phadke, S. Chakraborty, and J. Zachariah. A\ncase study of behavior-driven conjoint analysis on Yahoo! Front Page Today module. In Proc.\nof KDD, 2009.\n\n[12] L. Devroye. Introduction. In Non-Uniform Random Variate Generation, pages 1\u201326. Springer,\n\n1986.\n\n[13] L. Devroye. On exact simulation algorithms for some distributions related to Jacobi theta\n\nfunctions. Statistics & Probability Letters, 79(21):2251\u20132259, 2009.\n\n[14] M. O. Duff. Design for an optimal probe. In Proceedings of the 20th International Conference\n\non Machine Learning (ICML-03), pages 131\u2013138, 2003.\n\n[15] S. Filippi, O. Cappe, A. Garivier, and C. Szepesv\u00e1ri. Parametric bandits: The generalized linear\n\ncase. In Advances in Neural Information Processing Systems, pages 586\u2013594, 2010.\n\n[16] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In International Conference on\n\nMachine Learning, pages 757\u2013765, 2014.\n\n[17] C. Glynn, S. T. Tokdar, D. L. Banks, and B. Howard. Bayesian Analysis of Dynamic Linear\n\nTopic Models. arXiv preprint arXiv:1511.03947, 2015.\n\n[18] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[19] E. Hazan, T. Koren, and K. Y. Levy. Logistic regression: Tight bounds for stochastic and online\n\noptimization. In Conference on Learning Theory, pages 197\u2013209, 2014.\n\n9\n\n\f[20] K.-S. Jun, A. Bhargava, R. Nowak, and R. Willett. Scalable Generalized Linear Bandits: Online\n\nComputation and Hashing. arXiv preprint arXiv:1706.00136, 2017.\n\n[21] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, pages 817\u2013824, 2008.\n\n[22] L. Li, W. Chu, J. Langford, and R. E. Schapire. A Contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670. ACM, 2010.\n\n[23] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. In Proceedings of the fourth ACM International\nConference on Web search and Data Mining, pages 297\u2013306. ACM, 2011.\n\n[24] S. Linderman, M. Johnson, and R. P. Adams. Dependent multinomial models made easy: Stick-\nbreaking with the p\u00f3lya-gamma augmentation. In Advances in Neural Information Processing\nSystems, pages 3456\u20133464, 2015.\n\n[25] H. B. McMahan and M. Streeter. Open problem: Better bounds for online logistic regression.\n\nIn Conference on Learning Theory, pages 44\u20131, 2012.\n\n[26] I. Osband and B. Van Roy. Bootstrapped Thompson sampling and deep exploration. arXiv\n\npreprint arXiv:1507.00300, 2015.\n\n[27] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using P\u00f3lya-\nGamma latent variables. Journal of the American statistical Association, 108(504):1339\u20131349,\n2013.\n\n[28] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\n[29] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. The\n\nJournal of Machine Learning Research, 17(1):2442\u20132471, 2016.\n\n[30] D. Russo, B. Van Roy, A. Kazerouni, and I. Osband. A Tutorial on Thompson Sampling. arXiv\n\npreprint arXiv:1707.02038, 2017.\n\n[31] J. Scott and J. W. Pillow. Fully Bayesian inference for neural models with negative-binomial\n\nspiking. In Advances in Neural Information Processing Systems, pages 1898\u20131906, 2012.\n\n[32] M. Strens. A Bayesian framework for reinforcement learning. In ICML, pages 943\u2013950, 2000.\n\n[33] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health.\n\nIn Mobile Health, pages 495\u2013517. Springer, 2017.\n\n[34] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[35] I. Urteaga and C. H. Wiggins. Bayesian bandits: balancing the exploration-exploitation tradeoff\n\nvia double sampling. arXiv preprint arXiv:1709.03162, 2017.\n\n[36] J. Windle, N. G. Polson, and J. G. Scott. Sampling Polya-Gamma random variates: alternate\n\nand approximate techniques. arXiv preprint arXiv:1405.0506, 2014.\n\n[37] M. Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the\n\nAmerican Statistical Association, 74(368):799\u2013806, 1979.\n\n[38] M. Zhou, L. Li, D. Dunson, and L. Carin. Lognormal and gamma mixed negative binomial\nregression. In Proceedings of the 29th International Conference on Machine Learning, volume\n2012, page 1343. NIH Public Access, 2012.\n\n10\n\n\f", "award": [], "sourceid": 2253, "authors": [{"given_name": "Bianca", "family_name": "Dumitrascu", "institution": "Princeton University"}, {"given_name": "Karen", "family_name": "Feng", "institution": "Princeton University"}, {"given_name": "Barbara", "family_name": "Engelhardt", "institution": "Princeton University"}]}