{"title": "Thompson Sampling and Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 8804, "page_last": 8813, "abstract": "We study the effects of approximate inference on the performance of Thompson sampling in the $k$-armed bandit problems. Thompson sampling is a successful algorithm for online decision-making but requires posterior inference, which often must be approximated in practice. We show that even small constant inference error (in $\\alpha$-divergence) can lead to poor performance (linear regret) due to under-exploration (for $\\alpha<1$) or over-exploration (for $\\alpha>0$) by the approximation. While for $\\alpha > 0$ this is unavoidable, for $\\alpha \\leq 0$ the regret can be improved by adding a small amount of forced exploration even when the inference error is a large constant.", "full_text": "Thompson Sampling with Approximate Inference\n\nCollege of Information and Computer Science\n\nMy Phan\n\nUniversity of Massachusetts\n\nAmherst, MA\n\nmyphan@cs.umass.edu\n\nYasin Abbasi-Yadkori\n\nVinAI\n\nHanoi, Vietnam\n\nyasin.abbasi@gmail.com\n\nJustin Domke\n\nCollege of Information and Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA\n\ndomke@cs.umass.edu\n\nAbstract\n\nWe study the effects of approximate inference on the performance of Thompson\nsampling in the k-armed bandit problems. Thompson sampling is a successful\nalgorithm for online decision-making but requires posterior inference, which often\nmust be approximated in practice. We show that even small constant inference\nerror (in \u21b5-divergence) can lead to poor performance (linear regret) due to under-\nexploration (for \u21b5< 1) or over-exploration (for \u21b5> 0) by the approximation.\nWhile for \u21b5> 0 this is unavoidable, for \u21b5 \uf8ff 0 the regret can be improved by\nadding a small amount of forced exploration even when the inference error is a\nlarge constant.\n\n1\n\nIntroduction\n\nThe stochastic k-armed bandit problem is a sequential decision making problem where at each\ntime-step t, a learning agent chooses an action (arm) among k possible actions and observes a random\nreward. Thompson sampling (Russo et al., 2018) is a popular approach in bandit problems based on\nsampling from a posterior in each round. It has been shown to have good performance both in term\nof frequentist regret and Bayesian regret for the k-armed bandit problem under certain conditions.\nThis paper investigates Thompson sampling when only an approximate posterior is available. This is\nmotivated by the fact that in complex models, approximate inference methods such as Markov Chain\nMonte Carlo or Variational Inference must be used. Along this line, Lu & Van Roy (2017) propose a\nnovel inference method \u2013 Ensemble sampling \u2013 and analyze its regret for linear contextual bandits. To\nthe best of our knowledge this is the most closely related theoretical analysis of Thompson sampling\nwith approximate inference.\nThis paper analyzes the regret of Thompson sampling with approximate inference. Rather than\nconsidering a particular inference algorithm, we parameterize the error using the \u21b5-divergence, a\ntypical measure of inference accuracy. Our contributions are as follows:\n\n\u2022 Even small inference errors can lead to linear regret with naive Thompson sampling.\nGiven any error threshold \u270f> 0 and any \u21b5 we show that approximate posteriors with error at\nmost \u270f in \u21b5-divergence at all times can result in linear regret (both frequentist and Bayesian).\nFor \u21b5> 0 and for any reasonable prior, we show linear regret due to over-exploration by\nthe approximation (Theorem 1, Corrolary 1). For \u21b5< 1 and for priors satisfying certain\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconditions, we show linear regret due to under-exploration by the approximation, which\nprevents the posterior from concentrating (Theorem 2, Corrolary 2).\n\n\u2022 Forced exploration can restore sub-linear regret. For \u21b5 \uf8ff 0 we show that adding forced\nexploration to Thompson sampling can make the posterior concentrate and restore sub-linear\nregret (Theorem 3) even when the error threshold is a very large constant. We illustrate\nthis effect by showing that the performances of Ensemble sampling (Lu & Van Roy, 2017)\nand mean-\ufb01eld Variation Inference (Blei et al., 2017) can be improved in this way either\ntheoretically (Section 5.1) or in simulations (Section 6).\n\n2 Background and Notations.\n\n2.1 The k-armed Bandit Problem.\n\nWe consider the k-armed bandit problem parameterized by the mean reward vector m =\n(m1, ..., mk) 2R k, where m\u21e4i denotes the mean reward of arm (action) i. At each round t, the\nlearner chooses an action At and observes the outcome Yt which, conditioned on At, is independent\nof the history up to and not including time t, Ht1 = (A1, Y1, ..., At1, Yt1). For a time horizon T ,\nthe goal of the algorithm \u21e1 is to maximize the expected cumulative reward up to time T .\nLet \u2326 \u2713R k be the domain of the mean and \u2326i \u2713 \u2326 denote the region where the ith arm has\nthe largest mean. Let the function A\u21e4 :\u2326 !{ a1, ..., ak} denoting the best action be de\ufb01ned as:\nA\u21e4(m) = i if m 2 \u2326i.\nIn the frequentist setting we assume that there exists a true mean m\u21e4 which is \ufb01xed and unknown\nto the learner. Therefore, a policy \u21e1\u21e4 that always chooses A\u21e4(m\u21e4) will get the highest reward. The\nperformance of policy \u21e1 is measured by its expected regret compared to an optimal policy \u21e1\u21e4, which\nis de\ufb01ned as:\n\nRegret(T,\u21e1, m \u21e4) = T m\u21e4A\u21e4(m\u21e4) E\n\nm\u21e4At .\n\nTXt=1\n\n(1)\n\nOn the other hand, in the Bayesian setting, an agent expresses her beliefs about the mean vector\nin terms of a prior \u21e70, and therefore, the mean is treated as a random variable M = (M1, ..., Mk)\ndistributed according to the prior \u21e70. The Bayesian regret is the expectation of the regret under the\nprior of parameter M:\n\nBayesRegret(T,\u21e1 ) = E\u21e70Regret(T,\u21e1, M ) .\n\n(2)\n\n2.2 Thompson Sampling with Approximate Inference\n\nobtains a sample bm from \u21e7t and then selects arm At as follow: At = i if bm 2 \u2326i. In each round, we\n\nIn the frequentist setting, in order to perform Thompson sampling we de\ufb01ne a prior which is only\nused in the algorithm. On the other hand, in the Bayesian setting the prior is given.\nLet \u21e7t be the posterior distribution of M|Ht1 with density function \u21e1t(m). Thompson sampling\nassume an approximate sampling method is available that generates sample from an approximate\ndistribution Qt. We use qt to denote the density function of Qt.\nPopular approximate sampling methods include Markov Chain Monte Carlo (MCMC) (Andrieu et al.,\n2003), Sequential Monte Carlo (Doucet & Johansen, 2009) and Variational Inference (VI) (Blei\net al., 2017). There are packages that conveniently implement VI and MCMC methods, such as Stan\n(Carpenter et al., 2017), Edward (Tran et al., 2016), PyMC (Salvatier et al., 2016) and infer.NET\n(Minka et al., 2018).\nTo provide a general analysis of approximate sampling methods, we will use the \u21b5-divergence\n(Section 2.3) to quantify the distance between the posterior \u21e7t and the approximation Qt.\n\n2\n\n\f2.3 The Alpha Divergence\nThe \u21b5-divergence between two distributions P and Q with density functions p(x) and q(x) is de\ufb01ned\nas:\n\n1 R p(x)\u21b5q(x)1\u21b5dx\n\n\u21b5(1 \u21b5)\n\nD\u21b5(P, Q) =\n\n.\n\n(3)\n\n\u21b5-divergence generalizes many divergences, including KL(Q, P ) (\u21b5 ! 0), KL(P, Q) (\u21b5 ! 1),\nHellinger distance (\u21b5 = 0.5) and 2 divergence (\u21b5 = 2) and is a common way to measure errors in\ninference methods. MCMC errors are measured by the Total Variation distance, which can be upper\nbounded by the KL divergence using Pinsker\u2019s inequality (\u21b5 = 0 or \u21b5 = 1). Variational Inference\ntries to minimize the reverse KL divergence (information projection) between the target distribution\nand the approximation (\u21b5 = 0). Ensemble sampling (Lu & Van Roy, 2017) provides error guarantees\nusing reverse KL divergence (\u21b5 = 0). Expectation Propagation tries to minimize the KL divergence\n(\u21b5 = 1) and 2 Variational Inference tries to minimize the 2 divergence (\u21b5 = 2).\n\nFigure 1: The Gaussian Q which minimizes D\u21b5(P, Q) for different values of \u21b5 where the target\ndistribution P is a mixture of two Gaussians. Based on Figure 1 from (Minka, 2005)\n\nWhen \u21b5 is small, the approximation \ufb01ts the posterior\u2019s dominant mode. When \u21b5 is large, the\napproximation covers the posterior\u2019s entire support (Minka, 2005) as illustrated in Figure 1. Therefore\nchanging \u21b5 will affect the exploration-exploitation trade-off in bandit problems.\n\n2.4 Problem Statement.\nProblem Statement. For the k-armed bandit problem, given \u21b5 and \u270f> 0, if at all time-steps t we\nsample from an approximate distribution Qt such that D\u21b5(\u21e7t, Qt) <\u270f , will the regret be sub-linear\nin t?\n\n3 Motivating Example\n\nIn this section we present a simple example to show the effects of inference errors on the frequentist\nregret.\nExample. Consider a 2-armed bandit problem where the reward distributions are Norm(0.6, 0.22)\n\nand Norm(0.5, 0.22) for arm 1 and 2 respectively. The prior \u21e70 is Norm\u00b5T\n\n[0.1, 0.9] is the vector of prior means of arm 1 and 2 respectively, and I denotes the identity matrix.\nLet \u21e7t = Norm(\u00b5t, \u2303t) be the posterior at time t. Approximations Qt and Zt are calculated\nsuch that KL(\u21e7t, Qt) = 2 and KL(Zt, \u21e7t) = 1.5 by multiplying the covariance \u2303t by a constant:\nQt = Norm(\u00b5t, 4.52\u2303t) and Zt = Norm(\u00b5t, 0.32\u2303t). The KL divergence between two Gaussian\ndistributions is provided in Appendix F.\nWe perform the following simulations 1000 times and plot the mean cumulative regret up to time\nT = 100 in Figure 2b using three different policies:\n\n0 , 0.52I where \u00b50 =\n\n1. (Exact Thompson Sampling) At each time-step t, sample from the true posterior \u21e7t.\n2. (Approximation Qt) At each time-step t, compute Qt from \u21e7t and sample from Qt.\n3. (Approximation Zt) At each time-step t, compute Zt from \u21e7t and sample from Zt.\n\nThe regrets of sampling from the approximations Qt and Zt are in both cases larger than that of exact\nThompson sampling. Intuitively, the regret of Qt is larger because Qt explores more than the true\n\n3\n\n\f(a) Over-dispersed (approximation Qt) and under-\ndispersed sampling (approximation Zt) yield dif-\nferent posteriors after T = 100 time-steps. m1\nand m2 are the means of arms 1 and 2. Qt picks\narm 2 more often than exact Thompson sampling\nand Zt mostly picks arm 2. The posteriors of exact\nThompson sampling and Qt concentrate mostly in\nthe region where m1 > m2 while Zt\u2019s spans both\nregions.\n\n(b) The regret of sampling from the approxima-\ntions Qt and Zt are both larger than that of exact\nThompson sampling from the true posterior \u21e7t.\nShaded regions show 95% con\ufb01dence intervals.\n\nFigure 2: Approximation Qt (with high variance) and approximation Zt (with small variance) are\nde\ufb01ned in Section 3 where D1(\u21e7t, Qt) = 2 and D0(\u21e7t, Zt) = 1.5. Arm 1 is the true best arm.\n\nposterior (Figure 2a). In Section 4 we show that when \u21b5> 0 the approximation can incur this type\nof error, leading to linear regret. On the other hand, the regret of Zt is larger because Zt explores\nless than the exact Thompson sampling algorithm and therefore commits to the sub-optimal arm\n(Figure 2a). In Section 5 we show that when \u21b5< 1 the approximation can change the posterior\nconcentration rate, leading to linear regret. We also show that adding a uniform sampling step can\nhelp the posterior to concentrate when \u21b5 \uf8ff 0, and make the regret sub-linear.\n4 Regret Analysis When \u21b5> 0\n\nIn this section we analyze the regret when \u21b5> 0. Our result shows that the approximate method\nmight pick the sub-optimal arm with constant probability in every time-step, leading to linear regret.\nTheorem 1 (Frequentist Regret). Let \u21b5> 0, the number of arms be k = 2 and m\u21e41 > m\u21e42. Let \u21e70 be\na prior where P\u21e70(M2 > M1) > 0. For any error threshold \u270f> 0, there is a deterministic mapping\nf (\u21e7) such that for all t 0:\n\n1. Sampling from Qt = f (\u21e7t) chooses arm 2 with a constant probability.\n2. D\u21b5(\u21e7t, Qt) <\u270f .\n\nTherefore sampling from Qt for T /10 time-steps and using any policy for the remaining time-steps\nwill cause linear frequentist regret.\n\nTypically, approximate inference methods minimize divergences. Broadly speaking, this theorem\nshows that making a divergence a small constant, alone, is not enough to guarantee sub-linear regret.\nWe do not mean to imply that low regret is impossible but simply that making an \u21b5-divergence a\nsmall constant alone is not suf\ufb01cient.\nAt every time-step, the mapping f constructs the approximation Qt from the posterior \u21e7t by moving\nprobability mass from the region \u23261 where m1 > m2 to the region \u23262 where m2 > m1. Then Qt\nwill choose arm 2 with a constant probability at every time-step. The constant average regret per\ntime-step is discussed in Appendix A.4.\nTherefore, if we sample from Qt = f (\u21e7t) for 0.1T time steps and use any policy in the remaining\n0.9T time steps, we will still incur linear regret from the 0.1T time-steps. On the other hand, when\n\u21b5 \uf8ff 0, we show in Section 5.1 that sampling an arm uniformly at random for log T time-steps and\nsampling from an approximate distribution that satis\ufb01es the divergence constraint for T log T\ntime-steps will result in sub-linear regret.\n\n4\n\n\fthe assumption m\u21e41 > m\u21e42 in Theorem 1 is satis\ufb01ed with a non-zero probability\n\nAgrawal & Goyal (2013) show that the frequentist regret of exact Thompson sampling is O(pT )\nwith Gaussian or Beta priors and bounded rewards. Theorem 1 implies that when the assumptions in\n(Agrawal & Goyal, 2013) are satis\ufb01ed but there is a small constant inference error at every time-step,\nthe regret is no longer guaranteed to be sub-linear.\nIf\n(P\u21e70(M1 > M2) > 0), the Bayesian regret will also be linear:\nCorollary 1 (Bayesian Regret). Let \u21b5> 0 and the number of arms be k = 2. Let \u21e70 be a prior\nwhere P\u21e70(M1 > M2) > 0 and P\u21e70(M2 > M1) > 0. Then for any error threshold \u270f> 0, there is a\ndeterministic mapping f (\u21e7) such that for all t 0 the two statements in Theorem 1 hold.\nTherefore sampling from Qt for T /10 time-steps and using any policy for the remaining time-steps\nwill cause linear Bayesian regret.\n\nRusso & Roy (2016) prove that the Bayesian regret of Thompson sampling for k-armed bandits with\nsub-Gaussian rewards is O(pT ). Corollary 1 implies that even when the assumptions in Russo &\nRoy (2016) are satis\ufb01ed, under certain conditions and with approximation errors, the regret is no\nlonger guaranteed to be sub-linear.\n\n5 Regret Analysis When \u21b5< 1\n\nIn this section we analyze the regret when \u21b5< 1. Our result shows that for any error threshold,\nif the posterior \u21e7t places too much probability mass on the wrong arm then the approximation Qt\nis allowed to avoid the optimal arm. If the sub-optimal arms do not provide information about the\narms\u2019 ranking, the posterior \u21e7t+1 does not concentrate. Therefore Qt+1 is also allowed to be close\nin \u21b5-divergence while avoiding the optimal arm, leading to linear regret in the long term.\nTheorem 2 (Frequentist Regret). Let \u21b5< 1, the number of arms be k = 2 and m\u21e41 > m\u21e42. Let \u21e70 be\na prior where M2 and M1 M2 are independent. There is a deterministic mapping f (\u21e7) such that\nfor all t 0:\n\n1. Sampling from Qt = f (\u21e7t) chooses arm 2 with probability 1.\n2. For any \u270f> 0, there exists 0 < z \uf8ff 1 such that if P\u21e70(M2 > M1) = z and arm 2 is chosen\nat all times before t then D\u21b5(\u21e7t, Qt) <\u270f .\nFor any 0 < z \uf8ff 1, there exists \u270f> 0 such that if P\u21e70(M2 > M1) = z and arm 2 is chosen\nat all times before t then D\u21b5(\u21e7t, Qt) <\u270f .\n\nKL(Q, P ) (P ) \u00b7 TV(P, Q)2 .\n\nTherefore sampling from Qt at all time-steps results in linear frequentist regret.\nWe discuss why the above results are not immediately obvious. When \u21b5 ! 0, the \u21b5-divergence\nbecomes KL(Qt, \u21e7t). We might believe that the regret should be sub-linear in this case because the\nposterior \u21e7t becomes more concentrated, and so the total variation between Qt and \u21e7t must decrease.\nFor example, Ordentlich & Weinberger (2004) show the distribution-dependent Pinsker\u2019s inequality\nbetween KL(Q, P ) and the total variation TV(P, Q) for discrete distributions P and Q as follows:\n(4)\nHere, (P ) is a quantity that will increase to in\ufb01nity if P becomes more concentrated. However,\nthe algorithm in Theorem 2 constructs an approximation distribution that never picks the optimal\narm, so the posterior \u21e7t can not concentrate and the regret is linear. The error threshold \u270f causing\nlinear frequentist regret is correlated with the probability mass the prior places on the true best arm\n(Appendix B.4).\nWith some assumptions on the rewards, Gopalan et al. (2014) show that the problem-dependent\nfrequentist regret is O(log T ) for \ufb01nitely-supported, correlated priors with \u21e10(m\u21e4) > 0. Liu & Li\n(2016) study the prior-dependent frequentist regret of 2-armed-and-2-models bandits, and show that\n\nwith some smoothness assumptions on the reward likelihoods, the regret is O(pT /P\u21e70(M2 > M1)\nif arm 1 is the better arm. Theorem 2 implies that when the assumptions in (Gopalan et al., 2014)\nor (Liu & Li, 2016) are satis\ufb01ed, if M2 and M1 M2 are independent and there are approximation\nerrors, the regret is no longer guaranteed to be sub-linear.\nIf\nthe assumption m\u21e41 > m\u21e42 in Theorem 2 is satis\ufb01ed with a non-zero probability\n(P\u21e70(M1 > M2) > 0), the Bayesian regret wil also be linear:\n\n5\n\n\fCorollary 2 (Bayesian Regret). Let \u21b5< 1 and the number of arms be k = 2. Let \u21e70 be a prior\nwhere P\u21e70(M1 > M2) > 0 and M2 and M1 M2 are independent. There is a deterministic\nmapping f (\u21e7) such that for all t 0 the 2 statements in Theorem 2 hold.\nTherefore sampling from Qt at all time-steps results in linear Bayesian regret.\n\nRusso & Roy (2016) prove that the Bayesian regret of Thompson sampling for k-armed bandits with\nsub-Gaussian rewards is O(pT ). Corollary 2 implies that even when the assumptions in Russo &\nRoy (2016) are satis\ufb01ed, under certain conditions and with approximation errors, the regret is no\nlonger guaranteed to be sub-linear.\nWe note that, unlike the case when \u21b5> 0, if we use another policy in o(T ) time-steps to make the\nposterior concentrate and sample from Qt for the remaining time-steps, the regret can be sub-linear.\nWe provide a concrete algorithm in Section 5.1 for the case when \u21b5 \uf8ff 0.\n5.1 Algorithms with Sub-linear Regret for \u21b5 \uf8ff 0\nIn the previous section, we see that when \u21b5< 1, the approximation has linear regret because the\nposterior does not concentrate. In this section we show that when \u21b5 \uf8ff 0, it is possible to achieve\nsub-linear regret even when \u270f is a very large constant by adding a simple exploration step to force\nthe posterior to concentrate (the case of \u21b5> 0 cannot be improved according to Theorem 1). We\n\ufb01rst look at the necessary and suf\ufb01cient condition that will make the posterior concentrate, and then\nprovide an algorithm that satis\ufb01es it. Russo (2016) and Qin et al. (2017) both show the following\nresult under different assumptions:\nLemma 1 (Lemma 14 from Russo (2016)). Let m\u21e4 2R k be the true parameter and let a\u21e4 = A\u21e4(m\u21e4)\nbe the true best arm. If for all arms i,P1t=1 P (At = i|Ht1) = 1, then\nP (A\u21e4(M ) = a\u21e4|Ht1) = 1 with probability 1 .\n(5)\nIf there exists arm i such that P1t=1 P (At = i|Ht1) < 1, then lim inf t!1 P (A\u21e4(M ) =\ni|Ht1) > 0 with probability 1.\nRusso (2016) make the following assumptions, which allow correlated priors:\nAssumption 1. Let the reward distributions be in the canonical one dimensional exponen-\ntial family with the density: p(y|m) = b(y) exp(mT (y) A(m)) where b, T and A are\nknown function and A(m) is assumed to be twice differentiable. The parameter space \u2326=\n(m, m) is a bounded open hyper-rectangle, the prior density is uniformly bounded with 0 <\ninf m2\u2326 \u21e10(m) < supm2\u2326 \u21e10(m) < 1 and the log-partition function has bounded \ufb01rst deriva-\ntive with sup\u27132[m,m] |A0(m)| < 1.\nQin et al. (2017) make the following assumptions:\nAssumption 2. Let the prior be an uncorrelated multivariate Gaussian. Let the reward distribution\nof arm i be Norm(mi, 2) with a common known variance 2 but unknown mean mi.\n\nlim\nt!1\n\nEven though we consider the error in sampling from the posterior distribution, the regret is a result of\nchoosing the wrong arm. We de\ufb01ne \u21e7t as the posterior distribution of the best arm and Qt as the\napproximation of \u21e7t with the density functions\n\n\u21e1t(i) = P (A\u21e4 = i|Ht1) and qt(i) = P (At = i|Ht1).\n\nWe now de\ufb01ne an algorithm where each arm will be chosen in\ufb01nitely often, satisfying the condition\nof Lemma 1.\nTheorem 3 (Bayesian and Frequentist Regret). Consider the case when Assumption 1 or 2 is satis\ufb01ed.\n\nLet \u21b5 \uf8ff 0 and pt = o(1) be such thatP1t=1 pt = 1. For any number of arms k, any prior \u21e70 and\nany error threshold \u270f> 0, the following algorithm has o(T ) frequentist regret: at every time-step t,\n\u2022 with probability 1pt, sample from an approximate posterior Qt such that D\u21b5(\u21e7t, Qt) <\u270f ,\n\u2022 with probability pt, sample an arm uniformly at random.\nSince the Bayesian regret is the expectation of the frequentist regret over the prior, for any prior if the\nfrequentist regret is sub-linear at all points the Bayesian regret will be sub-linear.\n\n6\n\n\fThe following lemma shows that the error in choosing the arms is upper bounded by the error\nin choosing the parameters. Therefore whenever the condition D\u21b5(\u21e7t, Qt) <\u270f is satis\ufb01ed, the\ncondition D\u21b5(\u21e7t, Qt) <\u270f will be satis\ufb01ed and Theorem 3 is applicable.\nLemma 2.\n\nD\u21b5(\u21e7t, Qt) \uf8ff D\u21b5(\u21e7t, Qt) .\n\nWe also note that we can achieve sub-linear regret even when \u270f is a very large constant. We revisit\nEq. 4 to provide the intuition: KL(Q, P ) (P ) \u00b7 TV(P, Q)2. Here, (P ) is a quatity that will\nincrease to in\ufb01nity if P becomes more concentrated. Hence, if KL(Qt, \u21e7t) <\u270f for any constant\n\u270f and \u21e7t becomes concentrated, the total variation TV(Qt, \u21e7t) will decrease. Therefore, Qt will\nbecome concentrated, resulting in sub-linear regret.\nApplication. Lu & Van Roy (2017) propose an approximate sampling method called Ensemble\nsampling where they maintain a set of M models to approximate the posterior and analyze its regret\nfor the linear contextual bandits when M is \u2326(log(T )). For the k-armed bandit problem and when\nM is \u21e5(log(T )), Ensemble sampling satis\ufb01es the condition KL(Qt, \u21e7t) <\u270f in Theorem 3 with\nhigh probability. In this case, Lu & Van Roy (2017) show a regret bound that scales linearly with T .\nWe discuss in Appendix E how to apply Theorem 3 to get sub-linear regret with Ensemble sampling\nwhen M is \u21e5(log(T )).\n6 Simulations\n\nFor each approximation method we repeat the following simulations for 1000 times and plot the\nmean cumulative regret, using \ufb01ve different policies.\n\n1. (Exact Thompson sampling) Use exact posterior sampling to choose an action and update\n\nthe posterior (for reference).\n\n2. (Approximation method) Use the approximation method to choose an action and update\n\nthe posterior. We use the approximation naively without any modi\ufb01cation.\n\n3. (Forced Exploration) With a probability (the exploration rate), choose an action uniformly\nat random and update the posterior. Otherwise, use the approximation method to choose an\naction and update the posterior. This is the method suggested by Thm. 3.\n\n4. (Approximate Sample) Use the approximation method to choose an action. Use exact\n\nposterior sampling to update the posterior.\n\n5. (Approximate Update) Use exact posterior sampling to choose an action. Use the approxi-\n\nmate method to update the posterior.\n\nThe last two policies are performed to understand how the approximation affects the posterior\n(discussed in Section 6.3). We update the posterior using the closed-form formula when both the\nprior and reward distribution are Gaussian in Appendix G.\n\n6.1 Adding Forced Exploration to the Motivating Example\nIn this section we revisit the example in Section 3. We apply Qt, Zt and Ensemble sampling with\nM = 2 models to the bandit problem described in the example. We set the exploration rate at time t\nto be 1/t, T = 100 and show the results in Figure 3a and discuss them in Section 6.3.\n\n6.2 Simulations of Ensemble Sampling and Variational Inference for 50-armed bandits\nNow we add forced exploration to mean-\ufb01eld Variational Inference (VI) and Ensemble Sampling with\nM = 5 models for a 50-armed bandit instance. We generate the prior and the reward distribution\nas follows: the prior is Norm(0, \u23030). To generate a positive semi-de\ufb01nite matrix \u23030, we generate a\nrandom matrix A of size (k, k) where entries are uniformly sampled from [0, 1) and set \u23030 = AT A/k.\nThe true mean m\u21e4 is sampled from the prior. The reward distribution of arm i is Norm(m\u21e4i , 1).\nMean-\ufb01eld VI approximates the posterior by \ufb01nding an uncorrelated multivariate Gaussian distribution\nQt that minimizes KL(\u21e7t, Qt). If the posterior is \u21e7t = Norm(\u00b5t, \u2303t) then Qt has the closed-form\nsolution Qt = Norm(\u00b5t, Diag(\u23031\nt )1), which we used to perform the simulations. We set the\nexploration rate at time t to be 50/t, T = 3000, show the results in Figure 3b and discuss them in\nSection 6.3.\n\n7\n\n\f(a) Applying approximations Qt, Zt and Ensemble Sampling to the motivating example (Section 6.1).\n\n(b) Applying mean-\ufb01eld Variational Inference (VI) and Ensemble sampling on a 50-armed bandit (Section 6.2).\nFigure 3: Updating the posterior by exact Thompson sampling or adding forced exploration does not\nhelp the over-explored approximation Qt, but lowers the regrets of the under-explored approximations\nZt, Ensemble sampling and mean-\ufb01eld VI. Shaded regions show 95% con\ufb01dence intervals.\n\n6.3 Discussion\n\nWe observe in Figure 3a that the regret of Qt calculated from the posterior updated by exact Thompson\nsampling does not change signi\ufb01cantly. Moreover, exact posterior sampling with the posterior updated\nby Qt has the same regret as exact Thompson sampling. These two observations imply that Qt has\nthe same effect on the posterior as exact Thompson sampling. Therefore adding forced exploration is\nnot helpful.\nOn the other hand, in Figures 3a and 3b the regrets of Zt, Ensemble sampling and mean-\ufb01eld VI\ncalculated from the posterior updated by exact Thompson sampling decrease signi\ufb01cantly. Moreover,\nexact posterior sampling with the posterior updated by the approximations has similar regret to\nusing the approximations. This behaviour is likely because the approximation causes the posterior to\nconcentrate in the wrong region1. In combination, these two observations suggest that these methods\ndo not explore enough for the posterior to concentrate. Therefore adding forced exploration is helpful,\nwhich is compatible with the result in Theorem 3.\n\n1Note that in the case where there are 2 arms (Figure 3a), exact posterior sampling with the posterior updated\nby the approximate method has slightly lower regret than naively using the approximate method. This is only\nbecause there are only 2 regions, so exact posterior sampling explores more than the approximation in the other\nregion, which happens to be the correct one.\n\n8\n\n\f7 Related Work\n\nThere have been many works on sub-linear Bayesian and frequentist regrets for exact Thompson\nsampling. We discussed relevant works in detail in Section 4 and Section 5.\nEnsemble sampling (Lu & Van Roy, 2017) gives a theoretical analysis of Thompson sampling\nwith one particular approximate inference method. Lu & Van Roy (2017) maintain a set of M\nmodels to approximate the posterior, and analyzed its regret for linear contextual bandits when M\nis \u2326(log(T )). For the k-armed bandit problem and when M is \u21e5(log(T )), Ensemble sampling\nsatis\ufb01es the condition KL(Qt, \u21e7t) <\u270f in Theorem 3 with high probability. In this case, the regret of\nEnsemble sampling scales linearly with T .\nWe show in Theorem 2 that when the constraint KL(Qt, \u21e7t) <\u270f is satis\ufb01ed, which implies by\nLemma 2 that KL(Qt, \u21e7t) <\u270f is satis\ufb01ed, there can exist approximation algorithms that have linear\nregret in T . This result provides a linear lower bound, which is complementary with the linear regret\nupper bound of Ensemble Sampling in (Lu & Van Roy, 2017). Moreover, we show in Appendix E that\nwe can apply Theorem 3 to get sub-linear regret with Ensemble sampling with \u21e5(log(T )) models.\nIn reinforcement learning, there is a notion that certain approximations are \"stochastically optimistic\"\nand that this has implications for regret (Osband et al., 2016). This is similar in spirit to our analysis\nin terms of \u21b5-divergence, in that the characteristics of inference errors are important.\nThere has been a number of empirical works using approximate methods to perform Thompson\nsampling. Riquelme et al. (2018) implement variational inference, MCMC, Gaussian processes and\nother methods on synthetic and real world data sets and measure the regret. Urteaga & Wiggins\n(2018) derive a variational method for contextual bandits. Kawale et al. (2015) use particle \ufb01ltering\nto implement Thompson sampling for matrix factorization.\nFinally, if exact inference is not possible, it remains an open question if it is better to use Thompson\nsampling with approximate inference, or to use a different bandit method that does not require\ninference with respect to the posterior. For example Kveton et al. (2019) propose an algorithm based\non the bootstrap.\n\n8 Conclusion\n\nIn this paper we analyzed the performance of approximate Thompson sampling when at each time-\nstep t, the algorithm obtains a sample from an approximate distribution Qt such that the \u21b5-divergence\nbetween the true posterior and Qt remains at most a constant \u270f at all time-steps.\nOur results have the following implications. To achieve a sub-linear regret, we can only use \u21b5> 0\nfor o(T ) time-steps. Therefore we should use \u21b5 \uf8ff 0 with forced exploration to make the posterior\nconcentrate. This method theoretically guarantees a sub-linear regret even when \u270f is a large constant.\n\nAcknowledgments\nWe thank Huy Le for providing the proof of Lemma 9.\n\nReferences\nAgrawal, S. and Goyal, N. Further optimal regret bounds for Thompson sampling. In Proceedings of\nthe Sixteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS 2013),\nvolume 31 of Proceedings of Machine Learning Research, pp. 99\u2013107. PMLR, 2013.\n\nAndrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. An introduction to MCMC for machine\nlearning. Machine Learning, 50(1):5\u201343, 2003. ISSN 1573-0565. doi: 10.1023/A:1020281327116.\nBlei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians.\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017. doi: 10.1080/01621459.\n2017.1285773.\n\nCarpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo,\nJ., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical\nSoftware, 76(1), 2017.\n\n9\n\n\fCichocki, A. and Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust\n\nmeasures of similarities. Entropy, 12:1532\u20131568, 2010.\n\nDoucet, A. and Johansen, A. A tutorial on particle \ufb01ltering and smoothing: Fifteen years later.\n\nHandbook of Nonlinear Filtering, 12:656\u2013704, 2009.\n\nGopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In\nProceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings\nof Machine Learning Research, pp. 100\u2013108, Bejing, China, 22\u201324 Jun 2014. PMLR.\n\nKawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S. Ef\ufb01cient Thompson sampling\nfor online matrix-factorization recommendation. In Advances in Neural Information Processing\nSystems 28 (NIPS 2015), pp. 1297\u20131305. Curran Associates, Inc., 2015.\n\nKveton, B., Szepesvari, C., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M. Garbage\nin, reward out: Bootstrapping exploration in multi-armed bandits. In Proceedings of the 36th\nInternational Conference on Machine Learning, volume 97 of Proceedings of Machine Learning\nResearch, pp. 3601\u20133610, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR.\n\nLiu, C.-Y. and Li, L. On the prior sensitivity of thompson sampling. In Algorithmic Learning Theory,\n\npp. 321\u2013336, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7.\n\nLu, X. and Van Roy, B. Ensemble sampling. In Advances in Neural Information Processing Systems\n\n30 (NIPS 2017), pp. 3260\u20133268. Curran Associates, Inc., 2017.\n\nMinka, T. Divergence measures and message passing. Technical Report MSR-TR-2005-173, January\n\n2005.\n\nMinka, T., Winn, J., Guiver, J., Zaykov, Y., Fabian, D., and Bronskill, J.\n\nMicrosoft Research Cambridge. http://dotnet.github.io/infer.\n\n/Infer.NET 0.3, 2018.\n\nOrdentlich, E. and Weinberger, M. J. A distribution dependent re\ufb01nement of Pinsker\u2019s inequality.\nInternational Symposium on Information Theory, 2004. ISIT 2004. Proceedings., pp. 29\u2013, 2004.\nOsband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions.\nIn Proceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pp. 2377\u20132386. JMLR.org, 2016.\n\nQin, C., Klabjan, D., and Russo, D. Improving the expected improvement algorithm. In Advances in\n\nNeural Information Processing Systems 30, pp. 5381\u20135391. Curran Associates, Inc., 2017.\n\nRiquelme, C., Tucker, G., and Snoek, J. Deep Bayesian bandits showdown: An empirical comparison\nof bayesian deep networks for thompson sampling. In International Conference on Learning\nRepresentations (ICLR 2018), 2018.\n\nRusso, D. Simple Bayesian algorithms for best arm identi\ufb01cation. In 29th Annual Conference on\nLearning Theory (COLT 2016), volume 49 of Proceedings of Machine Learning Research, pp.\n1417\u20131418. PMLR, 2016.\n\nRusso, D. and Roy, B. V. An information-theoretic analysis of Thompson sampling. Journal of\n\nMachine Learning Research, 17(68):1\u201330, 2016.\n\nRusso, D. J., Roy, B. V., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling.\nISSN 1935-8237. doi:\nFoundations and Trends R in Machine Learning, 11(1):1\u201396, 2018.\n10.1561/2200000070.\n\nSalvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic programming in Python using PyMC3.\n\nPeerJ Computer Science, 2:e55, 2016.\n\nTran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. Edward: A library\n\nfor probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.\n\nUrteaga, I. and Wiggins, C. Variational inference for the multi-armed contextual bandit. In Proceed-\nings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS\n2018), volume 84 of Proceedings of Machine Learning Research, pp. 698\u2013706. PMLR, 2018.\n\n10\n\n\f", "award": [], "sourceid": 4739, "authors": [{"given_name": "My", "family_name": "Phan", "institution": "University of Massachusetts Amherst"}, {"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "VinAI Research/ VinTech JSC.,"}, {"given_name": "Justin", "family_name": "Domke", "institution": "University of Massachusetts, Amherst"}]}