{"title": "Bandit Learning with Positive Externalities", "book": "Advances in Neural Information Processing Systems", "page_first": 4918, "page_last": 4928, "abstract": "In many platforms, user arrivals exhibit a self-reinforcing behavior: future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit {\\em positive externalities}. We study multiarmed bandit (MAB) problems with positive externalities. We show that the self-reinforcing preferences may lead standard benchmark algorithms such as UCB to exhibit linear regret. We develop a new algorithm, Balanced Exploration (BE), which explores arms carefully to avoid suboptimal convergence of arrivals before sufficient evidence is gathered. We also introduce an adaptive variant of BE which successively eliminates suboptimal arms. We analyze their asymptotic regret, and establish optimality by showing that no algorithm can perform better.", "full_text": "Bandit Learning with Positive Externalities\n\nManagement Science and Engineering\n\nManagement Science and Engineering\n\nVirag Shah\n\nStanford University\n\nCalifornia, USA 94305\nvirag@stanford.edu\n\nJose Blanchet\n\nStanford University\n\nCalifornia, USA 94305\n\njblanche@stanford.edu\n\nRamesh Johari\n\nManagement Science and Engineering\n\nStanford University\n\nCalifornia, USA 94305\nrjohari@stanford.edu\n\nAbstract\n\nIn many platforms, user arrivals exhibit a self-reinforcing behavior: future user\narrivals are likely to have preferences similar to users who were satis\ufb01ed in the past.\nIn other words, arrivals exhibit positive externalities. We study multiarmed bandit\n(MAB) problems with positive externalities. We show that the self-reinforcing\npreferences may lead standard benchmark algorithms such as UCB to exhibit linear\nregret. We develop a new algorithm, Balanced Exploration (BE), which explores\narms carefully to avoid suboptimal convergence of arrivals before suf\ufb01cient evi-\ndence is gathered. We also introduce an adaptive variant of BE which successively\neliminates suboptimal arms. We analyze their asymptotic regret, and establish\noptimality by showing that no algorithm can perform better.\n\n1\n\nIntroduction\n\nA number of different platforms use multiarmed bandit (MAB) algorithms today to optimize their\nservice: e.g., search engines and information retrieval platforms; e-commerce platforms; and news\nsites. Many such platforms exhibit a natural self-reinforcement in the arrival process of users: future\narrivals may be biased towards users who expect to have positive experiences based on the past\noutcomes of the platform. For example, if a news site generates articles that are liberal (resp.,\nconservative), then it is most likely to attract additional users who are liberal (resp., conservative)\n[2]. In this paper, we study the optimal design of MAB algorithms when user arrivals exhibit such\npositive self-reinforcement.\nWe consider a setting in which a platform faces many types of users that can arrive. Each user type is\ndistinguished by preferring a subset of the item types above all others. The platform is not aware of\neither the type of the user, or the item-user payoffs. Following the discussion above, arrivals exhibit\npositive externalities (also called positive network effects) among the users [13]: in particular, if one\ntype of item generates positive rewards, users who prefer that type of item become more likely to\narrive in the future.\nOur paper quanti\ufb01es the consequences of positive externalities for bandit learning in a benchmark\nmodel where the platform is unable to observe the user\u2019s type on arrival. In the model we consider,\nintroduced in Section 3, there is a set of m arms. A given arriving user prefers a subset of these arms\nover the others; in particular, all arms other than the preferred arms generate zero reward. A preferred\narm a generates a Bernoulli reward with mean \u00b5a. To capture positive externalities, the probability\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u21b5 = 0\n\n0 <\u21b5< 1\n\n\u21b5 = 1\n\n\u21b5> 1\n\nLower Bound\n\n\u2326(ln T )\n\n\u2326(T 1\u21b5 ln\u21b5 T )\n\n\u2326(ln2 T )\n\n\u2326(ln\u21b5 T )\n\nUCB\n\nRandom-explore-then-commit\n\nBalanced Exploration (BE)\n\nBE with Arm Elimination (BE-AE) O(ln T )\n\nO(ln T )\n\n\u2326(T )\n\nO(ln T ) \u2326\u21e3T 1\u21b5 ln\n\n\u02dcO(ln T )\n\n\u21b5\n\n1\u21b5 T\u2318 \u2326\u21e3T\n\n\u02dcO(T 1\u21b5 ln\u21b5 T )\nO(T 1\u21b5 ln\u21b5 T )\n\n\u2326(T )\n\u00b5b\n\n\u00b5b+\u2713a\u21e4 \u00b5a\u21e4\u2318\n\n\u02dcO(ln2 T )\n\nO(ln2 T )\n\n\u2326(T )\n\n\u2326(T )\n\n\u02dcO(ln\u21b5 T )\n\nO(ln\u21b5 T )\n\nTable 1: Total regret under different settings. Here a\u21e4 = arg max \u00b5a, and b = arg maxa6=a\u21e4 \u00b5a.\nFor Random-explore-then-commit algorithm, we assume that the initial bias \u2713a for each arm a is a\npositive integer (cf. Section 3). The notation f (T ) = \u02dcO(g(T )) implies there exists k > 0 such that\nf (T ) = O(g(T ) lnkg(T )).\n\nthat a user preferring arm a arrives at time t is proportional to (Sa(t 1) + \u2713a)\u21b5, where Sa(t 1) is\nthe total reward observed from arm a in the past and \u2713a captures the initial conditions. The positive\nconstant \u21b5 captures the strength of the externality: when \u21b5 is large the positive externality is strong.\nThe platform aims to maximize cumulative reward up to time horizon T . We evaluate our performance\nby measuring regret against an \u201cof\ufb02ine\u201d oracle that always chooses the arm a\u21e4 = arg maxa \u00b5a.\nBecause of the positive externality, this choice causes the user population to shift entirely to users\npreferring arm a\u21e4 over time; in particular, the oracle achieves asymptotically optimal performance to\nleading order in T . We study the asymptotic scaling of cumulative regret against the oracle at T as\nT ! 1.\nAt the heart of this learning problem is a central tradeoff. On one hand, because of the positive\nexternality, the platform operator is able to move the user population towards the pro\ufb01t maximizing\npopulation. On the other hand, due to self-reinforcing preferences the impact of mistakes is ampli\ufb01ed:\nif rewards are generated on suboptimal arms, the positive externality causes more users that prefer\nthose arms to arrive in the future. We are able to explicitly quantify the impact of this tradeoff in our\nmodel.\nOur main results are as follows.\nLower bound. In Section 4, we provide an explicit lower bound on the best achievable regret for\neach \u21b5. Strikingly, the optimal regret is structurally quite different than classical lower bounds for\nMAB problems; see Table 1. Its development sheds light into the key differences between MABs\nwith positive externalities and those without.\nSuboptimality of classical approaches. In Section 5, we show that the UCB algorithm is not only\nsuboptimal, but in fact has positive probability of never obtaining a reward on the best arm a\u21e4\u2014and\nthus obtains linear regret. This is because UCB does not explore suf\ufb01ciently to \ufb01nd the best arm.\nHowever, we show that just exploring more aggressively is also insuf\ufb01cient; a random-explore-then-\ncommit policy which explores in an unstructured fashion remains suboptimal. This demonstrates the\nneed of developing a new approach to exploration.\nOptimal algorithm. In Section 6, we develop a new algorithmic approach towards optimizing the\nexploration-exploitation tradeoff. Interestingly, this algorithm is cautious in the face of uncertainty\nto avoid making long-lasting mistakes. Our algorithm, Balanced Exploration (BE), keeps the user\npopulation \u201cbalanced\u201d during the exploration phase; by doing so, it exploits an arm only when there\nis suf\ufb01cient certainty regarding its optimality. Its adaptive variant, Balanced Exploration with Arm\nElimination (BE-AE), intelligently eliminates suboptimal arms while balancing exploration among\nthe remainder. BE has the bene\ufb01t of not depending on system parameters, while BE-AE uses such\ninformation (e.g., \u21b5). We establish their optimality by developing an upper-bound on their regret\nfor each \u21b5; this nearly matches the lower bound (for BE), and exactly matches the lower bound (for\nBE-AE).\n\n2\n\n\fFurther, in Section 7 we provide simulation results to obtain quantitative insights into the relative\nperformance of different algorithms. We conclude the paper by summarizing the main qualitative\ninsights obtained from our work.\n\n2 Related work\n\nAs noted above, our work incorporates positive externalities in user arrivals. Positive externalities are\nalso referred to as positive network effects or positive network externalities. (Note that the phrase\n\u201cnetwork\u201d is often used here, even when the effects do not involve explicit network connections\nbetween the users.) See [13], as well as [21, 20] for background. Positive externalities are extensively\ndiscussed in most standard textbooks on microeconomic theory; see, e.g., Chapter 11 of [17].\nIt is well accepted that online search and recommendation engines produce feedback loops that can\nlead to self-reinforcement of popular items [3, 6, 19, 9]. Our model captures this phenomenon by\nemploying a self-reinforcing arrival process, inspired by classical urn processes [4, 12].\nWe note that the kind of self-reinforcing behavior observed in our model may be reminiscent of\n\u201cherding\u201d behavior in Bayesian social learning [7, 23, 1]. In these models, arriving Bayesian rational\nusers take actions based on their own private information, and the outcomes experienced by past\nusers. The central question in that literature is the following: do individuals base their actions on their\nown private information, or do they follow the crowd? By contrast, in our model it is the platform\nwhich takes actions, without directly observing preferences of the users.\nIf the user preferences are known then a platform might choose to personalize its services to satisfy\neach user individually. This is the theme of much recent work on contextual bandits; see, e.g.,\n[16, 22, 18] and [8] for a survey of early work. In such a model, it is important that either (1) enough\nobservable covariates are available to group different users together as decisions are made; or (2)\nusers are long-lived so that the platform has time to learn about them.\nIn contrast to contextual bandits, in our model the users\u2019 types are not known, and they are short-lived\n(one interaction per user). Of course, the reality of many platforms is somewhere in between: some\nuser information may be available, though imperfect. We view our setting as a natural benchmark\nmodel for analysis of the impact of self-reinforcing arrivals. Through this lens, our work suggests\nthat there are signi\ufb01cant consequences to learning when the user population itself can change over\ntime, an insight that we expect to be robust across a wide range of settings.\n\n3 Preliminaries\n\nIn this section we describe the key features of the model we study. We \ufb01rst describe the model,\nincluding a precise description of the arrival process that captures positive externalities. Next, we\ndescribe our objective: minimization of regret relative to the expected reward of a natural oracle\npolicy.\n\n3.1 Model\nArms and rewards. Let A = {1, ..., m} be the set of available arms. During each time t 2{ 1, 2, ...}\na new user arrives and an arm is \u201cpulled\u201d by the platform; we denote the arm pulled at time t by\nIt. We view pulling an arm as presenting the corresponding option to the newly arrived user. Each\narriving user prefers a subset of the arms, denoted by Jt. We describe below how Jt is determined.\nIf arm a is pulled at time t and if the user at time t prefers arm a 2 A (i.e., a 2 Jt) then the reward\nobtained at time t is an independent Bernoulli random variable with mean \u00b5a. We assume \u00b5a > 0 for\nall arms. If the user at time t does not prefer the arm pulled then the reward obtained at time t is zero.\nWe let Xt denote the reward obtained at time t.\nFor t 1, let Ta(t) represent the number of times arm a is pulled up to and including time t, and\nlet Sa(t) represent the total reward accrued by pulling arm a up to and including time t 1. Thus\nTa(t) = |{1 \uf8ff s \uf8ff t : Is = a}|, and Sa(t) = |{1 \uf8ff s \uf8ff t : Is = a, Xs = 1}|. We de\ufb01ne\nTa(0) = Sa(0) = 0.\n\n3\n\n\fUnique best arm. We assume there exists a unique a\u21e4 2 A such that:\n\na\u21e4 = arg max \u00b5a.\n\nThis assumption is standard and made for technical convenience; all our results continue to hold\nwithout it.\nArrivals with positive externalities. We now de\ufb01ne the arrival process {Jt}t1 that determines\nusers\u2019 preferences over arms; this arrival process is the novel feature of our model. We assume there\nare \ufb01xed constants \u2713a > 0 for a 2 A (independent of T ), denoting the initial \u201cpopularity\u201d of arm a.\nFor t 0, de\ufb01ne:\nObserve that by de\ufb01nition Na(0) = \u2713a.\nIn our arrival process, arms with higher values of Na(t) are more likely to be preferred. Formally,\nwe assume that the tth user prefers arm a (i.e., a 2 Jt) with probability a(t) independently of other\narms, where:\n\nNa(t) = Sa(t) + \u2713a, a 2 A.\n\na(t) =\n\nf (Na(t 1))\na0=1 f (Na0(t 1))\n\n,\n\nPm\n\nwhere f (\u00b7) is a positive, increasing function f. We refer to f as the externality function. In our\nanalysis we primarily focus on the parametric family f (x) = x\u21b5, where \u21b5 2 (0,1).\nIntuitively, the idea is that agents who prefer arm a are more likely to arrive if arm a has been\nsuccessful in the past. This is a positive externality: users who prefer arm a are more likely to\ngenerate rewards when arm a is pulled, and this will in turn increase the likelihood an arrival\npreferring arm a comes in the future. The parameter \u21b5 controls the strength of this externality: the\npositive externality is stronger when \u21b5 is larger.\nIf f is linear (\u21b5 = 1), then we can interpret our model in terms of an urn process. In this view, \u2713a\nresembles the initial number of balls of color a in the urn at time t = 1 and Na(t) resembles the\ntotal number of balls of color a added into the urn after t draws. Thus, the probability the tth draw\nis of color a is proportional to Na(t). In contrast to the standard urn model, in our model we have\nadditional control: namely, we can pull an arm, and thus govern the probability with which a new\nball of the same color is added into the urn.\n\n3.2 The oracle and regret\nMaximizing expected reward. Throughout our presentation, we use T to denote the time horizon\nover which performance is being optimized. (The remainder of our paper characterizes upper and\nlower bounds on performance as the time horizon T grows large.) We let T denote the total reward\naccrued up to time T :\n\nT =\n\nXt.\n\nTXt=1\n\nThe goal of the platform is to choose a sequence {It} to maximize E[T ]. As usual, It must be a\nfunction only of the past history (i.e., prior to time t).\nThe oracle policy. As is usual in multiarmed bandit problems, we measure our performance against\na benchmark policy that we refer to as the Oracle.\nDe\ufb01nition 1 (Oracle). The Oracle algorithm knows the optimal arm a\u21e4, and pulls it at all times\nt = 1, 2, . . ..\nLet \u21e4T denote the reward of the Oracle. Note that Oracle may not be optimal for \ufb01nite \ufb01xed T ; in\nparticular, unlike in the standard stochastic MAB problem, the expected cumulative reward E[\u21e4T ] is\nnot \u00b5a\u21e4T , as several arrivals may not prefer the optimal arm.\nThe next proposition provides tight bounds on E[\u21e4T ]. For the proof, see the Appendix.\n\na . The expected cumulative reward E[\u21e4T ] for\n\nthe Oracle satis\ufb01es:\n\nProposition 1. Suppose \u21b5> 0. Let \u2713\u21b5 =Pa6=a\u21e4 \u2713\u21b5\n1. E[\u21e4T ] \uf8ff \u00b5a\u21e4T \u00b5a\u21e4\u2713\u21b5\n(k + \u2713a\u21e4 1)\u21b5 + \u2713\u21b5 .\n\n1\n\nTXk=1\n\n4\n\n\f(k + \u2713a\u21e4)\u21b5 1.\n\n1\n\nTXk=1\nE[\u21e4T ] =8<:\n\n\u00b5a\u21e4T \u21e5(T 1\u21b5),\n\u00b5a\u21e4T \u21e5(ln T ),\u21b5\n\u00b5a\u21e4T \u21e5(1),\u21b5\n\n>\n\n0 <\u21b5< 1\n\n= 1\n1\n\n2. E[\u21e4T ] \u00b5a\u21e4T \u2713\u21b5\nIn particular, we have:\n\nThe discontinuity at \u21b5 = 1 in the asymptotic bound above arrises sincePT\nk\u21b5 diverges for each\n\u21b5 \uf8ff 1 but converges for \u21b5> 1. Further, the divergence is logarithmic for \u21b5 = 1 but polynomial for\neach \u21b5< 1.\nNote that in all cases, the reward asymptotically is of order \u00b5a\u21e4T . This is the best achievable\nperformance to leading order in T , showing that the oracle is asymptotically optimal.\nOur goal: Regret minimization. Given any policy, de\ufb01ne the regret against the Oracle as RT :\n\nk=1\n\n1\n\nRT = \u21e4T T .\n\n(1)\n\nOur goal in the remainder of the paper is to minimize the expected regret E[RT ]. In particular, we\nfocus on characterizing regret performance asymptotically to leading order in T (both lower bounds\nand achievable performance), for different values of the externality exponent \u21b5.\n\n4 Lower bounds\n\nIn this section, we develop lower bounds on the achievable regret of any feasible policy. As we will\n\ufb01nd, these lower bounds are quite distinct from the usual O(ln T ) lower bound (see [15, 8]) on regret\nfor the standard stochastic MAB problem. This fundamentally different structure arises because of\nthe positive externalities in the arrival process.\nTo understand our construction of the lower bound, consider the case where the externality function\nis linear (\u21b5 = 1); the other cases follow similar logic. Our basic idea is that in order to determine\nthe best arm, any optimal algorithm will need to explore all arms at least ln T times. However,\nthis means that after t0 =\u21e5(ln T ) time, the total reward on any suboptimal arms will be of order\n\nPb6=a\u21e4 Nb(t0) =\u21e5(ln T ). Because of the effect of the positive externality, any algorithm will then\n\nneed to \u201crecover\u201d from having accumulated rewards on these suboptimal arms. We show that even\nif the optimal arm a\u21e4 is pulled from time t0 onwards, a regret \u2326(ln2 T ) is incurred simply because\narrivals who do not prefer arm a\u21e4 continue to arrive in suf\ufb01cient numbers.\nThe next theorem provides regret lower bounds for all values of \u21b5. The proof can be found in the\nAppendix.\nTheorem 1.\n\n1. For \u21b5< 1, there exists no policy with E[RT ] = o(T 1\u21b5 ln\u21b5 T ) on all sets of\n\nBernoulli reward distributions.\n\n2. For \u21b5 = 1, there exists no policy with E[RT ] = o(ln2 T ) on all sets of Bernoulli reward\n\ndistributions.\n\n3. For \u21b5> 1, there exists no policy with E[RT ] = o(ln\u21b5 T ) on all sets of Bernoulli reward\n\ndistributions.\n\nThe remainder of the paper is devoted to studying regret performance of classic algorithms (such as\nUCB), and developing an algorithm that achieves the lower bounds above.\n\n5 Suboptimality of classical approaches\n\nWe devote this section to developing structural insight into the model, by characterizing the perfor-\nmance of two classical approaches for the standard stochastic MAB problem: the UCB algorithm\n[5, 8] and a random-explore-then-commit algorithm.\n\n5\n\n\f5.1 UCB\nWe \ufb01rst show that the standard upper con\ufb01dence bound (UCB) algorithm, which does not account for\nthe positive externality, performs poorly. (Recall that in the standard MAB setting, UCB achieves the\nasymptotically optimal O(ln T ) regret bound [15, 8].)\nFormally, the UCB algorithm is de\ufb01ned as follows.\nDe\ufb01nition 2 (UCB()). Fix > 0. For each a 2 A, let \u02c6\u00b5a(0) = 0 and for each t > 0 let\n\u02c6\u00b5a(t) := Sa(t1)\nTa(t1) , under convention that \u02c6\u00b5a(t) = 0 if Ta(t 1) = 0. For each a 2 A let ua(0) = 0\nand for each t > 0 let\n\nChoose:\n\nua(t) := \u02c6\u00b5a(t) +s ln t\n\nTa(t 1)\n\n.\n\nIt 2 arg max\na2A\n\nua(t),\n\nwith ties broken uniformly at random.\nUnder our model, consider an event where a\u21e4 62 Jt but It = a\u21e4: i.e., a\u21e4 is pulled but the arriving user\ndid not prefer arm a\u21e4. Under UCB, such events are self-reinforcing, in that they not only lower the\nupper con\ufb01dence bound for arm a\u21e4, resulting in fewer future pulls of arm a\u21e4, but they also reduce the\npreference of future users towards arm a\u21e4.\nIt is perhaps not surprising, then, that UCB performs poorly. However, the impact of this self-\nreinforcement under UCB is so severe that we obtain a striking result: there is a strictly positive\nprobability that the optimal arm a\u21e4 will never see a positive reward, as shown by the following\ntheorem. An immediate consequence of this result is that the regret of UCB is linear in the horizon\nlength. The proof can be found in the Appendix.\n\nTheorem 2. Suppose > 0. Suppose that f (x) is \u2326ln1+\u270f(x) for some \u270f> 0. For UCB()\n\nalgorithm, there exists an \u270f0 > 0 such that\n\nT!1\nIn particular, the regret of UCB() is O(T ).\n\nP\u21e3 lim\n\nSa\u21e4(T ) = 0\u2318 \u270f0.\n\n5.2 Random-explore-then-commit\nUCB fails because it does not explore suf\ufb01ciently. In this section, we show that more aggressive\nunstructured exploration is not suf\ufb01cient to achieve optimal regret. In particular, we consider a policy\nthat chooses arms independently and uniformly at random for some period of time, and then commits\nto the empirical best arm for the rest of the time.\nDe\ufb01nition 3 (REC(\u2327)). Fix \u2327 2 Z+. For each 1 \uf8ff t \uf8ff \u2327, choose It uniformly at random from set A.\nLet \u02c6a\u21e4 2 arg maxa Sa(\u2327 ), with tie broken at random. For \u2327< t < T , It = a\u21e4.\nThe following theorem provides performance bounds for the REC(\u2327) policy for our model. The proof\nof this result takes advantage of multitype continuous-time Markov branching processes [4, 12]; it is\ngiven in the Appendix.\nProposition 2. Suppose that \u2713a for each a 2 A is a positive integer. Let b = arg maxa6=a\u21e4 \u00b5a. The\nfollowing statements hold for the REC(\u2327) policy for any \u2327:\n1. If 0 <\u21b5< 1 then we have E[RT ] =\u2326( T 1\u21b5 ln\n\n\u00b5b+\u2713a\u21e4 \u00b5a\u21e4\u2318 .\n\n2. If \u21b5 = 1 then we have E[RT ] =\u2326 \u21e3T\n3. If \u21b5> 1 then we have E[RT ] =\u2326( T ).\nThus, for \u21b5 \uf8ff 1, the REC(\u2327) policy may improve on the performance of UCB by delivering sublinear\nregret. Nevertheless this regret scaling remains suboptimal for each \u21b5. In the next section, we\ndemonstrate that carefully structured exploration can deliver an optimal regret scaling (matching the\nlower bounds in Theorem 1).\n\n\u00b5b\n\n\u21b5\n\n1\u21b5 T ).\n\n6\n\n\f6 Optimal algorithms\n\nIn this section, we present an algorithm that achieves the lower bounds presented in Theorem 1. The\nmain idea of our algorithm is to structure exploration by balancing exploration across arms; this\nensures that the algorithm is not left to \u201ccorrect\u201d a potentially insurmountable imbalance in population\nonce the optimal arm has been identi\ufb01ed.\nWe \ufb01rst present a baseline algorithm called Balanced Exploration (BE) that nearly achieves the lower\nbound, but illustrates the key bene\ufb01t of balancing; this algorithm has the advantage that it needs\nno knowledge of system parameters. We then use a natural modi\ufb01cation of this algorithm called\nBalanced Exploration with Arm Elimination (BE-AE) that achieves the lower bound in Theorem 1,\nthough it uses some knowledge of system parameters in doing so.\n\n6.1 Balanced exploration\n\nThe BE policy is cautious during the exploration phase in the following sense: it pulls the arm\nwith least accrued reward, to give it further opportunity to ramp up its score just in case its poor\nperformance was bad luck. At the end of the exploration phase, it exploits the empirical best arm for\nthe rest of the horizon.\nTo de\ufb01ne BE, we require an auxiliary sequence wk, k = 1, 2, . . ., used to set the exploration time.\nThe only requirement on this sequence is that wk ! 1 as k ! 1; e.g., wk could be ln ln k for each\npostive integer k. The BE algorithm is de\ufb01ned as follows.\nDe\ufb01nition 4. Balanced-Exploration (BE) Algorithm: Given T , let n = wT ln T .\n\n1. Exploration phase: Explore until the (random) time \u2327n = min(t : Sb(t) n 8 b 2 A) ^ T ,\ni.e., explore until each arm has incurred at least n rewards, while if any arm accrues\nless than n rewards by time T , then \u2327n = T . Formally, for 1 \uf8ff t \uf8ff \u2327n, pull arm\nx(t) 2 arg inf a2A Sa(t 1), with ties broken at random.\n\n2. Exploitation phase: Let \u02c6a\u21e4 2 arg inf a2A Ta(\u2327n), with tie broken at random. For \u2327n + 1 \uf8ff\n\nt \uf8ff T , pull the arm \u02c6a\u21e4.\n\nNote that this algorithm only uses prior knowledge of the time horizon T , but no other system\nparameters; in particular, we do not need information on the strength of the positive externality,\ncaptured by \u21b5. Our main result is the following. The proof can be found in the Appendix.\nTheorem 3. Suppose wk, k = 0, 1, 2, . . ., is any sequence such that wk ! 1 as k ! 1. Then the\nregret of the BE algorithm is as follows:\n1. If 0 <\u21b5< 1 then E[RT ] = O(w\u21b5\n2. If \u21b5 = 1 then E[RT ] = O(wT ln2 T ).\nT ln\u21b5 T ).\n3. If \u21b5> 1 then E[RT ] = O(w\u21b5\n\nT T 1\u21b5 ln\u21b5 T ).\n\nIn particular, observe that if wk = ln ln k, then we conclude E[RT ] = \u02dcO(T 1\u21b5 ln\u21b5 T ) (if 0 <\u21b5<\n1); E[RT ] = \u02dcO(ln2 T ) (if \u21b5 = 1); and E[RT ] = \u02dcO(ln\u21b5 T ) (if \u21b5> 1). Recall that the notation\nf (T ) = \u02dcO(g(T )) implies there exists k > 0 such that f (T ) = O(g(T ) lnkg(T )).\n\n6.2 Balanced exploration with arm elimination\n\nThe BE algorithm very nearly achieves the lower bounds in Theorem 1. The additional \u201cin\ufb02ation\u201d\n(captured by the additional factor wT ) arises in order to ensure the algorithm achieves low regret\ndespite not having information on system parameters.\nWe now present an algorithm which eliminates the in\ufb02ation in regret by intelligently eliminating arms\nthat have poor performance during the exploration phase by using upper and lower con\ufb01dence bounds.\nThe algorithm assumes the knowledge of T , m, \u21b5, and \u2713a for each a to the platform (though we\ndiscuss the assumption on the knowledge of \u2713a further below). With these informational assumptions,\na(t) for each t can be computed by the platform. Below, \u02c6\u00b5a(t) is an unbiased estimate of \u00b5a given\nobservations till time t, while ua(t) and la(t) are its upper and lower con\ufb01dence bounds.\n\n7\n\n\fDe\ufb01nition 5. Balanced Exploration with Arm Elimination (BE-AE) Algorithm: Given T , m,\nand \u21b5, as well as \u2713a for each a 2 A, for each time t and each arm a de\ufb01ne:\n\n\u02c6\u00b5a(t) = (Ta(t))1\n\nXk\na(k)\n\n(Ik = a).\n\ntXk=1\n\n\u2713a\n\nm(1+\u2713b). De\ufb01ne ua(t) = \u02c6\u00b5a(t) + 5q ln T\n\ncTa(t) , and la(t) = \u02c6\u00b5a(t) \n\nFurther, let c = mina,b2A\n\n5q ln T\n\ncTa(t) .\n\nLet A(t) be the set of active arms at time t. At time t = 1 all arms are active, i.e., A(1) = A. At each\ntime t pull arm\n\nIt 2 arg inf\na2A(t)\n\nSa(t 1),\n\nwith ties broken lexicographically. Eliminate arm a from the active set if there exists an active arm\nb 2 A(t) such that ua(t) < lb(t).\nThe following theorem shows that the BE-AE algorithm achieves optimal regret, i.e., it meets the\nlower bounds in Theorem 1. The proof can be found in the Appendix.\nTheorem 4. For \ufb01xed m and \u21b5, the regret under the BE-AE algorithm satis\ufb01es the following:\n1. If 0 <\u21b5< 1 then E[RT ] = O(T 1\u21b5 ln\u21b5 T ).\n2. If \u21b5 = 1 then E[RT ] = O(ln2 T ).\n3. If \u21b5> 1 then E[RT ] = O(ln\u21b5 T ).\nAs noted above, our algorithm requires some knowledge of system parameters. We brie\ufb02y describe\nan approach that we conjecture delivers the same performance as BE-AE, but without knowledge of\n\u2713a for a 2 A. Given a small \u270f> 0, \ufb01rst run the exploration phase of the BE algorithm for n = \u270f ln T\ntime without removing any arm. For t subsequent to the end of this exploration phase, i.e., once\n\u270f ln T samples are obtained for each arm, we have Na(t) = \u270f ln T + \u2713a. Thus, the effect of \u2713a on\na(t) becomes negligible, and one can approximate a(t) by letting Nb(t) = Sb(t) for each arm b.\nWe then continue with the BE-AE algorithm as de\ufb01ned above (after completion of the exploration\nphase). We conjecture the regret performance of this algorithm will match BE-AE as de\ufb01ned above.\nProving this result, and more generally removing dependence on T , m, and \u21b5, remain interesting\nopen directions.\n\n7 Simulations\n\nBelow, we summarize our simulation setup and then describe our main \ufb01ndings.\nSimulation setup. We simulate our model with m = 2 arms, with externality strength \u21b5 = 1, arm\nreward parameters \u00b51 = 0.5 and \u00b52 = 0.3, and initial biases \u27131 = \u27132 = 1. For Fig. 1a, we simulate\neach algorithm one hundred times for each set of parameters. We plot pseudo-regret realization from\neach simulation, i.e., E[\u21e4T ] T , where E[\u21e4T ] is the expected reward for the Oracle, computed\nvia Monte Carlo simulation, and T is the total reward achieved by the algorithm. Thus, lower\npseudo-regret realization implies better performance. For Fig. 1b, each point is obtained by simulating\nthe corresponding algorithm one thousand times. The time horizon T is as mentioned in the \ufb01gures.\nParameters for each algorithm. We simulate UCB() with = 3. For Random-explore-then-commit,\nwe set the exploration time as pT (empirically, this performs signi\ufb01cantly better than ln T ). For BE,\nwe set wT = ln ln T with = 2 (see De\ufb01nition 4). For BE-AE, cf. De\ufb01nition 5, we recall that the\nTa(t) , and la(t) = \u02c6\u00b5a(t)pq ln T\nupper and lower con\ufb01dence bounds are set as ua(t) = \u02c6\u00b5a(t)+pq ln T\n\nTa(t)\nm(1+\u2713b). This choice of p was set in the paper for technical\nfor p = 5c1/2 where c = mina,b2A\nreasons, but unfortunately this choice is suboptimal for \ufb01nite T . The choice of p = 1/2 achieves\nsigni\ufb01cantly better performance for this experimental setup. The performance is sensitive to small\nchanges in p, as the plots illustrate when choosing p = 5/2. In contrast, in our experiments, we found\nthat the performance of BE is relatively robust to the choice of .\n\n\u2713a\n\n8\n\n\f(a) Realized psuedo-regret for T = 3 \u21e5 104.\n\n(b) Expected regret as a function of time horizon T\n\nFigure 1: Performance comparison of algorithms in different parameter regimes. All simulations\nhave m = 2 arms, externality strength \u21b5 = 1, arm reward parameters \u00b51 = 0.5 and \u00b52 = 0.3, and\ninitial arm bias \u27131 = \u27132 = 1.\n\nMain \ufb01ndings. The following are our main \ufb01ndings from the above simulations.\nFirst, even for \u21b5 = 1, REC appears to perform as poorly as UCB. Recall that in Section 5 we show\ntheoretically that the regret is linear for UCB for each \u21b5, and for REC for \u21b5> 1. For \u21b5 = 1, we are\nonly able to show that REC exhibits polynomial regret.\nSecond, for \ufb01nite T , the performance of the (asymptotically optimal) BE-AE algorithm is quite\nsensitive to the choice of algorithm parameters, and thus may perform poorly in certain regimes. By\ncontrast, the (nearly asymptotically optimal) BE algorithm appears to exhibit more robust perfor-\nmance.\n\n8 Discussion and conclusions\n\nIt is common that platforms make online decisions under uncertainty, and that these decisions impact\nfuture user arrivals. However, most MAB models in the past have decoupled the evolution of arrivals\nfrom the learning process. Our model, though stylized by design, provides several non-standard yet\ninteresting insights which we believe are relevant to many platforms. In particular:\n\n1. In the presence of self-reinforcing preferences, there is a cost to being optimistic in the face\n\nof uncertainty, as mistakes are ampli\ufb01ed.\n\n2. It is possible to mitigate the impact of transients arising from positive externalities by\n\nstructuring the exploration procedure carefully.\n\n3. Once enough evidence is obtained regarding optimality of a strategy, one may even use the\nexternalities to one\u2019s advantage by purposefully shifting the arrivals to a pro\ufb01t-maximizing\npopulation.\n\nOf course real-world scenarios are complex and involve other types of externalities which may reverse\nsome of these gains. For example, the presence of negative externalities may preclude the ability to\nhave \u201call\u201d arrivals prefer the chosen option. Alternatively, arrivals may have \u201climited memory\u201d, so\nthat future arrivals might eventually forget the effect of the externality. Overall, we believe that this\nis an interesting yet under-explored space of research, and that positive externalities of the kind we\nstudy may play a pivotal role in the effectiveness of learning algorithms.\n\nReferences\n[1] Daron Acemoglu, Munther A. Dahleh, Ilan Lobel, and Asuman Ozdaglar. Bayesian learning in\n\nsocial networks. The Review of Economic Studies, 78(4):1201\u20131236, 2011.\n\n[2] Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided\nthey blog. In Proceedings of the 3rd international workshop on Link discovery, pages 36\u201343.\nACM, 2005.\n\n9\n\n\f[3] Chris Anderson. The long tail. Wired magazine, 12(10):170\u2013177, 2004.\n[4] Krishna B. Athreya and Samuel Karlin. Embedding of urn schemes into continuous time markov\nbranching processes and related limit theorems. Ann. Math. Statist., 39(6):1801\u20131817, 12 1968.\n[5] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2):235\u2013256, May 2002.\n\n[6] Albert-L\u00e1szl\u00f3 Barab\u00e1si and R\u00e9ka Albert. Emergence of scaling in random networks. Science,\n\n286(5439):509\u2013512, 1999.\n\n[7] Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and\ncultural change as informational cascades. Journal of Political Economy, 100(5):992\u20131026,\n1992.\n\n[8] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\n\nmulti-armed bandit problems. Foundations and Trends in Machine Learning, 5(1), 2012.\n\n[9] Soumen Chakrabarti, Alan Frieze, and Juan Vera. The in\ufb02uence of search engines on prefer-\nential attachment. In Proceedings of the sixteenth annual ACM-SIAM Symposium on Discrete\nAlgorithms. Society for Industrial and Applied Mathematics, 2005.\n\n[10] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer,\n\n1998.\n\n[11] David A. Freedman. On tail probabilities for martingales. Ann. Probab., 3(1):100\u2013118, 02\n\n1975.\n\n[12] Svante Janson. Functional limit theorems for multitype branching processes and generalized\n\np\u00f3lya urns. Stochastic Processes and their Applications, 110(2):177 \u2013 245, 2004.\n\n[13] Michael L Katz and Carl Shapiro. Systems competition and network effects. Journal of\n\nEconomic Perspectives, 8(2):93\u2013115, 1994.\n\n[14] Petra K\u00fcster. Generalized Markov branching processes with state-dependent offspring distribu-\ntions. Zeitschrift f\u00fcr Wahrscheinlichkeitstheorie und Verwandte Gebiete, 64(4):475\u2013503, Dec\n1983.\n\n[15] T.L Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv. Appl.\n\nMath., 6(1):4\u201322, March 1985.\n\n[16] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in Neural Information Processing Systems, 2008.\n\n[17] Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al. Microeconomic Theory,\n\nvolume 1. Oxford University Press, 1995.\n\n[18] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The\n\nAnnals of Statistics, pages 693\u2013721, 2013.\n\n[19] Jacob Ratkiewicz, Santo Fortunato, Alessandro Flammini, Filippo Menczer, and Alessandro\nVespignani. Characterizing and modeling the dynamics of online popularity. Physical review\nletters, 105(15):158701, 2010.\n\n[20] Carl Shapiro and Hal R Varian. Information Rules: A Strategic Guide to the Network Economy.\n\nHarvard Business Press, 1998.\n\n[21] Oz Shy. A short survey of network economics. Review of Industrial Organization, 38(2):119\u2013\n\n149, 2011.\n\n[22] Aleksandrs Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th\n\nannual Conference On Learning Theory, 2011.\n\n[23] Lones Smith and Peter S\u00f8rensen. Pathological outcomes of observational learning. Economet-\n\nrica, 68(2):371\u2013398, 2000.\n\n10\n\n\f[24] David Williams. Probability with Martingales. Cambridge University Press, 1991.\n[25] G. Udny Yule. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis,\nF. R. S. Philosophical Transactions of the Royal Society of London B: Biological Sciences,\n213(402-410):21\u201387, 1925.\n\n11\n\n\f", "award": [], "sourceid": 2384, "authors": [{"given_name": "Virag", "family_name": "Shah", "institution": "Stanford"}, {"given_name": "Jose", "family_name": "Blanchet", "institution": "Stanford University"}, {"given_name": "Ramesh", "family_name": "Johari", "institution": "Stanford University"}]}