{"title": "Thompson Sampling with Information Relaxation Penalties", "book": "Advances in Neural Information Processing Systems", "page_first": 3554, "page_last": 3563, "abstract": "We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we define an intuitive family of control policies that include Thompson sampling (TS) and the Bayesian optimal policy as endpoints. Analogous to TS, which, at each decision epoch pulls an arm that is best with respect to the randomly sampled parameters, our algorithms sample entire future reward realizations and take the corresponding best action. However, this is done in the presence of \u201cpenalties\u201d that seek to compensate for the availability of future information.\n\nWe develop several novel policies and performance bounds for MAB problems that vary in terms of improving performance and increasing computational complexity between the two endpoints. Our policies can be viewed as natural generalizations of TS that simultaneously incorporate knowledge of the time horizon and explicitly consider the exploration-exploitation trade-off. We prove associated structural results on performance bounds and suboptimality gaps. Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem.", "full_text": "Thompson Sampling with Information Relaxation\n\nPenalties\n\nSeungki Min\n\nColumbia Business School\n\nCostis Maglaras\n\nColumbia Business School\n\nCiamac C. Moallemi\n\nColumbia Business School\n\nAbstract\n\nWe consider a \ufb01nite-horizon multi-armed bandit (MAB) problem in a Bayesian\nsetting, for which we propose an information relaxation sampling framework.\nWith this framework, we de\ufb01ne an intuitive family of control policies that include\nThompson sampling (TS) and the Bayesian optimal policy as endpoints. Analogous\nto TS, which, at each decision epoch pulls an arm that is best with respect to\nthe randomly sampled parameters, our algorithms sample entire future reward\nrealizations and take the corresponding best action. However, this is done in\nthe presence of \u201cpenalties\u201d that seek to compensate for the availability of future\ninformation.\nWe develop several novel policies and performance bounds for MAB problems that\nvary in terms of improving performance and increasing computational complexity\nbetween the two endpoints. Our policies can be viewed as natural generalizations\nof TS that simultaneously incorporate knowledge of the time horizon and explicitly\nconsider the exploration-exploitation trade-off. We prove associated structural\nresults on performance bounds and suboptimality gaps. Numerical experiments\nsuggest that this new class of policies perform well, in particular in settings where\nthe \ufb01nite time horizon introduces signi\ufb01cant exploration-exploitation tension into\nthe problem.\n\n1\n\nIntroduction\n\nDating back to the earliest work [2, 10], multi-armed bandit (MAB) problems have been considered\nwithin a Bayesian framework, in which the unknown parameters are modeled as random variables\ndrawn from a known prior distribution. In this setting, the problem can be viewed as a Markov\ndecision process (MDP) with a state that is an information state describing the beliefs of unknown\nparameters that evolve stochastically upon each play of an arm according to Bayes\u2019 rule.\nUnder the objective of expected performance, where the expectation is taken with respect to the\nprior distribution over unknown parameters, the (Bayesian) optimal policy (OPT) is characterized\nby Bellman equations immediately following from the MDP formulation. In the discounted in\ufb01nite-\nhorizon setting, the celebrated Gittins index [10] determines an optimal policy, despite the fact that\nits computation is still challenging. In the non-discounted \ufb01nite-horizon setting, which we consider,\nthe problem becomes more dif\ufb01cult [1], and except for some special cases, the Bellman equations\nare neither analytically nor numerically tractable, due to the curse of dimensionality. In this paper,\nwe focus on the Bayesian setting, and attempt to apply ideas from dynamic programming (DP) to\ndevelop tractable policies with good performance.\nTo this end, we apply the idea of information relaxation [4], a technique that provides a systematic\nway of obtaining the performance bounds on the optimal policy. In multi-period stochastic DP\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fproblems, admissible policies are required to make decisions based only on previously revealed\ninformation. The idea of information relaxation is to consider non-anticipativity as a constraint\nimposed on the policy space that can be relaxed, while simultaneously introducing a penalty for this\nrelaxation into the objective, as in the usual Lagrangian relaxations of convex duality theory. Under\nsuch a relaxation, the decision maker (DM) is allowed to access future information and is asked\nto solve an optimization problem so as to maximize her total reward, in the presence of penalties\nthat punish any violation of the non-anticipativity constraint. When the penalties satisfy a condition\n(dual feasibility, formally de\ufb01ned in \u00a73), the expected value of the maximal reward adjusted by the\npenalties provides an upper bound on the expected performance of the (non-anticipating) optimal\npolicy.\nThe idea of relaxing the non-anticipativity constraint has been studied in different contexts [17, 6,\n18, 11], and was later formulated as a formal framework by [4], upon which our methodology is\ndeveloped. This framework has been applied to a variety of applications including optimal stopping\nproblems [7], linear-quadratic control [12], dynamic portfolio execution [13], and more (see [3]).\nTypically, the application of this method to a speci\ufb01c class of MDPs requires custom analysis. In\nparticular, it is not always easy to determine penalty functions that (1) yield a relaxation that is\ntractable to solve, and (2) provide tight upper bounds on the performance of the optimal policy.\nMoreover, the established information relaxation theory focuses on upper bounds and provides no\nguidance on the development of tractable policies.\nOur contribution is to apply the information relaxation techniques to the \ufb01nite-horizon stochastic\nMAB problem, explicitly exploiting the structure of a Bayesian learning process. In particular,\n\n1. we propose a series of information relaxations and penalties of increasing computational\n\ncomplexity;\n\n2. we systematically obtain the upper bounds on the best achievable expected performance that\n\ntrade off between tightness and computational complexity;\n\n3. and we develop associated (randomized) policies that generalize Thompson sampling (TS)\n\nin the \ufb01nite-horizon setting.\n\nIn our framework, which we call information relaxation sampling, each of the penalty functions (and\ninformation relaxations) determines one policy and one performance bound given a particular problem\ninstance speci\ufb01ed by the time horizon and the prior beliefs. As a base case for our algorithms, we\nhave TS [21] and the conventional regret benchmark that has been used for Bayesian regret analysis\nsince [15]. At the other extreme, the optimal policy OPT and its expected performance follow from\nthe \u201cideal\u201d penalty (which, not surprisingly, is intractable to compute). By picking increasingly strict\ninformation penalties, we can improve the policy and the associated bound between the two extremes\nof TS and OPT.\nAs an example, one of our algorithms, IRS.FH, provides a very simple modi\ufb01cation of TS that\nnaturally incorporates time horizon T . Recalling that TS makes a decision based on sampled\nparameters from the posterior distribution in each epoch, we focus on the fact that knowing the\nparameters is as informative as having an in\ufb01nite number of future reward observations in terms of\nthe best arm identi\ufb01cation. By contrast, IRS.FH makes a decision based on future Bayesian estimates,\nupdated with only T \u2212 1 future reward realizations for each arm, where the rewards are sampled\nbased on the prior belief at the moment. When T = 1 (equivalently, at the last decision epoch), such a\npolicy takes a myopically best action based only on the current estimates, which is indeed an optimal\ndecision, whereas TS would still explore unnecessarily. While keeping the recursive structure of the\nsequential decision-making process of TS, IRS.FH naturally performs less exploration than TS as the\nremaining time horizon diminishes. This mitigates a common practical criticism of TS: it explores\ntoo much.\nBeyond this, we propose other algorithms that more explicitly quantify the bene\ufb01t of exploration and\nmore explicitly trade off between exploration and exploitation, at the cost of additional computational\ncomplexity. As we increase the complexity, we achieve policies that improve performance, and\nseparately provide tighter tractable computational upper bounds on the expected performance of\nany policy for a particular problem instance. By providing natural generalizations of TS, our\nwork provides both a deeper understanding of TS and improved policies that do not require tuning.\nSince TS has been shown to be asymptotically regret optimal [5], our improvements can at best be\n(asymptotically) constant factor improvements by that metric. On the other hand, TS is extremely\n\n2\n\n\fpopular in practice, and we demonstrate in numerical examples that the improvements can be\nsigni\ufb01cant and are likely to be of practical interest.\nMoreover, we develop upper bounds on performance that are useful in their own right. Suppose that\na decision maker faces a particular problem instance and is considering any particular MAB policy\n(be it one we suggest or otherwise). By simulating the policy, a lower bound on the performance of\nthe optimal policy can be found. We introduce a series of upper bounds that can also be evaluated in\nany problem instance via simulation. Paired with the lower bound, these provide a computational,\nsimulation-based \u201ccon\ufb01dence interval\u201d that can be helpful to the decision maker. For example, if the\nupper bound and lower bound are close, the suboptimality gap of the policy under consideration is\nguaranteed to be small, and it is not worth investing in better policies.\n\n2 Notation and Preliminaries\n\nProblem. We consider a classical stochastic MAB problem with K independent arms and \ufb01nite-\nhorizon T . At each decision epoch t = 1, . . . , T , the decision maker (DM) pulls an arm at \u2208 A (cid:44)\n{1, . . . , K} and earns a stochastic reward associated with arm at. More formally, the reward from\nthe nth pull of arm a is denoted by Ra,n which is independently drawn from unknown distribution\nRa(\u03b8a), where \u03b8a \u2208 \u0398a is the parameter associated with arm a. We also have a prior distribution\nPa(ya) over unknown parameter \u03b8a, where ya \u2208 Ya, which we call belief, is a hyperparameter\ndescribing the prior distribution: \u03b8a \u223c Pa(ya) and Ra,n|\u03b8a \u223c Ra(\u03b8a) for all n \u2208 N and all a \u2208 A.\nWe de\ufb01ne the outcome \u03c9 (cid:44) ((\u03b8a)a\u2208A, (Ra,n)a\u2208A,n\u2208N) that incorporates the all uncertainties that\nthe DM encounters. Given the prior belief vector y (cid:44) (y1, . . . , yK) \u2208 Y, we let I(y) be the prior\ndistribution of outcome \u03c9 that would be described with Pa\u2019s and Ra\u2019s.\nWe additionally de\ufb01ne the true mean reward \u00b5a and its Bayesian estimate \u02c6\u00b5a,n as follows\n\n\u00b5a(\u03b8a) (cid:44) E [Ra,n|\u03b8a] ,\n\n\u02c6\u00b5a,n(\u03c9; ya) (cid:44) E [\u00b5a(\u03b8a)|Ra,1, . . . , Ra,n] .\n\n(1)\nThrough out the paper, we assume that the rewards are absolutely integrable over the prior distribution:\ni.e., E [|Ra,n|] < \u221e, or more explicitly, Er\u223cRa(Pa(ya)) [|r|] < \u221e where Ra(Pa(ya)) denotes the\n(unconditional) distribution of reward Ra,n as a doubly stochastic random variable.\nnumber of pulls nt(a1:t, a) (cid:44) (cid:80)t\nPolicy. Given an action sequence up to time t, a1:t (cid:44) (a1, . . . , at) \u2208 At, de\ufb01ne the\ns=1 1{as = a} for each arm a, and the corresponding\n(cid:16)\nreward realization rt(a1:t, \u03c9) (cid:44) Rat,nt(a1:t,at). The natural \ufb01ltration Ft(a1:t, \u03c9; T, y) (cid:44)\n\u03c3\n\nencodes the observations revealed up to time t (inclusive).\n\nT, y, (as, rs(a1:s, \u03c9))s\u2208[t]\n\n(cid:17)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n1:t be the action sequence taken by a policy \u03c0. A policy \u03c0 is called non-anticipating if its every\nLet a\u03c0\nt is Ft\u22121-measurable, and we de\ufb01ne \u03a0F be a set of all non-anticipating policies, including\naction a\u03c0\nrandomized ones. The (Bayesian) performance of a policy \u03c0 is de\ufb01ned as the expected total reward\nover the randomness associated with the outcome, i.e.,\n\nV (\u03c0, T, y) (cid:44) E\u03c9\u223cI(y)\n\nrt(a\u03c0\n\n1:t, \u03c9)\n\n.\n\n(2)\n\nt=1\n\nMDP formulation. We assume that we are equipped with a Bayesian update function U : Y \u00d7 A \u00d7\nR (cid:55)\u2192 Y so that after observing Ra,1 = r from some arm a, the belief vector is updated from y to\nU(y, a, r) according to Bayes\u2019 rule, where only the ath component is updated in this step.\nIn a Bayesian framework, the MAB problem has a recursive structure. Given a time horizon T and\nprior belief y, suppose the DM had just earned r by pulling an arm a at time t = 1. The remaining\nproblem for the DM is equivalent to a problem with time horizon T \u2212 1 and prior belief U(y, a, r).\nFollowing from this Markovian structure, we obtain the Bellman equations for the MAB problem:\nQ\u2217(T, y, a) (cid:44) Er\u223cRa(Pa(ya)) [r + V \u2217(T \u2212 1,U(y, a, r))] ,\n(3)\nwith V \u2217(0, y) (cid:44) 0 for all y \u2208 Y. While the Bellman equation is intractable to analyze, it offers a\ncharacterization of the Bayesian optimal policy (OPT) and the best achievable performance V \u2217: i.e.,\nV \u2217(T, y) = V (OPT, T, y) = sup\u03c0\u2208\u03a0F V (\u03c0, T, y).\n\na\u2208A Q\u2217(T, y, a),\n\nV \u2217(T, y) (cid:44) max\n\n3\n\n\f3\n\nInformation Relaxation Sampling\n\nWe propose a general framework, which we refer to as information relaxation sampling (IRS), that\ntakes as an input a \u201cpenalty function\u201d zt(\u00b7), and produces as outputs a policy \u03c0z and an associated\nperformance bound W z.\nInformation relaxation penalties and inner problem. If we relax the non-anticipativity constraint\nt is Ft\u22121-measurable), the DM will be allowed to \ufb01rst observe\nimposed on policy space \u03a0F (i.e., a\u03c0\nall future outcomes in advance, and then pick an action (i.e., a\u03c0\nt is \u03c3(\u03c9)-measurable). To compensate\nfor this relaxation, we impose a penalty on the DM for violating the nonanticipativity constraint.\nWe introduce a penalty function zt(a1:t, \u03c9; T, y) to denote the penalty that the DM incurs at time\nt, when taking an action sequence a1:t given a particular instance speci\ufb01ed by \u03c9, T and y. The\nclairvoyant DM can \ufb01nd the best action sequence that is optimal for a particular outcome \u03c9 in the\npresence of penalties zt, by solving the following (deterministic) optimization problem, referred to as\nthe inner problem:\n\nmaximizea1:T \u2208AT\n\nrt(a1:t, \u03c9) \u2212 zt(a1:t, \u03c9; T, y).\n\n(\u2217)\n\nT(cid:88)\n\nt=1\n\nE [zt(a1:t, \u03c9; T, y)|Ft\u22121(a1:t\u22121, \u03c9; T, y) ] = 0,\n\nDe\ufb01nition 1 (Dual feasibility). A penalty function zt is dual feasible if it is ex-ante zero-mean, i.e.,\n(4)\nTo clarify the notion of conditional expectation, we remark that the mapping a1:t (cid:55)\u2192 zt(a1:t, \u03c9; T, y)\nis a stochastic function of the action sequence a1:t since the outcome \u03c9 is random.1 The dual\nfeasibility condition requires that the DM who makes decisions on the natural \ufb01ltration will receive\nzero penalties in expectation.\nIRS performance bound. Let W z(T, y) be the expected maximal value of the inner problem (\u2217),\nwhen the outcome \u03c9 is randomly drawn from its prior distribution I(y), i.e., the expected total payoff\nthat a clairvoyant DM can achieve in the presence of penalties:\n\n\u2200a1:t \u2208 At,\n\n\u2200t \u2208 [T ].\n\n(cid:41)(cid:35)\n\n(cid:34)\n\n(cid:40) T(cid:88)\n\nt=1\n\nW z(T, y) (cid:44) E\u03c9\u223cI(y)\n\nmax\na1:T \u2208AT\n\nrt(a1:t, \u03c9) \u2212 zt(a1:t, \u03c9; T, y)\n\n.\n\n(5)\n\nWe can obtain this value numerically via simulation: draw outcomes \u03c9(1), \u03c9(2), . . . , \u03c9(S) indepen-\ndently from I(y), solve the inner problem for each outcome separately, and then take the average of\nthe maximal values across these samples. The following theorem shows that W z is indeed a valid\nperformance bound of the stochastic MAB problem.\nTheorem 1 (Weak duality and strong duality). If the penalty function zt is dual feasible, W z is an\nupper bound on the optimal value V \u2217: for any T and y,\n\n(Weak duality)\n\nW z(T, y) \u2265 V \u2217(T, y).\n\nThere exists a dual feasible penalty function, referred to as the ideal penalty zideal\n\nt\n\n, such that\n\n(Strong duality)\n\nW ideal(T, y) = V \u2217(T, y).\n\nThe ideal penalty function zideal\n\nhas a following functional form:\n(a1:t, \u03c9) (cid:44) rt(a1:t, \u03c9) \u2212 E [rt(a1:t, \u03c9)|Ft\u22121(a1:t\u22121, \u03c9) ]\n\nt\n\nzideal\nt\n\n+ V \u2217 (T \u2212 t, yt(a1:t, \u03c9)) \u2212 E [ V \u2217 (T \u2212 t, yt(a1:t, \u03c9))|Ft\u22121(a1:t\u22121, \u03c9)] .\n\n(6)\n\n(7)\n\n(8)\n\nA good penalty function precisely penalizes for the additional pro\ufb01t extracted from using the future\ninformation \u03c9. At extreme, the ideal penalty zideal\n, intractable however, removes any incentive to\ndeviate from OPT and results in the strong duality. In (8), yt(a1:t, \u03c9) represents the posterior belief\nthat the DM would have at time t after observing the reward realizations associated with a1:t given \u03c9.\n1 As in usual probability theory, Z(\u03c9) (cid:44) E[X(\u03c9)|Y (\u03c9)] represents the expected value of a random variable\n\nt\n\nX(\u03c9) given the information Y (\u03c9), and Z(\u03c9) is itself a random variable that has a dependency on \u03c9.\n\n4\n\n\fIRS policy. Given a penalty function zt, we characterize a randomized and non-anticipating IRS\npolicy \u03c0z \u2208 \u03a0F as follows. The policy \u03c0z speci\ufb01es \u201cwhich arm to pull when the remaining time is\nT and current belief is y.\u201d Given T and y, it (i) \ufb01rst samples an outcome \u02dc\u03c9 from I(y) randomly,\n(ii) solves the inner problem to \ufb01nd a best action sequence \u02dca\u2217\n1:T with respect to \u02dc\u03c9 in the presence\nof penalties zt, and (iii) takes the \ufb01rst action \u02dca\u2217\n1:T suggests.\nAnalogous to Thompson sampling, it repeats steps (i)\u2013(iii) at every decision epoch, while updating\nthe remaining time T and belief y upon each reward realization.\n\n1 that the clairvoyant optimal solution \u02dca\u2217\n\nAlgorithm 1: Information relaxation sampling (IRS) policy\nFunction IRS(T, y; z)\n\n(cid:110)(cid:80)T\n\n(cid:111)\nSample \u02dc\u03c9 \u223c I(y) (equivalently, \u02dc\u03b8a \u223c Pa(ya) and \u02dcRa,n \u223c Ra(\u02dc\u03b8a), \u2200a \u2208 A, \u2200n \u2208 [T ])\nFind the best action sequence with respect to \u02dc\u03c9 under penalties zt:\nt=1 rt(a1:t, \u02dc\u03c9) \u2212 zt(a1:t, \u02dc\u03c9; T, y)\n1:T \u2190 argmaxa1:T \u2208AT\n\u02dca\u2217\nreturn \u02dca\u2217\ny0 \u2190 y\nfor t = 1, 2, . . . , T do\n\nProcedure IRS-Outer(T, y; z)\n\n1\n\nPlay at \u2190 IRS(T \u2212 t + 1, yt\u22121; z)\nEarn and observe a reward rt and update belief yt \u2190 U(yt\u22121, at, rt)\n\n1\n2\n\n3\n\n1\n2\n3\n4\n\nend\n\nRemark 1. The ideal penalty yields the Bayesian optimal policy: i.e., V (\u03c0ideal, T, y) = V \u2217(T, y).\nChoice of penalty functions. IRS policies include Thompson sampling and the Bayesian optimal\npolicy as two extremal cases. We propose a set of penalty functions spanning these two. While\ndeferring the detailed explanations in \u00a73.1 \u2013 \u00a73.4, we brie\ufb02y list the penalty functions:\n\nt (a1:t, \u03c9) (cid:44) rt(a1:t, \u03c9) \u2212 E [rt(a1:t, \u03c9)|\u03b81, . . . , \u03b8K ]\nzTS\nzIRS.FH\nt\nzIRS.V-ZERO\nt\nzIRS.V-EMAX\nt\n\n(a1:t, \u03c9) (cid:44) rt(a1:t, \u03c9) \u2212 E [rt(a1:t, \u03c9)|\u02c6\u00b51,T\u22121(\u03c9), . . . , \u02c6\u00b5K,T\u22121(\u03c9) ]\n(a1:t, \u03c9) (cid:44) rt(a1:t, \u03c9) \u2212 E [rt(a1:t, \u03c9)|Ft\u22121(a1:t\u22121, \u03c9) ]\n(a1:t, \u03c9) (cid:44) rt(a1:t, \u03c9) \u2212 E [rt(a1:t, \u03c9)|Ft\u22121(a1:t\u22121, \u03c9) ]\n\nE(cid:2)\u00b5at(\u03b8at)(cid:12)(cid:12)Rat,1, . . . , Rat,nt\u22121(a1:t\u22121,at)\n\n+ W TS (T \u2212 t, yt(a1:t, \u03c9)) \u2212 E(cid:2) W TS (T \u2212 t, yt(a1:t, \u03c9))(cid:12)(cid:12)Ft\u22121(a1:t\u22121, \u03c9)(cid:3)\n(cid:3) = \u02c6\u00b5at,nt\u22121(a1:t\u22121,at)(\u03c9) \u2013 they all represent the mean\n\nTo help understanding, we provide an identity as an example: E [rt(a1:t, \u03c9)|Ft\u22121(a1:t\u22121, \u03c9) ] =\n\n(9)\n(10)\n(11)\n(12)\n\nreward that the DM expects to get from arm at right before making a decision at time t.\nRemark 2. All penalty functions (8)\u2013(12) are dual feasible.\nAs we sequentially increase its complexity, from zTS to zideal, the penalty function more accurately\npenalizes the bene\ufb01t of knowing the future outcomes, more explicitly preventing the DM from\nexploiting the future information. As summarized in Table 1, it makes the inner problem closer to\nthe original stochastic optimization problem that results in a better performing policy and a tighter\nperformance bound. As a result, we achieve a family of algorithms that are intuitive and tractable,\nexhibiting a trade-off between quality and computational ef\ufb01ciency.\n\n3.1 Thompson Sampling\n\n(cid:40) T(cid:88)\n\n(cid:41)\n\n(cid:40) T(cid:88)\n\n(cid:41)\n\nWith the penalty function zTS\n\nt (a1:t, \u03c9) = rt(a1:t, \u03c9) \u2212 \u00b5at(\u03b8at), the inner problem (\u2217) reduces to\n\nrt(a1:t, \u03c9) \u2212 zTS\n\nt=1\n\nt (a1:t, \u03c9)\n\n= max\na1:T \u2208AT\n\na\u2208A \u00b5a(\u03b8a). (13)\nmax\na1:T \u2208AT\nThe resulting performance bound W TS(T, y) is E [T \u00d7 maxa\u2208A \u00b5a(\u03b8a)] that is the conventional\nbenchmark in a Bayesian setting [15, 19]. The corresponding IRS policy \u03c0TS restores Thompson\nsampling: when the sampled outcome \u02dc\u03c9 is used instead, it plays the arm \u02dca\u2217\n1 = argmaxa \u00b5a(\u02dc\u03b8a)\nwhere each \u02dc\u03b8a is sampled from Pa(ya). Recall that this sampling-based decision making is repeated\nin each epoch, while updating the belief sequentially, as described in IRS-OUTER in Algorithm 1.\n\n\u00b5at(\u03b8at)\n\nt=1\n\n= T \u00d7 max\n\n5\n\n\fInner problem\n\nRun time\n\nPenalty\nfunction\n\nPolicy\n\nPerformance\n\nbound\nW TS\n\nW IRS.FH\n\nTS\n\n\u03c0IRS.FH\n\nzTS\nt\n\nzIRS.FH\nt\n\nzIRS.V-ZERO\nt\nzIRS.V-EMAX\nt\n\nzideal\nt\n\nFind a best arm given parameters.\nFind a best arm given \ufb01nite observations. O(K)\u2020 or O(KT )\n\nO(K)\n\n\u03c0IRS.V-ZERO W IRS.V-ZERO Find an optimal allocation of T pulls.\n\u03c0IRS.V-EMAX W IRS.V-EMAX Find an optimal action sequence.\n\nOPT\n\nV \u2217\n\nSolve Bellman equations.\n\nO(KT 2)\nO(KT K)\n\n\u2013\n\nTable 1: List of algorithms associated with the penalty functions (8)\u2013(12). Run time represents the\ntime complexity of solving one instance of inner problem, that is, the time required to obtain one\nsample of performance bound W z or to make a single decision in policy \u03c0z. \u2020In IRS.FH, O(K) is\nachievable when the prior distribution Pa is a conjugate prior of the reward distribution Ra.\n\n3.2\n\nIRS.FH\n\nRecall that \u02c6\u00b5a,T\u22121(\u03c9) is the Bayesian estimate on the mean reward of an arm a inferred from the \ufb01rst\nT \u2212 1 reward realizations Ra,1, . . . , Ra,T\u22121. Given (10), the optimal solution to the inner problem\n(\u2217) is to pull an arm with the highest \u02c6\u00b5a,T\u22121(\u03c9) from beginning to the end:\n\n(cid:40) T(cid:88)\n\nt=1\n\n(cid:41)\n\n(cid:40) T(cid:88)\n\nt=1\n\n(cid:41)\n\n= T\u00d7max\n\nrt(a1:t, \u03c9) \u2212 zIRS.FH\n\nt\n\n(a1:t, \u03c9)\n\nmax\na1:T \u2208AT\n\n\u02c6\u00b5at,T\u22121(\u03c9)\n\n= max\na1:T \u2208AT\n\na\u2208A \u02c6\u00b5a,T\u22121(\u03c9).\n(14)\nIRS.FH is almost identical to TS except that \u00b5a(\u03b8a) is replaced with \u02c6\u00b5a,T\u22121(\u03c9). Note that \u02c6\u00b5a,T\u22121(\u03c9)\nis less informative than \u00b5a(\u03b8a) for the DM, since she will never be able to learn \u00b5a(\u03b8a) perfectly\nwithin a \ufb01nite horizon. In terms of estimation, knowing the parameters is equivalent to having the\nin\ufb01nite number of observations. The inner problem of TS asks the DM to \u201cidentify the best arm\nbased on the in\ufb01nite number of samples\u201d whereas that of IRS.FH asks her to \u201cidentify the best arm\nbased on the \ufb01nite number of samples,\u201d which takes into account the length of time horizon explicitly.\nFocusing on the policies \u03c0IRS.FH and \u03c0TS (where the randomly generated \u00b5a(\u02dc\u03b8a) and \u02c6\u00b5a,T\u22121(\u02dc\u03c9)\nare used), we observe that the distribution of \u02c6\u00b5a,T\u22121(\u02dc\u03c9) will be more concentrated while both\nhave the same mean \u00af\u00b5a (cid:44) E[\u00b5a(\u02dc\u03b8a)] = \u02c6\u00b5a,0. Since the variance of \u02c6\u00b5a,T\u22121(\u02dc\u03c9) and \u00b5a(\u02dc\u03b8a) govern\nthe degree of random exploration (deviating from the myopic decision of pulling an arm with\nthe largest \u00af\u00b5a), \u03c0IRS.FH naturally explores less than TS, in particular when it approaches the end\nof the horizon (T (cid:38) 1). For the performance bounds, by the same reason, we have W IRS.FH =\nE[T \u00d7maxa \u02c6\u00b5a,T\u22121(\u03c9)] \u2264 W TS = E[T \u00d7maxa \u00b5a(\u03b8a)], meaning that IRS.FH yields a performance\nbound that is tighter than the conventional regret benchmark.\nSampling \u02c6\u00b5a,T\u22121(\u02dc\u03c9) at once. In order to obtain \u02c6\u00b5a,T\u22121(\u02dc\u03c9) for a synthesized outcome \u02dc\u03c9, one may\napply Bayes\u2019 rule sequentially for each reward realization, which will take O(KT ) computations in\ntotal. It can be done in O(K) when the belief can be updated in a batch by the use of suf\ufb01cient statistics.\nIn Beta-Bernoulli and Gaussian MABs, for example, \u02c6\u00b5a,T\u22121(\u02dc\u03c9) can be represented as a convex\n\u02dcRa,n\ncombination of the current estimate \u00af\u00b5a and the sample mean 1\nT\u22121\nis distributed with Binomial(T \u2212 1, \u02dc\u03b8a) for Beta-Bernoulli case and N ((T \u2212 1) \u00b7 \u02dc\u03b8a, (T \u2212 1) \u00b7 \u03c32\na)\na represents the noise variance). After sampling the parameter \u02dc\u03b8a, we can\nfor Gaussian case (\u03c32\n\u02dcRa,n directly from the known distribution, then use it to compute \u02c6\u00b5a,T\u22121(\u02dc\u03c9) without\nsequentially updating the belief. In such cases, a single decision of \u03c0IRS.FH can be made within O(K)\noperations, similar in complexity to TS.\n\n\u02dcRa,n where(cid:80)T\u22121\n\nsample(cid:80)T\u22121\n\n(cid:80)T\u22121\n\nn=1\n\nn=1\n\nn=1\n\n3.3\n\nIRS.V-ZERO\n\nt\n\n, the DM at time t earns E [rt(a1:t, \u03c9)|Ft\u22121(a1:t\u22121, \u03c9) ], the expected\nUnder the penalty zIRS.V-ZERO\nmean reward that she can infer from the observations prior to time t. As we de\ufb01ned Ra,n to be a\nreward from the nth pull on arm a (not the pull at time n), the posterior belief associated with each\narm is determined only by the number of past pulls on that arm \u2013 from the nth pull on arm a, the DM\nearns \u02c6\u00b5a,n\u22121(\u03c9), irrespective of the detailed sequence of past actions.\n\n6\n\n\f(cid:41)\n\n\uf8f1\uf8f2\uf8f3 K(cid:88)\n\na=1\n\nnT (a1:T ,a)(cid:88)\n\nn=1\n\n\u02c6\u00b5at,nt\u22121(a1:t\u22121,at)\n\n= max\na1:T \u2208AT\n\n\uf8fc\uf8fd\uf8fe = max\n\nn1:K\u2208NT\n\n(cid:40) K(cid:88)\n\na=1\n\n\u02c6\u00b5a,n\u22121\n\n(cid:41)\n\nSa,na\n\n(cid:40) T(cid:88)\nwhere Sa,n (cid:44) (cid:80)n\n\nmax\na1:T \u2208AT\n\nt=1\n\nFollowing this observation, solving the inner problem (\u2217) is equivalent to \u201c\ufb01nding the optimal\nallocation (n\u2217\nK) among T remaining opportunities\u201d: omitting \u03c9 for brevity, it reduces to\n\n2, . . . , n\u2217\n\n1, n\u2217\n\n+ :(cid:80)K\n\n(15)\nm=1 \u02c6\u00b5a,m\u22121 is the cumulative payoff from the \ufb01rst n pulls of an arm a, and\na=1 na = T} is the set of all feasible allocations. Once the Sa,n\u2019s\nNT (cid:44) {(n1, . . . , nK) \u2208 ZK\nare computed, this inner problem can be solved within O(KT 2) operations by sequentially applying\nsup convolution K times. The detailed implementation is provided in Appendix \u00a7B.1.\nGiven an optimal allocation \u02dcn\u2217, the policy \u03c0IRS.V-ZERO needs to select which arm to pull next. In\nprinciple, any arm a that was included in the solution of the inner problem, \u02dcn\u2217\na > 0, would be \ufb01ne,\nbut we suggest a selection rule in which the arm that needs most pulls is chosen, i.e., argmaxa \u02dcn\u2217\na.\n\n3.4\n\nIRS.V-EMAX\n\nUnder perfect information relaxation, the DM perfectly knows not only (i) what she will earn at\nfuture times but also (ii) how her belief will evolve as a result of her action sequence. The previous\nalgorithms focus on the former component by making the DM to adjust the future rewards by\nconditioning (e.g., E[rt(at)|\u03b8], E[rt(at)| \u02c6\u00b51:K,T\u22121] and E[rt(at)|Ft\u22121]). IRS.V-EMAX also focuses\non the second component as well by charging an additional cost for using the information on her\nfuture belief transitions.\nin (8) by replacing V \u2217(T, y)\nSpeci\ufb01cally, the penalty function zIRS.V-EMAX\nwith W TS(T, y), which is a tractable alternative. The use of W TS(T, y) leads to a simple expression\nfor its conditional expectation: since \u03b8|Ft\u22121 is distributed with P(yt\u22121), we have\n\nis obtained from zideal\n\nt\n\nt\n\nE(cid:2) W TS (T \u2212 t, yt)(cid:12)(cid:12)Ft\u22121\n\n(cid:3) = (T \u2212 t) \u00d7 E(cid:104)\n\na\n\nmax\n\n\u00b5a(\u03b8a)\n\n= (T \u2212 t) \u00d7 E\u03b8\u223cP(yt\u22121)\n\n(16)\n= W TS (T \u2212 t, yt\u22121) . (17)\nWe further observe that, given \u03c9, the future belief yt(a1:t, \u03c9) depends only on how many times each\narm has been pulled, irrespective of the sequence of the pulls, and hence, the number of possible\nfuture beliefs is O(T K), not O(K T ). Given the above observations, we can solve the inner problem\nwithin O(KT K) computations by dynamic programming (i.e., by \ufb01nding a best action at each future\nbelief while iterating over the beliefs in an appropriate order). See \u00a7B.2 for details.\n\n\u00b5a(\u03b8a)\n\nmax\n\na\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12)Ft\u22121\n\n(cid:104)\n\n(cid:105)\n\n4 Analysis\n\nRemark 3 (Single period optimality). When T = 1, all \u03c0IRS.FH, \u03c0IRS.V-ZERO, and \u03c0IRS.V-EMAX take the\noptimal action that is pulling the myopically best arm a\u2217 = argmaxa\nProposition 1 (Asymptotic behavior). Assume that \u00b5i(\u03b8i) (cid:54)= \u00b5j(\u03b8j) almost surely for any two\ndistinct arms i (cid:54)= j. As T (cid:37) \u221e, the distribution of IRS.FH\u2019s and IRS.V-ZERO\u2019s action2 converge to\nthat of Thompson sampling: for all a \u2208 A,\nP [IRS.FH(T, y) = a] = lim\nT\u2192\u221e\n\nP [IRS.V-ZERO(T, y) = a] = P [TS(y) = a] .\n\nE[\u00b5a(\u03b8a)].\n\nlim\nT\u2192\u221e\n\n(18)\n\nTS(y), IRS.FH(T, y) and IRS.V-ZERO(T, y) denote the action taken by policies \u03c0TS, \u03c0IRS.FH and\n\u03c0IRS.V-ZERO, repsectively, when the remaining time is T and the prior belief is y. These are random\nvariables, since each of these policies uses a randomly sampled outcome \u02dc\u03c9 on its own.\nRemark 3 and Proposition 1 state that IRS.FH and IRS.V-ZERO behave like TS during the initial\ndecision epochs, gradually shift toward the myopic scheme and end up with optimal decision; in\ncontrast, TS will continue to explore throughout. The transition from exploration to exploitation\nunder these IRS policies occurs smoothly, without relying on an auxiliary control parameter. While\nmaintaining their recursive structure, IRS policies take into account the horizon T , and naturally\nbalance exploitation and exploration.\n\n2 For IRS.V-ZERO, we assume a particular selection rule such that \u02dca\u2217 = argmaxa \u02dcn\u2217\na.\n\n7\n\n\fTheorem 2 (Monotonicity in performance bounds). IRS.FH and IRS.V-ZERO monotonically improve\nthe performance bound: for any T and y,\n\nW TS(T, y) \u2265 W IRS.FH(T, y) \u2265 W IRS.V-ZERO(T, y).\n\n(19)\n\nNote that W TS(T, y) = E\u03b8\u223cP(y) [T \u00d7 maxa \u00b5a(\u03b8a)] is the conventional benchmark.\nIn addition, we have W IRS.V-EMAX \u2265 W ideal since W ideal is the lowest attainable upper bound\n(Theorem 1). Empirically, we also observe W IRS.V-ZERO \u2265 W IRS.V-EMAX.\nTheorem 3 (Suboptimality gap). In the Beta-Bernoulli MAB, for any T and y,\n\nW TS(T, y) \u2212 V (\u03c0TS, T, y) \u2264 3K + 2(cid:112)log T \u00d7 2\n(cid:18)\nW IRS.FH(T, y) \u2212 V (\u03c0IRS.FH, T, y) \u2264 3K + 2(cid:112)log T \u00d7\n(cid:18)\nW IRS.V-ZERO(T, y) \u2212 V (\u03c0IRS.V-ZERO, T, y) \u2264 2K +(cid:112)log T \u00d7\n\n\u221a\nKT ,\n\u221a\nKT \u2212 1\n2\n3\n\u221a\nKT \u2212 1\n2\n3\n\n(cid:19)\n(cid:112)T /K\n(cid:19)\n(cid:112)T /K\n\n.\n\n,\n\n(20)\n\n(21)\n\n(22)\n\nWe do not have a theoretical guarantee for monotonicity in the actual performance V (\u03c0z, T, y)\n\u221a\namong IRS policies. Instead, Theorem 3 indirectly shows the improvements in suboptmality gap,\nW z(T, y)\u2212V (\u03c0z, T, y): although all the bounds have the same asymptotic order of O(\nKT log T ),\nthe IRS policies improve the leading coef\ufb01cient or the additional term.\nTheorem 2 and 3 highlight that a better choice of penalty function zt leads to a tighter performance\nbound W z and a better performing policy \u03c0z. Recall that the penalties are designed to penalize the\ngain of having additional future information. While all IRS algorithms are basically optimistic in a\nsense that the DM makes a decision believing that the informed outcome (\u03c9 or \u02dc\u03c9) will be realized,\na better penalty function prevents the DM from picking up an action that is overly optimized to a\nparticular future realization.\n\n5 Numerical Experiment\n\nOur regret measure W TS \u2212 V (\u03c0) = E(cid:104)(cid:80)T\n\nWe visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB\nwith \ufb01ve arms (K = 5) with different noise variances. More speci\ufb01cally, each arm a \u2208 A has\nthe unknown mean reward \u03b8a \u223c N (0, 12) and yields the stochastic rewards Ra,n \u223c N (\u03b8a, \u03c32\na)\nwhere \u03c31 = 0.1, \u03c32 = 0.4, \u03c33 = 1.0, \u03c34 = 4.0 and \u03c35 = 10.0. Our experiment includes the\nstate-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian upper\ncon\ufb01dence bound [14] (BAYES-UCB, with a quantile of 1 \u2212 1\nt ), information directed sampling [20]\n(IDS), and optimistic Gittins index [9] (OGI, one-step look ahead approximation with a discount\nfactor \u03b3t = 1 \u2212 1\nt ). In the simulation, we randomly generate a set of outcomes \u03c9(1), . . . , \u03c9(S) and\nmeasure the performance of each policy \u03c0, V (\u03c0, T, y), and the performance bounds, W z(T, y), via\nsample average approximation across these sampled outcomes (S = 20, 000).\nFigure 1 plots the regret of policies (solid lines, W TS(T, y) \u2212 V (\u03c0, T, y)) and the regret bounds\n(dashed lines, W TS(T, y)\u2212W z(T, y)) that are measured at the different values of T = 5, 10, . . . , 500.\nis equivalent to the conven-\ntional Bayesian regret [19], and the measure W TS \u2212 W z provides the lower bound on the achievable\nregret since W TS \u2212 V (\u03c0) \u2265 W TS \u2212 W z for any policy \u03c0 \u2208 \u03a0F due to the weak duality. Despite the\nfact that we cannot compute the Bayesian optimal policy directly, we can infer where its regret curve\nis located in the shaded region of the plot.\nNote that lower regret curves are better, and higher bound curves are better. As we incorporate\nmore complicated IRS algorithm from TS to IRS.V-ZERO, we observe a clear improvement in both\nperformances and bounds, as predicted in Theorem 2 and 3. In this particular example, it is crucial to\nincorporate how much we can learn about each of the arms during the remaining time periods, which\nheavily depends on the noise level \u03c3a and time horizon T . Accordingly, IRS policies outperform to\nthe others, since ours more explicitly incorporate the exploitation-exploration trade-off.\n\nt=1 maxa \u00b5a(\u03b8a) \u2212 \u00b5a\u03c0\n\n(cid:105)\n\n(\u03b8a\u03c0\n\nt\n\n)\n\nt\n\n8\n\n\fFigure 1: Regret plot for Gaussian MAB with different noise variances. The solid lines represent\nthe (Bayesian) regret of policies, W TS(T, y) \u2212 V (\u03c0, T, y), and the dashed lines represent the\nregret bounds that IRS algorithms produce, W TS(T, y) \u2212 W z(T, y). The lowest achievable regret\n(W TS(T, y) \u2212 V \u2217(T, y)) should be within the shaded area. The times in the legend represent the\naverage length of time required to simulate each policy for a single problem instance with T = 500.\n\n6 Discussion\n\nwe can reformulate the objective function of the inner problem as(cid:80)\u221e\n\nWe have developed a uni\ufb01ed framework providing a principled method of improving TS that does not\nrequire any tuning or additional parameters. Despite the fact that this paper focuses on a \ufb01nite-horizon\nMAB with independent arms, the general idea of information relaxation sampling is not restricted to\nthis setting: we brie\ufb02y illustrate how to extend the framework to a broader class of problems.\nMAB with unknown time horizon. The framework (penalties, policies, and upper bounds) can\nnaturally incorporate the unknown T within the Bayesian setting: i.e., the horizon T is also a random\nvariable whose prior distribution is known. As a simple case, if T is independent of the DM\u2019s actions,\nt=1 \u03b3t(rt(a1:t, \u03c9) \u2212 zt(a1:t, \u03c9))\nwhere the discount factor \u03b3t (cid:44) P[T \u2265 t] is the survival probability, and rt(\u00b7) and zt(\u00b7) are the reward\nand penalty terms used in the paper. Alternatively, we can treat the random variable T like the random\nreward realizations \u2013 sample T from its prior distribution while a penalty function (additionally)\npenalizes for the gain from knowing T (one can imagine that the outcome \u03c9 now includes the\nrealization of T ). Structural results such as weak duality and strong duality will continue to hold.\nMAB in more complicated settings. Consider the following examples: (i) A \ufb01nite-horizon MAB\nwith correlated arms (e.g., Ra,n \u223c N (x(cid:62)\na) where \u03b8 \u2208 Rd is shared across the arms, and\nxa \u2208 Rd is an arm\u2019s feature vector): IRS.V-ZERO can be immediately implemented by adopting\nthe DP algorithm discussed in \u00a7B.2. (ii) MAB with the delayed reward realization: IRS.FH can be\nimmediately implemented by simulating the DM\u2019s learning process in the presence of delay. (iii)\nMAB with a budget constraint (in which each arm consumes a certain amount of budget and the\nDM wants to maximize the total reward within a limited budget. See [8]): all IRS algorithms can be\nimplemented by solving a budget-constrained optimization problem instead of a horizon-constrained\noptimization problem.\nIn these extensions, we can obtain not only the online decision making policies but also their\nperformance bounds as in this paper. Generally speaking, our framework provides a systemic\nway of improving TS by taking into account the exploitation-exploration trade-off more carefully,\nparticularly in the presence of some constraint that incurs incomplete learning. The main challenge\nwould be to design a suitable penalty function that is tractable yet captures the problem-speci\ufb01c\nexploration-exploitation trade-off precisely.\n\na \u03b8, \u03c32\n\n9\n\n0100200300400500600Time horizon T050100150200Bayesian regret WTS(T,y)V(,T,y) TS TS Bayes-UCB IRS.FH IRS.FH OGI IDS IRS.V-ZERO IRS.V-ZEROGaussian MAB (K=5) with Heteroscedastic NoiseTS (34 ms)Bayes-UCB (88 ms)IRS.FH (128 ms)OGI (829 ms)IDS (2.9 sec)IRS.V-ZERO (7.4 sec)\fReferences\n[1] Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments.\n\nChapman and Hall, 1985.\n\n[2] Rusell N. Bradt, S. M. Johnson, and Samuel Karlin. On sequential designs for maximizing the\n\nsum of n observations. Annals of Mathematical Statistics, 27(4):1060\u20131074, 1956.\n\n[3] David B. Brown and Martin B. Haugh. Information relaxation bounds for in\ufb01nite horizon\n\nMarkov decision processes. Operations Research, 65(5):1355\u20131379, 2017.\n\n[4] David B. Brown, James E. Smith, and Peng Sun.\n\nInformation relaxations and duality in\n\nstochastic dynamic programs. Operations Research, 58(4):785\u2013801, 2010.\n\n[5] Sebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson\nsampling. Proceedings of the 26th International Conference on Neural Information Processing\nSystems, 1(638-646), 2013.\n\n[6] M. H. A. Davis and I. Karatzas. A Deterministic Approach to Optimal Stopping. Wiley, 1994.\n\n[7] Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi. Pathwise optimization for optimal\n\nstopping problems. Management Science, 58(12):2292\u20132308, 2012.\n\n[8] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget\nconstraint and variable costs. Proceedings of the 27th AAAI Conference on Arti\ufb01cial Intelligence,\n2013.\n\n[9] Vivek F. Farias and Eli Gutin. Optimistic Gittins indices. Proceedings of the 30th International\n\nConference on Neural Information Processing Systems, (3161-3169), 2016.\n\n[10] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical\n\nSociety, Series B, 41(2):148\u2013177, 1979.\n\n[11] Martin B. Haugh and Leonid Kogan. Pricing American options: A duality approach. Operations\n\nResearch, 52(2):258\u2013270, 2004.\n\n[12] Martin B. Haugh and Andrew E. B. Lim. Linear-quadratic control and information relaxations.\n\nOperations Research Letters, 40:521\u2013528, 2012.\n\n[13] Martin B. Haugh and Chun Wang. Dynamic portfolio execution and information relaxations.\n\nSIAM Journal of Financial Math, 5:316\u2013359, 2014.\n\n[14] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On Bayesian upper con\ufb01dence\nbounds for bandit problems. Proceedings of the Fifteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics,, 22:592\u2013600, 2012.\n\n[15] Tze Leueng Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules.\n\nAdvances in Applied Mathematics, 6:4\u201322, 1985.\n\n[16] Olivier Marchal and Julyan Arbel. On the sub-Gaussianity of the Beta and Dirichlet distributions.\n\n2017.\n\n[17] R. T. Rockafellar and Roger J.-B. Wets. Scenarios and policy aggregation in optimization under\n\nuncertainty. Mathematics of Operations Research, 16(1):119\u2013147, 1991.\n\n[18] L. C. G. Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3):271\u2013\n\n286, 2002.\n\n[19] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[20] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.\n\nOperations Research, 66(1):230\u2013252, 2017.\n\n[21] W. Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n10\n\n\f", "award": [], "sourceid": 1927, "authors": [{"given_name": "Seungki", "family_name": "Min", "institution": "Columbia Business School"}, {"given_name": "Costis", "family_name": "Maglaras", "institution": "Columbia Business School"}, {"given_name": "Ciamac", "family_name": "Moallemi", "institution": "Columbia University"}]}