{"title": "Factored Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 2835, "page_last": 2844, "abstract": "We introduce the factored bandits model, which is a framework for learning with\nlimited (bandit) feedback, where actions can be decomposed into a Cartesian\nproduct of atomic actions. Factored bandits incorporate rank-1 bandits as a special\ncase, but significantly relax the assumptions on the form of the reward function. We\nprovide an anytime algorithm for stochastic factored bandits and up to constants\nmatching upper and lower regret bounds for the problem. Furthermore, we show\nthat with a slight modification the proposed algorithm can be applied to utility\nbased dueling bandits. We obtain an improvement in the additive terms of the regret\nbound compared to state of the art algorithms (the additive terms are dominating\nup to time horizons which are exponential in the number of arms).", "full_text": "Factored Bandits\n\nJulian Zimmert\n\nUniversity of Copenhagen\nzimmert@di.ku.dk\n\nYevgeny Seldin\n\nUniversity of Copenhagen\n\nseldin@di.ku.dk\n\nAbstract\n\nWe introduce the factored bandits model, which is a framework for learning with\nlimited (bandit) feedback, where actions can be decomposed into a Cartesian\nproduct of atomic actions. Factored bandits incorporate rank-1 bandits as a special\ncase, but signi\ufb01cantly relax the assumptions on the form of the reward function. We\nprovide an anytime algorithm for stochastic factored bandits and up to constants\nmatching upper and lower regret bounds for the problem. Furthermore, we show\nhow a slight modi\ufb01cation enables the proposed algorithm to be applied to utility-\nbased dueling bandits. We obtain an improvement in the additive terms of the regret\nbound compared to state-of-the-art algorithms (the additive terms are dominating\nup to time horizons that are exponential in the number of arms).\n\n1\n\nIntroduction\n\nWe introduce factored bandits, which is a bandit learning model, where actions can be decomposed\ninto a Cartesian product of atomic actions. As an example, consider an advertising task, where the\nactions can be decomposed into (1) selection of an advertisement from a pool of advertisements and\n(2) selection of a location on a web page out of a set of locations, where it can be presented. The\nprobability of a click is then a function of the quality of the two actions, the attractiveness of the\nadvertisement and the visibility of the location it was placed at. In order to maximize the reward the\nlearner has to maximize the quality of actions along each dimension of the problem. Factored bandits\ngeneralize the above example to an arbitrary number of atomic actions and arbitrary reward functions\nsatisfying some mild assumptions.\n\nStochastic\nRank-1\n\nFactored\nBandits\n\n(Generalized)\nLinear\nBandits\n\nUtility\nBased\n\nUniformly\nIdenti\ufb01able\n\nCondorcet\nWinner\n\nA \u201c t0, 1ud\nCombin.\nBandits\n\nrelaxation\nChen et al.\n(2016)\n\nDueling\nBandits\n\nweakly constrained\n\nreward models\n\nexplicit\n\nreward models\n\nFigure 1: Relations between factored bandits and other bandit models.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn a nutshell, at every round of a factored bandit game the player selects L atomic actions, a1, . . . , aL,\neach from a corresponding \ufb01nite set A\ufffd of size |A\ufffd| of possible actions. The player then observes a\nreward, which is an arbitrary function of a1, . . . , aL satisfying some mild assumptions. For example,\nit can be a sum of the quality of atomic actions, a product of the qualities, or something else that does\nnot necessarily need to have an analytical expression. The learner does not have to know the form of\nthe reward function.\nOur way of dealing with combinatorial complexity of the problem is through introduction of unique\nidenti\ufb01ability assumption, by which the best action along each dimension is uniquely identi\ufb01able. A\nbit more precisely, when looking at a given dimension we call the collection of actions along all other\ndimensions a reference set. The unique identi\ufb01ability assumption states that in expectation the best\naction along a dimension outperforms any other action along the same dimension by a certain margin\nwhen both are played with the same reference set, irrespective of the composition of the reference\nset. This assumption is satis\ufb01ed, for example, by the reward structure in linear and generalized linear\nbandits, but it is much weaker than the linearity assumption.\nIn Figure 1, we sketch the relations between factored bandits and other bandit models. We distinguish\nbetween bandits with explicit reward models, such as linear and generalized linear bandits, and\nbandits with weakly constrained reward models, including factored bandits and some relaxations of\ncombinatorial bandits. A special case of factored bandits are rank-1 bandits [7]. In rank-1 bandits the\nplayer selects two actions and the reward is the product of their qualities. Factored bandits generalize\nthis to an arbitrary number of actions and signi\ufb01cantly relax the assumption on the form of the reward\nfunction.\nThe relation with other bandit models is a bit more involved. There is an overlap between factored\nbandits and (generalized) linear bandits [1; 6], but neither is a special case of the other. When actions\nare represented by unit vectors, then for (generalized) linear reward functions the models coincide.\nHowever, the (generalized) linear bandits allow a continuum of actions, whereas factored bandits\nrelax the (generalized) linearity assumption on the reward structure to uniform identi\ufb01ability.\nThere is a partial overlap between factored bandits and combinatorial bandits [3]. The action set\nin combinatorial bandits is a subset of t0, 1ud. If the action set is unrestricted, i.e. A \u201c t0, 1ud,\nthen combinatorial bandits can be seen as factored bandits with just two actions along each of the d\ndimensions. However, typically in combinatorial bandits the action set is a strict subset of t0, 1ud and\none of the parameters of interest is the permitted number of non-zero elements. This setting is not\ncovered by factored bandits. While in the classical combinatorial bandits setting the reward structure\nis linear, there exist relaxations of the model, e.g. Chen et al. [4].\nDueling bandits are not directly related to factored bandits and, therefore, we depict them with faded\ndashed blocks in Figure 1. While the action set in dueling bandits can be decomposed into a product\nof the basic action set with itself (one for the \ufb01rst and one for the second action in the duel), the\nobservations in dueling bandits are the identities of the winners rather than rewards. Nevertheless, we\nshow that the proposed algorithm for factored bandits can be applied to utility-based dueling bandits.\nThe main contributions of the paper can be summarized as follows:\n\n1. We introduce factored bandits and the uniform identi\ufb01ability assumption.\n2. Factored bandits with uniformly identi\ufb01able actions are a generalization of rank-1 bandits.\n3. We provide an anytime algorithm for playing factored bandits under uniform identi\ufb01ability\nassumption in stochastic environments and analyze its regret. We also provide a lower bound\nmatching up to constants.\n\n4. Unlike the majority of bandit models, our approach does not require explicit speci\ufb01cation\nor knowledge of the form of the reward function (as long as the uniform identi\ufb01ability\nassumption is satis\ufb01ed). For example, it can be a weighted sum of the qualities of atomic\nactions (as in linear bandits), a product thereof, or any other function not necessarily known\nto the algorithm.\n\n5. We show that the algorithm can also be applied to utility-based dueling bandits, where the\nadditive factor in the regret bound is reduced by a multiplicative factor of K compared to\nstate-of-the-art (where K is the number of actions). It should be emphasized that in state-\nof-the-art regret bounds for utility-based dueling bandits the additive factor is dominating\n\n2\n\n\ffor time horizons below \u03a9pexppKqq, whereas in the new result it is only dominant for time\nhorizons up to OpKq.\n\n6. Our work provides a uni\ufb01ed treatment of two distinct bandit models: rank-1 bandits and\n\nutility-based dueling bandits.\n\nThe paper is organized in the following way. In Section 2 we introduce the factored bandit model\nand uniform identi\ufb01ability assumption. In Section 3 we provide algorithms for factored bandits and\ndueling bandits. In Section 4 we analyze the regret of our algorithm and provide matching upper and\nlower regret bounds. In Section 5 we compare our work empirically and theoretically with prior work.\nWe \ufb01nish with a discussion in Section 6.\n\n2 Problem Setting\n\n2.1 Factored bandits\nWe de\ufb01ne the game in the following way. We assume that the set of actions A can be represented as a\nCartesian product of atomic actions, A \u201c\u00c2L\n\ufffd\u201c1 A\ufffd. We call the elements of A\ufffd atomic arms. For\nrounds t \u201c 1, 2, ... the player chooses an action At P A and observes a reward rt drawn according\nto an unknown probability distribution pAt (i.e., the game is \u201cstochastic\u201d). We assume that the\nmean rewards \u00b5paq \u201c Errt|At \u201c as are bounded in r\u00b41, 1s and that the noise \u03b7t \u201c rt \u00b4 \u00b5pAtq is\nconditionally 1-sub-Gaussian. Formally, this means that\n\n@\u03bb P R\n\nE\u201ce\u03bb\u03b7t|Ft\u00b41\u2030 \u010f exp\u02c6 \u03bb2\n2 \u02d9 ,\nwhere Ft :\u201c tA1, r1, A2, r2, ..., At, rtu is the \ufb01ltration de\ufb01ned by the history of the game up to and\nincluding round t. We denote a\u02da \u201c pa\u02da1 , a\u02da2 , ..., a\u02daLq \u201c argmaxaPA \u00b5paq.\nDe\ufb01nition 1 (uniform identi\ufb01ability). An atomic set Ak has a uniformly identi\ufb01able best arm a\u02dak if\nand only if\n(1)\n\n\u00b5pa\u02dak , bq \u00b4 \u00b5pa, bq \u0105 0.\n\n@a P Akzta\u02daku : \u0394kpaq :\u201c min\nbP\u00c2\ufffd\u2030k A\ufffd\n\nWe assume that all atomic sets have uniformly identi\ufb01able best arms. The goal is to minimize the\npseudo-regret, which is de\ufb01ned as\n\nRegT \u201c E\u00ab T\u00fft\u201c1\n\n\u00b5pa\u02daq \u00b4 \u00b5pAtq\ufb00 .\n\nDue to generality of the uniform identi\ufb01ability assumption we cannot upper bound the instantaneous\nregret \u00b5pa\u02daq \u00b4 \u00b5pAtq in terms of the gaps \u0394\ufffdpa\ufffdq. However, a sequential application of (1) provides\na lower bound\n\n\u011b \u03941pa1q ` \u00b5pa1, a\u02da2 , ..., a\u02daLq \u00b4 \u00b5paq \u011b ... \u011b\n\n\u00b5pa\u02daq \u00b4 \u00b5paq \u201c \u00b5pa\u02daq \u00b4 \u00b5pa1, a\u02da2 , ..., a\u02daLq ` \u00b5pa1, a\u02da2 , ..., a\u02daLq \u00b4 \u00b5paq\n\u0394\ufffdpa\ufffdq.\nFor the upper bound let \u03ba be a problem dependent constant, such that \u00b5pa\u02daq\u00b4\u00b5paq \u010f \u03ba\u0159L\n\ufffd\u201c1 \u0394\ufffdpa\ufffdq\nholds for all a. Since the mean rewards are in r\u00b41, 1s, the condition is always satis\ufb01ed by \u03ba \u201c\nmina,\ufffd 2\u0394\u00b41\n\ufffd pa\ufffdq and by equation (2) \u03ba is always larger than 1. The constant \u03ba appears in the regret\nbounds. In the extreme case when \u03ba \u201c mina,\ufffd 2\u0394\u00b41\n\ufffd pa\ufffdq the regret guarantees are fairly weak.\nHowever, in many speci\ufb01c cases mentioned in the previous section, \u03ba is typically small or even 1.\nWe emphasize that algorithms proposed in the paper do not require the knowledge of \u03ba. Thus, the\ndependence of the regret bounds on \u03ba is not a limitation and the algorithms automatically adapt to\nmore favorable environments.\n\nL\u00ff\ufffd\u201c1\n\n(2)\n\n3\n\n\f2.2 Dueling bandits\nThe set of actions in dueling bandits is factored into A\u02c6A. However, strictly speaking the problem is\nnot a factored bandit problem, because the observations in dueling bandits are not the rewards.1 When\nplaying two arms, a and b, we observe the identity of the winning arm, but the regret is typically\nde\ufb01ned via average relative quality of a and b with respect to a \u201cbest\u201d arm in A.\nThe literature distinguishes between different dueling bandit settings. We focus on utility-based\ndueling bandits [14] and show that they satisfy the uniform identi\ufb01ability assumption.\nIn utility-based dueling bandits, it is assumed that each arm has a utility upaq and that the winning\nprobabilities are de\ufb01ned by Pra wins against bs \u201c \u03bdpupaq\u00b4 upbqq for a monotonously increasing link\nfunction \u03bd. Let wpa, bq be 1 if a wins against b and 0 if b wins against a. Let a\u02da :\u201c argmaxaPA upaq\ndenote the best arm. Then for any arm b P A and any a P Aza\u02da, it holds that Erwpa\u02da, bqs \u00b4\nErwpa, bqs \u201c \u03bdpupa\u02daq \u00b4 upbqq \u00b4 \u03bdpupaq \u00b4 upbqq \u0105 0, which satis\ufb01es the uniform identi\ufb01ability\nassumption. For the rest of the paper we consider the linear link function \u03bdpxq \u201c 1`x\n2 . The regret is\nthen de\ufb01ned by\n\nRegT \u201c E\u00ab T\u00fft\u201c1\n\nupa\u02daq \u00b4 upAtq\n\n2\n\n`\n\nupa\u02daq \u00b4 upBtq\n\n2\n\n\ufb00 .\n\n(3)\n\n3 Algorithms\n\nAlthough in theory an asymptotically optimal algorithm for any structured bandit problem was\npresented in [5], for factored bandits this algorithm does not only require solving an intractable semi-\nin\ufb01nite linear program at every round, but it also suffers from additive constants which are exponential\nin the number of atomic actions L. An alternative naive approach could be an adaptation of sparring\n[16], where each factor runs an independent K-armed bandit algorithm and does not observe the\natomic arm choices of other factors. The downside of sparring algorithms, both theoretically and\npractically, is that each algorithm operates under limited information and the rewards become non\ni.i.d. from the perspective of each individual factor.\nOur Temporary Elimination Algorithm (TEA, Algorithm 1) avoids these downsides. It runs indepen-\ndent instances of the Temporary Elimination Module (TEM, Algorithm 3) in parallel, one per each\nfactor of the problem. Each TEM operates on a single atomic set. The TEA is responsible for the\nsynchronization of TEM instances. Two main ingredients ensure information ef\ufb01ciency. First, we use\nrelative comparisons between arms instead of comparing absolute mean rewards. This cancels out\nthe effect of non-stationary means. The second idea is to use local randomization in order to obtain\nunbiased estimates of the relative performance without having to actually play each atomic arm with\nthe same reference, which would have led to prohibitive time complexity.\n\n1 @\ufffd : TEM\ufffd \u00d0 new TEM(A\ufffd)\n2 t \u00d0 1\n3 for s \u201c 1, 2, . . . do\n\n4\n\n5\n6\n7\n8\n9\n10\n11\n12\n\nMs \u00d0\nargmax\ufffd | TEM\ufffd . getActiveSetpfptq\u00b41q|\nTs \u00d0 pt, t ` 1, . . . , t ` Ms \u00b4 1q\nfor \ufffd P t1, . . . , Lu in parallel do\nTEM\ufffd . scheduleNextpTsq\nfor t P Ts do\nrt \u00d0 playppTEM\ufffd .Atq\ufffd\u201c1,...,Lq\nfor \ufffd P t1, . . . , Lu in parallel do\nTEM\ufffd . feedbackpprt1qt1PTsq\nt \u00d0 t ` |Ts|\nAlgorithm 1: Factored Bandit TEA\n\n1 TEM \u00d0 new TEM(A)\n2 t \u00d0 1\n3 for s \u201c 1, 2, . . . do\n\n4\n5\n6\n7\n8\n9\n10\n\nAs \u00d0 TEM . getActiveSetpfptq\u00b41q\nTs \u00d0 pt, t ` 1, . . . , t ` |As| \u00b4 1q\nTEM . scheduleNextpTsq\nfor b P As do\nrt \u00d0 playpTEM .At, bq\nt \u00d0 t ` 1\nTEM . feedbackpprt1qt1PTsq\nAlgorithm 2: Dueling Bandit TEA\n\n1In principle, it is possible to formulate a more general problem that would incorporate both factored bandits\nand dueling bandits. But such a de\ufb01nition becomes too general and hard to work with. For the sake of clarity we\nhave avoided this path.\n\n4\n\n\fThe TEM instances run in parallel in externally synchronized phases. Each module selects active\narms in getActiveSetp\u03b4q, such that the optimal arm is included with high probability. The length of a\nphase is chosen such that each module can play each potentially optimal arm at least once in every\nphase. All modules schedule all arms for the phase in scheduleNext. This is done by choosing arms\nin a round robin fashion (random choices if not all arms can be played equally often) and ordering\nthem randomly. All scheduled plays are executed and the modules update their statistics through the\ncall of feedback routine. The modules use slowly increasing lower con\ufb01dence bounds for the gaps in\norder to temporarily eliminate arms that are with high probability suboptimal. In all algorithms, we\nuse fptq :\u201c pt ` 1q log2pt ` 1q.\nDueling bandits For dueling bandits we only use a single instance of TEM. In each phase the\nalgorithm generates two random permutations of the active set and plays the corresponding actions\nfrom the two lists against each other. (The \ufb01rst permutation is generated in Line 6 and the second in\nLine 7 of Algorithm 2.)\n\n3.1 TEM\n\nThe TEM tracks empirical differences between rewards of all arms ai and aj in Dij. Based on these\ndifferences, it computes lower con\ufb01dence bounds for all gaps. The set K\u02da contains those arms where\nall LCB gaps are zero. Additionally the algorithm keeps track of arms that were never removed from\nB. During a phase, each arm from K\u02da is played at least once, but only arms in B can be played more\nthan once. This is necessary to keep the additive constants at M logpKq instead of M K.\nglobal :Ni,j, Di,j,K\u02da,B\n1 Function initialize(K)\n\n19 Function scheduleNext(T )\n\n20\n21\n22\n23\n\n24\n25\n26\n27\n\n29\n30\n31\n\n32\n33\n\n34\n\n35\n\nfor a P K\u02da do\n\u02dct \u00d0 random unassigned index in T\nA\u02dct \u00d0 a\n\nwhile not all Ats , . . . , Ats`|T |\u00b41 assigned\ndo\n\nfor a P B do\n\u02dct \u00d0 random unassigned index in T\nA\u02dct \u00d0 a\n\n28 Function feedback(tRtuts,...,ts`Ms\u00b41)\n\ns, Ri\n\ns \u00d0 0, 0\ns ` Rt\ns ` 1\n\n@ai : N i\nfor t \u201c ts, . . . , ts ` Ms \u00b4 1 do\ns \u00d0 RAt\nRAt\ns \u00d0 N At\nN At\nfor ai, aj P K\u02da do\nDi,j \u00d0 Di,j`mintN s\nNi,j \u00d0 Ni,j ` mintN s\n\ns \u00b4 Rj\ns q\nAlgorithm 3: Temporary Elimination Mod-\nule (TEM) Implementation\n\nj up Ri\ni , N s\nj u\ni , N s\n\ns\nN j\n\ns\nN i\n\n@ai, aj P K : Ni,j, Di,j \u00d0 0, 0\nB \u00d0 K\n\n2\n3\n4\n5 Function getActiveSet(\u03b4)\n6\n7\n8\n9\n\nif DNi,j \u201c 0 then\nelse\n\nK\u02da \u00d0 K\nfor ai P K do\n\n10\n\n11\n12\n13\n14\n15\n16\n17\n18\n\nDj,i\n\nNj,i \u00b4\n\nNj,i\n\n\u02c6\u0394LCBpaiq \u00d0 maxaj\u2030ai\nc 12 logp2KfpNj,iq\u03b4\u00b41q\nK\u02da \u00d0 tai P K| \u02c6\u0394LCBpaiq \u010f 0u\nif |K\u02da| \u201c 0 then\nK\u02da \u00d0 K\nB \u00d0 B X K\u02da\nif |B| \u201c 0 then\nB \u00d0 K\u02da\n\nreturn K\u02da\n\n4 Analysis\n\nWe start this section with the main theorem, which bounds the number of times the TEM pulls\nsub-optimal arms. Then we prove upper bounds on the regret for our main algorithms. Finally, we\nprove a lower bound for factored bandits that shows that our regret bound is tight up to constants.\n\n4.1 Upper bound for the number of sub-optimal pulls by TEM\n\nTheorem 1. For any TEM submodule TEM\ufffd with an arm set of size K \u201c |A\ufffd|, running in the\nTEA algorithm with M :\u201c max\ufffd |A\ufffd| and any suboptimal atomic arm a \u2030 a\u02da, let Ntpaq denote the\nnumber of times TEM has played the arm a up to time t. Then there exist constants Cpaq \u010f M for\n\n5\n\n\fa \u2030 a\u02da, such that\nErNtpaqs \u010f\n\nwhere\u0159a\u2030a\u02da Cpaq \u010f M logpKq ` 5\n\nbandits.\n\n120\n\n\u0394paq2\u02dc logp2Kt log2ptqq ` 4 log\u02c6 48 logp2Kt log2ptqq\n\n\u0394paq2\n\n\u02d9\u00b8 ` Cpaq,\n\n2 K in the case of factored bandits and Cpaq \u010f 5\n\n2 for dueling\n\nProof sketch. [The complete proof is provided in the Appendix.]\n\nStep 1 We show that the con\ufb01dence intervals are constructed in such a way that the probability\nof all con\ufb01dence intervals holding at all epochs up from s1 is at least 1 \u00b4 maxs\u011bs1 fptsq\u00b41. This\nrequires a novel concentration inequality (Lemma 3) for a sum of conditionally \u03c3s-sub-gaussian\nrandom variables, where \u03c3s can be dependent on the history. This technique might be useful for other\nproblems as well.\n\nt\n\nt\n\npaq.\n\npaq ` N conf\npaq based on the failure probabilities given by\n\nStep 2 We split the number of pulls into pulls that happen in rounds where the con\ufb01dence intervals\nhold and those where they fail: Ntpaq \u201c N conf\nWe can bound the expectation of N conf\nPrconf failure at round ss \u010f 1\nStep 3 We de\ufb01ne s1 as the last round in which the con\ufb01dence intervals held and a was not eliminated.\nWe can split N conf\npaq ` Cpaq and use the con\ufb01dence intervals to upper bound\npaq. The upper bound on\u0159a Cpaq requires special handling of arms that were eliminated once\nN conf\nand carefully separating the cases where con\ufb01dence intervals never fail and those where they might\nfail.\n\npaq \u201c N conf\nts1\n\nfptsq.\n\nts1\n\nt\n\nt\n\n4.2 Regret Upper bound for Dueling Bandit TEA\n\nA regret bound for the Factored Bandit TEA algorithm, Algorithm 1, is provided in the following\ntheorem.\nTheorem 2. The pseudo-regret of Algorithm 1 at any time T is bounded by\n\nRegT \u010f \u03ba\u00a8\u02dd\n\n120\n\n\u0394\ufffdpa\ufffdq\u02dc logp2|A\ufffd|t log2ptqq ` 4 log\u02c6 48 logp2|A\ufffd|t log2ptqq\nL\u00ff\ufffd\u201c1 \u00ffa\ufffd\u2030a\u02da\ufffd\n|A\ufffd|\u00ff\ufffd\nlogp|A\ufffd|q `\u00ff\ufffd\n` max\n\n5\n2|A\ufffd|.\n\n\u0394\ufffdpa\ufffdq\n\n\ufffd\n\n\u02d9\u00b8\u00b8\n\n\ufffd\u201c1 \u0394\ufffdpa\ufffdq, we have that\n\nProof. The design of TEA allows application of Theorem 1 to each instance of TEM. Using \u00b5pa\u02daq\u00b4\n\u00b5paq \u010f \u03ba\u0159L\n\nT\u00fft\u201c1\nRegT \u201c Er\nApplying Theorem 1 to the expected number of pulls and bounding the sums\u0159a Cpaq\u0394paq \u010f\n\u0159a Cpaq completes the proof.\n\n\u00b5pa\u02daq \u00b4 \u00b5patqss \u010f \u03ba\n\nErNTpa\ufffdqs\u0394\ufffdpa\ufffdq.\n\nL\u00ffl\u201c1 \u00ffa\ufffd\u2030a\u02da\ufffd\n\n4.3 Dueling bandits\n\nA regret bound for the Dueling Bandit TEA algorithm (DBTEA), Algorithm 2, is provided in the\nfollowing theorem.\nTheorem 3. The pseudo-regret of Algorithm 2 for any utility-based dueling bandit problem at any\n\ntime T (de\ufb01ned in equation (3) satis\ufb01es RegT \u010f O\u00b4\u0159a\u2030a\u02da\n\nlogpTq\u0394paq \u00af ` OpKq.\n\n6\n\n\ft paq the number of plays of an arm a in the \ufb01rst position, N B\n\nProof. At every round, each arm in the active set is played once in position A and once in position B\nin playpA, Bq. Denote by N A\nt paq the\nnumber of plays in the second position, and Ntpaq the total number of plays of the arm. We have\nRegT \u201c \u00ffa\u2030a\u02da\nt paqs\u0394paq.\nThe proof is completed by applying Theorem 1 to bound ErN A\n4.4 Lower bound\n\nt paqs\u0394paq \u201c \u00ffa\u2030a\u02da\nt paqs.\n\nErNtpaqs\u0394paq \u201c \u00ffa\u2030a\u02da\n\nErN A\n\nt paq ` N B\n\n2ErN A\n\nWe show that without additional assumptions the regret bound cannot be improved. The lower\nbound is based on the following construction. The mean reward of every arm is given by \u00b5paq \u201c\n\u00b5pa\u02daq \u00b4\u0159\ufffd \u0394\ufffdpa\ufffdq. The noise is Gaussian with variance 1. In this problem, the regret can be\ndecomposed into a sum over atomic arms of the regret induced by pulling these arms, RegT \u201c\n\u0159\ufffd\u0159a\ufffdPA\ufffd ErNTpa\ufffdqs\u0394\ufffdpa\ufffdq. Assume that we only want to minimize the regret induced by a single\natomic set A\ufffd. Further, assume that \u0394kpaq for all k \u2030 \ufffd are given. Then the problem is reduced to a\nregular K-armed bandit problem. The asymptotic lower bound for K-armed bandit under 1-Gaussian\nnoise goes back to [10]: For any consistent strategy \u03b8, the asymptotic regret is lower bounded by\n\u0394paq . Due to regret decomposition, we can apply this bound to every\nlim inf T\u00d18\natomic set separately. Therefore, the asymptotic regret in the factored bandit problem is\n\n2\n\nReg\u03b8\nT\n\nlogpTq \u011b\u0159a\u2030a\u02da\n\nlim inf\nT\u00d18\n\nReg\u03b8\nT\n\nlogpTq \u011b\n\nL\u00ff\ufffd\u201c1 \u00ffa\ufffd\u2030a\ufffd\n\n\u02da\n\n2\n\n\u0394\ufffdpa\ufffdq\n\n.\n\nThis shows that our general upper bound is asymptotically tight up to leading constants and \u03ba.\n\n\u03ba-gap We note that there is a problem-dependent gap of \u03ba between our upper and lower bounds.\nCurrently we believe that this gap stems from the difference between information and computational\ncomplexity of the problem. Our algorithm operates on each factor of the problem independently\nof other factors and is based on the \u201coptimism in the face of uncertainty\u201d principle. It is possible\nto construct examples in which the optimal strategy requires playing surely sub-optimal arms for\nthe sake of information gain. For example, this kind of constructions were used by Lattimore and\nSzepesv\u00e1ri [11] to show suboptimality of optimism-based algorithms. Therefore, we believe that\nremoving \u03ba from the upper bound is possible, but requires a fundamentally different algorithm\ndesign. What is not clear is whether it is possible to remove \u03ba without signi\ufb01cant sacri\ufb01ce of the\ncomputational complexity.\n\n5 Comparison to Prior Work\n\n5.1 Stochastic rank-1 bandits\n\nStochastic rank-1 bandits introduced by Katariya et al. [7] are a special case of factored bandits.\nThe authors published a re\ufb01ned algorithm for Bernoulli rank-1 bandits using KL con\ufb01dence sets in\nKatariya et al. [8]. We compare our theoretical results with the \ufb01rst paper because it matches our\nproblem assumptions. In our experiments, we provide a comparison to both the original algorithm\nand the KL version.\nIn the stochastic rank-1 problem there are only 2 atomic sets of size K1 and K2. The matrix of\nexpected rewards for each pair of arms is of rank 1. It means that for each u P A1 and v P A2, there\nexist u, v P r0, 1s such that Errpu, vqs \u201c u\u00a8 v. The proposed Stochastic rank-1 Elimination algorithm\nintroduced by Katariya et al. is a typical elimination style algorithm. It requires knowledge of the\ntime horizon and uses phases that increase exponentially in length. In each phase, all arms are played\nuniformly. At the end of a phase, all arms that are sub-optimal with high probability are eliminated.\n\nTheoretical comparison It is hard to make a fair comparison of the theoretical bounds be-\ncause TEA operates under much weaker assumptions. Both algorithms have a regret bound of\n\n1\n\n\u03942pvq\u00af logptq\u00af. The problem independent multiplicative factors\n\nO\u00b4\u00b4\u0159uPA1zu\u02da\n\n1\n\n\u03941puq `\u0159vPA2zv\u02da\n\n7\n\n\fhidden under O are smaller for TEA, even without considering that rank-1 Elimination requires\na doubling trick for anytime applications. However, the problem dependent factors are in favor\nof rank-1 Elimination, where the gaps correspond to the mean difference under uniform sampling\npu\u02da \u00b4 uq\u0159vPA2 v{K2. In factored bandits, the gaps are de\ufb01ned as pu\u02da \u00b4 uq minvPA2 v, which is\n\nnaturally smaller. The difference stems from different problem assumptions. Stronger assumptions\nof rank-1 bandits make elimination easier as the number of eliminated suboptimal arms increases.\nThe TEA analysis holds in cases where it becomes harder to identify suboptimal arms after removal\nof bad arms. This may happen when highly suboptimal atomic actions in one factor provide more\ndiscriminative information on atomic actions in other factors than close to optimal atomic actions in\nthe same factor (this follows the spirit of illustration of suboptimality of optimistic algorithms in [11]).\nWe leave it to future work to improve the upper bound of TEA under stronger model assumptions.\nIn terms of memory and computational complexity, TEA is inferior to regular elimination style\nalgorithms, because we need to keep track of relative performances of the arms. That means both\n\ncomputational and memory complexities are Op\u0159\ufffd |A\ufffd|2q per round in the worst case, as opposed to\nrank-1 Elimination that only requires O`|A1| ` |A2|\u02d8.\nEmpirical comparison The number of arms is set to 16 in both sets. We always \ufb01x u\u02da \u00b4 u \u201c\nv\u02da \u00b4 v \u201c 0.2. We vary the absolute value of u\u02dav\u02da. As expected, rank1ElimKL has an advantage\n2,\nwhen the Bernoulli random variables are strongly biased towards one side. When the bias is close to 1\nwe clearly see the better constants of TEA. In the evaluation we clearly outperform rank-1 Elimination\n\nFigure 2: Comparison of Rank1Elim, Rank1ElimKL, and TEA for K \u201c L \u201c 16. The results are\naveraged over 20 repetitions of the experiment.\n\nover different parameter settings and even beat the KL optimized version if the means are not too\nclose to zero or one. This supports that our algorithm does not only provide a more practical anytime\nversion of elimination, but also improves on constant factors in the regret. We believe that our\nalgorithm design can be used to improve other elimination style algorithms as well.\n\n5.2 Dueling Bandits: Related Work\n\nTo the best of our knowledge, the proposed Dueling Bandit TEA is the \ufb01rst algorithm that satis\ufb01es\nthe following three criteria simultaneously for utility-based dueling bandits:\n\n\u2022 It requires no prior knowledge of the time horizon (nor uses the doubling trick or restarts).\n\u2022 Its pseudo-regret is bounded by Op\u0159a\u2030a\u02da\n\u2022 There are no additive constants that dominate the regret for time horizons T \u0105 OpKq.\n\nlogptq\u0394paq q.\n\nWe want to stress the importance of the last point. For all state-of-the-art algorithms known to us,\nwhen the number of actions K is moderately large, the additive term is dominating for any realistic\ntime horizon T . In particular, Ailon et al. [2] introduces three algorithms for the utility-based dueling\nbandit problem. The regret of Doubler scales with Oplog2ptqq. The regret of MultiSBM has an\nadditive term of order\u0159a\u2030a\u02da\n\u0394paq that is dominating for T \u0103 \u03a9pexppKqq. The last algorithm,\nSparring, has no theoretical analysis.\nAlgorithms based on the weaker Condorcet winner assumption apply to utility-based setting, but\nthey all suffer from equally large or even larger additive terms. The RUCB algorithm introduced\nby Zoghi et al. [17] has an additive term in the bound that is de\ufb01ned as 2D\u0394max logp2Dq, for\n\nK\n\n8\n\n\f4\u03b1\n\nK\n\n2\u0159ai\u2030a\u02da\u0159aj\u2030ai\n\nmint\u0394paiq2,\u0394pajq2u. By unwrapping these de\ufb01-\n\u0394max \u201c maxa\u2030a\u02da \u0394paq and D \u0105 1\nnitions, we see that the RUCB regret bound has an additive term of order 2D\u0394max \u011b\u0159a\u2030a\u02da\n\u0394paq .\nThis is again the dominating term for time horizons T \u010f \u03a9pexppKqq. The same applies to the\nRMED algorithm introduced by Komiyama et al. [9], which has an additive term of OpK 2q. (The\ndependencies on the gaps are hidden behind the O-notation.) The D-TS algorithm by Wu and Liu\n[13] based on Thompson Sampling shows one of the best empirical performances, but its regret\nbound includes an additive constant of order OpK 3q.\nOther algorithms known to us, Interleaved Filter [16], Beat the Mean [15], and SAVAGE [12], all\nrequire knowledge of the time horizon T in advance.\n\nEmpirical comparison We have used the framework provided by Komiyama et al. [9]. We use the\nsame utility for all sub-optimal arms. In Figure 3, the winning probability of the optimal arm over\nsuboptimal arms is always set to 0.7, we run the experiment for different number of arms K. TEA\noutperforms all algorithms besides RMED variants, as long as the number of arms are suf\ufb01ciently\nbig. To show that there also exists a regime where the improved constants gain an advantage over\nRMED, we conducted a second experiment in Figure 4 (in the Appendix), where we set the winning\nprobability to 0.952 and signi\ufb01cantly increase the number of arms. The evaluation shows that the\nadditive terms are indeed non-negligible and that Dueling Bandit TEA outperforms all baseline\nalgorithms when the number of arms is suf\ufb01ciently large.\n\nFigure 3: Comparison of Dueling Bandits algorithms with identical gaps of 0.4. The results are\naveraged over 20 repetitions of the experiment.\n\n6 Discussion\n\nWe have presented the factored bandits model and uniform identi\ufb01ability assumption, which requires\nno knowledge of the reward model. We presented an algorithm for playing stochastic factored bandits\nwith uniformly identi\ufb01able actions and provided matching upper and lower bounds for the problem\nup to constant factors. Our algorithm and proofs might serve as a template to turn other elimination\nstyle algorithms into improved anytime algorithms.\nFactored bandits with uniformly identi\ufb01able actions generalize rank-1 bandits. We have also provided\na uni\ufb01ed framework for the analysis of factored bandits and utility-based dueling bandits. Furthermore,\nwe improve the additive constants in the regret bound compared to state-of-the-art algorithms for\nutility-based dueling bandits.\nThere are multiple potential directions for future research. One example mentioned in the text is\nthe possibility of improving the regret bound when additional restrictions on the form of the reward\nfunction are introduced or improvements of the lower bound when algorithms are restricted in\ncomputational or memory complexity. Another example is the adversarial version of the problem.\n\n2Smaller gaps show the same behavior but require more arms and more timesteps.\n\n9\n\n\fReferences\n[1] Y. Abbasi-Yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\n\nIn Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\n[2] N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits.\n\nInternational Conference on Machine Learning, pages 856\u2013864, 2014.\n\nIn\n\n[3] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\n[4] W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multi-armed bandit and its extension to\nprobabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746\u20131778,\n2016.\n\n[5] R. Combes, S. Magureanu, and A. Proutiere. Minimal exploration in structured stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 1761\u20131769, 2017.\n\n[6] S. Filippi, O. Cappe, A. Garivier, and C. Szepesv\u00e1ri. Parametric bandits: The generalized linear\n\ncase. In Advances in Neural Information Processing Systems, pages 586\u2013594, 2010.\n\n[7] S. Katariya, B. Kveton, C. Szepesv\u00e1ri, C. Vernade, and Z. Wen. Stochastic rank-1 bandits (long\n\nversion). In AISTATS, volume 54 of PMLR, pages 392\u2013401, April 2017.\n\n[8] S. Katariya, B. Kveton, C. Szepesv\u00e1ri, C. Vernade, and Z. Wen. Bernoulli rank-1 bandits for\n\nclick feedback. International Joint Conference on Arti\ufb01cial Intelligence, 2017.\n\n[9] J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal\nalgorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141\u20131154,\n2015.\n\n[10] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[11] T. Lattimore and C. Szepesv\u00e1ri. The end of optimism? An asymptotic analysis of \ufb01nite-armed\n\nlinear bandits (long version). In AISTATS, volume 54 of PMLR, pages 728\u2013737, April 2017.\n\n[12] T. Urvoy, F. Clerot, R. F\u00e9raud, and S. Naamane. Generic exploration and k-armed voting bandits.\nIn Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages\n91\u201399, 2013.\n\n[13] H. Wu and X. Liu. Double thompson sampling for dueling bandits. In Advances in Neural\n\nInformation Processing Systems, pages 649\u2013657, 2016.\n\n[14] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling\nbandits problem. In Proceedings of the 26th Annual International Conference on Machine\nLearning, pages 1201\u20131208. ACM, 2009.\n\n[15] Y. Yue and T. Joachims. Beat the mean bandit. In Proceedings of the 28th International\n\nConference on Machine Learning (ICML-11), pages 241\u2013248, 2011.\n\n[16] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal\n\nof Computer and System Sciences, 78(5):1538\u20131556, 2012.\n\n[17] M. Zoghi, S. Whiteson, R. Munos, and M. Rijke. Relative upper con\ufb01dence bound for the\nk-armed dueling bandit problem. In International Conference on Machine Learning, pages\n10\u201318, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1491, "authors": [{"given_name": "Julian", "family_name": "Zimmert", "institution": "University of Copenhagen"}, {"given_name": "Yevgeny", "family_name": "Seldin", "institution": "University of Copenhagen"}]}