{"title": "Categorized Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 14422, "page_last": 14432, "abstract": "We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random variables, which are gradually weaker so that more and more bandit scenarios satisfy at least one of them. We first prove instance-dependent lower bounds on the cumulative regret for each of these models, indicating how the complexity of the bandit problems increases with the generality of the ordering concept  considered. We also provide algorithms  that fully leverage the structure of the model with their associated theoretical guarantees. Finally, we have conducted an  analysis on real data to highlight that those ordered categories actually exist in practice.", "full_text": "Categorized Bandits\n\nMatthieu Jedor\n\nCMLA, ENS Paris-Saclay & Cdiscount\n\njedor@cmla.ens-cachan.fr\n\nJonathan Lou\u00ebdec\n\nCdiscount\n\njonathan.louedec@cdiscount.com\n\nVianney Perchet\n\nCMLA, ENS Paris-Saclay & Criteo AI Lab\n\nperchet@cmla.ens-cachan.fr\n\nAbstract\n\nWe introduce a new stochastic multi-armed bandit setting where arms are grouped\ninside \u201cordered\u201d categories. The motivating example comes from e-commerce,\nwhere a customer typically has a greater appetence for items of a speci\ufb01c well-\nidenti\ufb01ed but unknown category than any other one. We introduce three concepts\nof ordering between categories, inspired by stochastic dominance between random\nvariables, which are gradually weaker so that more and more bandit scenarios\nsatisfy at least one of them. We \ufb01rst prove instance-dependent lower bounds on the\ncumulative regret for each of these models, indicating how the complexity of the\nbandit problems increases with the generality of the ordering concept considered.\nWe also provide algorithms that fully leverage the structure of the model with their\nassociated theoretical guarantees. Finally, we have conducted an analysis on real\ndata to highlight that those ordered categories actually exist in practice.\n\n1\n\nIntroduction\n\nIn the multi-armed bandit problem, an agent has several possible decisions, usually referred to\nas \u201carms\u201d, and chooses or \u201cpulls\u201d sequentially one of them at each time step. This generates a\nsequence of rewards and the objective is to maximize their cumulative sum. The performance of\na learning algorithm is then evaluated through the \u201cregret\u201d, which is the difference between the\ncumulative reward of an oracle (that knows the best arm in expectation) and the cumulative reward of\nthe algorithm. There is a clear trade-off arising between gathering information on uncertain arms\n(by pulling them more often) and using this information (by choosing greedily the best decision so\nfar). This tradeoff is usually called \u201cexploration vs exploitation\u201d. Although originally introduced\nfor adaptive clinical trials [37], multi-armed bandits now play an important role in recommender\nsystems [30]. However, the traditional bandit model (see Bubeck and Cesa-Bianchi [6] for more\ndetails and variants) must be adapted to speci\ufb01c applications to unleash its full power.\nConsider for instance e-commerce. One of the core optimization problem is to decide which products\nto recommend, or display, to a user landing on a website, in the objective of maximizing the click-\nthrough-rate or the conversion rate. Arms of recommender systems are the different products that\ncan be displayed. The number of products, even if \ufb01nite, is prohibitively huge as the regret, i.e. the\nlearning cost, typically scale linearly with the number of arms. So agnostic bandit algorithms take\ntoo much time to complete their learning phase. Thankfully, there is an inherent structure behind\na typical catalogue: products are gathered into well de\ufb01ned categories. As customers are generally\ninterested in only one or a few of them, it seems possible and pro\ufb01table to gather information across\nproducts to speed up the learning phase and, ultimately, to make more re\ufb01ned recommendations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur results We introduce and study the idea of categorized bandits. In this framework, arms are\ngrouped inside known categories and we assume the existence of a partial yet unknown order between\ncategories. We aim at leveraging this additional assumption to reduce the linear dependency in the\ntotal number of arms. We present three different partial orders over categories inspired by different\nnotions of stochastic dominance between random variables. We considered gradually weaker notions\nof ordering in order to cover more and more bandit scenarios. On the other hand, the stronger the\nassumption, the more \u201cpowerful\u201d the algorithms are, i.e. their regret is smaller. Those assumptions\nare motivated and justi\ufb01ed by real data gathered on the e-commerce website Cdiscount. We \ufb01rst\nprove asymptotic instance-dependent lower bounds on the cumulative regret for each of these models,\nwith a special emphasis on how the complexity of the bandit problems increases with the generality\nof the ordering concept considered. We then proceed to develop two generic algorithms for the\ncategorized bandit problem that fully leverage the structure of the model; the \ufb01rst one is devised\nfrom the principle of optimism in the face of uncertainty [3] when the second one is from the\nBayesian principle [37]. Finite-time instance-dependent upper bounds on the cumulative regret are\nprovided for the former algorithm. Finally, we conduct numerical experiments on different scenarios\nto illustrate both \ufb01nite-time and asymptotic performances of our algorithms compared to algorithms\neither agnostic to the structure or only taking it partly into account.\n\nRelated works The idea of clustering is not novel in the bandit literature [34, 5, 15, 24, 31] yet\nthey mainly focus on clustering users based on their preferences. Li et al. [32] extended these work\nto the clustering of items as well. Katariya et al. [19] considered a problem where the goal is to sort\nitems according to their means into clusters. Similar in spirit are bandit algorithms for low-rank\nmatrix completion [39, 21, 18]. Maillard and Mannor [33] studied a multi-armed bandit problem\nwhere arms are partitioned into latent groups. Valko et al. [38] and Koc\u00e1k et al. [22] proposed\nalgorithms where the features of items are derived from a known similarity graph over the items.\nHowever, none of these works consider the known structure of categories in which the items are\ngathered. The model \ufb01ts in the more general structured stochastic bandit framework i.e. where\nexpected reward of arms can be dependent, see e.g., [28, 13, 2, 25, 35]. More recently, Combes et al.\n[8] proposed an asymptotically optimal algorithm for structured bandits relying on forced exploration\n(similarly to [29]) and a tracking mechanism on the number of draws of sub-optimal arms. However,\nthese approaches forcing exploration are too conservative as the linear dependency only disappears\nasymptotically. There exist two other ways to tackle the bandit problem with arms grouped inside\ncategories. The \ufb01rst one could rely on tree search methods, popularized by the celebrated UCT\nalgorithm [23]. Alternative hierarchical algorithms [9] could also be used. The second one could be\nlinear bandits [10, 36, 1] where we introduce a \u201ccategorical\u201d feature that indicates in which category\nthe arm belongs. However, these approaches are also not satisfactory as they do not leverage the full\nstructure of the problem.\n\n2 Model\n\nWe now present the variant of the multi-armed bandit model we consider. As usual, a decision maker\nsequentially selects (or pulls) an arm at each time step t \u2208 {1, . . . , T} =: [T ] . As motivated in the\nintroduction, the total number of possible arms can be prohibitively large, but we assume that this\nlarge number of arms are grouped in a small number M of categories. For the sake of presentation,\nwe are going to assume that each category has the same number of arms K, yet all of our assumptions\nand results immediately generalize to different number of arms. We emphasize again that the M\ncategories of K arms each form a known partition of the set of arms (of cardinality M K). At time\nstep t \u2208 [T ], the agent selects a category Ct and an arm At \u2208 Ct in this category. This generates\na reward X Ct\nk is\nAt\nthe unknown expected reward of the arm k of category m. For notational convenience, we will\nassume that arms are ordered inside each category, i.e. \u00b5m\nK for all\ncategory m and that category 1 is the best category, with respect to a partial order de\ufb01ned below.1 We\nstress out that, in the partial orders we consider, the maximum of \u00b5m\nk over m and k is necessarily\n1. As in any multi-armed bandit problem, the overall objective of an agent is to maximize her\n\u00b51\nexpected cumulative reward until time horizon T or identically, to minimize her expected cumulative\n\n+ \u03b7t where \u03b7t is some independent 1 sub-Gaussian white noise and \u00b5m\n\nK\u22121 > \u00b5m\n\n= \u00b5Ct\nAt\n\n1 > \u00b5m\n\n2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u00b5m\n\n1To be precise, since the order is only partial, some categories might not be pairwise comparable, but we\n\nassume that the optimal category is comparable to, and dominates, all the others.\n\n2\n\n\f1 \u2212 E[(cid:80)T\n\nregret E[RT ] = T \u00b51\n\u2206m,k := \u00b51\narm and the kth arm of category m and N m\nof times this arm has been pulled up to (not including) time step t.\n\nk (T )], where\nk is the difference, usually called \u201cgap\u201d, between the expected rewards of the best\ns<t 1{Cs = m, As = k} denotes the number\n\nm,k \u2206m,k E[N m\n\n1 \u2212 \u00b5m\n\nt=1 \u00b5Ct\nAt\n\n], or equivalently, E[RT ] = (cid:80)\nk (t) :=(cid:80)\n\nRelations of dominance The main assumption to leverage is that the set of categories is partially\nordered with a unique maximal element. Those partial orders are quite similar to the standard ones\ninduced by stochastic dominance [17, 4] over random variables. We are going to consider three\nnotions of dominance (inducing three different partial orders) that are gradually weaker so that the\nbandit setting is more and more general. Consequently, the regret should be higher and higher.\nDe\ufb01nition 1. Let A = {\u00b5A\nK} \u2282 R be a pair of categories,\nGroup-sparse dominance A group-sparsely dominates B, denoted by A (cid:23)s B, if each element of\nA are non-negative and at least one is positive, and each element of B are non-positive, i.e.,\nmax\nk\u2208[K]\n\nK} \u2282 R and B = {\u00b5B\n\nk \u2265 0 \u2265 max\n\u00b5A\nk\u2208[K]\n\n\u00b5A\nk > min\nk\u2208[K]\n\n1 , . . . , \u00b5A\n\n1 , . . . , \u00b5B\n\n\u00b5B\nk .\n\nStrong dominance A strongly dominates B, denoted by A (cid:23)0 B, if each element of A is bigger\n\nthan any element of B, i.e., min\nk\u2208[K]\n\nk \u2265 max\n\u00b5A\nk\u2208[K]\n\n\u00b5B\nk .\n\nFirst-order dominance A \ufb01rst-order dominates B, denoted by A (cid:23)1 B, if sup\nx\u2208R\n\nFA(x)\u2212FB(x) \u2264 0 ,\nk \u2264 x} is the cumulative distribution function of a uniform\n\n(cid:80)K\nk=1 1{\u00b5A\n\nwhere FA(x) = 1\nrandom variable over A (and similarly for B).\nK\n\nThe \ufb01rst notion of dominance is inspired by the classical (group-)sparsity concept in machine learning,\nthat already emerged in variants of multi-armed bandits [26, 7]. It is quite a strong assumption as\nit implies the knowledge of a threshold2 between two categories. The second notion weakens this\nassumption as the threshold is unknown. The third notion is even weaker. The second and third\nnotions of dominance are similar to the zeroth (also called strong) and \ufb01rst-order of stochastic\ndominances between two random variables respectively uniform over A and B. Hence, the three\nconcepts of dominance immediately generalize to categories with different number of elements, with\nthe very same de\ufb01nitions. Furthermore, one can weaken even more the dominance, e.g. introducing a\nsecond-order variant, but we will not consider it in this paper.\n\nR\n\nR\n\nR\n\nExample To illustrate the concepts of dominance, we have\nrepresented, in Figure 1, 3 categories of 3 arms each.\nIt\ncan be easily checked that, for the \ufb01rst-order dominance,\nCAT 1 (cid:23)1 CAT 2 (cid:23)1 CAT 3 as, if they have the same num-\nber of elements, A \ufb01rst-order dominates B if the kth largest\nelements of A is greater than the kth largest element of B, for\nany k. Moreover, for the strong dominance, CAT 1 (cid:23)0 CAT 3\nsince the worst mean of CAT 1 is higher than the best mean\nof CAT 3. Moreover, if this common value was known, then\nthe dominance would even be group-sparse.\nLemma 1. Let A1, . . . ,AM be \ufb01nite categories. If there is a\nFigure 1: Illustration of dominances\ncategory A\u2217 that dominates all the other ones for any of the\npartial orders de\ufb01ned above, then A\u2217 contains the maximal element of the union A1 \u222aA2 \u222a . . .\u222aAM .\nMoreover, if A group-sparsely dominates B, then the dominance also holds in the strong sense.\nSimilarly, if A strongly dominates B, then the dominance also holds in the \ufb01rst-order sense.\n\n\u00b53\n1\n\u00b53\n2\n\u00b53\n3\nCAT 3\n\n\u00b51\n1\n\u00b51\n2\n\u00b51\n3\n\n\u00b52\n1\n\u00b52\n2\n\u00b52\n3\n\nCAT 1\n\nCAT 2\n\n2.1 Empirical evidence of dominance\n\nWe illustrate these assumptions on a real dataset. We have collected the CTR of products in four\ndifferent categories over one month on the e-commerce website Cdiscount, one of the leading e-\ncommerce companies in France, gathered in Table 2a. CAT 1 to 3 are three of the largest categories3\nin terms of revenue while CAT 4 is a smaller category. The following dominances can be highlighted.\n\n2This threshold is \ufb01xed at 0 for convenience, but it could have any value.\n3For privacy reason, the exact content of the different categories cannot be revealed.\n\n3\n\n\fCAT 1 CAT 2 CAT 3 CAT 4\n\n0.0133\n0.0114\n0.0108\n0.0107\n0.0096\n0.0095\n0.0088\n0.0086\n0.0084\n0.0080\n\n0.0089\n0.0140\n0.0086\n0.0088\n0.0078\n0.0083\n0.0056\n0.0082\n0.0052\n0.0078\n0.0050\n0.0078\n0.0049\n0.0078\n0.0047\n0.0077\n0.0042\n0.0076\n0.0074\n0.0041\n(a) Click-through rates\n\n0.0069\n0.0063\n0.0053\n0.0051\n0.0051\n0.0044\n0.0042\n0.0041\n0.0040\n0.0038\n\n(b) Cumulative distribution functions\n\nFigure 2: Illustrations of dominance on a real dataset\n\nStrong dominance CAT 1 strongly dominates CAT 4 as its minimum CTR is 0.008 compared to the\n\nmaximum CTR of 0.0069 of the other. Similarly, CAT 2 strongly dominates CAT 4.\n\nFirst-order dominance CAT 2 \ufb01rst-order dominates CAT 3 as the CTR of each line of the second\ncolumn are bigger than those of the third column. This dominance is not strong as 0.0074 is\nsmaller than 0.0089. CAT 3 \ufb01rst-order but not strongly dominates CAT 4.\n\nUncomparable categories CAT 1 and CAT 2 are not comparable with respect to any partial order.\nNotice that, had the \ufb01rst item of CAT 2 performed only 5% worse than observed,4 then CAT 1 would\nhave been optimal with respect to the \ufb01rst-order dominance. So even if the dominance assumption\nis not satis\ufb01ed during that speci\ufb01c month, assuming it would still give good empirical results. The\nrelations of dominance can be easier to determine based on the representation of the associated cdf of\nFigure 2b. As the cdf of the random variable uniform on CAT 4 is, pointwise, the biggest one, this\nmeans that this category is \ufb01rst-order dominated by all the other ones. Moreover, it reaches 1 while\nthe cdf of CAT 1 and CAT 2 are still at 0. This implies that the dominance of these two categories is\neven strong. This analysis motivates and validates our assumption.\n\n3 Lower bounds\n\nIn this section, we provide lower bounds on the regret that any \u201creasonable\u201d algorithm (the precise\nde\ufb01nition is given below) must incur in a multi-armed bandit problem, where arms are grouped into\npartially ordered categories (with a dominating one). To simplify the exposition, we assume here\nthat noises are drawn from Gaussian distribution with unit variance. The class of algorithms we\n\nconsider are consistent [27] with respect to a given a class of possible bandit problems M =(cid:8)\u00b5 =\n(\u00b51, . . . , \u00b5M K) \u2208 RM K(cid:9). We recall that an algorithm is consistent with M if, for any admissible\n\nreward vector \u00b5 \u2208 M and any parameter \u03b1 \u2208 (0, 1], the regret of that algorithm is asymptotically\nnegligible compared to T \u03b1, i.e.,\n= 0 . Graves and Lai [16] proved that\nany algorithm consistent with M has a regret scaling at least logarithmically in T , with a leading\nconstant c\u00b5 depending on \u00b5 (and M) i.e., lim inf\n\u2265 c\u00b5 ; moreover, c\u00b5 is the solution of\nT\u2192\u221e\nsome auxiliary optimization problem. In our setting, it rewrites as\nk (\u00b5m\n\nsubject to (cid:80)\n\nT \u03b1\nE\u00b5 [RT ]\nlog(T )\n\nlim sup\nT\u2192\u221e\n\nE\u00b5 [RT ]\n\nm,k N m\n\n(cid:9). We point out that the assumption of dom-\n\nk )2 \u2265 2,\u2200 \u03bb \u2208 \u039b(\u00b5) ,\n\nk \u2212 \u03bbm\n\ninance is hidden in the class of bandit problem M. In the remaining and with a slight abuse of\nnotation, we are going to call an algorithm consistent with a dominance assumption if it is consistent\nwith the set of all possible vectors of means satisfying this dominance assumption.\n\nsup\n\u03b1\u2208(0,1)\n\n(cid:80)\n\nwhere \u039b(\u00b5) =(cid:8)\u03bb \u2208 M; \u00b51\n\nc\u00b5 = minN\u22650\n\nm,k N m\n\nk \u2206m,k\n1, \u03bb1\n1 = \u03bb1\n\n1 < maxm,k \u03bbm\nk\n\n4The CTR of the best item of CAT 2 is so higher than the second one, we could expect it is actually an outlier,\n\ni.e., an artefact of the choice of that speci\ufb01c month and category.\n\n4\n\n0.0040.0060.0080.0100.0120.014x0.00.20.40.60.81.0F(x)CAT 1CAT 2CAT 3CAT 4\fGroup-sparse dominance\n\nIn this case, the above optimization problem has a closed-form solution.\n\nTheorem 3.1. An algorithm consistent with the group-sparse dominance satis\ufb01es c\u00b5 =\n\nK(cid:88)\n\nk=2\n\n2\n\n\u22061,k\n\n.\n\nThe proof of this result (and the subsequent ones) is postponed to the Appendix. This lower bound\nindicates that all arms in the optimal category (and only those) should be pulled a logarithmic number\nof times, hence the regret should only scale asymptotically linearly in the number of arms in the\noptimal category instead of linearly with the total number of arms. We want to stress out here that\nTheorem 3.1 might have a misleading interpretation. Although the asymptotic regret scales with K\nand independently of M, the \ufb01nite-stage minimax regret is still of the order of\nM KT , as with usual\nbandits. This is simply because the lower-bound proof [6] of the standard multi-armed bandit case\nuses set of parameters of the form (0, . . . , 0, \u03b5, 0, . . . , 0) which respect the group-sparse assumption.\nAs a result, the asymptotic lower bound of Theorem 3.1 is hiding some \ufb01nite-time dependency\nm,k 1/\u2206m,k, yet independent of log(T )) that\n\nin M K (possibly of the form of an extra-term in(cid:80)\n\n\u221a\n\nnon-asymptotic algorithms5 would not be able to remove.\n\nStrong dominance\n\nIn the case of strong dominance, a similar closed-form expression can be stated.\n\nTheorem 3.2. With strong dominance, a consistent algorithm veri\ufb01es c\u00b5 =\n\nK(cid:88)\n\nk=2\n\nM(cid:88)\n\n2\n\n\u22061,k\n\n+\n\n2\n\n.\n\n\u2206m,K\n\nm=2\n\nThis lower bound indicates that the dominance assumption can be leveraged to replace the asymptotic\nlinear dependency in the total number of arms category into a linear dependency in the number of\narms of the optimal category plus the number of categories. With M categories of K arms each,\nthe dependency in M K is replaced into M + K. However, as before and for the same reasons, the\n\ufb01nite-time minimax lower bound will still be of the order\nM KT . The lower bound of Theorem 3.2\nseems to indicate that an optimal algorithm should be pulling only the arms of the optimal category\nand the worst arm (not the best!) of the other categories, at least asymptotically and logarithmically.\nYet again, there is no guarantee that non-asymptotic algorithms can achieve this highly-demanding\n(and rather counter-intuitive) lower bound.\n\n\u221a\n\nFirst-order dominance There are no simple closed form expression of c\u00b5 with the \ufb01rst-order\ndominance assumption, see nonetheless Appendix A.3 for some variational expression. However, for\nthe sake of illustration, we provide a closed-form solution for a speci\ufb01c case.\nTheorem 3.3. With \ufb01rst-order dominance and M = K = 2 and assuming that arms are intertwined,\ni.e. \u00b51\n\n2, a consistent algorithm satis\ufb01es\n\n1 > \u00b52\n\n1 > \u00b51\n\n2 > \u00b52\n\nc\u00b5 =\n\n2\n\n\u22061,2\n\n+\n\n2\n\n\u22062,2\n\n+\n\n2\n\n\u22062,1\n\n(cid:18)\n1 \u2212 (\u22062,2 \u2212 \u22061,2)2\n(\u22061,2)2 + (\u22062,2)2\n\n(cid:19)\n\n.\n\nIt is quite interesting to compare this lower bound to the corresponding ones with group-sparsity\nwhere c\u00b5 = 2\nand without structure at all\n\u22061,2\n. Clearly, lower bounds are, as expected, decreasing with additional\nwhere c\u00b5 = 2\n+ 2\n\u22062,1\n\u22061,2\nstructure. More interestingly, the \ufb01rst-order lower bound somehow interpolates between this two by\nby a factor \u03c1 \u2208 (0, 1); \u03c1 = 0 corresponding to the stronger assumption of\nmultiplying the term 2\n\u22062,1\nstrong dominance and \u03c1 = 1 to the absence of dominance assumption.\n\n, with strong dominance where c\u00b5 = 2\n\u22061,2\n+ 2\n\u22062,2\n\n+ 2\n\u22062,2\n\n4 Algorithms and upper bounds\n\n4.1 Optimism principle\n\nOur \ufb01rst algorithm is based on the principle of optimism in the face of uncertainty and is summarized\nin Algorithm 1. It behaves in three different ways depending on the number of categories that are called\n\n5We call an algorithm non-asymptotic if its worst-case regret is of the order of\n\n\u221a\nM KT , maybe up to some\nadditional polynomial dependency in M and K. In particular, classical algorithms for structured bandits [8, 29]\nare only asymptotical.\n\n5\n\n\f\u201cactive\u201d. The de\ufb01nition of an active category will depend on the assumption of dominance. Formally,\nlet \u03b4 \u2208 (0, 1) be a con\ufb01dence level (\ufb01xing the con\ufb01dence level actually requires that the horizon T is\nknown, but there exist well understood anytime version of all these results [12]). At time step t, it com-\nputes the set of active categories, denoted A(t, \u03b4). The three states of Algorithm 1 are then as follows:\n\nalgorithm pulls all arms.\n\n1. |A(t, \u03b4)| = 0: no category is active; the\n2. |A(t, \u03b4)| = 1: only one category is active;\n3. |A(t, \u03b4)| > 1: several categories are active;\n\nthe algorithm performs UCB(\u03b4) in it.\n\nthe algorithm pulls all arms inside those.\n\nWe now detail what we called an active category\nfor each notion of dominance de\ufb01ned previously\nalong with theorems upper bounding the regret of\nthe CATSE algorithm.\n\nAlgorithm 1: CATSE(\u03b4)\nPull each arm once\nwhile t \u2264 T do\n\nCompute set of active categories A(t, \u03b4)\nif |A(t, \u03b4)| = 0 then\nelse if |A(t, \u03b4)| = 1 then\n\nPull all arms\n\nPerform UCB(\u03b4) in the active\ncategory\n\nPull all arms in active categories\n\nelse\n\nend\n\nend\n\nGroup-sparse dominance Under this assumption, we say a category is active if it has an active\narm. Following the idea of sparse bandits [26] or bounded regret [7], we say that the arm k of category\nm is active if\n\n(cid:113) log N m\n\n(cid:80)\n\n(cid:98)\u00b5m\n\nk (t) :=\n\ns<t;(Ct,At)=(m,k) X Ct\nAt\n\n\u2265 2\n\nN m\n\nk (t)\n\nk (t)\n\n.\n\nN m\n\nk (t)\n\nThis condition ensures that the expected number of times an arm with positive mean is non active is\n\ufb01nite in expectation. Similarly, the expected number of times an arm with non positive mean is active\nis also \ufb01nite. Those conditions will ensure that the expected number of times a suboptimal category\nis pulled is also \ufb01nite. Then, the set of active categories, denoted A(t) is simply\n\nA(t) :=\n\nm \u2208 [M ];\u2203 k \u2208 [K],(cid:98)\u00b5m\n\n(cid:113) log N m\n\n(cid:27)\n\nk (t)\n\nN m\n\nk (t)\n\n.\n\nTheorem 4.1. In the group-sparse dominance setting, the expected regret of CATSE veri\ufb01es with\nprobability at least 1 \u2212 2\u03b4KT ,\n\n8 log 1\n\u03b4\n\u22061,k\n\n+\n\n\u2206m,k +\n\n40\n1)2 log\n(\u00b51\n\n16\n(\u00b51\n1)2\n\n\u2206m,k + (M \u2212 1)K\n\n\u2206m,k .\n\nE[RT ] \u2264 K(cid:88)\n\nk=2\n\n(cid:26)\n\n(cid:88)\n\nm,k\n\nk (t) \u2265 2\n(cid:88)\n\nm,k\n\n(cid:88)\n\nm,k\n\n\u03c02\n6\n\nThe \ufb01rst term is the bound of the UCB algorithm while the third term is the regret incurred when the\noptimal category is non active and the last term comes from a suboptimal category being active. As a\nresult, CATSE is asymptotically optimal, up to a multiplicative factor. A trick to improve empirically\nthe performance of the algorithm is to replace the round-robin sampling phase (when |A(t)| = 0) by\nchoosing an arm with a higher probability the closer it is to be active. This idea was analyzed in [7]\nwith additional assumptions. Yet this can only improve the second term of the regret, which is already\nconstant w.r.t. T (so we chose to not focus on it). For example, a possibility is to pull arm (m, k) at\ntime t with probability pm\n. Another possible improvement is to\neliminate categories in which there exist an arm whose upper bound is less than 0. Again, this only\nimproves a term constant w.r.t. T .\n\nk (t) \u221d(cid:16)(cid:113) 4 log N m\n\nk (t) \u2212(cid:98)\u00b5m\n\n(cid:17)\u22122\n\nk (t)\n\nk (t)\n\nN m\n\nStrong dominance\nIn this setting, CATSE will use the information gathered by all arms. The\noverall idea is to construct con\ufb01dence region for the mean vector and to eliminate a category as soon\nas it is clearly dominated by another one. The statistical test to perform in order to determine which\ncategories to eliminate is based on the following alternative characterization of dominance.\nLet \u2206(K) := {x \u2208 RK\nProposition 1. A strongly dominates B if and only if \u2200 x \u2208 \u2206(K),\u2200 y \u2208 \u2206(K),(cid:104)x, \u00b5A(cid:105) \u2265 (cid:104)y, \u00b5B(cid:105).\nAt the end of the p-th round of the phase of successive elimination of categories, each arm has been\npulled p times. A natural estimator of \u00b5m \u2208 RK is the coordinate wise empirical average of rewards,\n\n+ ;(cid:107)x(cid:107)1 = 1} be the K-simplex and \u00b5m := (\u00b5m\n\nk )k be the vector of means.\n\n6\n\n\f(cid:80)p\n\nr=1 X m\n\nk (p) = 1\np\n\ni.e., \u00b5m\nk (r) is the reward gathered\nk (r), where (with a slight abuse of notation), X m\nby the r-th pull of arm k of category m. We now describe the statistical run at the end of round\np \u2208 N; category n \u2208 [M ] is eliminated by category m \u2208 [M ] if it holds that\nL+\nm(p, \u03b4) := max\nx\u2208\u2206(K)\n\n(cid:104)y,(cid:98)\u00b5n(p)(cid:105) + (cid:107)y(cid:107)2 \u03b2(p, \u03b4) =: L\u2212\n(cid:1). The set of active categories is then de\ufb01ne as follows\n\nn (p, \u03b4) ,\n(1)\n\nwhere \u03b2(p, \u03b4) =\n\ny\u2208\u2206(K)\n\n(cid:104)x,(cid:98)\u00b5m(p)(cid:105) \u2212 (cid:107)x(cid:107)2 \u03b2(p, \u03b4) > min\n(cid:113) 2\n(cid:0)K log 2 + log 1\nA(t, \u03b4) =(cid:8)m \u2208 [M ];\u2200 n (cid:54)= m, L+\n(cid:1) M(cid:88)\n(cid:88)\n\u2206m,k +8(cid:0)K log 2+log\n\n+\n\np\n\n\u03b4\n\n1\n\u03b4\n\nm=2\n\nRT \u2264 K(cid:88)\n\n8 log 1\n\u03b4\n\u22061,k\n\nk=2\n\nm,k\n\nn (t, \u03b4) \u2264 L\u2212\n\nm(t, \u03b4)(cid:9) .\n(cid:16) (cid:107)x(cid:107)2 + (cid:107)y(cid:107)2\n\n(cid:17)2 K(cid:88)\n\nk=1\n\nmin\n\nx,y\u2208\u2206(K)\n\n(cid:104)x, \u00b51(cid:105) \u2212 (cid:104)y, \u00b5m(cid:105)\n\n\u2206m,k\n\nTheorem 4.2. In the strong dominance case, the regret of CATSE satis\ufb01es w.p. at least 1 \u2212 \u03b4M T ,\n\nFirst-order dominance CATSE will proceed with \ufb01rst-order dominance as with strong dominance,\nthe major difference is the statistical test. Let us \ufb01rst characterize the notion of \ufb01rst-order dominance.\nProposition 2. A \ufb01rst-order dominates B if and only if \u2200 x \u2208 \u2206(K),(cid:104)x, \u00b5A(cid:105) \u2265 (cid:104)x, \u00b5B(cid:105) .\nThe statistical test is then: category n \u2208 [M ] is eliminated by category m \u2208 [M ] at round p if\n\nDm,n(p, \u03b4) := max\nx\u2208\u2206(K)\n\nwhere(cid:98)\u00b5m\n\n\u03c4 (p) represent respectively the reordering of(cid:98)\u00b5m(p) and(cid:98)\u00b5n(p) in decreasing order\n\u03c3 (p) and(cid:98)\u00b5n\n(cid:0)(cid:113)\n\u03b4 +(cid:112)1 + (K + 1) log K(cid:1). We emphasis the permutation is speci\ufb01c to\n\nand \u03b3(p, \u03b4) = 1\u221a\nboth a category and a round. This statistical test yields the following set of active categories\n\n> 2\u03b3(p, \u03b4) ,\n\nK log 1\n\n(cid:107)x(cid:107)2\n\n2p\n\n(2)\n\nA(t, \u03b4) = {m \u2208 [M ];\u2200 n (cid:54)= m, Dm,n(t, \u03b4) \u2264 2\u03b3(t, \u03b4)} .\n\nTheorem 4.3. Under the additional assumption that X m\nthe \ufb01rst-order dominance, the regret of CATSE veri\ufb01es with probability at least 1 \u2212 \u03b4M T ,\n\nk \u2208 [0, 1] for all category m and arm k, in\n\n(cid:104)x,(cid:98)\u00b5m\n\n\u03c3 (p) \u2212(cid:98)\u00b5n\n\u03c4 (p)(cid:105)\n\nRT \u2264 K(cid:88)\n\nk=2\n\n(cid:88)\n\nm,k\n\n8 log 1\n\u03b4\n\u22061,k\n\n+\n\n(cid:18)\n\n\u2206m,k + 16\n\nK log\n\n1\n\u03b4\n\n+ K log K + log K + 1\n\nk=1 \u2206m,k\n(cid:107)\u00b51 \u2212 \u00b5m(cid:107)2\n\n2\n\n.\n\nm=2\n\n(cid:19) M(cid:88)\n\n(cid:80)K\n\n4.2 Bayesian principle\n\nAlgorithm 2: MURPHY SAMPLING\nwhile t \u2264 T do\n\nThe MURPHY SAMPLING (MS) algorithm [20] was orig-\ninally developed in a pure exploration setting. Conceptu-\nally, it is derived from THOMPSON SAMPLING (TS) [37],\nthe difference is that the sampling respects some inher-\nent structure of the problem. To de\ufb01ne MS, we denote\nby F(t) = \u03c3 (A1, X1, . . . , At, Xt) the information avail-\nable after t steps and Hd the assumption of dominance\nconsidered. Let \u03a0t = P (\u00b7|Ft) be the posterior distribution of the means parameters after t rounds.\nThe algorithm samples, at each time step, from the posterior distribution \u03a0t\u22121 (\u00b7|Hd) and then pulls\nthe best arm, which, by de\ufb01nition, is in the best category sampled at this time step. In comparison,\nTS would sample from \u03a0t\u22121 without taking into account any structure. To implement this algorithm,\nwe use that independent conjugate priors will produce independent posteriors, making the posterior\nsampling tractable. The required assumption, i.e. the structure of our problem, is then attained using\nrejection sampling. We do not provide theoretical guarantees on its regret but we will illustrate\nempirically on simulated data that it is highly competitive compared to the other algorithms.\n\nSample \u03b8(t) \u223c \u03a0t\u22121 (\u00b7|Hd)\nPull (Ct, At) \u2208 arg max(m,k) \u03b8m\n\nk (t)\n\nend\n\n5 Experiments\n\nIn this section, we present numerical experiments illustrating the performance of the algorithms we\nhave introduced. We also compare them with two families of algorithms. The \ufb01rst one is algorithms\n\n7\n\n\f(a) In the sparse and strong dominance scenario\n\n(b) In the \ufb01rst-order dominance scenario\n\nFigure 3: Regret of various algorithms as a function of time\n\nfor the multi-armed bandit framework, namely UCB [3] and TS [37]; they are agnostic to the structure\nof the arms. The second family of algorithms is adapted to tree search, namely UCT [23]; they\npartially take into account the inherent structure. Speci\ufb01cally, they will just use the fact that arms are\ngrouped into categories but not that one category dominates the others. We consider two scenarios for\nthe different dominance hypothesis. In all experiments, rewards are drawn from Gaussian distribution\nwith unit variance and we report the average regret as a function of time, in log-scale. To implement\nTS and MS, we pulled each arm once and then sampled using a Gaussian prior. The simulations\nwere ran until time horizon 10,000 and results were averaged over 100 independent runs.\n\nGroup-sparse & strong dominance We start by grouping the experiments in the group-sparse\nand strong dominance setting, as we recall that the only difference between the two concepts is\nthe knowledge of a threshold between the best category and the others. In this \ufb01rst scenario, we\nanalyze a problem with \ufb01ve categories and \ufb01ve arms per category. Precisely, in the \ufb01rst category\nthe optimal arm has expected reward 1, and the four suboptimal arms consist of one group of three\n(stochastically) identical arms each with expected reward 0.5 and one arm with expected reward 0.\nThe four suboptimal category are identical and are composed of two arms with expected rewards 0\nand \u22121, respectively and a group of three arms with expected reward \u22120.5. We used the subscript\ns and 0 to denote the assumption of dominance the algorithm exploited. CATSEs and CATSE0\nwere run with \u03b4 = 1\nM t, respectively. Results are presented on Figure 3a. In the case of\ngroup-sparse dominance, CATSEs outperforms both UCB and UCT; MSs asymptotically performs\nas well yet with a slightly higher regret. Interestingly, UCT performs well in the beginning; thanks to\nthe lack of an exploration phase compared to CATSEs. In the case of strong dominance, MS0 and\nCATSE0 asymptotically perform alike and slightly better than UCT. However, the regret of CATSE0\nis much higher due to its round-robin sampling phase; this can be seen in the beginning as CATSE0 is\nstill in the search of the optimal category. If we compare the two versions of each algorithm between\nthem, we can notice two points. Firstly, for CATSE, the result of the potential sampling improvement\nis signi\ufb01cant. Secondly, for MS, the regret in the group-sparse case is slightly worse than in the\nstrong dominance case even though it is stronger. This is simply due to our implementation and the\ndif\ufb01culty of the posterior sampling, in particular the rejection sampling phase.\n\nt and \u03b4 = 1\n\nFirst-order dominance Finally, we consider the \ufb01rst-order dominance setting. In this scenario,\nwe look upon a problem with \ufb01ve categories and ten arms per category. Precisely, in the optimal\ncategory, the best arm has expected reward 5 while the nine suboptimal arms consist of three group\nof \ufb01ve, three and one arms, with expected rewards 4, 3 and 2, respectively. The four suboptimal\ncategories are composed of two arms with expected rewards 4.5 and 0, respectively, and eight arms\nwith expected reward 3. CATSE was run with \u03b4 = 1\nM t and the results are presented on Figure 3b.\nOnce again, MS and CATSE outperform baseline algorithms and both appear to have the same slope\nasymptotically with a signi\ufb01cant difference between their regret, again due to the exploration phase\nof CATSE. It is interesting to observe that UCT performed poorly; as noticed in [9], the convergence\ncan be sluggish. Indeed, the main issue occurs when the best arm is underestimated. In that case,\nit is pulled a logarithmic number of times the optimal category is pulled, which is a logarithmic\n\n8\n\n102103104Time50100150200250300350400450RegretUCBTSUCTCatSEsCatSE0MSsMS0102103104Time100200300400500600RegretUCBTSUCTCatSE1MS1\fnumber of times, since the second best arm overall is in suboptimal categories. Hence, it would take\nan exponential of exponentials number of time for the optimal arm to become the best again.\n\n6 Conclusion\n\nTwo problems remain open: the \ufb01rst one is a better exploration phase in CATSE since it heavily\nimpact the regret and as noted in [14], ETC algorithms are necessarily suboptimal; and the second\nis an upper bound on the regret of the MS algorithm since it is highly competitive in practice. We\nbelieve that it is asymptotically optimal and that it can be applied to other setting of structured bandits.\n\nAcknowledgments\n\nThis work was supported in part by a public grant as part of the Investissement d\u2019avenir project,\nreference ANR-11-LABX-0056-LMH, LabEx LMH, in a joint call with Gaspard Monge Program for\noptimization, operations research and their interactions with data sciences.\n\nReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320,\n2011.\n\n[2] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic\n\nconcepts. In ICML, pages 3\u201311, 1999.\n\n[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[4] Vijay S Bawa. Optimal rules for ordering uncertain prospects. Journal of Financial Economics,\n\n2(1):95\u2013121, 1975.\n\n[5] Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative\n\n\ufb01ltering. In Advances in Neural Information Processing Systems, pages 3347\u20133355, 2014.\n\n[6] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[7] S\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic\n\nmulti-armed bandits. In Conference on Learning Theory, pages 122\u2013134, 2013.\n\n[8] Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 1761\u20131769,\n2017.\n\n[9] Pierre-Arnaud Coquelin and R\u00e9mi Munos. Bandit algorithms for tree search. arXiv preprint\n\ncs/0703062, 2007.\n\n[10] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under\nIn 21st Annual Conference on Learning Theory - COLT 2008, Helsinki,\nbandit feedback.\nFinland, July 9-12, 2008, pages 355\u2013366, 2008. URL http://colt2008.cs.helsinki.fi/\npapers/80-Dani.pdf.\n\n[11] Herbert Aron David and Haikady Navada Nagaraja. Order statistics. Wiley, third edition, 2003.\n\n[12] R\u00e9my Degenne and Vianney Perchet. Anytime optimal algorithms in stochastic multi-armed\n\nbandits. In International Conference on Machine Learning, pages 1587\u20131595, 2016.\n\n[13] R\u00e9my Degenne and Vianney Perchet. Combinatorial semi-bandit with known covariance. In\n\nAdvances in Neural Information Processing Systems, pages 2972\u20132980, 2016.\n\n9\n\n\f[14] Aur\u00e9lien Garivier, Tor Lattimore, and Emilie Kaufmann. On explore-then-commit strategies. In\n\nAdvances in Neural Information Processing Systems, pages 784\u2013792, 2016.\n\n[15] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International\n\nConference on Machine Learning, pages 757\u2013765, 2014.\n\n[16] Todd L Graves and Tze Leung Lai. Asymptotically ef\ufb01cient adaptive choice of control laws\nincontrolled markov chains. SIAM journal on control and optimization, 35(3):715\u2013743, 1997.\n\n[17] Josef Hadar and William R Russell. Rules for ordering uncertain prospects. The American\n\neconomic review, 59(1):25\u201334, 1969.\n\n[18] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochas-\n\ntic rank-1 bandits. arXiv preprint arXiv:1608.03023, 2016.\n\n[19] Sumeet Katariya, Lalit Jain, Nandana Sengupta, James Evans, and Robert Nowak. Adaptive\n\nsampling for coarse ranking. arXiv preprint arXiv:1802.07176, 2018.\n\n[20] Emilie Kaufmann, Wouter Koolen, and Aurelien Garivier. Sequential test for the lowest mean:\n\nFrom thompson to murphy sampling. arXiv preprint arXiv:1806.00973, 2018.\n\n[21] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Ef\ufb01cient\nthompson sampling for online matrix-factorization recommendation. In Advances in neural\ninformation processing systems, pages 1297\u20131305, 2015.\n\n[22] Tom\u00e1\u0161 Koc\u00e1k, Michal Valko, R\u00e9mi Munos, and Shipra Agrawal. Spectral thompson sampling.\n\nIn Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[23] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit based monte-carlo planning.\n\nconference on machine learning, pages 282\u2013293. Springer, 2006.\n\nIn European\n\n[24] Nathan Korda, Bal\u00e1zs Sz\u00f6r\u00e9nyi, and Li Shuai. Distributed clustering of linear bandits in peer to\npeer networks. In Journal of machine learning research workshop and conference proceedings,\nvolume 48, pages 1301\u20131309. International Machine Learning Societ, 2016.\n\n[25] Joon Kwon and Vianney Perchet. Gains and losses are fundamentally different in regret\nminimization: The sparse case. The Journal of Machine Learning Research, 17(1):8106\u20138137,\n2016.\n\n[26] Joon Kwon, Vianney Perchet, and Claire Vernade. Sparse stochastic bandits. In 30th Annual\nConference on Learning Theory - COLT 2017, Amsterdam, Netherlands, July 7-10, 2017, pages\n355\u2013366, 2017.\n\n[27] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[28] Tor Lattimore and R\u00e9mi Munos. Bounded regret for \ufb01nite-armed structured bandits. In Advances\n\nin Neural Information Processing Systems, pages 550\u2013558, 2014.\n\n[29] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of \ufb01nite-\narmed linear bandits. In 20th International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 728\u2013737, 2017.\n\n[30] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[31] Shuai Li, Claudio Gentile, and Alexandros Karatzoglou. Graph clustering bandits for recom-\n\nmendation. arXiv preprint arXiv:1605.00596, 2016.\n\n[32] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative \ufb01ltering bandits. In\nProceedings of the 39th International ACM SIGIR conference on Research and Development in\nInformation Retrieval, pages 539\u2013548. ACM, 2016.\n\n10\n\n\f[33] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In International Conference on\n\nMachine Learning, pages 136\u2013144, 2014.\n\n[34] Trong T Nguyen and Hady W Lauw. Dynamic clustering of contextual multi-armed bandits.\nIn Proceedings of the 23rd ACM International Conference on Conference on Information and\nKnowledge Management, pages 1959\u20131962. ACM, 2014.\n\n[35] Pierre Perrault, Vianney Perchet, and Michal Valko. Finding the bandit in a graph: Sequential\nsearch-and-stop. In The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 1668\u20131677, 2019.\n\n[36] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\n[37] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[38] Michal Valko, R\u00e9mi Munos, Branislav Kveton, and Tom\u00e1\u0161 Koc\u00e1k. Spectral bandits for smooth\n\ngraph functions. In International Conference on Machine Learning, pages 46\u201354, 2014.\n\n[39] Xiaoxue Zhao, Weinan Zhang, and Jun Wang. Interactive collaborative \ufb01ltering. In Proceedings\nof the 22nd ACM international conference on Information & Knowledge Management, pages\n1411\u20131420. ACM, 2013.\n\n11\n\n\f", "award": [], "sourceid": 8183, "authors": [{"given_name": "Matthieu", "family_name": "Jedor", "institution": "ENS Paris-Saclay & Cdiscount"}, {"given_name": "Vianney", "family_name": "Perchet", "institution": "ENS Paris-Saclay & Criteo AI Lab"}, {"given_name": "Jonathan", "family_name": "Louedec", "institution": "Cdiscount"}]}