{"title": "Contextual Combinatorial Multi-armed Bandits with Volatile Arms and Submodular Reward", "book": "Advances in Neural Information Processing Systems", "page_first": 3247, "page_last": 3256, "abstract": "In this paper, we study the stochastic contextual combinatorial multi-armed bandit (CC-MAB) framework that is tailored for volatile arms and submodular reward functions. CC-MAB inherits properties from both contextual bandit and combinatorial bandit: it aims to select a set of arms in each round based on the side information (a.k.a. context) associated with the arms. By ``volatile arms'', we mean that the available arms to select from in each round may change; and by ``submodular rewards'', we mean that the total reward achieved by selected arms is not a simple sum of individual rewards but demonstrates a feature of diminishing returns determined by the relations between selected arms (e.g. relevance and redundancy). Volatile arms and submodular rewards are often seen in many real-world applications, e.g. recommender systems and crowdsourcing, in which multi-armed bandit (MAB) based strategies are extensively applied. Although there exist works that investigate these issues separately based on standard MAB, jointly considering all these issues in a single MAB problem requires very different algorithm design and regret analysis. Our algorithm CC-MAB provides an online decision-making policy in a contextual and combinatorial bandit setting and effectively addresses the issues raised by volatile arms and submodular reward functions. The proposed algorithm is proved to achieve $O(cT^{\\frac{2\\alpha+D}{3\\alpha + D}}\\log(T))$ regret after a span of $T$ rounds. The performance of CC-MAB is evaluated by experiments conducted on a real-world crowdsourcing dataset, and the result shows that our algorithm outperforms the prior art.", "full_text": "Contextual Combinatorial Multi-armed Bandits with\n\nVolatile Arms and Submodular Reward\n\nLixing Chen, Jie Xu\n\nZhuo Lu\n\nDepartment of Electrical and Computer Engineering\n\nDepartment of Electrical Engineering\n\nUniversity of Miami\n\nCoral Gables, FL 33146\n\n{lx.chen, jiexu}@miami.edu\n\nUniversity of South Florida\n\nTampa, FL 33620\nzhuolu@usf.edu\n\nAbstract\n\nIn this paper, we study the stochastic contextual combinatorial multi-armed ban-\ndit (CC-MAB) framework that is tailored for volatile arms and submodular reward\nfunctions. CC-MAB inherits properties from both contextual bandit and combina-\ntorial bandit: it aims to select a set of arms in each round based on the side infor-\nmation (a.k.a. context) associated with the arms. By \u201cvolatile arms\u201d, we mean that\nthe available arms to select from in each round may change; and by \u201csubmodular\nrewards\u201d, we mean that the total reward achieved by selected arms is not a simple\nsum of individual rewards but demonstrates a feature of diminishing returns de-\ntermined by the relations between selected arms (e.g. relevance and redundancy).\nVolatile arms and submodular rewards are often seen in many real-world applica-\ntions, e.g. recommender systems and crowdsourcing, in which multi-armed bandit\n(MAB) based strategies are extensively applied. Although there exist works that\ninvestigate these issues separately based on standard MAB, jointly considering all\nthese issues in a single MAB problem requires very different algorithm design\nand regret analysis. Our algorithm CC-MAB provides an online decision-making\npolicy in a contextual and combinatorial bandit setting and effectively addresses\nthe issues raised by volatile arms and submodular reward functions. The proposed\nalgorithm is proved to achieve O(cT\n3\u03b1+D log(T )) regret after a span of T rounds.\nThe performance of CC-MAB is evaluated by experiments conducted on a real-\nworld crowdsourcing dataset, and the result shows that our algorithm outperforms\nthe prior art.\n\n2\u03b1+D\n\n1 Introduction\n\nMulti-armed bandit (MAB) problems are among the most fundamental sequential decision problems\nwith an exploration vs. exploitation trade-off. In such problems, a decision maker chooses one of\nseveral \u201carms\u201d, and observes a realization from an unknown reward distribution. Each decision\nis made based on past decisions and observed rewards. The objective is to maximize expected\ncumulative reward [4] over some time horizon by balancing exploration (to learn the average reward\nof different arms) and exploiting (to select arms that have yielded high reward in the past). The\nperformance of a decision policy is measured by the expected regret, de\ufb01ned as the gap between the\nexpected reward achieved by the algorithm and that achieved by an oracle algorithm always selecting\nthe best arm. Contextual bandit [1] is an extension of the standard MAB framework where there is\nsome side information (also called context) associated with each arm that determines the reward\ndistribution. Contextual bandit also aims to maximize the cumulative reward by selecting one arm\nin each round, but now the contexts can be leveraged to predict the expected reward of arms.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, in many real-world scenarios, e.g. recommender systems [20] and crowdsourcing [21], a\ndecision maker needs to select multiple arms (e.g., recommended items and crowdsourcing workers)\nin each time slot. Such MAB problems fall into the category of combinatorial bandit where a set\nof arms rather than one individual arm are chosen in each round. What is more complicated is that\nthe total reward of selected arms is often not a simple sum of the reward of individual arms in the\nset. For example, in a recommender system, recommending a diverse set of items increases the\nchance that a user would like at least one item. As such, recommending multiple redundant items\nwould produce little bene\ufb01t. This notion of diminishing returns due to redundancy is often captured\nformally using submodularity [14].\n\nAnother thorny issue in applying standard MAB frameworks in practice is the assumption on a\nconstant set of arms that are available inde\ufb01nitely. However, due to the inherent dynamic nature of\nmany real-world applications, the arms available in each round may change dynamically over time.\nFor example, potential crowdsourcing workers vary depending on speci\ufb01c tasks, location, and time.\nThe variation of the arm set over time has been considered by MAB variants known as sleeping\nbandit [11] and volatile bandit [3] where arms can become available or unavailable in each round.\nHowever, these works are developed on the standard MAB framework and little effort has been made\nto extend volatile arms to the contextual or combinatorial MAB setting.\n\nIn this paper, we develop a novel online decision-making approach based on contextual and combina-\ntorial bandit to address various challenges caused by volatile arms and submodular reward functions.\nAlthough our algorithm inherits concepts from some existing MAB problems, jointly considering\nthese issues in one MAB framework requires very different algorithm design and regret analysis.\nThe main contribution of this paper is summarized as follows: (i) We propose a contextual combi-\nnatorial multi-armed bandit algorithm (CC-MAB) framework that is compatible with submodular\nreward functions and volatile arms. (ii) We rigorously prove the performance guarantee of the pro-\nposed CC-MAB, which shows a O(cT\n3\u03b1+D log(T )) regret upper bound after playing T rounds. (iii)\nWe evaluate the proposed algorithm on a real-world dataset as a crowdsourcing problem. The result\nshows that our approach signi\ufb01cantly outperforms other existing MAB algorithms.\n\n2\u03b1+D\n\n1.1 Difference from Existing MAB Frameworks\n\nContextual bandit: Contextual bandit considers the scenario where decision makers can observe the\ncontext of arms and infer the rewards of other unseen arms. In general, the contextual bandit frame-\nwork is more applicable than the non-contextual variants, as it is rare that no context is available\n[15]. Various contextual bandit algorithms have been proposed, e.g. LinUCB [16], Epoch-Greedy\n[15] for the stochastic setting and EXP4, EXP4.P [2] for the adversarial setting. However, most of\nthese algorithms only allow the decision makers to pull one single arm in each round. Incorporating\ncombinatorial bandit into a context-aware setting is still an under-studied topic.\n\nCombinatorial bandit and submodular reward function: Efforts have been made to generalize the\nclassical MAB problems to combinatorial bandit [6, 8] which allows the decision makers to pull a\nset of arms in each round. However, most of these works are developed on standard bandit problems\nand neglect the available context of arms. Moreover, the submodular reward is a special issue often\nencountered in combinatorial bandit since the total reward of multiple selected arms may depend\non the relations between individual arms. There exist works that consider submodular function\nin combinatorial bandit [9, 21] but they are for the non-contextual setting. Authors in [5] use a\nbandit framework to learn the submodular utility function. The most related work is probably [18],\nwhich investigates contextual combinatorial bandit with submodular reward. However, it considers\na special submodular reward function that depends on a linear combination of individual rewards\nand assumes a constant set of arms which is very different from the volatile arms in our work.\n\nVolatile bandit and Sleeping bandit: The key idea of volatile bandit [3] and sleeping bandit [11] is\nthat the arms may \u201cappear\u201d or \u201cdisappear\u201d in each round. Volatile and sleeping bandit are bene\ufb01-\ncial extensions of standard MAB problems and \ufb01t many practical applications where the available\narms vary over time. While these works only consider volatile arms for standard MAB, our paper\nconsiders volatile arms in the contextual combinatorial MAB with submodular rewards. In addition,\nalthough volatile and sleeping bandit allow available arms to change over time, they still assume that\nthe arms appear in each round come from an already-known \ufb01nite arm set. By contrast, we allow\nin\ufb01nitely many arms in CC-MAB by taking advantage of the H\u00f6lder (Lipschitz) condition on the\ncontext space, which also is the basis of works on Lipschitz bandit [12] in continuum bandit [13].\n\n2\n\n\fThe rest of paper is organized as follows: Section 2 formulates a contextual combinatorial MAB\nproblem (CC-MAB) with volatile arms and submodular functions. Section 3 introduces the algo-\nrithm design and analyzes the regret of CC-MAB. Section 4 evaluates the performance of CC-MAB\non a real-world crowdsourcing problem, followed by the conclusion in Section 5.\n\n2 Preliminaries and Problem Formulation\n\nWe consider a sequential decision-making process for a horizon of T time slots (rounds) and for-\nmulate it as a contextual combinatorial multi-armed bandit problem. Let Mt = {1, . . . , M t} be\nthe set of arms arrived/available in time slot t. Notice that this formulation captures the volatile\narms: arm sets Mt, \u2200 0 < t \u2264 T (and their size) in different time slots can be different from each\nother. For each arrived arm m \u2208 Mt, its context (side information) xt\nm \u2208 X , [0, 1]D can be ob-\nserved, where D is the dimension of observed context vector and X is a bounded context space. Let\nxt = {xt\nm}m\u2208Mt be the context set that collects the contexts of all arms in time slot t. The quality\n(i.e., the reward of choosing an arm individually) of arm m is a random variable drawn from an\nunknown distribution parametrized by its context xt\nm)\nand its expected value by \u00b5(xt\nm)}m\u2208Mt collect the qualities of arms\narrived in time slot t and \u00b5t = {\u00b5(xt\nm)}m\u2208Mt collect their expected values. Given the available\narms Mt to choose from in each time slot, our objective is to pick a subset of arms St \u2286 Mt to\nmaximize the total reward. Usually, a decision maker will have a budget B that limits the maximum\nnumber of arms that can be selected, i.e., |St| \u2264 B, \u2200t. We assume this budget is a constant across\ntime slots. Nevertheless, our formulation can be easily extended to work with time-varying budgets.\n\nm and we denote this random quality by r(xt\n\nm)]. Let rt = {r(xt\n\nm) = E[r(xt\n\nAs aforementioned, many reward/payoff functions we encounter in real-world applications are sub-\nmodular. Therefore, we de\ufb01ne a submodular reward function u : 2Mt\n\u2192 R+ to measure the reward\nachieved by an arm set. Let u(rt, St) be the reward of selecting the arm set St, its value is jointly\ndetermined by the qualities of individual arms {r(xt\nm)}m\u2208S t and the relations between arms that\ncreate submodularity. The considered submodular reward function is general, and is featured by the\ndiminishing returns property: given the available arm set M and the corresponding quality r in an\narbitrary time slot, for all possible arm subsets S \u2286 B \u2286 M and any arm m /\u2208 B, we have\n\nu(r, {m} \u222a S) \u2212 u(r, S) \u2265 u(r, {m} \u222a B) \u2212 u(r, B).\n\n(1)\n\nWe denote the marginal reward of an arm m to a set S by \u2206(r, m|S) , u(r, {m} \u222a S) \u2212 u(r, S).\nMoreover, we also require the reward function to be monotone: for all S \u2286 B, it holds that u(r, S) \u2264\nu(r, B). It is assumed that the reward function is revealed at the beginning of each time slot when\nobserving the arrived arms. For example, in the spatial crowdsourcing application [19], the reward\nfunction can be determined by the overlapped sensing area of workers (arms); and in diversi\ufb01ed\ninformation retrieval [22], the reward function can be determined by the topic of articles (arms).\nOur goal is selecting a subset of arms St \u2286 Mt in each time slot t to maximize the expected\ncumulative reward up to a \ufb01nite time horizon T :\n\nmax\n\nS 1,...,S T XT\n\nt=1\n\nE(cid:2)u(rt, St)(cid:3)\n\ns.t. |St| \u2264 B, St \u2286 Mt, \u2200t\n\n(2a)\n\n(2b)\n\nObviously, the above problem can be decoupled into T subproblems, one for each time slot t as fol-\nlow: maxS t\u2208Mt,|S t|\u2264B E [u(rt, St)]. Let us for now assume that for an arbitrary arm with context\nx \u2208 X , its expected quality \u00b5(x) = E [r(x)] is known a priori. Since the relations among arms are\nobserved upon their arrival, E [u(rt, St)] can be written as u(E [rt] , St) = u(\u00b5t, St). Now, in each\ntime slot t, we need to select a subset S\u2217,t(xt) that satis\ufb01es:\n\nS\u2217,t(xt) = arg max\n\nu(\u00b5t, S)\n\nS\u2286Mt,|S|\u2264B\n\n(3)\n\nClearly, if |Mt| \u2264 B, then we simply select all arms in Mt. Since S\u2217,t(xt) is obtained by an\nomniscient oracle that knows the expected quality of all arrived arms, we call {S\u2217,t(xt)}T\nt=1 the\noracle solution. However, maximizing a submodular function with cardinality constraint in (3) is\nNP-hard. Fortunately, the greedy algorithm [17] (in Algorithm 1) offers a polynomial-time solution\nand guarantees to achieve no less than (1 \u2212 1/e) of the optimum as stated in the following Lemma:\n\n3\n\n\fAlgorithm 1 Greedy Algorithm\n1: Input: arm set Mt, reward vector rt, submodular reward function u, budget B.\n2: Initialization: S0 \u2190 \u2205, \u03c4 \u2190 0;\n3: while k \u2264 B do:\n4:\n5:\n\nk = k + 1;\nDepend on the expected reward \u00b5t, select mk = arg maxmk\u2208Mt\\Sk\u22121 \u2206(\u00b5t, {mk}|Sk\u22121);\nSk = Sk\u22121 \u222a {mk}\n\n6:\n7: Return: St = Sk\n\nLemma 1. In an arbitrary time slot t, let St be the arm set selected by greedy algorithm and S\u2217,t\nbe the optimal arm set for the problem in (3), we will have u(\u00b5t, St) \u2265 (1 \u2212 1\n\ne )u(\u00b5t, S\u2217,t).\n\nThe proof for Lemma 1 is omitted here, see [17] for detail. Since there is no polynomial time\nalgorithm that achieves a better approximation in general for submodular function maximization\nthan the greedy algorithm [7], we use it to solve the submodular function maximization problem\nfor each per-slot subproblem in (3) and de\ufb01ne the regret of an algorithm up to slot T against this\nbenchmark as follows:\n\nHere, the expectation is taken with respect to the choices made by a learning algorithm and the\ndistributions of qualities.\n\nE(cid:2)u(rt, St)(cid:3)\n\n(4)\n\nR(T ) = (1 \u2212\n\n1\ne\n\n) \u00b7XT\n\nt=1\n\nE(cid:2)u(rt, S\u2217,t)(cid:3) \u2212XT\n\nt=1\n\n3 Contextual Combinatorial MAB\n\nIn practice, the expected quality of an arm is unknown a priori. In this case, the per-slot subproblem\ncannot be solved as described previously. Therefore, we have to learn the expected qualities of arms\nover time using a MAB framework. However, learning the quality for volatile arms faces special\nchallenges especially when a universal arm set U (i.e., Mt \u2286 U holds true for all t) is not well-\nde\ufb01ned. Let us consider an extreme case that the arms arrived in every slot are completely new (i.e.,\nthe universal arm set is in\ufb01nitely large), then the decision maker is unable to play an arm several\ntimes to learn the expected quality as in standard MAB algorithms and it is meaningless to do that\nsince the arm will not appear again. To tailor our CC-MAB for a general case of volatile arms,\nwe resort to the contextual bandit with similarity information. Basically, we divide the arms into\ndifferent groups based on their context information and learn the expected quality for each group\nof arms by assuming that arms with similar context information will have similar qualities. This\nidea is also used by Lipschitz bandit [12] in continuum bandit to deal with the in\ufb01nite number of\narms. CC-MAB is carefully designed to properly group the volatile arms and de\ufb01ne a control policy\nthat makes a good trade-off between exploration (i.e., to learn the expected qualities of arms) and\nexploitation (i.e., to use the learned qualities to guide the future arm selection) to achieve a sublinear\nregret.\n\n3.1 Algorithm Structure\n\n\u00d7 \u00b7 \u00b7 \u00b7 \u00d7 1\nhT\n\nThe pseudo-code of CC-MAB is presented in Algorithm 2. Given the time horizon T , CC-MAB\n\ufb01rst creates a partition PT which splits the context space X into (hT )D hypercubes of identical size\n1\n. These hypercubes correspond to possible arm groups whose expected qualities\nhT\nneeds to be estimated. The parameter hT is a critical variable that determines the performance of\nCC-MAB. Its value design will be discussed in detail later. Each hypercube p \u2208 PT keeps (i) a\ncounter Ct(p) to record the number of times that an arm with x \u2208 p is chosen and (ii) a collection of\nobserved qualities E t(p) realized by selected arms with context from hypercube p. Then, the quality\nof arms with context x \u2208 p can be estimated by the sample mean \u02c6rt(p) = 1\nthat the collection E t(p) does not appear in the algorithm, since \u02c6rt(p) can be computed based on\n\u02c6rt\u22121(p), Ct\u22121(p), and realized qualities of selected arms in current slot t.\n\nC t(p) Pr\u2208E t(p) r. Note\n\nIn each time slot t, CC-MAB performs the following steps:\n{xt\n\nm}m\u2208Mt is observed. For each context xt\n\nm, the algorithm determines a hypercube pt\n\nthe context of arrived arms xt =\nm \u2208 PT , such\n\n4\n\n\fAlgorithm 2 CC-MAB\n\n1: Input: T , hT , K(t), X .\n2: Initialization: context partition PT ; set C0(p) = 0, \u02c6r(p) = 0, \u2200p \u2208 PT ;\n3: for t = 1, . . . , T do:\n4:\n5:\n6:\n7:\n8:\n9:\n\nObserve arrived arms Mt and their contexts xt = (xt\nFind pt = (pt\nIdentify under-explored hypercubes P ue,t and the arm set Mue,t; let q = |Mue,t|;\nif P ue,t 6= \u2205 then:\n\nif q \u2265 B then : St \u2190 randomly pick B arms from Mue,t;\nelse: St \u2190 pick q arms in Mue,t and other (B \u2212 q) as in (6);\n\nm)m\u2208Mt ;\nm \u2208 PT , m \u2208 Mt;\n\nm)m\u2208Mt such that xt\n\nm \u2208 pt\n\nm, pt\n\n\u22b2 Exploration\n\n10:\n\n11:\n12:\n\n13:\n\n14:\n\nelse: St \u2190 pick B arms as in (7);\nfor each arm m \u2208 St do:\n\n\u22b2 Exploitation\n\nObserve the quality rm of arm m;\nUpdate reward estimation: \u02c6r(pt\nUpdate counters: C(pt\n\nm) = C(pt\n\nm) = \u02c6r(pt\nm) + 1;\n\nm)C(pt\nC(pt\n\nm)+1\n\nm)+rm\n\n;\n\nm \u2208 pt\n\nm holds. The collection of these hypercubes in slot t is denoted by pt = {pt\n\nthat xt\nm}m\u2208Mt .\nThen the algorithm checks if there exist hypercubes p \u2208 pt that have not been explored suf\ufb01ciently\noften. For this purpose, we de\ufb01ne the under-explored hypercubes in slot t as:\n\nP ue,t\n\nT\n\n, (cid:8)p \u2208 PT | \u2203 m \u2208 Mt, xt\n\nm \u2208 p, Ct(p) \u2264 K(t)(cid:9)\n\n(5)\n\nm \u2208 P ue,t\n\nwhere K(t) is a deterministic, monotonically increasing control function that needs to be designed\nby CC-MAB. In addition, we collect the arms that fall in the under-explored hypercubes in Mue,t ,\nT }. Depending on the under-explored arms Mue,t in time slot t, CC-MAB\n{m \u2208 Mt | pt\ncan either be in an exploration phase or exploitation phase.\nIf the set of under-explored arms is non-empty, i.e. Mue,t 6= \u2205, the algorithm enters an exploration\nphase. Let q = |Mue,t| be the size of under-explored arms.\nIf the set of under-explored arms\ncontains at least B arms, i.e. q \u2265 B, then CC-MAB randomly selects B arms from Mue,t. If the\nunder-explored arm set contains less than B elements, i.e., q < B, then CC-MAB selects all q arms\nfrom Mue,t. Since the budget B is not fully utilized, the rest (B \u2212 q) additional arms are selected\nsequentially using the greedy algorithm by exploiting the estimated qualities \u02c6rt as follows:\n\u2206(\u02c6rt, {mk}|{Sk\u22121 \u222a Mue,t}), k = 1, . . . , (B \u2212 q)\n\narg max\n\nmk =\n\n(6)\n\nmk\u2208Mt\\{Mue,t\u222aSk\u22121}\n\nwhere Sk\u22121 = {mi}k\u22121\ni=1 . If the arm de\ufb01ned by (6) in not unique, ties are broken arbitrarily. Note\nthat by this procedure, even in exploration phases, the algorithm exploits whenever the number of\nunder-explored arms is smaller than the budget.\nIf the set of under-explored arms is empty, i.e., Mue,t = \u2205, the algorithm enters an exploitation\nphase. It selects all B arms based on estimated qualities using the greedy algorithm:\n\nmk = arg max\n\n\u2206(\u02c6rt, {mk}|Sk\u22121), k = 1, . . . , B\n\n(7)\n\nmk\u2208Mt\\Sk\u22121\n\nAfter selecting the arm set, CC-MAB observes the qualities realized by selected arms and then\nupdates the estimated quality and the counter of each hypercube in pt.\n\nIt remains to design the input parameter hT and the control policy K(t) in order to achieve a sublin-\near regret in the time horizon T , i.e., R(T ) = O(T \u03b3) with \u03b3 < 1, such that CC-MAB guarantees an\nasymptotically optimal performance since limT \u2192\u221e R(T )/T = 0 holds.\n\n3.2 Parameter Design and Regret Analysis\n\nIn this section, we design the algorithm parameters hT and K(t) and give a corresponding upper\nbound for the regret incurred by CC-MAB. The regret analysis is carried out based on the natu-\nral assumption that the expected qualities of arms are similar if they have similar contexts. This\nassumption is formalized by the H\u00f6lder condition as follows:\n\n5\n\n\fAssumption 1 (H\u00f6lder Condition). There exists L > 0, \u03b1 > 0 such that for any two contexts\nx, x\u2032 \u2208 X , it holds that\n\n|\u00b5(x) \u2212 \u00b5(x\u2032)| \u2264 Lkx \u2212 x\u2032k\u03b1\n\n(8)\n\nwhere k \u00b7 k denotes the Euclidean norm in RD.\n\nAssumption 1 is needed for the regret analysis, but it should be noted that CC-MAB can also be\napplied if this assumption does not hold. However, a regret bound might not be guaranteed in this\ncase. Now, we set hT = \u2308T\n3\u03b1+D \u2309 (where D is the dimension of the context space) for the context\n3\u03b1+D log(t) in each time slot t for identifying the under-explored\nspace partition, and K(t) = t\nhypercubes and arms. Then, we will have a sublinear regret upper bound of CC-MAB as follows:\n\n2\u03b1\n\n1\n\nTheorem 1 (Regret Upper Bound). Let K(t) = t\nrun with these parameters and H\u00f6lder condition holds true, the regret R(T ) is bounded by\n\n3\u03b1+D log(t) and hT = \u2308T\n\n3\u03b1+D \u2309. If CC-MAB is\n\n2\u03b1\n\n1\n\nR(T ) \u2264(1 \u2212\n\n1\ne\n\n) \u00b7 Brmax2D(cid:16)log(T )T\n) \u00b7 B2rmax(cid:18)M max\n\nB (cid:19) \u03c02\n\n1\ne\n\n3\n\n+ (1 \u2212\n\n2\u03b1+D\n\n3\u03b1+D + T\n\nD\n\n3\u03b1+D(cid:17)\n\n+(cid:18)3BLD\u03b1/2 +\n\n2Brmax + 2BLD\u03b1/2\n\n(2\u03b1 + D)/(3\u03b1 + D)(cid:19) T\ne )Brmax2D.\n\n2\u03b1+D\n\n3\u03b1+D .\n\nThe leading order of the regret R(T ) is O(cT\n\n2\u03b1+D\n\n3\u03b1+D log(T )) where c = (1 \u2212 1\n\nProof. See Appendix A in the supplemental \ufb01le.\n\nThe upper bound in Theorem 1 is valid for any \ufb01nite time horizon, thereby providing a bound on\nthe performance loss for any \ufb01nite T . This can be used to characterize the convergence speed of the\nproposed algorithm. The leading order of regret upper bound R(T ) mainly depends on the context\ndimension D. The role of D here is similar to the role of the number of arms in standard MAB\nalgorithms, e.g. UCB1[4]. Since the arm set considered in CC-MAB is in\ufb01nitely large, CC-MAB\nsplits these arms into different groups (i.e., hypercubes) based on their context information. The\nconstant (1 \u2212 1\ne )Brmax2D grows exponentially with the context dimension D, and hence regret\ntends to be high when the context space is large. However, a learner may apply dimension reduction\ntechniques, e.g., feature selection, based on empirical experience to cut down the context space. In\naddition, the form of upper regret bound for CC-MAB is very different from that for many exist-\ning MAB algorithms, e.g. contextual/combinatorial/volatile MAB, which are developed on a \ufb01nite\narm set. The continuum-armed bandits (CAB), considering a continuum arm set X \u2208 [0, 1], pro-\nvides a regret upper bound O(cT\n3 (T )) with \ufb01xed discretization [13]. If CC-MAB is run with\none-dimension context space and the parameter \u03b1 large enough, its regret upper bound reduces to\ncT\n\n3 log(T ), which is slightly looser than the CAB.\n\n3 log\n\n2\n\n1\n\n2\n\nWe note that the regret bound, which although is sublinear in T , is loose when the budget B is close\nto M max (the maximum number of arms arriving in each time slot). Consider the special case of\nB = M max, CC-MAB is actually identical to the oracle algorithm, i.e., choosing all arms arriving\nin each time slot and hence the regret is 0. It is intuitive that when the budget B is large, learning\nis not very much needed and hence the more challenging regime is when the budget B is small.\nFurthermore, if we are able to know the speci\ufb01c stochastic pattern of the arm arrival, a sharper regret\nbound can be derived. For example, consider a case that the number of available arms in each time\nslot {M t}T\nt=1 (M t = |Mt|) are i.i.d. random variables and the probability E [Pr(B < M t)] = \u03b2,\nwhere \u03b2 \u2208 [0, 1], is revealed. Then, the regret upper bound of CC-MAB can be derived in the\nfollowing corollary.\nCorollary 1. If Assumption 1 and E [Pr(B < M t)] = \u03b2 hold true, and CC-MAB is run with the\nparameters de\ufb01ned in Theorem 1. Let Rub be the regret upper bound de\ufb01ned in Theorem 1, the\nregret R(T ) is bounded by R(T ) \u2264 \u03b2Rub.\n\nCompared to Theorem 1, the regret is scaled by the parameter \u03b2 in this case. This is due to the fact\nthat with probability (1 \u2212 \u03b2) the budget is larger than the number of arrived arms and hence, both\nCC-MAB and oracle algorithm select all the arrived arms and no regret is incurred.\n\n6\n\n\f3.3 Complexity Analysis\n\nThe computational complexity of CC-MAB is mainly determined by the counters C(p) and es-\ntimated context-speci\ufb01c qualities \u02c6r(p) for each hypercube p \u2208 PT kept by the learner.\nIf CC-\nMAB is run with the parameters in Theorem 1, the number of hypercubes is (hT )D = \u2308T\n3\u03b1+D \u2309D.\nHence, the required memory is sublinear in the time horizon T . However, this also means that when\nT \u2192 \u221e, the algorithm would require in\ufb01nite memory. Fortunately, in the practical implementations,\nA learner only needs to keep the counters of hypercubes p to which at least one of appeared arms\u2019\ncontext vectors belongs. Hence the required number of counters that have to be kept is actually\nmuch smaller than the analytical requirement. Moreover, the learner may choose to stop splitting\nthe context space after a certain level of granularity so the number of hypercubes will be bounded.\n\n1\n\n4 Experiments\n\n4.1 Experiment setting\n\nWe evaluate the performance of CC-MAB in a crowdsourcing application based on the data pub-\nlished by Yelp1. The dataset provides abundant real-world traces for emulating spatial crowdsourc-\ning tasks where Yelp users are assigned with tasks to review local businesses. The dataset contains\n61,184 businesses, 36,6715 users and 1,569,264 reviews.\n\n1) Arms and Context: We divide the time span of the dataset into daily instances, and every day (each\ntime slot) a set of users are selected to review businesses. In each time slot, a user can choose one\nbusiness from a set of reachable businesses to review. Therefore, each user-business pair is an arm\nin our crowdsourcing problem. Note that the available users vary across the time and the businesses\na user can review also change depending on users\u2019 location. This means that the available arms in\neach round are different. In addition, each user has side information including the number of fans,\nthe number of received votes, and the years a user was elite, which can be used as arm context.\n\n2) System Reward: When soliciting reviews on businesses, the decision maker expects more busi-\nnesses covered under the budget constraint. Therefore, individual user\u2019s marginal quality is maxi-\nmized if each of them reviews a distinct business. In the case that a business is reviewed more than\nonce, the value of subsequent reviews will be discounted. In order to capture this property, we use\nDixit-Stiglitz preference model [23] to calculate the total reward. Let rij denote the j-th review on\nthe i-th business. The quality of each review record (arm) is calculated based on the text length of\nthe comment, and the number of votes it received (rij = LijN votes\nij where Lij is the normalized text\nlength and N votes\nis the vote count). Figure 1 shows the distribution of arm quality over the number\nof fans context. We see that the reviews from users with a larger number of fans tend to have a\nhigher quality. Figure 2 further depicts the expected quality of user groups in each hypercube on\ntwo context dimensions number of years as elite and number of fans, which are used to de\ufb01ne the\ncontext space in the experiment. We see that the review quality is very related to the users\u2019 context.\n\nij\n\ne\nt\ni\nl\n\n0~3\n\n3~6\n\ne\n \ns\na\n \ns\nr\na\ne\ny\n \nf\no\n \nr\ne\nb\nm\nu\n>9N\n\n6~9\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0~25 26~50 51~75 75~100 >100\n\nNumber of fans\n\nFigure 1: Arm quality distribution on fans context.\n\nFigure 2: Expected quality for hypercubes.\n\nFor each business i, its reward is given by the Dixit-Stiglitz model:\n\nui = (cid:16)Xj\n\n(rij )p(cid:17)1/p\n\n, p \u2265 1.\n\n(9)\n\n1Yelp dataset challenge: www.yelp.com/dataset/challenge\n\n7\n\n\fand the total reward is the sum of rewards of all businesses: u = Pi ui. It can be easily veri\ufb01ed that\n\nthe given reward function is submodular for any value of parameter p \u2265 1. We compare CC-MAB\nwith the following benchmarks:\n\nIn each time slot, Oracle\n\n1) Oracle: Oracle knows precisely the expected quality of each arm.\nchooses B arms using the greedy algorithm presented in Algorithm 1.\n2) k-LinUCB: LinUCB [16] is a contextual bandit algorithm which recommends exactly one arm\nin each round. To select a set of k users, we repeat LinUCB algorithm k times in each round. By\nsequentially removing selected arms, we ensure that the k arms returned by k-LinUCB are distinct\nin each round. Notice that k-LinUCB is unaware of the submodular reward when selecting users.\n3) UCB: UCB algorithm [4] is a classical MAB algorithm (non-contextual and non-combinatorial)\nthat achieves the logarithmic regret bound. Similar to k-LinUCB, we repeat UCB k times to select\nmultiple users in each round.\n4) CC-MAB-NS: CC-MAB-NS(Non-Submodular) is a variant of proposed CC-MAB where sub-\nmodularity of reward function is not considered. It simply selects B (or (B \u2212 k)) arms with the\nhighest quality during exploitation (or semi-exploitation).\n5) Random: The Random algorithm picks B arms randomly from the available arms in each round.\n\n4.2 Results and Discussions\n\nFigure 3 shows the cumulative system rewards achieved by CC-MAB and 5 benchmark algorithms.\nAs expected, Oracle has the highest cumulative reward and gives an upper bound to other algo-\nrithms. Among other algorithms, we see that the context-aware algorithms CC-MAB, k-LinUCB\nand CC-MAB-NS outperform UCB and Random algorithm. This indicates that exploiting the con-\ntext information of arms helps to better learn the quality of arms. Further, it can be observed that the\nCC-MAB achieves a close-to-oracle performance while k-LinUCB and CC-MAB-NS incur obvious\nreward loss since they do not consider the submodularity of the reward function.\n\n104\n\nCC-MAB\nOracle\nk-LinUCB\nUCB\nCC-MAB-NS\nRandom\n\nd\nr\na\nw\ne\nr\n \n\ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n40\n\n80\nTime slot t\n\n120\n\n160\n\n200\n\ns\nd\nn\nu\no\nr\n \n\n0\n0\n2\n\n \n\nn\n\ni\n \n\nd\nr\na\nw\ne\nr\n \ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n105\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nCC-MAB\nOracle\nk-LinUCB\nUCB\nCC-MAB-NS\nRandom\n\n10\n\n20\n\n30\n\n40\n\n50\n60\nBudget B\n\n70\n\n80\n\n90\n\n100\n\nFigure 3: Comparison of cumulative rewards.\n\nFigure 4: Cumulative rewards over budgets.\n\nFigure 4 shows the cumulative rewards achieved by 6 algorithms in 200 rounds under different\nbudgets. In general, all the algorithms achieve a higher cumulative reward with a larger budget since\nmore arms can be selected. It is worth noticing that the cumulative rewards obtained by Oracle and\nCC-MAB become saturated at budget B = 50 which is much smaller compared to that of other\nalgorithms that become saturated after B = 90. This is due to the fact that the most bene\ufb01cial arms\ncan be ef\ufb01ciently identi\ufb01ed by Oracle and CC-MAB algorithms by considering the submodularity of\nthe reward function, and therefore the arms that are left out only offer little marginal reward. Note\nthat, in our experiment, the maximum number of arms in each round is set as 100 and hence all\nalgorithms have the same performance at B = 100. In addition, we see that CC-MAB is able to\nachieve the close-to-oracle performance at all budget levels. In Figure 5, we further depict the regret\nof CC-MAB across different budgets. It shows that the regret decreases with the increase in budget.\nThis seems contradictory to Theorem 1 where the regret upper bound grows with the budget B.\nHowever, the regret upper bound is proved for an arbitrary submodular function and an arbitrarily\nlarge arm set in each round. In a setting that the maximum number of arms arriving in each round is\n\ufb01xed, increasing the budget reduces the chance that bene\ufb01cial arms are left out.\n\nWe also analyze the impact of submodularity to our algorithm. Figure 6 shows the cumulative reward\nachieved by CC-MAB and CC-MAB-NS with different levels of submodularity, where the level of\nsubmodularity is determined by the parameter p in the reward function (9). A larger p indicates a\nstronger submodularity. When p = 1, the reward function becomes non-submodular, therefore CC-\n\n8\n\n\ft\ne\nr\ng\ne\nR\n\n12000\n\n10000\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\n10\n\n30\n\n50\n70\nBudget B\n\n90\n\n100\n\n)\n\n4\n\n0\n1\n\n(\n \n\nd\nr\na\nw\ne\nr\n \ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n50\n\n45\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0.0%\n\nCC-MAB\n\nCC-MAB-NS\n\n15.9%\n\n16.6%\n\n17.1%\n\n17.1%\n\n1\n\n3\n\n5\n\n7\n\n9\n\nParameter p for submodular function\n\nFigure 5: Regret of CC-MAB over budgets.\n\nFigure 6: Impact of submodularity.\n\nMAB and CC-MAB-NS achieve the same cumulative reward. Moreover, at this point, the achieved\nreward is maximized since the diminishing return of submodularity disappears. When the reward\nfunction becomes submodular (p > 1), CC-MAB outperforms CC-MAB-NS, and the performance\nloss incurred by CC-MAB-NS grows as the submodularity becomes stronger.\n\nOther extended simulations are given in Appendix B in the supplemental \ufb01le.\n\n5 Conclusion\n\nWe presented a framework called contextual combinatorial multi-armed bandit that accommodates\ncombinatorial nature of contextual arms. An ef\ufb01cient algorithm CC-MAB was proposed, which is\ntailored to volatile arms and submodular reward functions. We rigorously proved that the regret up-\nper bound of the proposed algorithm is sublinear in the time horizon T . Experiments on real-world\ncrowdsourcing data demonstrated that our algorithm helps to explore and exploit arms\u2019 reward by\nconsidering the context information and the submodularity of reward function and hence improves\nthe cumulative reward compared to many existing MAB algorithms. CC-MAB currently creates\nstatic context partitions during initialization which may be inappropriate in certain cases, a meaning-\nful extension is to generate appropriate partitions dynamically over time based on the distribution of\narrived arms on the context space.\n\n6 Acknowledgment\n\nL. Chen and J. Xu\u2019s work is supported in part by the Army Research Of\ufb01ce under Grant W911NF-\n18-1-0343. The views and conclusions contained in this document are those of the authors and\nshould not be interpreted as representing the of\ufb01cial policies, either expressed or implied, of the\nArmy Research Of\ufb01ce or the U.S. Government. The U.S. Government is authorized to reproduce\nand distribute reprints for Government purposes notwithstanding any copyright notation herein.\n\nReferences\n\n[1] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[3] Z. Bnaya, R. Puzis, R. Stern, and A. Felner. Volatile multi-armed bandits for guaranteed\n\ntargeted social crawling. In AAAI (Late-Breaking Developments), 2013.\n\n[4] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] L. Chen, A. Krause, and A. Karbasi. Interactive submodular bandit. In Advances in Neural\n\nInformation Processing Systems, pages 141\u2013152, 2017.\n\n[6] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and\n\napplications. In International Conference on Machine Learning, pages 151\u2013159, 2013.\n\n9\n\n\f[7] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM),\n\n45(4):634\u2013652, 1998.\n\n[8] Y. Gai, B. Krishnamachari, and R. Jain. Combinatorial network optimization with unknown\nvariables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM\nTransactions on Networking (TON), 20(5):1466\u20131478, 2012.\n\n[9] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Re-\n\nsearch, 13(Oct):2903\u20132922, 2012.\n\n[10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican statistical association, 58(301):13\u201330, 1963.\n\n[11] R. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and\n\nbandits. Machine learning, 80(2-3):245\u2013272, 2010.\n\n[12] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings\nof the fortieth annual ACM symposium on Theory of computing, pages 681\u2013690. ACM, 2008.\n\n[13] R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances\n\nin Neural Information Processing Systems, pages 697\u2013704, 2005.\n\n[14] A. Krause and C. Guestrin. Beyond convexity: Submodularity in machine learning. ICML\n\nTutorials, 2008.\n\n[15] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, pages 817\u2013824, 2008.\n\n[16] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670. ACM, 2010.\n\n[17] G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a\n\nsubmodular set function. Mathematics of operations research, 3(3):177\u2013188, 1978.\n\n[18] L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversi\ufb01ed\nonline recommendation. In Proceedings of the 2014 SIAM International Conference on Data\nMining, pages 461\u2013469. SIAM, 2014.\n\n[19] G. Radanovic, A. Singla, A. Krause, and B. Faltings. Information gathering with peers: Sub-\nmodular optimization with peer-prediction constraints. In Proceedings of the 32nd AAAI Con-\nference on Arti\ufb01cial Intelligence (AAAI\u201918), 2018.\n\n[20] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed\nbandits. In Proceedings of the 25th international conference on Machine learning, pages 784\u2013\n791. ACM, 2008.\n\n[21] P. Yang, N. Zhang, S. Zhang, K. Yang, L. Yu, and X. Shen. Identifying the most valuable\nworkers in fog-assisted spatial crowdsourcing. IEEE Internet of Things Journal, 4(5):1193\u2013\n1203, 2017.\n\n[22] Y. Yue and C. Guestrin. Linear submodular bandits and their application to diversi\ufb01ed retrieval.\n\nIn Advances in Neural Information Processing Systems, pages 2483\u20132491, 2011.\n\n[23] Y. Zhang and M. van der Schaar. Information production and link formation in social comput-\n\ning systems. IEEE Journal on selected Areas in communications, 30(11):2136\u20132145, 2012.\n\n10\n\n\f", "award": [], "sourceid": 1652, "authors": [{"given_name": "Lixing", "family_name": "Chen", "institution": "University of Miami"}, {"given_name": "Jie", "family_name": "Xu", "institution": "University of Miami"}, {"given_name": "Zhuo", "family_name": "Lu", "institution": "University of South Florida"}]}