{"title": "Combinatorial Multi-Armed Bandit with General Reward Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1659, "page_last": 1667, "abstract": "In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\\log T)$ distribution-dependent regret and $\\tilde{O}(\\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\\tilde{O}(\\sqrt T)$ bound on the $(1-\\epsilon)$-approximation regret of its online problem, for any $\\epsilon>0$.", "full_text": "Combinatorial Multi-Armed Bandit with General\n\nReward Functions\n\nWei Chen\u2217\n\nWei Hu\u2020\n\nFu Li\u2021\n\nJian Li\u00a7\n\nYu Liu\u00b6\n\nPinyan Lu(cid:107)\n\nAbstract\n\nIn this paper, we study the stochastic combinatorial multi-armed bandit (CMAB)\nframework that allows a general nonlinear reward function, whose expected value\nmay not depend only on the means of the input random variables but possibly\non the entire distributions of these variables. Our framework enables a much\nlarger class of reward functions such as the max() function and nonlinear utility\nfunctions. Existing techniques relying on accurate estimations of the means of\nrandom variables, such as the upper con\ufb01dence bound (UCB) technique, do not\nwork directly on these functions. We propose a new algorithm called stochastically\ndominant con\ufb01dence bound (SDCB), which estimates the distributions of under-\nlying random variables and their stochastically dominant con\ufb01dence bounds. We\nprove that SDCB can achieve O(log T ) distribution-dependent regret and \u02dcO(\nT )\ndistribution-independent regret, where T is the time horizon. We apply our results\nto the K-MAX problem and expected utility maximization problems. In particular,\n\u221a\nfor K-MAX, we provide the \ufb01rst polynomial-time approximation scheme (PTAS)\nT ) bound on the (1\u2212\u0001)-approximation\nfor its of\ufb02ine problem, and give the \ufb01rst \u02dcO(\nregret of its online problem, for any \u0001 > 0.\n\n\u221a\n\n1\n\nIntroduction\n\nStochastic multi-armed bandit (MAB) is a classical online learning problem typically speci\ufb01ed as a\nplayer against m machines or arms. Each arm, when pulled, generates a random reward following an\nunknown distribution. The task of the player is to select one arm to pull in each round based on the\nhistorical rewards she collected, and the goal is to collect cumulative reward over multiple rounds as\nmuch as possible. In this paper, unless otherwise speci\ufb01ed, we use MAB to refer to stochastic MAB.\nMAB problem demonstrates the key tradeoff between exploration and exploitation: whether the\nplayer should stick to the choice that performs the best so far, or should try some less explored\nalternatives that may provide better rewards. The performance measure of an MAB strategy is its\ncumulative regret, which is de\ufb01ned as the difference between the cumulative reward obtained by\nalways playing the arm with the largest expected reward and the cumulative reward achieved by the\n\u221a\nlearning strategy. MAB and its variants have been extensively studied in the literature, with classical\nresults such as tight \u0398(log T ) distribution-dependent and \u0398(\nT ) distribution-independent upper and\nlower bounds on the regret in T rounds [19, 2, 1].\nAn important extension to the classical MAB problem is combinatorial multi-armed bandit (CMAB).\nIn CMAB, the player selects not just one arm in each round, but a subset of arms or a combinatorial\n\n\u2217Microsoft Research, email: weic@microsoft.com. The authors are listed in alphabetical order.\n\u2020Princeton University, email: huwei@cs.princeton.edu.\n\u2021The University of Texas at Austin, email: fuli.theory.research@gmail.com.\n\u00a7Tsinghua University, email: lapordge@gmail.com.\n\u00b6Tsinghua University, email: liuyujyyz@gmail.com.\n(cid:107)Shanghai University of Finance and Economics, email: lu.pinyan@mail.shufe.edu.cn.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fE[u((cid:80)\n\nobject in general, referred to as a super arm, which collectively provides a random reward to the\nplayer. The reward depends on the outcomes from the selected arms. The player may observe partial\nfeedbacks from the selected arms to help her in decision making. CMAB has wide applications\nin online advertising, online recommendation, wireless routing, dynamic channel allocations, etc.,\nbecause in all these settings the action unit is a combinatorial object (e.g. a set of advertisements, a\nset of recommended items, a route in a wireless network, and an allocation between channels and\nusers), and the reward depends on unknown stochastic behaviors (e.g. users\u2019 click through behaviors,\nwireless transmission quality, etc.). Therefore CMAB has attracted a lot of attention in online learning\nresearch in recent years [12, 8, 22, 15, 7, 16, 18, 17, 23, 9].\nMost of these studies focus on linear reward functions, for which the expected reward for playing a\nsuper arm is a linear combination of the expected outcomes from the constituent base arms. Even for\nstudies that do generalize to non-linear reward functions, they typically still assume that the expected\nreward for choosing a super arm is a function of the expected outcomes from the constituent base\narms in this super arm [8, 17]. However, many natural reward functions do not satisfy this property.\nFor example, for the function max(), which takes a group of variables and outputs the maximum one\namong them, its expectation depends on the full distributions of the input random variables, not just\ntheir means. Function max() and its variants underly many applications. As an illustrative example,\nwe consider the following scenario in auctions: the auctioneer is repeatedly selling an item to m\nbidders; in each round the auctioneer selects K bidders to bid; each of the K bidders independently\ndraws her bid from her private valuation distribution and submits the bid; the auctioneer uses the\n\ufb01rst-price auction to determine the winner and collects the largest bid as the payment.1 The goal of\nthe auctioneer is to gain as high cumulative payments as possible. We refer to this problem as the\nK-MAX bandit problem, which cannot be effectively solved in the existing CMAB framework.\nBeyond the K-MAX problem, many expected utility maximization (EUM) problems are studied\nin stochastic optimization literature [27, 20, 21, 4]. The problem can be formulated as maximizing\ni\u2208S Xi)] among all feasible sets S, where Xi\u2019s are independent random variables and u(\u00b7) is\na utility function. For example, Xi could be the random delay of edge ei in a routing graph, S is a\nrouting path in the graph, and the objective is maximizing the utility obtained from any routing path,\nand typically the shorter the delay, the larger the utility. The utility function u(\u00b7) is typically nonlinear\nto model risk-averse or risk-prone behaviors of users (e.g. a concave utility function is often used to\nmodel risk-averse behaviors). The non-linear utility function makes the objective function much more\ncomplicated: in particular, it is no longer a function of the means of the underlying random variables\nXi\u2019s. When the distributions of Xi\u2019s are unknown, we can turn EUM into an online learning problem\nwhere the distributions of Xi\u2019s need to be learned over time from online feedbacks, and we want to\nmaximize the cumulative reward in the learning process. Again, this is not covered by the existing\nCMAB framework since only learning the means of Xi\u2019s is not enough.\nIn this paper, we generalize the existing CMAB framework with semi-bandit feedbacks to handle\ngeneral reward functions, where the expected reward for playing a super arm may depend more\nthan just the means of the base arms, and the outcome distribution of a base arm can be arbitrary.\nThis generalization is non-trivial, because almost all previous works on CMAB rely on estimating\nthe expected outcomes from base arms, while in our case, we need an estimation method and an\nanalytical tool to deal with the whole distribution, not just its mean. To this end, we turn the problem\ninto estimating the cumulative distribution function (CDF) of each arm\u2019s outcome distribution. We\nuse stochastically dominant con\ufb01dence bound (SDCB) to obtain a distribution that stochastically\n\u221a\ndominates the true distribution with high probability, and hence we also name our algorithm SDCB.\nWe are able to show O(log T ) distribution-dependent and \u02dcO(\nT ) distribution-independent regret\nbounds in T rounds. Furthermore, we propose a more ef\ufb01cient algorithm called Lazy-SDCB, which\n\u221a\n\ufb01rst executes a discretization step and then applies SDCB on the discretized problem. We show that\nLazy-SDCB also achieves \u02dcO(\nT ) distribution-independent regret bound. Our regret bounds are\ntight with respect to their dependencies on T (up to a logarithmic factor for distribution-independent\nbounds). To make our scheme work, we make a few reasonable assumptions, including boundedness,\nmonotonicity and Lipschitz-continuity2 of the reward function, and independence among base arms.\nWe apply our algorithms to the K-MAX and EUM problems, and provide ef\ufb01cient solutions with\nconcrete regret bounds. Along the way, we also provide the \ufb01rst polynomial time approximation\n\n1We understand that the \ufb01rst-price auction is not truthful, but this example is only for illustrative purpose for\n\nthe max() function.\n\n2The Lipschitz-continuity assumption is only made for Lazy-SDCB. See Section 4.\n\n2\n\n\fscheme (PTAS) for the of\ufb02ine K-MAX problem, which is formulated as maximizing E[maxi\u2208S Xi]\nsubject to a cardinality constraint |S| \u2264 K, where Xi\u2019s are independent nonnegative random\nvariables.\nTo summarize, our contributions include: (a) generalizing the CMAB framework to allow a general\nreward function whose expectation may depend on the entire distributions of the input random\nvariables; (b) proposing the SDCB algorithm to achieve ef\ufb01cient learning in this framework with\nnear-optimal regret bounds, even for arbitrary outcome distributions; (c) giving the \ufb01rst PTAS for the\nof\ufb02ine K-MAX problem. Our general framework treats any of\ufb02ine stochastic optimization algorithm\nas an oracle, and effectively integrates it into the online learning framework.\n\nRelated Work. As already mentioned, most relevant to our work are studies on CMAB frameworks,\namong which [12, 16, 18, 9] focus on linear reward functions while [8, 17] look into non-linear\nreward functions. In particular, Chen et al. [8] look at general non-linear reward functions and Kveton\net al. [17] consider speci\ufb01c non-linear reward functions in a conjunctive or disjunctive form, but\nboth papers require that the expected reward of playing a super arm is determined by the expected\noutcomes from base arms.\nThe only work in combinatorial bandits we are aware of that does not require the above assumption on\nthe expected reward is [15], which is based on a general Thompson sampling framework. However,\nthey assume that the joint distribution of base arm outcomes is from a known parametric family within\nknown likelihood function and only the parameters are unknown. They also assume the parameter\nspace to be \ufb01nite. In contrast, our general case is non-parametric, where we allow arbitrary bounded\ndistributions. Although in our known \ufb01nite support case the distribution can be parametrized by\nprobabilities on all supported points, our parameter space is continuous. Moreover, it is unclear how\nto ef\ufb01ciently compute posteriors in their algorithm, and their regret bounds depend on complicated\nproblem-dependent coef\ufb01cients which may be very large for many combinatorial problems. They\nalso provide a result on the K-MAX problem, but they only consider Bernoulli outcomes from base\narms, much simpler than our case where general distributions are allowed.\nThere are extensive studies on the classical MAB problem, for which we refer to a survey by Bubeck\nand Cesa-Bianchi [5]. There are also some studies on adversarial combinatorial bandits, e.g. [26, 6].\nAlthough it bears conceptual similarities with stochastic CMAB, the techniques used are different.\nExpected utility maximization (EUM) encompasses a large class of stochastic optimization problems\nand has been well studied (e.g. [27, 20, 21, 4]). To the best of our knowledge, we are the \ufb01rst to study\nthe online learning version of these problems, and we provide a general solution to systematically\naddress all these problems as long as there is an available of\ufb02ine (approximation) algorithm. The\nK-MAX problem may be traced back to [13], where Goel et al. provide a constant approximation\nalgorithm to a generalized version in which the objective is to choose a subset S of cost at most K\nand maximize the expectation of a certain knapsack pro\ufb01t.\n\n2 Setup and Notation\n\nProblem Formulation. We model a combinatorial multi-armed bandit (CMAB) problem as a tuple\n(E,F, D, R), where E = [m] = {1, 2, . . . , m} is a set of m (base) arms, F \u2286 2E is a set of subsets\nof E, D is a probability distribution over [0, 1]m, and R is a reward function de\ufb01ned on [0, 1]m \u00d7 F.\nThe arms produce stochastic outcomes X = (X1, X2, . . . , Xm) drawn from distribution D, where\nthe i-th entry Xi is the outcome from the i-th arm. Each feasible subset of arms S \u2208 F is called a\nsuper arm. Under a realization of outcomes x = (x1, . . . , xm), the player receives a reward R(x, S)\nwhen she chooses the super arm S to play. Without loss of generality, we assume the reward value to\nbe nonnegative. Let K = maxS\u2208F |S| be the maximum size of any super arm.\nLet X (1), X (2), . . . be an i.i.d.\nsequence of random vectors drawn from D, where X (t) =\n(X (t)\nm ) is the outcome vector generated in the t-th round. In the t-th round, the player\nchooses a super arm St \u2208 F to play, and then the outcomes from all arms in St, i.e., {X (t)\n| i \u2208 St},\nare revealed to the player. According to the de\ufb01nition of the reward function, the reward value in the\nt-th round is R(X (t), St). The expected reward for choosing a super arm S in any round is denoted\nby rD(S) = EX\u223cD[R(X, S)].\n\n1 , . . . , X (t)\n\ni\n\n3\n\n\fWe also assume that for a \ufb01xed super arm S \u2208 F, the reward R(x, S) only depends on the revealed\noutcomes xS = (xi)i\u2208S. Therefore, we can alternatively express R(x, S) as RS(xS), where RS is a\nfunction de\ufb01ned on [0, 1]S.3\nA learning algorithm A for the CMAB problem selects which super arm to play in each round\nt be the super arm selected by A\nbased on the revealed outcomes in all previous rounds. Let SA\nin the t-th round.4 The goal is to maximize the expected cumulative reward in T rounds, which\n\nt )(cid:3). Note that when the underlying distribution D is\n\nis E(cid:104)(cid:80)T\n\nknown, the optimal algorithm A\u2217 chooses the optimal super arm S\u2217 = argmaxS\u2208F{rD(S)} in every\nround. The quality of an algorithm A is measured by its regret in T rounds, which is the difference\nbetween the expected cumulative reward of the optimal algorithm A\u2217 and that of A:\n\nt=1 R(X (t), SA\nt )\n\n(cid:105)\n\nt=1\n\nE(cid:2)rD(SA\n=(cid:80)T\nD(T ) = T \u00b7 rD(S\u2217) \u2212 T(cid:88)\n\nReg\n\nA\n\nE(cid:2)rD(SA\nt )(cid:3) .\n\nt=1\n\nFor some CMAB problem instances, the optimal super arm S\u2217 may be computationally hard to \ufb01nd\neven when the distribution D is known, but ef\ufb01cient approximation algorithms may exist, i.e., an\n\u03b1-approximate (0 < \u03b1 \u2264 1) solution S(cid:48) \u2208 F which satis\ufb01es rD(S(cid:48)) \u2265 \u03b1 \u00b7 maxS\u2208F{rD(S)} can be\nef\ufb01ciently found given D as input. We will provide the exact formulation of our requirement on such\nan \u03b1-approximation computation oracle shortly. In such cases, it is not fair to compare a CMAB\nalgorithm A with the optimal algorithm A\u2217 which always chooses the optimal super arm S\u2217. Instead,\nwe de\ufb01ne the \u03b1-approximation regret of an algorithm A as\n\nD,\u03b1(T ) = T \u00b7 \u03b1 \u00b7 rD(S\u2217) \u2212 T(cid:88)\n\nA\n\nE(cid:2)rD(SA\nt )(cid:3) .\n\nReg\n\nt=1\n\nAs mentioned, almost all previous work on CMAB requires that the expected reward rD(S) of\na super arm S depends only on the expectation vector \u00b5 = (\u00b51, . . . , \u00b5m) of outcomes, where\n\u00b5i = EX\u223cD[Xi]. This is a strong restriction that cannot be satis\ufb01ed by a general nonlinear function\nRS and a general distribution D. The main motivation of this work is to remove this restriction.\n\nAssumptions. Throughout this paper, we make several assumptions on the outcome distribution D\nand the reward function R.\nAssumption 1 (Independent outcomes from arms). The outcomes from all m arms are mutually\nindependent, i.e., for X \u223c D, X1, X2, . . . , Xm are mutually independent. We write D as D =\nD1 \u00d7 D2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Dm, where Di is the distribution of Xi.\nWe remark that the above independence assumption is also made for past studies on the of\ufb02ine EUM\nand K-MAX problems [27, 20, 21, 4, 13], so it is not an extra assumption for the online learning case.\nAssumption 2 (Bounded reward value). There exists M > 0 such that for any x \u2208 [0, 1]m and any\nS \u2208 F, we have 0 \u2264 R(x, S) \u2264 M.\nAssumption 3 (Monotone reward function). If two vectors x, x(cid:48) \u2208 [0, 1]m satisfy xi \u2264 x(cid:48)\ni (\u2200i \u2208 [m]),\nthen for any S \u2208 F, we have R(x, S) \u2264 R(x(cid:48), S).\n\nComputation Oracle for Discrete Distributions with Finite Supports. We require that there\nexists an \u03b1-approximation computation oracle (0 < \u03b1 \u2264 1) for maximizing rD(S), when each Di\n(i \u2208 [m]) has a \ufb01nite support. In this case, Di can be fully described by a \ufb01nite set of numbers\n(i.e., its support {vi,1, vi,2, . . . , vi,si} and the values of its cumulative distribution function (CDF)\nFi on the supported points: Fi(vi,j) = PrXi\u223cDi [Xi \u2264 vi,j] (j \u2208 [si])). The oracle takes such a\nrepresentation of D as input, and can output a super arm S(cid:48) = Oracle(D) \u2208 F such that rD(S(cid:48)) \u2265\n\u03b1 \u00b7 maxS\u2208F{rD(S)}.\n\n3 SDCB Algorithm\n\n3[0, 1]S is isomorphic to [0, 1]|S|; the coordinates in [0, 1]S are indexed by elements in S.\n4Note that SA\n\nt may be random due to the random outcomes in previous rounds and the possible randomness\n\nused by A.\n\n4\n\n\fAlgorithm 1 SDCB (Stochastically dominant con\ufb01dence bound)\n1: Throughout the algorithm, for each arm i \u2208 [m], maintain: (i) a counter Ti which stores the\nnumber of times arm i has been played so far, and (ii) the empirical distribution \u02c6Di of the\nobserved outcomes from arm i so far, which is represented by its CDF \u02c6Fi\n\n// Action in the i-th round\nPlay a super arm Si that contains arm i\nUpdate Tj and \u02c6Fj for each j \u2208 Si\n\n2: // Initialization\n3: for i = 1 to m do\n4:\n5:\n6:\n7: end for\n8: for t = m + 1, m + 2, . . . do\n// Action in the t-th round\n9:\nFor each i \u2208 [m], let Di be a distribution whose CDF Fi is\n10:\n\n(cid:40)\n\nmax{ \u02c6Fi(x) \u2212(cid:113) 3 ln t\n\nFi(x) =\n\n1,\n\n2Ti\n\n, 0}, 0 \u2264 x < 1\n\nx = 1\n\nPlay the super arm St \u2190 Oracle(D), where D = D1 \u00d7 D2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Dm\nUpdate Tj and \u02c6Fj for each j \u2208 St\n\n11:\n12:\n13: end for\n\nWe present our algorithm stochastically dominant con\ufb01dence bound (SDCB) in Algorithm 1. Through-\nout the algorithm, we store, in a variable Ti, the number of times the outcomes from arm i are observed\nso far. We also maintain the empirical distribution \u02c6Di of the observed outcomes from arm i so far,\nwhich can be represented by its CDF \u02c6Fi: for x \u2208 [0, 1], the value of \u02c6Fi(x) is just the fraction of\nthe observed outcomes from arm i that are no larger than x. Note that \u02c6Fi is always a step function\nwhich has \u201cjumps\u201d at the points that are observed outcomes from arm i. Therefore it suf\ufb01ces to store\nthese discrete points as well as the values of \u02c6Fi at these points in order to store the whole function\n\u02c6Fi. Similarly, the later computation of stochastically dominant CDF Fi (line 10) only requires\ncomputation at these points, and the input to the of\ufb02ine oracle only needs to provide these points and\ncorresponding CDF values (line 11).\nThe algorithm starts with m initialization rounds in which each arm is played at least once5 (lines 2-7).\nIn the t-th round (t > m), the algorithm consists of three steps. First, it calculates for each i \u2208 [m] a\ndistribution Di whose CDF Fi is obtained by lowering the CDF \u02c6Fi (line 10). The second step is to\ncall the \u03b1-approximation oracle with the newly constructed distribution D = D1 \u00d7\u00b7\u00b7\u00b7\u00d7 Dm as input\n(line 11), and thus the super arm St output by the oracle satis\ufb01es rD(St) \u2265 \u03b1 \u00b7 maxS\u2208F{rD(S)}.\nFinally, the algorithm chooses the super arm St to play, observes the outcomes from all arms in St,\nand updates Tj\u2019s and \u02c6Fj\u2019s accordingly for each j \u2208 St.\nThe idea behind our algorithm is the optimism in the face of uncertainty principle, which is the key\nprinciple behind UCB-type algorithms. Our algorithm ensures that with high probability we have\nFi(x) \u2264 Fi(x) simultaneously for all i \u2208 [m] and all x \u2208 [0, 1], where Fi is the CDF of the outcome\ndistribution Di. This means that each Di has \ufb01rst-order stochastic dominance over Di.6 Then from\nthe monotonicity property of R(x, S) (Assumption 3) we know that rD(S) \u2265 rD(S) holds for all\nS \u2208 F with high probability. Therefore D provides an \u201coptimistic\u201d estimation on the expected\nreward from each super arm.\n\nRegret Bounds. We prove O(log T ) distribution-dependent and O(\nindependent upper bounds on the regret of SDCB (Algorithm 1).\n\n\u221a\n\nT log T ) distribution-\n\n5Without loss of generality, we assume that each arm i \u2208 [m] is contained in at least one super arm.\n6We remark that while Fi(x) is a numerical lower con\ufb01dence bound on Fi(x) for all x \u2208 [0, 1], at the\n\ndistribution level, Di serves as a \u201cstochastically dominant (upper) con\ufb01dence bound\u201d on Di.\n\n5\n\n\fWe call a super arm S bad if rD(S) < \u03b1 \u00b7 rD(S\u2217). For each super arm S, we de\ufb01ne\n\n\u2206S = max{\u03b1 \u00b7 rD(S\u2217) \u2212 rD(S), 0}.\n\nLet FB = {S \u2208 F | \u2206S > 0}, which is the set of all bad super arms. Let EB \u2286 [m] be the set of\narms that are contained in at least one bad super arm. For each i \u2208 EB, we de\ufb01ne\n\n\u2206i,min = min{\u2206S | S \u2208 FB, i \u2208 S}.\n\nRecall that M is an upper bound on the reward value (Assumption 2) and K = maxS\u2208F |S|.\nTheorem 1. A distribution-dependent upper bound on the \u03b1-approximation regret of SDCB (Algo-\nrithm 1) in T rounds is\n\nM 2K\n\nln T +\n\n+ 1\n\n\u03b1M m,\n\n2136\n\u2206i,min\n\n(cid:88)\n\ni\u2208EB\n\n\u221a\n\n(cid:18) \u03c02\n(cid:18) \u03c02\n\n3\n\n+ 1\n\n3\n\n(cid:19)\n\n(cid:19)\n\n\u03b1M m.\n\nand a distribution-independent upper bound is\n\n93M\n\nmKT ln T +\n\nthe summation reward function R(x, S) =(cid:80)\n\nThe proof of Theorem 1 is given in the supplementary material. The main idea is to reduce our\nanalysis on general reward functions satisfying Assumptions 1-3 to the one in [18] that deals with\ni\u2208S xi. Our analysis relies on the Dvoretzky-Kiefer-\nWolfowitz inequality [10, 24], which gives a uniform concentration bound on the empirical CDF of a\ndistribution.\n\nApplying Our Algorithm to the Previous CMAB Framework. Although our focus is on general\nreward functions, we note that when SDCB is applied to the previous CMAB framework where the\nexpected reward depends only on the means of the random variables, it can achieve the same regret\nbounds as the previous combinatorial upper con\ufb01dence bound (CUCB) algorithm in [8, 18].\nLet \u00b5i = EX\u223cD[Xi] be arm i\u2019s mean outcome. In each round CUCB calculates (for each arm i) an\nupper con\ufb01dence bound \u00af\u00b5i on \u00b5i, with the essential property that \u00b5i \u2264 \u00af\u00b5i \u2264 \u00b5i + \u039bi holds with\nhigh probability, for some \u039bi > 0. In SDCB, we use Di as a stochastically dominant con\ufb01dence\nbound of Di. We can show that \u00b5i \u2264 EYi\u223cDi[Yi] \u2264 \u00b5i + \u039bi holds with high probability, with the\nsame interval length \u039bi as in CUCB. (The proof is given in the supplementary material.) Hence, the\nanalysis in [8, 18] can be applied to SDCB, resulting in the same regret bounds.We further remark that\nin this case we do not need the three assumptions stated in Section 2 (in particular the independence\nassumption on Xi\u2019s): the summation reward case just works as in [18] and the nonlinear reward case\nrelies on the properties of monotonicity and bounded smoothness used in [8].\n\n4\n\nImproved SDCB Algorithm by Discretization\n\nIn Section 3, we have shown that our algorithm SDCB achieves near-optimal regret bounds. However,\nthat algorithm might suffer from large running time and memory usage. Note that, in the t-th round,\nan arm i might have been observed t \u2212 1 times already, and it is possible that all the observed values\nfrom arm i are different (e.g., when arm i\u2019s outcome distribution Di is continuous). In such case,\nit takes \u0398(t) space to store the empirical CDF \u02c6Fi of the observed outcomes from arm i, and both\ncalculating the stochastically dominant CDF Fi and updating \u02c6Fi take \u0398(t) time. Therefore, the\nworst-case space usage of SDCB in T rounds is \u0398(T ), and the worst-case running time is \u0398(T 2)\n(ignoring the dependence on m and K); here we do not count the time and space used by the of\ufb02ine\ncomputation oracle.\nIn this section, we propose an improved algorithm Lazy-SDCB which reduces the worst-case memory\nusage and running time to O(\nT log T )\ndistribution-independent regret bound. To this end, we need an additional assumption on the reward\nfunction:\nAssumption 4 (Lipschitz-continuous reward function). There exists C > 0 such that for any S \u2208 F\nand any x, x(cid:48) \u2208 [0, 1]m, we have |R(x, S) \u2212 R(x(cid:48), S)| \u2264 C(cid:107)xS \u2212 x(cid:48)\nS(cid:107)1 =\n\nT ) and O(T 3/2), respectively, while preserving the O(\n\nS(cid:107)1, where (cid:107)xS \u2212 x(cid:48)\n\n\u221a\n\n\u221a\n\n(cid:80)\ni\u2208S |xi \u2212 x(cid:48)\ni|.\n\n6\n\n\fAlgorithm 2 Lazy-SDCB with known time horizon\n(cid:26)[0, 1\n1: s \u2190 (cid:100)\u221a\nInput: time horizon T\n2: Ij \u2190\n3: Invoke SDCB (Algorithm 1) for T rounds, with the following change: whenever observing an\n\nT(cid:101)\ns ],\n( j\u22121\ns , j\ns ],\n\nj = 1\nj = 2, . . . , s\n\noutcome x (from any arm), \ufb01nd j \u2208 [s] such that x \u2208 Ij, and regard this outcome as j\n\ns\n\nAlgorithm 3 Lazy-SDCB without knowing the time horizon\n1: q \u2190 (cid:100)log2 m(cid:101)\n2: In rounds 1, 2, . . . , 2q, invoke Algorithm 2 with input T = 2q\n3: for k = q, q + 1, q + 2, . . . do\n4:\n5: end for\n\nIn rounds 2k + 1, 2k + 2, . . . , 2k+1, invoke Algorithm 2 with input T = 2k\n\nWe \ufb01rst describe the algorithm when the time horizon T is known in advance. The algorithm is\nsummarized in Algorithm 2. We perform a discretization on the distribution D = D1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Dm to\nobtain a discrete distribution \u02dcD = \u02dcD1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u02dcDm such that (i) for \u02dcX \u223c \u02dcD, \u02dcX1, . . . , \u02dcXm are also\ns , . . . , 1},\nmutually independent, and (ii) every \u02dcDi is supported on a set of equally-spaced values { 1\nwhere s is set to be (cid:100)\u221a\ns ], I2 =\ns , s\u22121\ns ], . . . , Is\u22121 = ( s\u22122\n( 1\ns , 2\nPr\n\ns , 2\nT(cid:101). Speci\ufb01cally, we partition [0, 1] into s intervals: I1 = [0, 1\ns ], Is = ( s\u22121\n[ \u02dcXi = j/s] = Pr\n\ns , 1], and de\ufb01ne \u02dcDi as\n\n[Xi \u2208 Ij] ,\n\nj = 1, . . . , s.\n\n\u02dcXi\u223c \u02dcDi\n\nXi\u223cDi\n\nFor the CMAB problem ([m],F, D, R), our algorithm \u201cpretends\u201d that the outcomes are drawn from\n\u02dcD instead of D, by replacing any outcome x \u2208 Ij by j\ns (\u2200j \u2208 [s]), and then applies SDCB to the\nproblem ([m],F, \u02dcD, R). Since each \u02dcDi has a known support { 1\ns , . . . , 1}, the algorithm only needs\nto maintain the number of occurrences of each support value in order to obtain the empirical CDF of\n\u221a\nall the observed outcomes from arm i. Therefore, all the operations in a round can be done using\n\u221a\nT ) time and space, and the total time and space used by Lazy-SDCB are O(T 3/2) and\nO(s) = O(\nO(\nThe discretization parameter s in Algorithm 2 depends on the time horizon T , which is why Algo-\nrithm 2 has to know T in advance. We can use the doubling trick to avoid the dependency on T . We\npresent such an algorithm (without knowing T ) in Algorithm 3. It is easy to see that Algorithm 3 has\nthe same asymptotic time and space usages as Algorithm 2.\n\nT ), respectively.\n\ns , 2\n\nRegret Bounds. We show that both Algorithm 2 and Algorithm 3 achieve O(\nT log T )\ndistribution-independent regret bounds. The full proofs are given in the supplementary material.\nRecall that C is the coef\ufb01cient in the Lipschitz condition in Assumption 4.\nTheorem 2. Suppose the time horizon T is known in advance. Then the \u03b1-approximation regret of\nAlgorithm 2 in T rounds is at most\n\n\u221a\n\n\u221a\n\n\u221a\n\n93M\n\nmKT ln T + 2CK\n\nT +\n\n+ 1\n\n\u03b1M m.\n\n(cid:19)\n\n(cid:18) \u03c02\n\n3\n\nProof Sketch. The regret consists of two parts: (i) the regret for the discretized CMAB problem\n([m],F, \u02dcD, R), and (ii) the error due to discretization. We directly apply Theorem 1 for the \ufb01rst\npart. For the second part, a key step is to show |rD(S) \u2212 r \u02dcD(S)| \u2264 CK/s for all S \u2208 F (see the\nsupplementary material).\nTheorem 3. For any time horizon T \u2265 2, the \u03b1-approximation regret of Algorithm 3 in T rounds is\nat most\n\n\u221a\n\nT + 10\u03b1M m ln T.\n\n\u221a\n\n318M\n\nmKT ln T + 7CK\n\n7\n\n\f5 Applications\n\nWe describe the K-MAX problem and the class of expected utility maximization problems as\napplications of our general CMAB framework.\n\nThe K-MAX Problem.\nIn this problem, the player is allowed to select at most K arms from the\nset of m arms in each round, and the reward is the maximum one among the outcomes from the\n\nselected arms. In other words, the set of feasible super arms is F = (cid:8)S \u2286 [m](cid:12)(cid:12)|S| \u2264 K(cid:9), and\n\nthe reward function is R(x, S) = maxi\u2208S xi. It is easy to verify that this reward function satis\ufb01es\nAssumptions 2, 3 and 4 with M = C = 1.\nNow we consider the corresponding of\ufb02ine K-MAX problem of selecting at most K arms from\nm independent arms, with the largest expected reward. It can be implied by a result in [14] that\n\ufb01nding the exact optimal solution is NP-hard, so we resort to approximation algorithms. We can\nshow, using submodularity, that a simple greedy algorithm can achieve a (1 \u2212 1/e)-approximation.\nFurthermore, we give the \ufb01rst PTAS for this problem. Our PTAS can be generalized to constraints\nother than the cardinality constraint |S| \u2264 K, including s-t simple paths, matchings, knapsacks, etc.\nThe algorithms and corresponding proofs are given in the supplementary material.\nTheorem 4. There exists a PTAS for the of\ufb02ine K-MAX problem. In other words, for any constant\n\u0001 > 0, there is a polynomial-time (1 \u2212 \u0001)-approximation algorithm for the of\ufb02ine K-MAX problem.\n\n\u221a\nWe thus can apply our SDCB algorithm to the K-MAX bandit problem and obtain O(log T )\n\u221a\ndistribution-dependent and \u02dcO(\nT ) distribution-independent regret bounds according to Theorem 1,\nor can apply Lazy-SDCB to get \u02dcO(\nT ) distribution-independent bound according to Theorem 2 or 3.\nStreeter and Golovin [26] study an online submodular maximization problem in the oblivious\n\u221a\nadversary model. In particular, their result can cover the stochastic K-MAX bandit problem as a\nmT log m) upper bound on the (1 \u2212 1/e)-regret can be shown. While\nspecial case, and an O(K\nthe techniques in [26] can only give a bound on the (1 \u2212 1/e)-approximation regret for K-MAX,\n\u221a\nT ) bound on the (1 \u2212 \u0001)-approximation regret for any constant \u0001 > 0,\nwe can obtain the \ufb01rst \u02dcO(\nusing our PTAS as the of\ufb02ine oracle. Even when we use the simple greedy algorithm as the oracle,\nour experiments show that SDCB performs signi\ufb01cantly better than the algorithm in [26] (see the\nsupplementary material).\n\nform R(x, S) = u((cid:80)\nproblem is to maximize the expected utility E[u((cid:80)\n\nExpected Utility Maximization. Our framework can also be applied to reward functions of the\ni\u2208S xi), where u(\u00b7) is an increasing utility function. The corresponding of\ufb02ine\ni\u2208S xi)] subject to a feasibility constraint S \u2208 F.\nNote that if u is nonlinear, the expected utility may not be a function of the means of the arms in\nS. Following the celebrated von Neumann-Morgenstern expected utility theorem, nonlinear utility\nfunctions have been extensively used to capture risk-averse or risk-prone behaviors in economics (see\ne.g., [11]), while linear utility functions correspond to risk-neutrality.\nLi and Deshpande [20] obtain a PTAS for the expected utility maximization (EUM) problem for\nseveral classes of utility functions (including for example increasing concave functions which\ntypically indicate risk-averseness), and a large class of feasibility constraints (including cardinality\nconstraint, s-t simple paths, matchings, and knapsacks). Similar results for other utility functions and\nfeasibility constraints can be found in [27, 21, 4]. In the online problem, we can apply our algorithms,\nusing their PTASs as the of\ufb02ine oracle. Again, we can obtain the \ufb01rst tight regret bounds on the\n(1 \u2212 \u0001)-approximation regret for any \u0001 > 0, for the class of online EUM problems.\n\nAcknowledgments\n\nWei Chen was supported in part by the National Natural Science Foundation of China (Grant No.\n61433014). Jian Li and Yu Liu were supported in part by the National Basic Research Program\nof China grants 2015CB358700, 2011CBA00300, 2011CBA00301, and the National NSFC grants\n61033001, 61361136003. The authors would like to thank Tor Lattimore for referring to us the DKW\ninequality.\n\n8\n\n\fReferences\n[1] Jean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In\n\nCOLT, pages 217\u2013226, 2009.\n\n[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[3] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] Anand Bhalgat and Sanjeev Khanna. A utility equivalence theorem for concave functions. In IPCO, pages\n\n126\u2013137. Springer, 2014.\n\n[5] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[6] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences,\n\n78(5):1404\u20131422, 2012.\n\n[7] Shouyuan Chen, Tian Lin, Irwin King, Michael R. Lyu, and Wei Chen. Combinatorial pure exploration of\n\nmulti-armed bandits. In NIPS, 2014.\n\n[8] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension\n\nto probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1\u201333, 2016.\n\n[9] Richard Combes, M. Sadegh Talebi, Alexandre Proutiere, and Marc Lelarge. Combinatorial bandits\n\nrevisited. In NIPS, 2015.\n\n[10] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample\ndistribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics,\npages 642\u2013669, 1956.\n\n[11] P. C. Fishburn. The foundations of expected utility. Dordrecht: Reidel, 1982.\n[12] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown\nvariables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions\non Networking, 20(5):1466\u20131478, 2012.\n\n[13] Ashish Goel, Sudipto Guha, and Kamesh Munagala. Asking the right questions: Model-driven optimization\n\nusing probes. In PODS, pages 203\u2013212. ACM, 2006.\n\n[14] Ashish Goel, Sudipto Guha, and Kamesh Munagala. How to probe for an extreme value. ACM Transactions\n\non Algorithms (TALG), 7(1):12:1\u201312:20, 2010.\n\n[15] Aditya Gopalan, Shie Mannor, and Yishay mansour. Thompson sampling for complex online problems. In\n\nICML, pages 100\u2013108, 2014.\n\n[16] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriksson. Matroid bandits: Fast\n\ncombinatorial optimization with learning. In UAI, pages 420\u2013429, 2014.\n\n[17] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesv\u00e1ri. Combinatorial cascading bandits. In\n\nNIPS, 2015.\n\n[18] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesv\u00e1ri. Tight regret bounds for stochastic\n\ncombinatorial semi-bandits. In AISTATS, pages 535\u2013543, 2015.\n\n[19] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[20] Jian Li and Amol Deshpande. Maximizing expected utility for stochastic combinatorial optimization\n\nproblems. In FOCS, pages 797\u2013806, 2011.\n\n[21] Jian Li and Wen Yuan. Stochastic combinatorial optimization via poisson approximation. In STOC, pages\n\n971\u2013980, 2013.\n\n[22] Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial partial monitoring\n\ngame with linear feedback and its applications. In ICML, pages 901\u2013909, 2014.\n\n[23] Tian Lin, Jian Li, and Wei Chen. Stochastic online greedy learning with semi-bandit feedbacks. In NIPS,\n\n2015.\n\n[24] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability,\n\npages 1269\u20131283, 1990.\n\n[25] George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for\n\nmaximizing submodular set functions \u2013 I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[26] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In\n\nNIPS, 2008.\n\n[27] Jiajin Yu and Shabbir Ahmed. Maximizing expected utility over a knapsack constraint. Operations\n\nResearch Letters, 44(2):180\u2013185, 2016.\n\n9\n\n\f", "award": [], "sourceid": 903, "authors": [{"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Fu", "family_name": "Li", "institution": "The University of Texas at Austin"}, {"given_name": "Jian", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Yu", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Pinyan", "family_name": "Lu", "institution": "Shanghai University of Finance and Economics"}]}