{"title": "Combinatorial Cascading Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1450, "page_last": 1458, "abstract": "We propose combinatorial cascading bandits, a class of partial monitoring problems where at each step a learning agent chooses a tuple of ground items subject to constraints and receives a reward if and only if the weights of all chosen items are one. The weights of the items are binary, stochastic, and drawn independently of each other. The agent observes the index of the first chosen item whose weight is zero. This observation model arises in network routing, for instance, where the learning agent may only observe the first link in the routing path which is down, and blocks the path. We propose a UCB-like algorithm for solving our problems, CombCascade; and prove gap-dependent and gap-free upper bounds on its n-step regret. Our proofs build on recent work in stochastic combinatorial semi-bandits but also address two novel challenges of our setting, a non-linear reward function and partial observability. We evaluate CombCascade on two real-world problems and show that it performs well even when our modeling assumptions are violated. We also demonstrate that our setting requires a new learning algorithm.", "full_text": "Combinatorial Cascading Bandits\n\nBranislav Kveton\nAdobe Research\nSan Jose, CA\n\nkveton@adobe.com\n\nZheng Wen\nYahoo Labs\n\nSunnyvale, CA\n\nzhengwen@yahoo-inc.com\n\nAzin Ashkan\n\nTechnicolor Research\n\nLos Altos, CA\n\nazin.ashkan@technicolor.com\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nszepesva@cs.ualberta.ca\n\nAbstract\n\nWe propose combinatorial cascading bandits, a class of partial monitoring prob-\nlems where at each step a learning agent chooses a tuple of ground items subject\nto constraints and receives a reward if and only if the weights of all chosen items\nare one. The weights of the items are binary, stochastic, and drawn independently\nof each other. The agent observes the index of the \ufb01rst chosen item whose weight\nis zero. This observation model arises in network routing, for instance, where the\nlearning agent may only observe the \ufb01rst link in the routing path which is down,\nand blocks the path. We propose a UCB-like algorithm for solving our problems,\nCombCascade; and prove gap-dependent and gap-free upper bounds on its n-step\nregret. Our proofs build on recent work in stochastic combinatorial semi-bandits\nbut also address two novel challenges of our setting, a non-linear reward function\nand partial observability. We evaluate CombCascade on two real-world problems\nand show that it performs well even when our modeling assumptions are violated.\nWe also demonstrate that our setting requires a new learning algorithm.\n\n1\n\nIntroduction\n\nCombinatorial optimization [16] has many real-world applications. In this work, we study a class of\ncombinatorial optimization problems with a binary objective function that returns one if and only if\nthe weights of all chosen items are one. The weights of the items are binary, stochastic, and drawn\nindependently of each other. Many popular optimization problems can be formulated in our setting.\nNetwork routing is a problem of choosing a routing path in a computer network that maximizes the\nprobability that all links in the chosen path are up. Recommendation is a problem of choosing a list\nof items that minimizes the probability that none of the recommended items are attractive. Both of\nthese problems are closely related and can be solved using similar techniques (Section 2.3).\nCombinatorial cascading bandits are a novel framework for online learning of the aforementioned\nproblems where the distribution over the weights of items is unknown. Our goal is to maximize the\nexpected cumulative reward of a learning agent in n steps. Our learning problem is challenging for\ntwo main reasons. First, the reward function is non-linear in the weights of chosen items. Second,\nwe only observe the index of the \ufb01rst chosen item with a zero weight. This kind of feedback arises\nfrequently in network routing, for instance, where the learning agent may only observe the \ufb01rst link\nin the routing path which is down, and blocks the path. This feedback model was recently proposed\nin the so-called cascading bandits [10]. The main difference in our work is that the feasible set can\nbe arbitrary. The feasible set in cascading bandits is a uniform matroid.\n\n1\n\n\fStochastic online learning with combinatorial actions has been previously studied with semi-bandit\nfeedback and a linear reward function [8, 11, 12], and its monotone transformation [5]. Established\nalgorithms for multi-armed bandits, such as UCB1 [3], KL-UCB [9], and Thompson sampling [18, 2];\ncan be usually easily adapted to stochastic combinatorial semi-bandits. However, it is non-trivial to\nshow that the algorithms are statistically ef\ufb01cient, in the sense that their regret matches some lower\nbound. Kveton et al. [12] recently showed this for CombUCB1, a form of UCB1. Our analysis builds\non this recent advance but also addresses two novel challenges of our problem, a non-linear reward\nfunction and partial observability. These challenges cannot be addressed straightforwardly based on\nKveton et al. [12, 10].\nWe make multiple contributions. In Section 2, we de\ufb01ne the online learning problem of combinato-\nrial cascading bandits and propose CombCascade, a variant of UCB1, for solving it. CombCascade\nis computationally ef\ufb01cient on any feasible set where a linear function can be optimized ef\ufb01ciently.\nA minor-looking improvement to the UCB1 upper con\ufb01dence bound, which exploits the fact that the\nexpected weights of items are bounded by one, is necessary in our analysis. In Section 3, we derive\ngap-dependent and gap-free upper bounds on the regret of CombCascade, and discuss the tightness\nof these bounds. In Section 4, we evaluate CombCascade on two practical problems and show that\nthe algorithm performs well even when our modeling assumptions are violated. We also show that\nCombUCB1 [8, 12] cannot solve some instances of our problem, which highlights the need for a new\nlearning algorithm.\n\n2 Combinatorial Cascading Bandits\n\nThis section introduces our learning problem, its applications, and also our proposed algorithm. We\ndiscuss the computational complexity of the algorithm and then introduce the co-called disjunctive\nvariant of our problem. We denote random variables by boldface letters. The cardinality of set A is\n|A| and we assume that min; = +1. The binary and operation is denoted by ^, and the binary or\nis _.\n2.1 Setting\n\nWe model our online learning problem as a combinatorial cascading bandit. A combinatorial cas-\ncading bandit is a tuple B = (E, P, \u21e5), where E = {1, . . . , L} is a \ufb01nite set of L ground items, P\nis a probability distribution over a binary hypercube {0, 1}E, \u21e5 \u2713 \u21e7\u21e4(E), and:\n\n\u21e7\u21e4(E) = {(a1, . . . , ak) : k  1, a1, . . . , ak 2 E, ai 6= aj for any i 6= j}\n\nis the set of all tuples of distinct items from E. We refer to \u21e5 as the feasible set and to A 2 \u21e5 as a\nfeasible solution. We abuse our notation and also treat A as the set of items in solution A. Without\nloss of generality, we assume that the feasible set \u21e5 covers the ground set, E = [\u21e5.\nLet (wt)n\ntime t, the learning agent chooses solution At = (at\nand then receives a binary reward:\n\nt=1 be an i.i.d. sequence of n weights drawn from distribution P , where wt 2{ 0, 1}E. At\n) 2 \u21e5 based on its past observations\n\n1, . . . , at\n\n|At|\n\nrt = min\ne2At\n\nwt(e) = ^e2At\n\nwt(e)\n\nas a response to this choice. The reward is one if and only if the weights of all items in At are one.\nThe key step in our solution and its analysis is that the reward can be expressed as rt = f (At, wt),\nwhere f :\u21e5 \u21e5 [0, 1]E ! [0, 1] is a reward function, which is de\ufb01ned as:\nw(e) , A 2 \u21e5 , w 2 [0, 1]E .\n\nAt the end of time t, the agent observes the index of the \ufb01rst item in At whose weight is zero, and\n+1 if such an item does not exist. We denote this feedback by Ot and de\ufb01ne it as:\n\nf (A, w) = Ye2A\nOt = min1 \uf8ff k \uf8ff| At| : wt(at\nk) = 1{k < Ot}\n\nwt(at\n\nk) = 0 .\n\nNote that Ot fully determines the weights of the \ufb01rst min{Ot,|At|} items in At. In particular:\n\nk = 1, . . . , min{Ot,|At|} .\n\n(1)\n\n2\n\n\fAccordingly, we say that item e is observed at time t if e = at\nk for some 1 \uf8ff k \uf8ff min{Ot,|At|}.\nNote that the order of items in At affects the feedback Ot but not the reward rt. This differentiates\nour problem from combinatorial semi-bandits.\nThe goal of our learning agent is to maximize its expected cumulative reward. This is equivalent to\nminimizing the expected cumulative regret in n steps:\n\nR(n) = E [Pn\n\nt=1 R(At, wt)] ,\n\nwhere R(At, wt) = f (A\u21e4, wt)  f (At, wt) is the instantaneous stochastic regret of the agent at\ntime t and A\u21e4 = arg max A2\u21e5 E [f (A, w)] is the optimal solution in hindsight of knowing P . For\nsimplicity of exposition, we assume that A\u21e4, as a set, is unique.\nA major simplifying assumption, which simpli\ufb01es our optimization problem and its learning, is that\nthe distribution P is factored:\n\n(2)\nwhere Pe is a Bernoulli distribution with mean \u00afw(e). We borrow this assumption from the work of\nKveton et al. [10] and it is critical to our results. We would face computational dif\ufb01culties without\nit. Under this assumption, the expected reward of solution A 2 \u21e5, the probability that the weight of\neach item in A is one, can be written as E [f (A, w)] = f (A, \u00afw), and depends only on the expected\nweights of individual items in A. It follows that:\n\nP (w) =Qe2E Pe(w(e)) ,\n\nA\u21e4 = arg max A2\u21e5 f (A, \u00afw) .\n\nIn Section 4, we experiment with two problems that violate our independence assumption. We also\ndiscuss implications of this violation.\nSeveral interesting online learning problems can be formulated as combinatorial cascading bandits.\nConsider the problem of learning routing paths in Simple Mail Transfer Protocol (SMTP) that max-\nimize the probability of e-mail delivery. The ground set in this problem are all links in the network\nand the feasible set are all routing paths. At time t, the learning agent chooses routing path At and\nobserves if the e-mail is delivered. If the e-mail is not delivered, the agent observes the \ufb01rst link in\nthe routing path which is down. This kind of information is available in SMTP. The weight of item\ne at time t is an indicator of link e being up at time t. The independence assumption in (2) requires\nthat all links fail independently. This assumption is common in the existing network routing models\n[6]. We return to the problem of network routing in Section 4.2.\n\n2.2 CombCascade Algorithm\n\nOur proposed algorithm, CombCascade, is described in Algorithm 1. This algorithm belongs to the\nfamily of UCB algorithms. At time t, CombCascade operates in three stages. First, it computes the\nupper con\ufb01dence bounds (UCBs) Ut 2 [0, 1]E on the expected weights of all items in E. The UCB\nof item e at time t is de\ufb01ned as:\n(3)\nwhere \u02c6ws(e) is the average of s observed weights of item e, Tt(e) is the number of times that item e\n\nUt(e) = min \u02c6wTt1(e)(e) + ct1,Tt1(e), 1 ,\n\nis observed in t steps, and ct,s =p(1.5 log t)/s is the radius of a con\ufb01dence interval around \u02c6ws(e)\nafter t steps such that \u00afw(e) 2 [ \u02c6ws(e)  ct,s, \u02c6ws(e) + ct,s] holds with a high probability. After the\nUCBs are computed, CombCascade chooses the optimal solution with respect to these UCBs:\n\nAt = arg max A2\u21e5 f (A, Ut) .\n\nk such that k \uf8ff Ot.\n\nFinally, CombCascade observes Ot and updates its estimates of the expected weights based on the\nweights of the observed items in (1), for all items at\nFor simplicity of exposition, we assume that CombCascade is initialized by one sample w0 \u21e0 P . If\nw0 is unavailable, we can formulate the problem of obtaining w0 as an optimization problem on \u21e5\nwith a linear objective [12]. The initialization procedure of Kveton et al. [12] tracks observed items\nand adaptively chooses solutions with the maximum number of unobserved items. This approach is\ncomputationally ef\ufb01cient on any feasible set \u21e5 where a linear function can be optimized ef\ufb01ciently.\nCombCascade has two attractive properties. First, the algorithm is computationally ef\ufb01cient, in the\n\nsense that At = arg max A2\u21e5Pe2A log(Ut(e)) is the problem of maximizing a linear function on\n\n3\n\n\fAlgorithm 1 CombCascade for combinatorial cascading bandits.\n\n// Initialization\nObserve w0 \u21e0 P\n8e 2 E : T0(e) 1\n8e 2 E : \u02c6w1(e) w0(e)\nfor all t = 1, . . . , n do\n\n// Compute UCBs\n\n8e 2 E : Ut(e) = min \u02c6wTt1(e)(e) + ct1,Tt1(e), 1 \n// Solve the optimization problem and get feedback\nAt arg max A2\u21e5 f (A, Ut)\nObserve Ot 2{ 1, . . . ,|At| , +1}\n// Update statistics\n8e 2 E : Tt(e) Tt1(e)\nfor all k = 1, . . . , min{Ot,|At|} do\n\nk\n\ne at\nTt(e) Tt(e) + 1\n\u02c6wTt(e)(e) \n\nTt1(e) \u02c6wTt1(e)(e) + 1{k < Ot}\n\nTt(e)\n\n\u21e5. This problem can be solved ef\ufb01ciently for various feasible sets \u21e5, such as matroids, matchings,\nand paths. Second, CombCascade is sample ef\ufb01cient because the UCB of solution A, f (A, Ut), is a\nproduct of the UCBs of all items in A, which are estimated separately. The regret of CombCascade\ndoes not depend on |\u21e5| and is polynomial in all other quantities of interest.\n2.3 Disjunctive Objective\n\nA2\u21e5 Ye2A\n\nOur reward model is conjuctive, the reward is one if and only if the weights of all chosen items are\nwt(e), the reward\nis one if the weight of any item in At is one. This model arises in recommender systems, where the\nrecommender is rewarded when the user is satis\ufb01ed with any recommended item. The feedback Ot\nis the index of the \ufb01rst item in At whose weight is one, as in cascading bandits [10].\n\none. A natural alternative is a disjunctive model rt = maxe2At wt(e) =We2At\nLet f_ :\u21e5 \u21e5 [0, 1]E ! [0, 1] be a reward function, which is de\ufb01ned as f_(A, w) = 1 Qe2A(1 \n\nw(e)). Then under the independence assumption in (2), E [f_(A, w)] = f_(A, \u00afw) and:\nf (A, 1  \u00afw) .\n\nf_(A, \u00afw) = arg min\n\nA\u21e4 = arg max\n\n(1  \u00afw(e)) = arg min\nA2\u21e5\n\nA2\u21e5\n\nTherefore, A\u21e4 can be learned by a variant of CombCascade where the observations are 1  wt and\neach UCB Ut(e) is substituted with a lower con\ufb01dence bound (LCB) on 1  \u00afw(e):\n\nLt(e) = max1  \u02c6wTt1(e)(e)  ct1,Tt1(e), 0 .\n\nLet R(At, wt) = f (At, 1  wt)  f (A\u21e4, 1  wt) be the instantaneous stochastic regret at time t.\nThen we can bound the regret of CombCascade as in Theorems 1 and 2. The only difference is that\ne,min and f\u21e4 are rede\ufb01ned as:\n\ne,min = minA2\u21e5:e2A,A>0 f (A, 1  \u00afw)  f (A\u21e4, 1  \u00afw) ,\n\nf\u21e4 = f (A\u21e4, 1  \u00afw) .\n\n3 Analysis\n\nWe prove gap-dependent and gap-free upper bounds on the regret of CombCascade in Section 3.1.\nWe discuss these bounds in Section 3.2.\n\n3.1 Upper Bounds\nWe de\ufb01ne the suboptimality gap of solution A = (a1, . . . , a|A|) as A = f (A\u21e4, \u00afw)  f (A, \u00afw) and\nthe probability that all items in A are observed as pA =Q|A|1\n\u00afw(ak). For convenience, we de\ufb01ne\n\nk=1\n\n4\n\n\fshorthands f\u21e4 = f (A\u21e4, \u00afw) and p\u21e4 = pA\u21e4. Let \u02dcE = E \\ A\u21e4 be the set of suboptimal items, the items\nthat are not in A\u21e4. Then the minimum gap associated with suboptimal item e 2 \u02dcE is:\n\ne,min = f (A\u21e4, \u00afw)  maxA2\u21e5:e2A,A>0 f (A, \u00afw) .\n\nK\n\nLet K = max{|A| : A 2 \u21e5} be the maximum number of items in any solution and f\u21e4 > 0. Then\nthe regret of CombCascade is bounded as follows.\nTheorem 1. The regret of CombCascade is bounded as R(n) \uf8ff\nProof. The proof is in Appendix A. The main idea is to reduce our analysis to that of CombUCB1 in\nstochastic combinatorial semi-bandits [12]. This reduction is challenging for two reasons. First, our\nreward function is non-linear in the weights of chosen items. Second, we only observe some of the\nchosen items.\nOur analysis can be trivially reduced to semi-bandits by conditioning on the event of observing all\nitems. In particular, let Ht = (A1, O1, . . . , At1, Ot1, At) be the history of CombCascade up to\nchoosing solution At, the \ufb01rst t  1 observations and t actions. Then we can express the expected\nregret at time t conditioned on Ht as:\n\nf\u21e4 Xe2 \u02dcE\n\n4272\ne,min\n\nlog n +\n\n\u21e12\n3\n\nL.\n\nE [R(At, wt)|Ht] = E [At(1/pAt)1{At > 0, Ot | At|}|Ht]\n\nand analyze our problem under the assumption that all items in At are observed. This reduction is\nproblematic because the probability pAt can be low, and as a result we get a loose regret bound.\nWe address this issue by formalizing the following insight into our problem. When f (A, \u00afw) \u2327 f\u21e4,\nCombCascade can distinguish A from A\u21e4 without learning the expected weights of all items in A.\nIn particular, CombCascade acts implicitly on the pre\ufb01xes of suboptimal solutions, and we choose\nthem in our analysis such that the probability of observing all items in the pre\ufb01xes is \u201cclose\u201d to f\u21e4,\nand the gaps are \u201cclose\u201d to those of the original solutions.\nLemma 1. Let A = (a1, . . . , a|A|) 2 \u21e5 be a feasible solution and Bk = (a1, . . . , ak) be a pre\ufb01x of\nk \uf8ff| A| items of A. Then k can be set such that Bk  1\nThen we count the number of times that the pre\ufb01xes can be chosen instead of A\u21e4 when all items in\nthe pre\ufb01xes are observed. The last remaining issue is that f (A, Ut) is non-linear in the con\ufb01dence\nradii of the items in A. Therefore, we bound it from above based on the following lemma.\nLemma 2. Let 0 \uf8ff p1, . . . , pK \uf8ff 1 and u1, . . . , uK  0. Then:\nk=1 pk +PK\n\nThis bound is tight when p1, . . . , pK = 1 and u1, . . . , uK = 0.\nThe rest of our analysis is along the lines of Theorem 5 in Kveton et al. [12]. We can achieve linear\ndependency on K, in exchange for a multiplicative factor of 534 in our upper bound.\n\nQK\nk=1 min{pk + uk, 1}\uf8ff QK\n\n2 A and pBk  1\n\nk=1 uk .\n\n2 f\u21e4.\n\nWe also prove the following gap-free bound.\n\nTheorem 2. The regret of CombCascade is bounded as R(n) \uf8ff 131s KLn log n\n\nf\u21e4\n\n+\n\n\u21e12\n3\n\nL.\n\nProof. The proof is in Appendix B. The key idea is to decompose the regret of CombCascade into\ntwo parts, where the gaps At are at most \u270f and larger than \u270f. We analyze each part separately and\nthen set \u270f to get the desired result.\n\n3.2 Discussion\n\nIn Section 3.1, we prove two upper bounds on the n-step regret of CombCascade:\n\nTheorem 1: O(KL(1/f\u21e4)(1/) log n) , Theorem 2: O(pKL(1/f\u21e4)n log n) ,\n\nwhere = min e2 \u02dcE e,min. These bounds do not depend on the total number of feasible solutions\n|\u21e5| and are polynomial in any other quantity of interest. The bounds match, up to O(1/f\u21e4) factors,\n\n5\n\n\f7w = (0:4; 0:4; 0:2; 0:2)\n\nt\ne\nr\ng\ne\nR\n\n80\n60\n40\n20\n0\n\nt\ne\nr\ng\ne\nR\n\n500\n400\n300\n200\n100\n0\n\n7w = (0:4; 0:4; 0:9; 0:1)\n\nt\ne\nr\ng\ne\nR\n\n2k\n\n4k\n6k\nStep n\n\n8k\n\n10k\n\n100\n80\n60\n40\n20\n0\n\n7w = (0:4; 0:4; 0:3; 0:3)\n\n2k\n\n4k\n6k\nStep n\n\n8k\n\n10k\n\nCombCascade\nCombUCB1\n8k\n\n10k\n\n2k\n\n4k\n6k\nStep n\n\nFigure 1: The regret of CombCascade and CombUCB1 in the synthetic experiment (Section 4.1). The\nresults are averaged over 100 runs.\n\nthe upper bounds of CombUCB1 in stochastic combinatorial semi-bandits [12]. Since CombCascade\nreceives less feedback than CombUCB1, this is rather surprising and unexpected. The upper bounds\nof Kveton et al. [12] are known to be tight up to polylogarithmic factors. We believe that our upper\nbounds are also tight in the setting similar to Kveton et al. [12], where the expected weight of each\nitem is close to 1 and likely to be observed.\nThe assumption that f\u21e4 is large is often reasonable. In network routing, the optimal routing path is\nlikely to be reliable. In recommender systems, the optimal recommended list often does not satisfy\na reasonably large fraction of users.\n\n4 Experiments\n\nWe evaluate CombCascade in three experiments. In Section 4.1, we compare it to CombUCB1 [12],\na state-of-the-art algorithm for stochastic combinatorial semi-bandits with a linear reward function.\nThis experiment shows that CombUCB1 cannot solve all instances of our problem, which highlights\nthe need for a new learning algorithm. It also shows the limitations of CombCascade. We evaluate\nCombCascade on two real-world problems in Sections 4.2 and 4.3.\n\n4.1 Synthetic\n\nIn the \ufb01rst experiment, we compare CombCascade to CombUCB1 [12] on a synthetic problem. This\nproblem is a combinatorial cascading bandit with L = 4 items and \u21e5= {(1, 2), (3, 4)}. CombUCB1\nis a popular algorithm for stochastic combinatorial semi-bandits with a linear reward function. We\napproximate maxA2\u21e5 f (A, w) by minA2\u21e5Pe2A(1  w(e)). This approximation is motivated by\nthe fact that f (A, w) =Qe2A w(e) \u21e1 1 Pe2A(1  w(e)) as mine2E w(e) ! 1. We update the\n\nestimates of \u00afw in CombUCB1 as in CombCascade, based on the weights of the observed items in (1).\nWe experiment with three different settings of \u00afw and report our results in Figure 1. The settings of\n\u00afw are reported in our plots. We assume that wt(e) are distributed independently, except for the last\nplot where wt(3) = wt(4). Our plots represent three common scenarios that we encountered in our\n\nboth CombCascade and CombUCB1 can learn A\u21e4. The regret of CombCascade is slightly lower than\n\nexperiments. In the \ufb01rst plot, arg max A2\u21e5 f (A, \u00afw) = arg min A2\u21e5Pe2A(1  \u00afw(e)). In this case,\nthat of CombUCB1. In the second plot, arg max A2\u21e5 f (A, \u00afw) 6= arg min A2\u21e5Pe2A(1  \u00afw(e)). In\n\nthis case, CombUCB1 cannot learn A\u21e4 and therefore suffers linear regret. In the third plot, we violate\nour modeling assumptions. Perhaps surprisingly, CombCascade can still learn the optimal solution\nA\u21e4, although it suffers higher regret than CombUCB1.\n\n4.2 Network Routing\n\nIn the second experiment, we evaluate CombCascade on a problem of network routing. We experi-\nment with six networks from the RocketFuel dataset [17], which are described in Figure 2a.\nOur learning problem is formulated as follows. The ground set E are the links in the network. The\nfeasible set \u21e5 are all paths in the network. At time t, we generate a random pair of starting and end\nnodes, and the learning agent chooses a routing path between these nodes. The goal of the agent is\nto maximizes the probability that all links in the path are up. The feedback is the index of the \ufb01rst\nlink in the path which is down. The weight of link e at time t, wt(e), is an indicator of link e being\n\n6\n\n\fNetwork Nodes Links\n153\n1221\n972\n1239\n161\n1755\n3257\n328\n147\n3967\n6461\n374\n\n108\n315\n87\n161\n79\n141\n\nt\ne\nr\ng\ne\nR\n\n8k\n6k\n4k\n2k\n0\n\n(a)\n\n1221\n1755\n3967\n\nt\ne\nr\ng\ne\nR\n\n60k\n\n120k 180k 240k 300k\n\nStep n\n\n30k\n\n20k\n\n10k\n\n0\n\n(b)\n\n1239\n3257\n6461\n\n60k 120k 180k 240k 300k\n\nStep n\n\nFigure 2: a. The description of six networks from our network routing experiment (Section 4.2). b.\nThe n-step regret of CombCascade in these networks. The results are averaged over 50 runs.\n\nup at time t. We model wt(e) as an independent Bernoulli random variable wt(e) \u21e0 B( \u00afw(e)) with\nmean \u00afw(e) = 0.7 + 0.2 local(e), where local(e) is an indicator of link e being local. We say that\nthe link is local when its expected latency is at most 1 millisecond. About a half of the links in our\nnetworks are local. To summarize, the local links are up with probability 0.9; and are more reliable\nthan the global links, which are up only with probability 0.7.\nOur results are reported in Figure 2b. We observe that the n-step regret of CombCascade \ufb02attens as\ntime n increases. This means that CombCascade learns near-optimal policies in all networks.\n\n4.3 Diverse Recommendations\n\nIn our last experiment, we evaluate CombCascade on a problem of diverse recommendations. This\nproblem is motivated by on-demand media streaming services like Net\ufb02ix, which often recommend\ngroups of movies, such as \u201cPopular on Net\ufb02ix\u201d and \u201cDramas\u201d. We experiment with the MovieLens\ndataset [13] from March 2015. The dataset contains 138k people who assigned 20M ratings to 27k\nmovies between January 1995 and March 2015.\nOur learning problem is formulated as follows. The ground set E are 200 movies from our dataset:\n25 most rated animated movies, 75 random animated movies, 25 most rated non-animated movies,\nand 75 random non-animated movies. The feasible set \u21e5 are all K-permutations of E where K/2\nmovies are animated. The weight of item e at time t, wt(e), indicates that item e attracts the user at\ntime t. We assume that wt(e) = 1 if and only if the user rated item e in our dataset. This indicates\nthat the user watched movie e at some point in time, perhaps because the movie was attractive. The\nuser at time t is drawn randomly from our pool of users. The goal of the learning agent is to learn a\nlist of items A\u21e4 = arg max A2\u21e5 E [f_(A, w)] that maximizes the probability that at least one item\nis attractive. The feedback is the index of the \ufb01rst attractive item in the list (Section 2.3). We would\nlike to point out that our modeling assumptions are violated in this experiment. In particular, wt(e)\nare correlated across items e because the users do not rate movies independently. The result is that\nA\u21e4 6= arg max A2\u21e5 f_(A, \u00afw). It is NP-hard to compute A\u21e4. However, E [f_(A, w)] is submodular\nand monotone in A, and therefore a (1  1/e) approximation to A\u21e4 can be computed greedily. We\ndenote this approximation by A\u21e4 and show it for K = 8 in Figure 3a.\nOur results are reported in Figure 3b. Similarly to Figure 2b, the n-step regret of CombCascade is\na concave function of time n for all studied K. This indicates that CombCascade solutions improve\nover time. We note that the regret does not \ufb02atten as in Figure 2b. The reason is that CombCascade\ndoes not learn A\u21e4. Nevertheless, it performs well and we expect comparably good performance in\nother domains where our modeling assumptions are not satis\ufb01ed. Our current theory cannot explain\nthis behavior and we leave it for future work.\n\n5 Related Work\n\nOur work generalizes cascading bandits of Kveton et al. [10] to arbitrary combinatorial constraints.\nThe feasible set in cascading bandits is a uniform matroid, any list of K items out of L is feasible.\nOur generalization signi\ufb01cantly expands the applicability of the original model and we demonstrate\nthis on two novel real-world problems (Section 4). Our work also extends stochastic combinatorial\nsemi-bandits with a linear reward function [8, 11, 12] to the cascade model of feedback. A similar\nmodel to cascading bandits was recently studied by Combes et al. [7].\n\n7\n\n\fMovie title\nPulp Fiction\nForrest Gump\nIndependence Day\nShawshank Redemption\nToy Story\nShrek\nWho Framed Roger Rabbit?\nAladdin\n\nAnimation\n\nNo\nNo\nNo\nNo\nYes\nYes\nYes\nYes\n\nt\ne\nr\ng\ne\nR\n\n8k\n\n6k\n\n4k\n\n2k\n\n0\n\n(a)\n\nK = 8\nK = 12\nK = 16\n\n60k\n\n80k\n\n100k\n\n20k\n\n40k\n\nStep n\n(b)\n\nFigure 3: a. The optimal list of 8 movies in the diverse recommendations experiment (Section 4.3).\nb. The n-step regret of CombCascade in this experiment. The results are averaged over 50 runs.\n\nOur generalization is signi\ufb01cant for two reasons. First, CombCascade is a novel learning algorithm.\nCombUCB1 [12] chooses solutions with the largest sum of the UCBs. CascadeUCB1 [10] chooses K\nitems out of L with the largest UCBs. CombCascade chooses solutions with the largest product of\nthe UCBs. All three algorithms can \ufb01nd the optimal solution in cascading bandits. However, when\nthe feasible set is not a matroid, it is critical to maximize the product of the UCBs. CombUCB1 may\nlearn a suboptimal solution in this setting and we illustrate this in Section 4.1.\nSecond, our analysis is novel. The proof of Theorem 1 is different from those of Theorems 2 and 3\nin Kveton et al. [10]. These proofs are based on counting the number of times that each suboptimal\nitem is chosen instead of any optimal item. They can be only applied to special feasible sets, such a\nmatroid, because they require that the items in the feasible solutions are exchangeable. We build on\nthe recent work of Kveton et al. [12] to achieve linear dependency on K in Theorem 1. The rest of\nour analysis is novel.\nOur problem is a partial monitoring problem where some of the chosen items may be unobserved.\nAgrawal et al. [1] and Bartok et al. [4] studied partial monitoring problems and proposed learning\nalgorithms for solving them. These algorithms are impractical in our setting. As an example, if we\nformulate our problem as in Bartok et al. [4], we get |\u21e5| actions and 2L unobserved outcomes; and\nthe learning algorithm reasons over |\u21e5|2 pairs of actions and requires O(2L) space. Lin et al. [15]\nalso studied combinatorial partial monitoring. Their feedback is a linear function of the weights of\nchosen items. Our feedback is a non-linear function of the weights.\nOur reward function is non-linear in unknown parameters. Chen et al. [5] studied stochastic combi-\nnatorial semi-bandits with a non-linear reward function, which is a known monotone function of an\nunknown linear function. The feedback in Chen et al. [5] is semi-bandit, which is more informative\nthan in our work. Le et al. [14] studied a network optimization problem where the reward function\nis a non-linear function of observations.\n\n6 Conclusions\n\nWe propose combinatorial cascading bandits, a class of stochastic partial monitoring problems that\ncan model many practical problems, such as learning of a routing path in an unreliable communica-\ntion network that maximizes the probability of packet delivery, and learning to recommend a list of\nattractive items. We propose a practical UCB-like algorithm for our problems, CombCascade, and\nprove upper bounds on its regret. We evaluate CombCascade on two real-world problems and show\nthat it performs well even when our modeling assumptions are violated.\nOur results and analysis apply to any combinatorial action set, and therefore are quite general. The\nstrongest assumption in our work is that the weights of items are distributed independently of each\nother. This assumption is critical and hard to eliminate (Section 2.1). Nevertheless, it can be easily\nrelaxed to conditional independence given the features of items, along the lines of Wen et al. [19].\nWe leave this for future work. From the theoretical point of view, we want to derive a lower bound\non the n-step regret in combinatorial cascading bandits, and show that the factor of f\u21e4 in Theorems\n1 and 2 is intrinsic.\n\n8\n\n\fReferences\n[1] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically\nef\ufb01cient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space.\nIEEE Transactions on Automatic Control, 34(3):258\u2013267, 1989.\n\n[2] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit\nproblem. In Proceeding of the 25th Annual Conference on Learning Theory, pages 39.1\u201339.26,\n2012.\n\n[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47:235\u2013256, 2002.\n\n[4] Gabor Bartok, Navid Zolghadr, and Csaba Szepesvari. An adaptive algorithm for \ufb01nite stochas-\ntic partial monitoring. In Proceedings of the 29th International Conference on Machine Learn-\ning, 2012.\n\n[5] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General frame-\nwork, results and applications. In Proceedings of the 30th International Conference on Ma-\nchine Learning, pages 151\u2013159, 2013.\n\n[6] Baek-Young Choi, Sue Moon, Zhi-Li Zhang, Konstantina Papagiannaki, and Christophe Diot.\nAnalysis of point-to-point packet delay in an operational network. In Proceedings of the 23rd\nAnnual Joint Conference of the IEEE Computer and Communications Societies, 2004.\n\n[7] Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. Learning to\nrank: Regret lower bounds and ef\ufb01cient algorithms. In Proceedings of the 2015 ACM SIGMET-\nRICS International Conference on Measurement and Modeling of Computer Systems, 2015.\n\n[8] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with\nunknown variables: Multi-armed bandits with linear rewards and individual observations.\nIEEE/ACM Transactions on Networking, 20(5):1466\u20131478, 2012.\n\n[9] Aurelien Garivier and Olivier Cappe. The KL-UCB algorithm for bounded stochastic bandits\nand beyond. In Proceeding of the 24th Annual Conference on Learning Theory, pages 359\u2013\n376, 2011.\n\n[10] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learn-\nIn Proceedings of the 32nd International Conference on\n\ning to rank in the cascade model.\nMachine Learning, 2015.\n\n[11] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriksson. Matroid\nbandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference\non Uncertainty in Arti\ufb01cial Intelligence, pages 420\u2013429, 2014.\n\n[12] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for\nstochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on\nArti\ufb01cial Intelligence and Statistics, 2015.\n\n[13] Shyong Lam and Jon Herlocker. MovieLens Dataset. http://grouplens.org/datasets/movielens/,\n\n2015.\n\n[14] Thanh Le, Csaba Szepesvari, and Rong Zheng. Sequential learning for multi-channel wireless\nnetwork monitoring with channel switching costs. IEEE Transactions on Signal Processing,\n62(22):5919\u20135929, 2014.\n\n[15] Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial par-\nIn Proceedings of the 31st\n\ntial monitoring game with linear feedback and its applications.\nInternational Conference on Machine Learning, pages 901\u2013909, 2014.\n\n[16] Christos Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization. Dover Publica-\n\ntions, Mineola, NY, 1998.\n\n[17] Neil Spring, Ratul Mahajan, and David Wetherall. Measuring ISP topologies with Rocketfuel.\n\nIEEE / ACM Transactions on Networking, 12(1):2\u201316, 2004.\n\n[18] William. R. Thompson. On the likelihood that one unknown probability exceeds another in\n\nview of the evidence of two samples. Biometrika, 25(3-4):285\u2013294, 1933.\n\n[19] Zheng Wen, Branislav Kveton, and Azin Ashkan. Ef\ufb01cient learning in large-scale combinato-\nrial semi-bandits. In Proceedings of the 32nd International Conference on Machine Learning,\n2015.\n\n9\n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Branislav", "family_name": "Kveton", "institution": "Adobe Research"}, {"given_name": "Zheng", "family_name": "Wen", "institution": "Yahoo Labs"}, {"given_name": "Azin", "family_name": "Ashkan", "institution": "Technicolor Research"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}