{"title": "Position-based Multiple-play Bandit Problem with Unknown Position Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 4998, "page_last": 5008, "abstract": "Motivated by online advertising, we study a multiple-play multi-armed bandit problem with position bias that involves several slots and the latter slots yield fewer rewards. We characterize the hardness of the problem by deriving an asymptotic regret bound. We propose the Permutation Minimum Empirical Divergence (PMED) algorithm and derive its asymptotically optimal regret bound. Because of the uncertainty of the position bias, the optimal algorithm for such a problem requires non-convex optimizations that are different from usual partial monitoring and semi-bandit problems. We propose a cutting-plane method and related bi-convex relaxation for these optimizations by using auxiliary variables.", "full_text": "Position-based Multiple-play Bandit Problem with\n\nUnknown Position Bias\n\nJunpei Komiyama\n\nThe University of Tokyo\njunpei@komiyama.info\n\nJunya Honda\n\nThe University of Tokyo / RIKEN\nhonda@stat.t.u-tokyo.ac.jp\n\nAkiko Takeda\n\nThe Institute of Statistical Mathematics / RIKEN\n\natakeda@ism.ac.jp\n\nAbstract\n\nMotivated by online advertising, we study a multiple-play multi-armed bandit\nproblem with position bias that involves several slots and the latter slots yield\nfewer rewards. We characterize the hardness of the problem by deriving an asymp-\ntotic regret bound. We propose the Permutation Minimum Empirical Divergence\n(PMED) algorithm and derive its asymptotically optimal regret bound. Because\nof the uncertainty of the position bias, the optimal algorithm for such a problem\nrequires non-convex optimizations that are different from usual partial monitor-\ning and semi-bandit problems. We propose a cutting-plane method and related\nbi-convex relaxation for these optimizations by using auxiliary variables.\n\n1\n\nIntroduction\n\nOne of the most important industries related to computer science is online advertising. In the United\nStates, 72.5 billion dollars was spent on online advertising [19] in 2016. Most online advertising is\nviewed on web pages during Internet browsing. A web-site owner has a set of possible advertisements\n(ads): some of them are more attractive than others, and the owner would like to maximize the\nattention of visiting users. One of the observable metrics of the user attention is the number of\nclicks on the ads. By considering each ad (resp. click) to be an arm (resp. reward) and assuming\nonly one slot is available for advertisements, the maximization of clicks boils down to the so-called\nmulti-armed bandit problem, where the arm with the largest expected reward is sought.\nWhen two or more ad slots are available on the web page, the problem boils down to a multiple-play\nmulti-armed bandit problem. Several variants of the multiple play bandit problem and its extension\ncalled semi-bandit problem have been considered in the literature. Arguably, the simplest is one\nassuming that an ad receives equal clicks regardless of its position [2, 24]. In practice, ads receive\nless clicks when they are placed at bottom slots; this is so-called position bias.\nA well-known model that explains position bias is the cascade model [23], which assumes that the\nusers\u2019 attention goes from top to bottom until they lose interest. While this model explains position\nbias in early positions well [10], a drawback to the cascade model when it is applied to the bandit\nsetting [26] is that the order of the allocated ads does not affect the reward, which is not very natural.\nTo resolve this issue, Combes et al. [8] introduced a weight for each slot that corresponds to the\nreward obtained by clicking on that slot. However, no principled way of de\ufb01ning the weight has been\ndescribed.\nAn extension of the cascade model, called the dependent click model (DCM) [14], addresses these\nissues by admitting multiple clicks of a user. In DCM, each slot is associated with a probability that\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe user loses interest in the following ads if the current ad is interesting. While the algorithm in\nKatariya et al. [21] cleverly exploits this structure, it still depends on the cascade assumption, and\nas a result it discards some of the feedback on the latter slots, which reduces the ef\ufb01ciency of the\nalgorithm. Moreover, the reward in DCM does not exactly correspond to the number of clicks.\nLagr\u00e9e et al. [27] has studied a position-based model (PBM) where each slot has its own discount\nfactor on the number of clicks. PBM takes the order of the shown ads into consideration. However,\nthe algorithms proposed in Lagr\u00e9e et al. [27] are \u201chalf-online\u201d in the sense that the value of an ad is\nadaptively estimated, whereas the values of the slots are estimated by using an off-line dataset. Such\nan off-line computation is not very handy since the click trend varies depending on the day and hour\n[1]. Moreover, a signi\ufb01cant portion of online advertisements is sold via ad networks [34]. As a result,\nadvertisers have to deal with thousands of web pages to show their ads. Taking these aspects into\nconsideration, pre-computing position bias for each web page limits the use of these algorithms.\nTo address this issue, we provide a way to allocate advertisements in a fully online manner by\nconsidering \u201cPBM under Uncertainty of position bias\u201d (PBMU). One of the challenges when the\nuncertainty of a position-based factor is taken into account is that, when some ad appears to have a\nsmall click through rate (CTR, the probability of click) in some slot, we cannot directly attribute it to\neither the arm or the slot. In this sense, several combinations of ads and slots need to be examined to\nestimate both the ad-based and position-based model parameters.\nNote also that an extension of the non-stochastic bandit approach [3] to multiple-play, such as\nthe ordered slate model [20], is general enough to deal with PBMU. However, algorithms based\non the non-stochastic approach do not always perform well in compensation for its generality.\nAnother extension of multi-armed bandit problems is the partial monitoring problem [31, 4] that\nadmits the case in which the parameters are not directly observable. However, partial monitoring is\ninef\ufb01cient at solving bandit problems: a K-armed bandit problem with binary rewards corresponds to\na partial monitoring problem with 2K possible outcomes. As a result, the existing partial monitoring\nalgorithms, such as the ones in [33, 25], are not practical even for a moderate number of arms.\nBesides, the computation of a feasible solution in PBMU requires non-convex optimizations as we\nwill see in Section 5. This implies that PBMU cannot directly be converted into the partial monitoring\nwhere such a non-convex optimization does not appear [25].\nThe contributions of this paper are as follows: First, we study the position-based bandit model with\nuncertainty (PBMU) and derive a regret lower bound (Section 3). Second, we propose an algorithm\nthat ef\ufb01ciently utilizes feedback (Section 4). One of the challenges in the multiple-play bandit\nproblem is that there is an exponentially large number of possible sequences of arms to allocate at\neach round. We reduce the number of candidates by using a bipartite matching algorithm that runs in\na polynomial time to the number of arms. The performance of the proposed algorithm is veri\ufb01ed in\nSection 6. Third, a slightly modi\ufb01ed version of the algorithm is analyzed in Section 7. This algorithm\nhas a regret upper bound that matches the lower bound. Finally, we reveal that the lower bound is\nrelated to a linear optimization problem with an in\ufb01nite number of constraints. Such an optimization\nproblem appears in many versions of the bandit problem [9, 25, 12]. We propose an optimization\nmethod that reduces it to a \ufb01nite-constraint linear optimization based on a version of the cutting-plane\nmethod (Section 5). Related non-convex optimizations that are characteristic to PBMU are solved by\nusing bi-convex relaxation. Such optimization methods are of interest in solving even larger classes\nof bandit problems.\n\n2 Problem Setup\n\nLet K be the number of arms (ads) and L < K be the number of slots. Each arm i \u2208 [K] =\n{1, 2, . . . , K} is associated with a distinct parameter \u03b8\u2217\ni \u2208 (0, 1), and each slot l \u2208 [L] is associated\nl \u2208 (0, 1]. At each round t = 1, 2, . . . , T , the system selects L arms I(t) =\nwith a parameter \u03ba\u2217\n(I1(t), . . . , IL(t)) and receives a corresponding binary reward (click or non-click) for each slot. The\nreward of the l-th slot is i.i.d. drawn from a Bernoulli distribution Ber(\u00b5\u2217\ni \u03ba\u2217\nl .\nAlthough the slot-based parameters are unknown, it is natural that the ads receives more clicks when\nthey are placed at early slots: we assume \u03ba\u2217\nNote that this model is redundant: a model with \u00b5\u2217\n\u00b5\u2217\ni,l = (\u03b8\u2217\n\nis equivalent to the model with\nl \u03ba1). Therefore, without loss of generality, we assume \u03ba1 = 1. In summary,\n\n2 > \u00b7\u00b7\u00b7 > \u03ba\u2217\ni,l = \u03b8\u2217\n\ni /\u03ba1)(\u03ba\u2217\n\ni,l = \u03b8\u2217\n\nIl(t),l), where \u00b5\u2217\nL > 0 and this order is known.\ni \u03ba\u2217\n\nl\n\n1 > \u03ba\u2217\n\n2\n\n\fi }i\u2208[K] and {\u03ba\u2217\n\nt-th round at which arm i was in slot l (i.e., Ni,l(t) =(cid:80)t\u22121\n\nthis model involves K + L parameters {\u03b8\u2217\nl }l\u2208[L], and the number of rounds T . The\nparameters except for \u03ba1 = 1 are unknown to the system. Let Ni,l(t) be the number of rounds before\nt(cid:48)=1 1{i = Il(t(cid:48))}, where 1{E} is 1 if E\nholds and 0 otherwise). In the following, we abbreviate arm i in slot l to \u201cpair (i, l)\u201d. Let \u02c6\u00b5i,l(t) be\nthe empirical mean of the reward of pair (i, l) after the \ufb01rst t \u2212 1 rounds.\nThe goal of the system is to maximize the cumulative rewards by using some sophisticated algorithm.\nWithout loss of generality, we can assume \u03b8\u2217\nK. The algorithm cannot exploit\nthis ordering. In this model, allocating arms of larger expected rewards on earlier slots increases\nexpected rewards: As a result, allocating arms 1, 2, . . . , L to slots 1, 2, . . . , L maximizes the expected\n,\nl . Re-\n(i,l)\u2208[K]\u00d7[L] \u2206i,lNi,l(T ). The regret increases\n\nreward. A quantity called (pseudo) regret is de\ufb01ned as: Reg(T ) =(cid:80)T\ngret can be alternatively represented as Reg(T ) =(cid:80)\n\ni \u2212 \u03b8\u2217\ni\u2208[L](\u03b8\u2217\nl \u2212 \u03b8\u2217\nand E[Reg(T )] is used for evaluating the performance of an algorithm. Let \u2206i,l = \u03b8\u2217\nl \u03ba\u2217\n\nIi(t))\u03ba\u2217\ni \u03ba\u2217\n\n3 > \u00b7\u00b7\u00b7 > \u03b8\u2217\n\n(cid:16)(cid:80)\n\n2 > \u03b8\u2217\n\n1 > \u03b8\u2217\n\n(cid:17)\n\nt=1\n\ni\n\nunless I(t) = (1, 2, . . . , L).\n\n1, . . . , \u03b8(cid:48)\n\nK) \u2208 (0, 1)K} and Kall = {(\u03ba(cid:48)\n\n3 Regret Lower Bound\nHere, we derive an asymptotic regret lower bound when T \u2192 \u221e. In the context of the standard multi-\narmed bandit problem, Lai and Robbins [28] derived a regret lower bound for strongly consistent\nalgorithms, and it is followed by many extensions, such as the one for multi-parameter distributions\n[6] and the ones for Markov decision processes [13, 7]. Intuitively, a strongly consistent algorithm is\n\u201cuniformly good\u201d in the sense that it works well with any set of model parameters. Their result was\nextended to the multiple-play [2] and PBM [27] cases. We further extend it to the case of PBMU.\nL > 0} be\nLet Tall = {(\u03b8(cid:48)\nthe sets of all possible values on the parameters of the arms and slots, respectively. Let (1), . . . , (K)\nbe a permutation of 1, . . . , K and T(1),...,(L) be the subset of Tall such that the i-th best arm is (i).\n(cid:111)\nNamely,\nT(1),...,(L) =\n,\n(1),...,(L) = Tall \\T(1),...,(L). An algorithm is strongly consistent if E[Reg(T )] = o(T a) for any\nand T c\na > 0 given any instance of the bandit problem with its parameters {\u03b8(cid:48)\nl} \u2208 Kall.\nThe following lemma, whose proof is in Appendix F, lower-bounds the number of draws on the pairs\nof arms and slots.\nLemma 1. (Lower bound on the number of draws) The following inequality holds for Ni,l(T ) of the\nstrongly consistent algorithm:\n\n(L),\u2200i /\u2208{(1),...,(L)}(\u03b8(cid:48)\n\ni}i\u2208[K] \u2208 Tall, {\u03ba(cid:48)\n\nK) \u2208 (0, 1)K : \u03b8(cid:48)\n\n(2) > \u00b7\u00b7\u00b7 > \u03b8(cid:48)\n\n2 > \u00b7\u00b7\u00b7 > \u03ba(cid:48)\n\nL) : 1 = \u03ba(cid:48)\n\n(\u03b8(cid:48)\n1, . . . , \u03b8(cid:48)\n\n1, . . . , \u03ba(cid:48)\n\n(1) > \u03b8(cid:48)\n\n1 > \u03ba(cid:48)\n\ni < \u03b8(cid:48)\n\n(cid:110)\n\n(L))\n\n\u2200{\u03b8(cid:48)\n\ni}\u2208T c\n\n1,...,L,{\u03ba(cid:48)\n\nl}\u2208Kall\n\nE[Ni,l(T )]dKL(\u03b8\u2217\n\ni \u03ba\u2217\n\nl) \u2265 log T \u2212 o(log T ),\ni\u03ba(cid:48)\nl , \u03b8(cid:48)\n\n(cid:88)\n\n(i,l)\u2208[K]\u00d7[L]\n\nwhere dKL(p, q) = p log(p/q) + (1 \u2212 p) log((1 \u2212 p)/(1 \u2212 q)) is the KL divergence between two\nBernoulli distributions.\n\nSuch a divergence-based bound appears in many stochastic bandit problems. However, unlike other\nbandit problems, the argument inside the KL divergence is a product of parameters \u03b8(cid:48)\ni\u03ba(cid:48)\nl: While\ni},{\u03ba(cid:48)\nl}. Therefore, \ufb01nding\ndKL(\u00b7, \u03b8(cid:48)\nl, it is not convex to the parameter space {\u03b8(cid:48)\ni\u03ba(cid:48)\nl) is convex to \u03b8(cid:48)\ni\u03ba(cid:48)\ni,l dKL(\u00b5i,l, \u03b8(cid:48)\ni\u03ba(cid:48)\nl) is non-convex, which makes PBMU dif\ufb01cult.\n(cid:88)\n\na set of parameters that minimizes(cid:80)\n\uf8f1\uf8f2\uf8f3{qi,l} \u2208 [0,\u221e)[K]\u00d7[K] : \u2200i\u2208[K\u22121]\n\nFurthermore, we can formalize the regret lower bound in what follows. Let\n\nqi+1,l,\u2200l\u2208[K\u22121]\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nQ =\n\nqi,l =\n\nqi,l =\n\nqi,l+1\n\n\uf8fc\uf8fd\uf8fe .\n\ni\u2208[K]\n\nl\u2208[K]\n\nl\u2208[K]\n\ni\u2208[K]\n\nIntuitively, {qi,l} for l \u2264 L corresponds to the draw of arm i in slot l, and {qi,l} for l > L corresponds\nto the non-draw of arm i, as we will see later. The following quantities characterizes the minimum\n\n3\n\n\f(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal}) =\nC\u2217\n(cid:26)\n\nthe set of optimal solutions of which is denoted by\n\n{qi,l}\u2208R(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal})\n\ninf\n\n\u2206i,lqi,l ,\n\n(i,l)\u2208[K]\u00d7[L]\n\namount of exploration for consistency:\n\nR(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal}) =\n\n{qi,l} \u2208 Q :\n\n(cid:26)\n\n{\u03b8(cid:48)\n\ni}\u2208T c\n\n(1),...,(L),{\u03ba(cid:48)\n\n(cid:88)\n\n(i,l)\u2208[K]\u00d7[L]:i(cid:54)=(l)\n\ninf\nl}\u2208Kall:\u2200i\u2208[L]\u03b8(cid:48)\ni\u03ba(cid:48)\nl) \u2265 1\ni\u03ba(cid:48)\nqi,ldKL (\u00b5i,l, \u03b8(cid:48)\n\ni=\u03b8i\u03bai\n\n(cid:27)\n\n.\n\n(1)\n\ni},{\u03ba(cid:48)\n\nEquality (1) states that drawing each pair (i, l) for Ni,l = qi,l log T times suf\ufb01ces to reduce the risk\nthat the true parameter is {\u03b8(cid:48)\n(1),...,(L) and\nl = \u03b8i\u03bai for any i \u2208 [L]. Note that the constraint \u03b8(cid:48)\n\u03b8(cid:48)\ni\u03ba(cid:48)\ni\u03ba(cid:48)\ni = \u03b8i\u03bai corresponds to the fact that drawing\nan optimal list of arms does not increase the regret: Intuitively, this corresponds to the fact that the\ntrue parameter of the best arm is obtained for free in the regret lower bound of the standard bandit\nproblem1. Moreover, let\n\nl} for any parameters {\u03b8(cid:48)\n\nl} such that \u03b8(cid:48)\n\ni \u2208 T c\n\ni},{\u03ba(cid:48)\n\n(cid:88)\n\nR\u2217\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal}) =\n\n{qi,l} \u2208 R(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal}) :\n\n(cid:88)\n\n\u2206i,lqi,l = C\u2217\n\n(i,l)\u2208[K]\u00d7[L]\n\n(cid:27)\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal})\n\n.\n\n(2)\n\ni },{\u03ba\u2217\n\nl } from any {\u03b8(cid:48)\n\n1,...,L log T is the possible minimum regret such that the minimum divergence of\nl} is larger than log T . Using Lemma 1 yields the following regret\n\nThe value C\u2217\n{\u03b8\u2217\nlower bound, whose proof is also in the Appendix F.\nTheorem 2. The regret of a strongly consistent algorithm is lower bounded as follows:\n\ni},{\u03ba(cid:48)\n\nE[Reg(T )] \u2265 C\u2217\n\n1,...,L({\u00b5\u2217\n\ni,l},{\u03b8\u2217\n\ni },{\u03ba\u2217\n\nl }) log T \u2212 o(log T ).\n\ni ) for j = min(i \u2212 1, L) satis\ufb01es the conditions in\nRemark 3. Ni,l = (log T )/dKL(\u03b8\u2217\nLemma 1, which means that regret lower bound in Theorem 2 is O(K log T /\u2206) = O(K log T ),\nwhere \u2206 = mini(cid:54)=j,l(cid:54)=m |\u03b8\u2217\nj||\u03ba\u2217\n\ni , \u03b8\u2217\ni \u03ba\u2217\nj \u03ba\u2217\nm|.\nl \u2212 \u03ba\u2217\n\ni \u2212 \u03b8\u2217\n\n4 Algorithm\n\nOur algorithm, called Permutation Minimum Empirical Divergence (PMED), is closely related to the\noptimization we discussed in Section 3.\n\n4.1 PMED Algorithm\n\nWe denote a list of L arms that are drawn at each round as L-allocation. For example, (3, 2, 1, 5)\nis a 4-allocation, which corresponds to allocating arms 3, 2, 1, 5 to slots 1, 2, 3, 4, respectively.\nLike the Deterministic Minimum Empirical Divergence (DMED) algorithm [17] for the single-play\nmulti-armed bandit problem, Algorithm 1 selects arms by using a loop. LC = LC(t) is the set of\nL-allocations in the current loop, and LN = LN (t) is the set of L-allocations that are to be drawn in\nthe next loop. Note that, |LN| \u2265 1 always holds at the end of each loop so that at least one element is\ni (cid:54)= \u03b8i\u03bai into consideration. However, such parameters can\nbe removed without increasing regret, and thus the in\ufb01mum over \u03b8(cid:48)\ni = \u03b8i\u03bai suf\ufb01ces. This can be under-\nstood because the regret bound of the standard K-armed bandit problem with expectation of each arm \u00b5i\ni=2(log T )/dKL(\u00b5i, \u00b51): Arm 1 is drawn without increasing regret, and thus estimation of \u00b51 can be\narbitrary accurate. In our case placing arms 1, ..., L into slots 1, ..., L does not increase the regret, and thus the\nestimation of the product parameter \u03b8i\u03bai for each i \u2208 [L] is very accurate.\n\n1The in\ufb01mum should take parameters \u03b8(cid:48)\n\nis(cid:80)K\n\ni\u03ba(cid:48)\n\ni\u03ba(cid:48)\n\n4\n\n\fAlgorithm 1 PMED and PMED-Hinge Algorithms\n\u221a\n1: Input: \u03b1 > 0, \u03b2 > 0 (for PMED-Hinge), f (n) = \u03b3/\n2: LN \u2190 \u2205. LC \u2190 {vmod\n3: while t \u2264 T do\nfor each vmod\n4:\n5:\n6:\n7:\n\nIf there exists some pair (i, l) \u2208 vmod\ni=1, {\u02c6\u03bal(t)}L\n\nend for\nCompute the MLE {\u02c6\u03b8i(t)}K\n\nm : m \u2208 [K] do\n\nK }.\n, . . . , vmod\n\nl=1\n\n1\n\nn with \u03b3 > 0 (for PMED-Hinge).\n\nm such that Ni,l(t) < \u03b1\n\n\u221a\n\nlog t, then put vmod\n\nm into LN .\n\n=\n\n(cid:40)min{\u03b8i,\u03bal}(cid:80)\nmin{\u03b8i,\u03bal}(cid:80)\nIf (cid:83)\n\n(i,l)\u2208[K]\u00d7[L] Ni,l(t)dKL(\u02c6\u00b5i,l(t), \u03b8i\u03bal)\n(i,l)\u2208[K]\u00d7[L] Ni,l(t) (dKL(\u02c6\u00b5i,l(t), \u03b8i\u03bal) \u2212 f (Ni,l(t)))+ .\n\n(PMED)\n\n(PMED-Hinge)\n\nif Algorithm is PMED-Hinge then\nIf |\u02c6\u03b8i(t) \u2212 \u02c6\u03b8j(t)| < \u03b2/(log log t) for some i (cid:54)= j or |\u02c6\u03bal(t) \u2212 \u02c6\u03bam(t)| < \u03b2/(log log t) for\nsome l (cid:54)= m, then put all of vmod\n\n, . . . , vmod\n\nK to LN .\n\n(i,l)\u2208[K]\u00d7[L]{dKL(\u02c6\u00b5i,l(t), \u02c6\u03b8i(t)\u02c6\u03bal(t)) > f (Ni,l(t))} holds,\n, . . . , vmod\n\n1\n\nthen put all of\n\n({\u02c6\u00b5i,l(t)},{\u02c6\u03b8i(t)},{\u02c6\u03bal(t)})\n({\u02c6\u00b5i,l(t)},{\u02c6\u03b8i(t)},{\u02c6\u03bal(t)},{f (Ni,l(t))}). (PMED-Hinge)\n\n(PMED)\n\nK into LN .\n\nvmod\n1\nend if\nCompute {qi,l}\u2208\n\u02dcNi,l \u2190 qi,l log t for each (i, l) \u2208 [K] \u00d7 [K].\n\n\uf8f1\uf8f2\uf8f3R\u2217\nDecompose \u02dcNi,l =(cid:80)\n\n\u02c61(t),..., \u02c6L(t)\nR\u2217,H\n\u02c61(t),..., \u02c6L(t)\n\nv creq\n\n8:\n9:\n\n10:\n\n11:\n\n12:\n\n13:\n14:\n\n15:\n16:\n17:\n18:\n\nv ev where ev for each v is a permutation matrix and creq\n\nv > 0 by\n\n(cid:16)\n\nusing Algorithm 2.\nri,l \u2190 Ni,l(t).\nfor each permutation matrix ev do\nv \u2190 min\nca\ufb00\nLet (v1, . . . , vL) be the L-allocation corresponding to ev. If ca\ufb00\npair (vl, l) that is in none of the L-allocations in LN , then put (v1, . . . , vL) into LN .\nri,l \u2190 ri,l \u2212 ca\ufb00\n\n(cid:8)c > 0 : min(i,l)\u2208[K]\u00d7[L](ri,l \u2212 c ev,i,l) \u2265 0(cid:9)(cid:17)\n\n.\nv < creq\n\ncreq\nv , maxc\n\nv ev,i,l.\n\nv\n\nand there exists a\n\n19:\n20:\n21:\n22:\n23:\n24: end while\n\nend for\nSelect I(t) \u2208 LC in an arbitrary \ufb01xed order. LC \u2192 LC \\ {I(t)}.\nPut (\u02c61(t), . . . , \u02c6L(t)) into LN .\nIf LC = \u2205 then LC \u2190 LN , LN \u2190 \u2205.\n\n, . . . , vmod\n\nput into LC. There are three lines where L-allocations are put into LN without duplication: Lines 5,\n18, and 22. We explain each of these lines below.\nLine 5 is a uniform exploration over all pairs (i, l). For m \u2208 [K], let vmod\nm be an L-allocation\n(1 + modK(m), 1 + modK(1 + m), . . . , 1 + modK(L + m\u2212 1)), where modK(x) is the minimum\nnon-negative integer among {x\u2212cK : c \u2208 N}. From the de\ufb01nition of vmod\nm , any pair (i, l) \u2208 [K]\u00d7[L]\n\u221a\nbelongs to exactly one of vmod\nlog t times, a\nK . If some pair (i, l) is not allocated \u03b1\ncorresponding L-allocation is put into LN . This exploration stabilizes the estimators.\nLine 18 and related routines are based on the optimal amount of explorations. { \u02dcNi,l}i\u2208[K],l\u2208[K] is\ncalculated by plugging in the maximum likelihood estimator (MLE) ({\u02c6\u03b8i}i\u2208[K],{\u02c6\u03bal}l\u2208[L]) into the\noptimization problem of Inequality (2). As { \u02dcNi,l} is a set of K \u00d7 K variables2, the algorithm needs\nto convert it into a set of L-allocations to put them into LN . This is done by decomposing it into a set\nof permutation matrices, which we will explain in Section 4.2.\nLine 22 is for exploitation: If no pair is put to LN by Line 5 or Line 18 and LC is empty, then Line\n22 puts arms (\u02c61(t), . . . , \u02c6L(t)) of the top-L largest {\u02c6\u03b8i(t)} (with ties broken arbitrarily) into LN .\n\n1\n\n2K \u00d7 K is not a typo of K \u00d7 L: {qi,l} and { \u02dcNi,l} are sets of K 2 variables.\n\n5\n\n\fAlgorithm 2 Permutation Matrix Decomposition\n1: Input: Ni,l.\n2: \u00afNi,l \u2190 Ni,l.\n3: while \u00afNi,l > 0 for some (i, l) \u2208 [K] \u00d7 [K] do\n4:\n5:\n6:\n7: end while\n8: Output {creq\n\nv , ev}\n\n(cid:8)c > 0 : min(i,l)\u2208[K]\u00d7[K]( \u00afNi,l \u2212 cev,i,l) \u2265 0(cid:9).\n\nFind a permutation matrix ev such that, for any i, l such that ev,i,l = 1 \u21d2 \u00afNi,l > 0.\nLet creq\nv = maxc\n\u00afNi,l \u2190 \u00afNi,l \u2212 creq\n\nv ev,i,l for each (i, l) \u2208 [K] \u00d7 [K].\n\nFigure 1: A permutation matrix with K = 4, where (i, l) = 1 for (i, l) \u2208 (1, 1), (2, 3), (3, 2), (4, 4).\nIf L = 2, this matrix corresponds to allocating arm 1 in slot 1 and arm 3 in slot 2.\n\n4.2 Permutation Matrix and Allocation Strategy\nIn this section, we discuss the way to convert { \u02dcNi,l} = {qi,l log t}, the estimated optimal amount of\nexploration, into L-allocations. A permutation matrix is a square matrix that has exactly one entry of\n1 at each row and each column and 0s elsewhere (Figure 1, left). There are K! permutation matrices\nsince they corresponds to ordering K elements. Therefore, even though {qi,l} can be obviously\ndecomposed into a linear combination of permutation matrices, it is not clear how to compute them\nwithout computing the set of all permutation matrices that are exponentially large in K. Algorithm\n2 solves this problem: Let \u00afNi,l be a temporal variable that is initialized by \u02dcNi,l at the beginning.\nIn each iteration, it subtracts a scalar multiplication of a permutation matrix ev whose (i, l) entry\nev,i,l of value 1 corresponds to \u00afNi,l > 0. (Line 6 in Algorithm 2). This boils down to \ufb01nding a\nperfect matching in a bipartite graph where the left (resp. right) nodes correspond to rows (resp.\ncolumns) and edges between nodes i and l are spanned if \u00afNi,l > 0. Although a naive greedy fails\nin such a matching problem (c.f., Appendix A), a maximal matching in a bipartite graph can be\ncomputed by the Hopcroft\u2013Karp algorithm [18] in O(K 2.5) times, and Theorem 4 below ensures\nthat the maximum matching is always perfect:\n\u00afNi,l > 0}\nTheorem 4. (Existence of a perfect matching) For any { \u00afNi,l \u2208 [K]\u00d7 [K] : \u00afNi,l \u2265 0,\u2203(i,l)\nsuch that the sums of each row and column are equal, there exists a permutation matrix ev such that\n\u2200(i,l)\u2208[K]\u00d7[K]:ev,i,l=1\nThe proof of Theorem 4 is in Appendix E. Each subtraction increases the number of 0 entries in\n\u00afNi,l (Line 5 in Algorithm 2); Algorithm 2 runs in O(K 4.5) times by computing at most O(K 2)\nperfect matching sub-problems, and as a result it decomposes \u02dcNi,l into a positive linear combination\nof permutation matrices. The main algorithm checks whether each the entries of the permutation\nmatrices are suf\ufb01ciently explored (Line 18 in Algorithm 1), and draws an L-allocation corresponding\nto a permutation matrix (Figure 1, right) if under-explored.\n\n\u00afNi,l > 0.\n\n5 Optimizations\n\nThis section discusses two optimizations that appear in Algorithm 1, namely, the MLE computation\n(Line 7), and the computation of the optimal solution (Line 12).\nMLE (Line 7) is the solution of a bi-convex optimization: the optimization of {\u03b8i} (resp. {\u03bal}) is\nconvex when we view {\u03bal} (resp. {\u03b8i}) as a constant. Therefore, off-the-shelf tools for optimizing\nconvex functions (e.g., Newton\u2019s method) are applicable to alternately optimizing {\u03b8i} and {\u03bal}.\nAssuming that each convex optimization yields an optimal value, such an alternate optimization\n\n6\n\n111100000000000011000000\fAlgorithm 3 Cutting-plane method for obtaining {qi,l} on Line 12 of Algorithm 1\n1: Input: the number of iterations S, nominal constraint {\u03b8(0)\n2: for s = 1, 2, . . . , S do\n3:\n\n(i,l)\u2208[K]\u00d7[L] \u2206i,lqi,l such that\n\ni } \u2208 T c\n\nFind q(s)\n\n\u02c61(t),..., \u02c6L(t)\n\n.\n\n(cid:32)\n\n(cid:33)\n\nqi,ldKL\n\n\u02c6\u00b5i,l(t), \u03b8(cid:48)\n\ni\n\n\u02c6\u03b8l(t)\u02c6\u03bal(t)\n\n\u03b8(cid:48)\n\nl\n\n\u2265 1\n\n(i,l)\u2208[K]\u00d7[L]:i(cid:54)=\u02c6l(t)\n\ni }, . . . ,{\u03b8(s\u22121)\n}.\ni,l dKL(\u02c6\u00b5i,l(t), \u03b8(cid:48)\n(i,l)\u2208[K]\u00d7[L] q(s)\n\ni\n\ni\n\n\u02c6\u03b8l(t)\u02c6\u03bal(t)\n\n\u03b8(cid:48)\n\nl\n\n).\n\ni,l \u2190 min{qi,l}\u2208Q(cid:80)\n(cid:88)\ni}(cid:80)\n\ni} \u2208 {\u03b8(0)\ni } \u2190 min{\u03b8(cid:48)\n\ni },{\u03b8(1)\n\nfor all {\u03b8(cid:48)\nFind {\u03b8(s)\n\n4:\n5: end for\n\n\u02c6\u03b8l(t)\u02c6\u03bal(t)\n\ni\n\n\u03b8(cid:48)\n\nl\n\n\u02c61(t),..., \u02c6L(t)\n\n(cid:80)\n\ni} and {\u03ba(cid:48)\n\n(Convergence of\n\nthere exists a constant C and that\n\n) is Lipchitz continuous as |f ({\u03b8(1)\n\nmonotonically decreases the objective function and thus converges. Note that a local minimum\nobtained by bi-convex optimizations is not always a global minimum due to its non-convex nature.\nAlthough the computation of the optimal solution (Line 12) involves {\u03b8(cid:48)\nl}, the constraint\ni = \u02c6\u03b8i(t)\u02c6\u03bai(t)/\u03b8(cid:48)\neliminates latter variables as \u03ba(cid:48)\ni. This optimization is a linear semi-in\ufb01nite pro-\ngramming (LSIP) on {qi,l}, which is a linear programming (LP) with an in\ufb01nite set of linear\nconstraints parameterized by {\u03b8(cid:48)\ni}. Algorithm 3 is the cutting-plane method with pessimistic oracle\n[29] that boils the LSIP down to \ufb01nite constraint LPs. At each iteration s, it adds a new constraint\n{\u03b8(s)\ni } \u2208 T c\nthat is \u201chardest\u201d in a sense that it minimizes the sum of divergences (Line 4 in\nAlgorithm 3). The following theorem guarantees the convergence of the algorithm when the exactly\nhardest constraint is found.\nTheorem 5.\ntion 5.2]) Assume that\n\nthe cutting-plane method, Mutapcic and Boyd [29, Sec-\ni}) =\ni })| \u2264\ni,l dKL(\u02c6\u00b5i,l(t), \u03b8(cid:48)\ni }||, where the norm ||\u00b7|| is any Lp norm. Then, Algorithm 3 converges to its optimal\n\n(i,l)\u2208[K]\u00d7[L] q(s)\ni }\u2212{\u03b8(2)\nC||{\u03b8(1)\nsolution as S \u2192 \u221e.\nAlthough the Lipchitz continuity assumption does not hold as dKL(p, q) approaches in\ufb01nity when q\nis close to 0 or 1, by restricting q to some region [\u0001, 1 \u2212 \u0001], Lipchitz continuity can be guaranteed for\nsome C = C(\u0001). Theorem 5 assumes the availability of an exact solution to the hardest constraint,\nwhich is generally hard since this objective is non-convex in its nature. Still, we can obtain a fair\nsolution with the following reasons: First, although the space T c\nis not convex, it suf-\ni} \u2208 (0, 1)K : \u03b8(cid:48)\n\ufb01ces to consider each of the convex subspaces\n\u02c6l(t)\nwhere X = min(L, l \u2212 1), for each l \u2208 [K] \\ {1} separately because the hardest constraint\nis always in one of these subspaces (which follows from the convexity of the objective func-\ntion). Second, the following bi-convex relaxation can be used: Let \u03b7(cid:48)\n(cid:80)\nL be auxiliary\n1, . . . , 1/\u03b8(cid:48)\nvariables that correspond to 1/\u03b8(cid:48)\nL. Namely, we optimize a relaxed objective function\ni \u2212 1)2, where \u03c6 > 0 is a penalty\ni\u2208[L](\u03b8(cid:48)\ni\u03b7(cid:48)\ni,l dKL(\u02c6\u00b5i,l(t), \u03b8(cid:48)\ni\u03b7(cid:48)\nq(s)\n(i,l)\u2208[K]\u00d7[L]\ni} and\nl}, and thus an alternate optimization is effective. Setting \u03c6 \u2192 \u221e induces a solution in which \u03b7(cid:48)\ni is\ni ([30, Theorem 17.1]). Our algorithm starts with a small value of \u03c6; then it gradually\n\nparameter. Convexity of KL divergence implies that this objective is a bi-convex function of {\u03b8(cid:48)\n{\u03b7(cid:48)\nequal to 1/\u03b8(cid:48)\nincreases \u03c6.\n\nthe constraint f ({\u03b8(cid:48)\ni }) \u2212 f ({\u03b8(2)\n\n(cid:110){\u03b8(cid:48)\n(cid:17)\n\n+ \u03c6(cid:80)\n\n\u2265 \u00b7\u00b7\u00b7 \u2265 \u03b8(cid:48)\n\n1, . . . , \u03b7(cid:48)\n\n\u02c6\u03b8l(t)\u02c6\u03bal(t))\n\n\u02c61(t),..., \u02c6L(t)\n\n\u02c6L(t)\n\n\u02c6X(t)\n\n= \u03b8(cid:48)\n\n(cid:111)\n\n(cid:16)\n\n, \u03b8(cid:48)\n\n\u02c61(t)\n\nl\n\n6 Experiment\n\nTo evaluate the empirical performance of the proposed algorithms, we conducted computer simula-\ntions with synthetic and real-world datasets. The compared algorithms are MP-TS [24], dcmKL-UCB\n[21], PBM-PIE [27], and PMED (proposed in this paper). MP-TS is an algorithm based on Thompson\nsampling [32] that ignores position bias: it draws the top-L arms on the basis of posterior sampling,\nand the posterior is calculated without considering position bias. DcmKL-UCB is a KL-UCB [11]\n\n7\n\n\f(a) Synthetic\n\n(b) Real-world (Tencent)\n\nFigure 2: Regret-round log-log plots of algorithms.\n\nbased algorithm that works under the DCM assumption. PBM-PIE is an algorithm that allocates\ntop-(L \u2212 1) slots greedily and allocates L-th arm based on the KL-UCB bound. Note that PBM-PIE\nrequires an estimation of {\u03ba\u2217\nl }; here, a bi-convex optimization is used to estimate it3. We did not test\nPBM-TS [27], which is another algorithm for PBM, mainly because that its regret bound has not\nbeen derived yet. However, its regret appears to be asymptotically optimal when {\u03ba\u2217\nl } are known\n(Figure 1(a) in Lagr\u00e9e et al.[27]), and thus it does not explore suf\ufb01ciently when there is uncertainty in\nthe position bias. We set \u03b1 = 10 for PMED. We used the Gurobi LP solver4 for solving the LPs. To\nspeed up the computation, we skipped the bi-convex and LP optimizations in most rounds with large\nt and used the result of the last computation. We used the Newton\u2019s method (resp. a gradient method)\nfor computing the MLE (resp. the hardest constraint) in Algorithm 3.\nSynthetic data: This simulation was designed to check the consistency of the algorithms, and it\ninvolved 5 arms with (\u03b81, . . . , \u03b85) = (0.95, 0.8, 0.65, 0.5, 0.35), and 2 slots with (\u03ba1, \u03ba2) = (1, 0.6).\nThe experimental results are shown on the left of Figure 2. The results are averaged over 100 runs. LB\nis the simulated value of the regret lower bound in Section 3. While the regret of PMED converges,\nthe other algorithms suffer a 100 times or larger regret than LB at T = 107, which implies that these\nalgorithms are not consistent under our model.\nReal-world data: Following the existing work [24, 27], we used the KDD Cup 2012 track 2 dataset\n[22] that involves session logs of soso.com, a search engine owned by Tencent. Each of the 150M\nlines from the log contains the user ID, the query, an ad, and a slot in {1, 2, 3} at which the ad was\ndisplayed and a binary reward indicated (click/no-click). Following Lagr\u00e9e et al. [27], we obtained\nmajor 8 queries. Using the click logs of the queries, the CTRs and position bias were estimated in\norder to maximize the likelihood by using bi-convex optimization in Section 4. Note that, the number\nof arms and parameters are slightly different from the ones reported previously [27]. For the sake\nof completeness, we show the parameters in Appendix C. We conducted 100 runs for each queries,\nand the right \ufb01gure in Figure 2 shows the averaged regret over 8 queries. Although the gap between\nPMED and existing algorithms are not drastic compared with synthetic parameters, the existing\nalgorithms suffer larger regret than PMED.\n\n7 Analysis\n\nAlthough the authors conjecture that PMED is optimal, it is hard to analyze it directly. The technically\nhardest part arises from the case in which the divergence of each action is small but not yet fully\nconverged. To circumvent these dif\ufb01culty, we devised a modi\ufb01ed algorithm called PMED-Hinge\n(Algorithm 1) that involves extra exploration. In particular, we modify the optimization problem as\n\n3The bi-convex optimization is identical to the one used for obtaining the MLE in PMED.\n4http://www.gurobi.com\n\n8\n\n\ffollows: Let\n\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l}) =\nRH\n\n(cid:26)\n\n{qi,l} \u2208 Q :\n\n(cid:88)\n\n(i,l)\u2208[K]\u00d7[L]:i(cid:54)=(l)\n\nwhere (x)+ = max(x, 0). Moreover, let\n\u2217,H\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l}) =\n\nC\n\nthe optimal solution of which is\n\nR\u2217,H\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l}) =\n(cid:88)\n\n(cid:26)\n\n(l)\u03ba(cid:48)\n\nl)\u2264\u03b4i,l\n\n{\u03b8(cid:48)\n\ninf\n\n(1),...,(L),{\u03ba(cid:48)\n\nl}\u2208Kall:\u2200l\u2208[L]dKL(\u00b5(l),l,\u03b8(cid:48)\n\n(cid:27)\ni}\u2208T c\nl) \u2212 \u03b4i,l)+ \u2265 1\nqi,l (dKL(\u00b5i,l, \u03b8(cid:48)\ni\u03ba(cid:48)\n(cid:88)\n\n,\n\n{qi,l}\u2208RH\n\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l})\n\ninf\n\n\u2206i,lqi,l ,\n\n(i,l)\u2208[K]\u00d7[L]\n\n{qi,l} \u2208 RH\n\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l}) :\n\u2217,H\n(1),...,(L)({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l})\n\n(cid:27)\n\n.\n\n\u2206i,lqi,l = C\n\n(i,l)\u2208[K]\u00d7[L]\n\nThe necessity of additional terms in PMED-Hinge are discussed in Appendix B. The following\ntheorem, whose proof is in Appendix G, derives a regret upper bound that matches the lower bound\nin Theorem 2.\nTheorem 6. (Asymptotic optimality of PMED-Hinge) Let the solution of the optimal exploration\nR\u2217,H\nl },{0}). For any\n1,...,L({\u00b5i,l},{\u03b8i},{\u03bal},{\u03b4i,l}) restricted to l \u2264 L is unique at ({\u00b5\u2217\n\u03b1 > 0, \u03b2 > 0, and \u03b3 > 0, the regret of PMED-Hinge is bounded as:\n\ni,l},{\u03b8\u2217\n\ni },{\u03ba\u2217\n\nE[Reg(T )] \u2264 C\u2217\n\n1,...,L({\u00b5\u2217\n\ni,l},{\u03b8\u2217\n\ni },{\u03ba\u2217\n\nl }) log T + o(log T ) .\n\nNote that, the assumption on the uniqueness of the solution in Theorem 6 is required to achieve\nan optimal coef\ufb01cient on the log T factor. It is not very dif\ufb01cult to derive an O(log T ) regret even\nthough the uniqueness condition is not satis\ufb01ed. Although our regret bound is not \ufb01nite-time, the\nonly asymptotic analysis comes from the optimal constant on the top of log T term (Lemma 11 in\nAppendix) and it is not very hard to derive an O(log T ) \ufb01nite-time regret bound.\n\n8 Conclusion\n\nBy providing a regret lower bound and an algorithm with a matching regret bound, we gave the \ufb01rst\ncomplete characterization of a position-based multiple-play multi-armed bandit problem where the\nquality of the arms and the discount factor of the slots are unknown. We provided a way to compute\nthe optimization problems related to the algorithm, which is of its own interest and is potentially\napplicable to other bandit problems.\n\n9\n\n\fAcknowledgements\n\nThe authors gratefully acknowledge Kohei Komiyama for discussion on a permutation matrix and\nsincerely thank the anonymous reviewers for their useful comments. This work was supported in\npart by JSPS KAKENHI Grant Number 17K12736, 16H00881, 15K00031, and Inamori Foundation\nResearch Grant.\n\nReferences\n[1] D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through\n\nrate. In WWW, pages 21\u201330, 2009.\n\n[2] V. Anantharam, P. Varaiya, and J. Walrand. Asymptotically ef\ufb01cient allocation rules for the\nmultiarmed bandit problem with multiple plays-part i: I.i.d. rewards. Automatic Control, IEEE\nTransactions on, 32(11):968\u2013976, 1987.\n\n[3] P. Auer, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. Siam\n\nJournal on Computing, 2002.\n\n[4] G. Bart\u00f3k, D. P. Foster, D. P\u00e1l, A. Rakhlin, and C. Szepesv\u00e1ri. Partial monitoring - classi\ufb01cation,\n\nregret bounds, and algorithms. Math. Oper. Res., 39(4):967\u2013997, 2014.\n\n[5] S. Bubeck. Bandits Games and Clustering Foundations. Theses, Universit\u00e9 des Sciences et\n\nTechnologie de Lille - Lille I, June 2010.\n\n[6] A. Burnetas and M. Katehakis. Optimal adaptive policies for sequential allocation problems.\n\nAdvances in Applied Mathematics, 17(2):122\u2013142, 1996.\n\n[7] A. Burnetas and M. Katehakis. Optimal adaptive policies for markov decision processes. Math.\n\nOper. Res., 22(1):222\u2013255, Feb. 1997.\n\n[8] R. Combes, S. Magureanu, A. Prouti\u00e8re, and C. Laroche. Learning to rank: Regret lower bounds\nand ef\ufb01cient algorithms. In Proceedings of the 2015 ACM SIGMETRICS, pages 231\u2013244, 2015.\n\n[9] R. Combes, M. S. Talebi, A. Prouti\u00e8re, and M. Lelarge. Combinatorial bandits revisited. In\n\nNIPS, pages 2116\u20132124, 2015.\n\n[10] N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click\n\nposition-bias models. In WSDM, pages 87\u201394, 2008.\n\n[11] A. Garivier and O. Capp\u00e9. The KL-UCB algorithm for bounded stochastic bandits and beyond.\n\nIn COLT, pages 359\u2013376, 2011.\n\n[12] A. Garivier and E. Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence. In COLT,\n\npages 998\u20131027, 2016.\n\n[13] T. L. Graves and T. L. Lai. Asymptotically ef\ufb01cient adaptive choice of control laws in controlled\n\nmarkov chains. SIAM Journal on Control and Optimization, 35(3):715\u2013743, 1997.\n\n[14] F. Guo, C. Liu, and Y. M. Wang. Ef\ufb01cient multiple-click models in web search. In WSDM,\n\npages 124\u2013131, 2009.\n\n[15] P. Hall. On representatives of subsets. Journal of the London Mathematical Society, s1-10(1):26\u2013\n\n30, 1935.\n\n[16] W. W. Hogan. Point-to-set maps in mathematical programming. SIAM Review, 15(3):591\u2013603,\n\n1973.\n\n[17] J. Honda and A. Takemura. An Asymptotically Optimal Bandit Algorithm for Bounded Support\n\nModels. In COLT, pages 67\u201379, 2010.\n\n[18] J. E. Hopcroft and R. M. Karp. An n5/2 algorithm for maximum matchings in bipartite graphs.\n\nSIAM Journal on Computing, 2(4):225\u2013231, 1973.\n\n10\n\n\f[19] Interactive Advertising Bureau. IAB internet advertising revenue report - 2016 full year results.,\n\n2017.\n\n[20] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In NIPS, pages\n\n1054\u20131062, 2010.\n\n[21] S. Katariya, B. Kveton, C. Szepesv\u00e1ri, and Z. Wen. DCM bandits: Learning to rank with\n\nmultiple clicks. In ICML, pages 1215\u20131224, 2016.\n\n[22] KDD cup 2012 track 2, 2012.\n\n[23] D. Kempe and M. Mahdian. A cascade model for externalities in sponsored search. In WINE,\n\npages 585\u2013596, 2008.\n\n[24] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of thompson sampling in\nstochastic multi-armed bandit problem with multiple plays. In ICML, pages 1152\u20131161, 2015.\n\n[25] J. Komiyama, J. Honda, and H. Nakagawa. Regret lower bound and optimal algorithm in \ufb01nite\n\nstochastic partial monitoring. In NIPS, pages 1792\u20131800, 2015.\n\n[26] B. Kveton, C. Szepesv\u00e1ri, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the\n\ncascade model. In ICML, pages 767\u2013776, 2015.\n\n[27] P. Lagr\u00e9e, C. Vernade, and O. Capp\u00e9. Multiple-play bandits in the position-based model. In\n\nNIPS, pages 1597\u20131605, 2016.\n\n[28] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[29] A. Mutapcic and S. P. Boyd. Cutting-set methods for robust convex optimization with pessimiz-\n\ning oracles. Optimization Methods and Software, 24(3):381\u2013406, 2009.\n\n[30] J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research and\n\nFinancial Engineering. Springer New York, 2nd edition, 2006.\n\n[31] A. Piccolboni and C. Schindelhauer. Discrete prediction games with arbitrary feedback and\n\nloss. In COLT 2001 and EuroCOLT 2001, pages 208\u2013223, 2001.\n\n[32] W. R. Thompson. On The Likelihood That One Unknown Probability Exceeds Another In View\n\nOf The Evidence Of Two Samples. Biometrika, 25:285\u2013294, 1933.\n\n[33] H. P. Vanchinathan, G. Bart\u00f3k, and A. Krause. Ef\ufb01cient partial monitoring with prior information.\n\nIn NIPS, pages 1691\u20131699, 2014.\n\n[34] S. Yuan, J. Wang, and X. Zhao. Real-time bidding for online advertising: Measurement and\nanalysis. In Proceedings of the Seventh International Workshop on Data Mining for Online\nAdvertising, ADKDD \u201913, pages 3:1\u20133:8. ACM, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2586, "authors": [{"given_name": "Junpei", "family_name": "Komiyama", "institution": "The University of Tokyo"}, {"given_name": "Junya", "family_name": "Honda", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Akiko", "family_name": "Takeda", "institution": "The Institute of Statistical Mathematics / RIKEN"}]}