{"title": "Double Thompson Sampling for Dueling Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 657, "abstract": "In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As its name suggests, D-TS selects both the first and the second candidates according to Thompson Sampling. Specifically, D-TS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison according to two sets of samples independently drawn from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case. For general Copeland dueling bandits, we show that D-TS achieves $O(K^2 \\log T)$ regret. Moreover, using a back substitution argument, we refine the regret to $O(K \\log T + K^2 \\log \\log T)$ in Condorcet dueling bandits and many practical Copeland dueling bandits. In addition, we propose an enhancement of D-TS, referred to as D-TS+, that reduces the regret by carefully breaking ties. Experiments based on both synthetic and real-world data demonstrate that D-TS and D-TS$^+$ significantly improve the overall performance, in terms of regret and robustness.", "full_text": "Double Thompson Sampling for Dueling Bandits\n\nHuasen Wu\n\nUniversity of California, Davis\n\nhswu@ucdavis.edu\n\nXin Liu\n\nUniversity of California, Davis\n\nxinliu@ucdavis.edu\n\nAbstract\n\nIn this paper, we propose a Double Thompson Sampling (D-TS) algorithm for\ndueling bandit problems. As its name suggests, D-TS selects both the \ufb01rst and the\nsecond candidates according to Thompson Sampling. Speci\ufb01cally, D-TS maintains\na posterior distribution for the preference matrix, and chooses the pair of arms for\ncomparison according to two sets of samples independently drawn from the poste-\nrior distribution. This simple algorithm applies to general Copeland dueling bandits,\nincluding Condorcet dueling bandits as a special case. For general Copeland du-\neling bandits, we show that D-TS achieves O(K 2 log T ) regret. Moreover, using\na back substitution argument, we re\ufb01ne the regret to O(K log T + K 2 log log T )\nin Condorcet dueling bandits and most practical Copeland dueling bandits. In\naddition, we propose an enhancement of D-TS, referred to as D-TS+, to reduce the\nregret in practice by carefully breaking ties. Experiments based on both synthetic\nand real-world data demonstrate that D-TS and D-TS+ signi\ufb01cantly improve the\noverall performance, in terms of regret and robustness.\n\nIntroduction\n\n1\nThe dueling bandit problem [1] is a variant of the classical multi-armed bandit (MAB) problem, where\nthe feedback comes in the form of pairwise comparison. This model has attracted much attention as\nit can be applied in many systems such as information retrieval (IR) [2, 3], where user preferences\nare easier to obtain and typically more stable. Most earlier work [1, 4, 5] focuses on Condorcet\ndueling bandits, where there exists an arm, referred to as the Condorcet winner, that beats all other\narms. Recent work [6, 7] turns to a more general and practical case of a Copeland winner(s), which is\nthe arm (or arms) that beats the most other arms. Existing algorithms are mainly generalized from\ntraditional MAB algorithms along two lines: 1) UCB (Upper Con\ufb01dence Bound)-type algorithms,\nsuch as RUCB [4] and CCB [6]; and, 2) MED (Minimum Empirical Divergence)-type algorithms,\nsuch as RMED [5] and CW-RMED/ECW-RMED [7].\nIn traditional MAB, an alternative effective solution is Thompson Sampling (TS) [8]. Its principle is\nto choose the optimal action that maximizes the expected reward according to the randomly drawn\nbelief. TS has been successfully applied in traditional MAB [9, 10, 11, 12] and other online learning\nproblems [13, 14]. In particular, empirical studies in [9] show that TS not only achieves lower regret\nthan other algorithms in practice, but is also more robust as a randomized algorithm.\nIn the wake of the success of TS in these online learning problems, a natural question is whether\nand how TS can be applied to dueling bandits to further improve the performance. However, it is\nchallenging to apply the standard TS framework to dueling bandits, because not all comparisons\nprovide information about the system statistics. Speci\ufb01cally, a good learning algorithm for dueling\nbandits will eventually compare the winner against itself. However, comparing one arm against itself\ndoes not provide any statistical information, which is critical in TS to update the posterior distribution.\nThus, TS needs to be adjusted so that 1) comparing the winners against themselves is allowed, but, 2)\ntrapping in comparing a non-winner arm against itself is avoided.\nIn this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandits,\nincluding both Condorcet dueling bandits and general Copeland dueling bandits. As its name\nsuggests, D-TS typically selects both the \ufb01rst and the second candidates according to samples\nindependently drawn from the posterior distribution. D-TS also utilizes the idea of con\ufb01dence\nbounds to eliminate the likely non-winner arms, and thus avoids trapping in suboptimal comparisons.\nCompared to prior studies on dueling bandits, D-TS has both practical and theoretical advantages.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFirst, the double sampling structure of D-TS better suits the nature of dueling bandits. Launching two\nindependent rounds of sampling provides us the opportunity to select the same arm in both rounds\nand thus to compare the winners against themselves. This double sampling structure also leads to\nmore extensive utilization of TS (e.g., compared to RCS [3]), and signi\ufb01cantly reduces the regret.\nIn addition, this simple framework applies to general Copeland dueling bandits and achieves lower\nregret than existing algorithms such as CCB [6]. Moreover, as a randomized algorithm, D-TS is more\nrobust in practice.\nSecond, this double sampling structure enables us to obtain theoretical bounds for the regret of D-TS.\nAs noted in traditional MAB literature [10, 15], theoretical analysis of TS is usually more dif\ufb01cult than\nUCB-type algorithms. The analysis in dueling bandits is even more challenging because the selection\nof arms involves more factors and the two selected arms may be correlated. To address this issue, our\nD-TS algorithm draws the two sets of samples independently. Because their distributions are fully\ncaptured by historic comparison results, when the \ufb01rst candidate is \ufb01xed, the comparison between it\nand all other arms is similar to traditional MAB and thus we can borrow ideas from traditional MAB.\nUsing the properties of TS and con\ufb01dence bounds, we show that D-TS achieves O(K 2 log T ) regret\nfor a general K-armed Copeland dueling bandit. More interestingly, the property that the sample\ndistribution only depends on historic comparing results (but not t) enables us to re\ufb01ne the regret\nusing a back substitution argument, where we show that D-TS achieves O(K log T + K 2 log log T )\nin Condorcet dueling bandits and many practical Copeland dueling bandits.\nBased on the analysis, we further re\ufb01ne the tie-breaking criterion in D-TS and propose its enhancement\ncalled D-TS+. D-TS+ achieves the same theoretical regret bound as D-TS, but performs better in\npractice especially when there are multiple winners.\nIn summary, the main contributions of this paper are as follows:\n\n\u2022 We propose a D-TS algorithm and its enhancement D-TS+ for general Copeland dueling\nbandits. The double sampling structure suits the nature of dueling bandits and leads to more\nextensive usage of TS, which signi\ufb01cantly reduces the regret.\n\n\u2022 We obtain theoretical regret bounds for D-TS and D-TS+. For general Copeland dueling\nbandits, we show that D-TS and D-TS+ achieve O(K 2 log T ) regret. In Condorcet dueling\nbandits and most practical Copeland dueling bandits, we further re\ufb01ne the regret bound to\nO(K log T + K 2 log log T ) using a back substitution argument.\n\n\u2022 We evaluate the D-TS and D-TS+ algorithms through experiments based on both synthetic\nand real-world data. The results show that D-TS and D-TS+ signi\ufb01cantly improve the\noverall performance, in terms of regret and robustness, compared to existing algorithms.\n\n2 Related Work\n\nEarly dueling bandit algorithms study \ufb01nite-horizon settings, using the \u201cexplore-then-exploit\u201d ap-\nproaches, such as IF [1], BTM [16], and SAVAGE [17]. For in\ufb01nite horizon settings, recent work\nhas generalized the traditional MAB algorithms to dueling bandits along two lines. First, RUCB\n[4] and CCB [6] are generalizations of UCB for Condorcet and general Copeland dueling bandits,\nrespectively. In addition, [18] reduces dueling bandits to traditional MAB, which is then solved\nby UCB-type algorithms, called MutiSBM and Sparring. Second, [5] and [7] extend the MED\nalgorithm to dueling bandits, where they present the lower bound on the regret and propose the\ncorresponding optimal algorithms, including RMED for Condorcet dueling bandits [5], CW-RMED\nand its computationally ef\ufb01cient version ECW-RMED for general Copeland dueling bandits [7].\nDifferent from such existing work, we study algorithms for dueling bandits from the perspective of\nTS, which typically achieves lower regret and is more robust in practice.\nDated back to 1933, TS [8] is one of the earliest algorithms for exploration/exploitation tradeoff.\nNowadays, it has been applied in many variants of MAB [11, 12, 13] and other more complex\nproblems, e.g., [14], due to its simplicity, good performance, and robustness [9]. Theoretical analysis\nof TS is much more dif\ufb01cult. Only recently, [10] proposes a logarithmic bound for the standard\nfrequentist expected regret, whose constant factor is further improved in [15]. Moreover [19, 20]\nderive the bounds for its Bayesian expected regret through information-theoretic analysis.\nTS has been preliminarily considered for dueling bandits [3, 21]. In particular, recent work [3]\nproposes a Relative Con\ufb01dence Sampling (RCS) algorithm that combines TS with RUCB [4] for\nCondorcet dueling bandits. Under RCS, the \ufb01rst arm is selected by TS while the second arm is\nselected according to their RUCB. Empirical studies demonstrate the performance improvement of\nusing RCS in practice, but no theoretical bounds on the regret are provided.\n\n2\n\n\ft\n\nt\n\nt\n\n, a(2)\n\nto a(2)\n\nscore is de\ufb01ned as (cid:80)\n\n3 System Model\nWe consider a dueling bandit problem with K (K \u2265 2) arms, denoted by A = {1, 2, . . . , K}. At\nt ) is displayed to a user and a noisy comparison outcome\neach time-slot t > 0, a pair of arms (a(1)\nwt is obtained, where wt = 1 if the user prefers a(1)\n, and wt = 2 otherwise. We assume the\nuser preference is stationary over time and the distribution of comparison outcomes is characterized\nby the preference matrix P = [pij]K\u00d7K, where pij is the probability that the user prefers arm i to\narm j, i.e., pij = P{i (cid:31) j}, i, j = 1, 2, . . . , K. We assume that the displaying order does not affect\nthe preference, and hence, pij + pji = 1 and pii = 1/2. We say that arm i beats arm j if pij > 1/2.\nWe study the general Copeland dueling bandits, where the Copeland winner is de\ufb01ned as the arm\n(or arms) that maximizes the number of other arms it beats [6, 7]. Speci\ufb01cally, the Copeland\n1(pij > 1/2), and the normalized Copeland score is de\ufb01ned as \u03b6i =\n1(pij > 1/2), where 1(\u00b7) is the indicator function. Let \u03b6\u2217 be the highest normalized\nK\u22121\nCopeland score, i.e., \u03b6\u2217 = max1\u2264i\u2264K \u03b6i. Then the Copeland winner is de\ufb01ned as the arm (or arms)\nwith the highest normalized Copeland score, i.e., C\u2217 = {i : 1 \u2264 i \u2264 K, \u03b6i = \u03b6\u2217}. Note that the\nCondorcet winner is a special case of Copeland winner with \u03b6\u2217 = 1.\nA dueling bandit algorithm \u0393 decides which pair of arms to compare depending on the historic obser-\nvations. Speci\ufb01cally, de\ufb01ne a \ufb01ltration Ht\u22121 as the history before t, i.e., Ht\u22121 = {a(1)\n\u03c4 , a(2)\n\u03c4 , w\u03c4 , \u03c4 =\n1, 2, . . . , t \u2212 1}. Then a dueling bandit algorithm \u0393 is a function that maps Ht\u22121 to (a(1)\nt ), i.e.,\n, a(2)\nt ) = \u0393(Ht\u22121). The performance of a dueling bandit algorithm \u0393 is measured by its expected\n(a(1)\ncumulative regret, which is de\ufb01ned as\n\n(cid:80)\n\n, a(2)\n\nj(cid:54)=i\n\nj(cid:54)=i\n\n1\n\nt\n\nt\n\nT(cid:88)\n\nt=1\n\nE(cid:2)\u03b6a(1)\n\nt\n\n(cid:3).\n\n+ \u03b6a(2)\n\nt\n\nR\u0393(T ) = \u03b6\u2217T \u2212 1\n2\n\n(1)\n\nThe objective of \u0393 is then to minimize R\u0393(T ). As pointed out in [6], the results can be adapted to\nother regret de\ufb01nitions because the above de\ufb01nition bounds the number of suboptimal comparisons.\n\n4 Double Thompson Sampling\n\n4.1 D-TS Algorithm\n\nWe present the D-TS algorithm for Copeland dueling bandits, as described in Algorithm 1 (time\nindex t is omitted in pseudo codes for brevity). As its name suggests, the basic idea of D-TS is to\nselect both the \ufb01rst and the second candidates by TS. For each pair (i, j) with i (cid:54)= j, we assume a\nbeta prior distribution for its preference probability pij. These distributions are updated according to\nthe comparison results Bij(t \u2212 1) and Bji(t \u2212 1), where Bij(t \u2212 1) (resp. Bji(t \u2212 1)) is the number\nof time-slots when arm i (resp. j) beats arm j (resp. i) before t. D-TS selects the two candidates by\nsampling from the posterior distributions.\nSpeci\ufb01cally, at each time-slot t, the D-TS algorithm consists of two phases that select the \ufb01rst and\nthe second candidates, respectively. When choosing the \ufb01rst candidate a(1)\n, we \ufb01rst use the RUCB\n[4] of pij to eliminate the arms that are unlikely to be the Copeland winner, resulting in a candidate\nset Ct (Lines 4 to 6). The algorithm then samples \u03b8(1)\nij (t) from the posterior beta distribution, and the\nis chosen by \u201cmajority voting\u201d, i.e., the arm within Ct that beats the most arms\n\ufb01rst candidate a(1)\naccording to \u03b8(1)\nij (t) will be selected (Lines 7 to 11). The ties are broken randomly here for simplicity\nand will be re\ufb01ned later in Section 4.3. A similar idea is applied to select the second candidate a(2)\n,\nwhere new samples \u03b8(2)\n(t) among all arms with\n\n(t) are generated and the arm with the largest \u03b8(2)\n\nt\n\nt\n\nt\n\n\u2264 1/2 is selected as the second candidate (Lines 13 to 14).\n\nia\n\n(1)\nt\n\nia\n\n(1)\nt\n\nl\n\nia\n\n(1)\nt\n\nThe double sampling structure of D-TS is designed based on the nature of dueling bandits, i.e., at each\ntime-slot, two arms are needed for comparison. Unlike RCS [3], D-TS selects both candidates using\nTS. This leads to more extensive utilization of TS and thus achieves much lower regret. Moreover,\nthe two sets of samples are independently distributed, following the same posterior that is only\ndetermined by the comparison statistics Bij(t\u2212 1) and Bji(t\u2212 1). This property enables us to obtain\nan O(K 2 log T ) regret bound and further re\ufb01ne it by a back substitution argument, as discussed later.\nWe also note that RUCB-based elimination (Lines 4 to 6) and RLCB (Relative Lower Con\ufb01dence\nBound)-based elimination (Line 14) are essential in D-TS. Without these eliminations, the algorithm\n\n3\n\n\fAlgorithm 1 D-TS for Copeland Dueling Bandits\n1: Init: B \u2190 0K\u00d7K; // Bij is the number of time-slots that the user prefers arm i to j.\n2: for t = 1 to T do\n// Phase 1: Choose the \ufb01rst candidate a(1)\n3:\n4: U := [uij], L := [lij], where uij = Bij\n\n(cid:113) \u03b1 log t\n\n, lij = Bij\n\n\u2212(cid:113) \u03b1 log t\n\n+\n\nBij +Bji\n\nBij +Bji\n\n, if\n\nBij +Bji\n\nBij +Bji\n\n0 := 1 for any x.\n\n1(uij > 1/2); // Upper bound of the normalized Copeland score.\n\n(cid:80)\n\ni (cid:54)= j, and uii = lii = 1/2, \u2200i; // x\n\u02c6\u03b6i \u2190 1\nK\u22121\nC \u2190 {i : \u02c6\u03b6i = maxj\nfor i, j = 1, . . . , K with i < j do\n\n\u02c6\u03b6j};\n\nj(cid:54)=i\n\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\n12:\n13:\n14:\n\nij \u223c Beta(Bij + 1, Bji + 1);\n\nSample \u03b8(1)\nji \u2190 1 \u2212 \u03b8(1)\n(cid:80)\n\u03b8(1)\nij ;\nend for\na(1) \u2190 arg max\narms; Ties are broken randomly.\n// Phase 2: Choose the second candidate a(2)\nia(1) \u223c Beta(Bia(1) + 1, Ba(1)i + 1) for all i (cid:54)= a(1), and let \u03b8(2)\nSample \u03b8(2)\na(2) \u2190 arg max\nia(1)\u22641/2\n\n\u03b8(2)\nia(1); // Choosing only from uncertain pairs.\n\n1(\u03b8(1)\n\nj(cid:54)=i\n\ni\u2208C\n\ni:l\n\nij > 1/2); // Choosing from C to eliminate likely non-winner\n\na(1)a(1) = 1/2;\n\n// Compare and Update\nCompare pair (a(1), a(2)) and observe the result w;\nUpdate B: Ba(1)a(2) \u2190 Ba(1)a(2) + 1 if w = 1, or Ba(2)a(1) \u2190 Ba(2)a(1) + 1 if w = 2;\n\n15:\n16:\n17:\n18: end for\n\nmay trap in suboptimal comparisons. Consider one extreme case in Condorcet dueling bandits1:\nassume arm 1 is the Condorcet winner with p1j = 0.501 for all j > 1, and arm 2 is not the Condorcet\nwinner, but with p2j = 1 for all j > 2. Then for a larger K (e.g., K > 4), without RUCB-based\nelimination, the algorithm may trap in a(1)\nt = 2 for a long time, because arm 2 is likely to receive\nhigher score than arm 1. This issue can be addressed by RUCB-based elimination as follows: when\nchosen as the \ufb01rst candidate, arm 2 has a great probability to compare with arm 1; after suf\ufb01cient\ncomparisons with arm 1, arm 2 will have u21(t) < 1/2 with high probability; then arm 2 is likely\nto be eliminated because arm 1 has \u02c6\u03b61(t) = 1 > \u02c6\u03b62(t) with high probability. Similarly, RLCB-based\n(t) \u2264 1/2) is important especially\nelimination (Line 14, where we restrict to the arms with l\nfor non-Condorcet dueling bandits. Speci\ufb01cally, l\nt with\nhigh probability. Thus, comparing a(1)\nand arm i brings little information gain and thus should be\neliminated to minimize the regret.\n\n(t) > 1/2 indicates that arm i beats a(1)\n\n(1)\nia\nt\n\n(1)\nia\nt\n\nt\n\n4.2 Regret Analysis\n\nBefore conducting the regret analysis, we \ufb01rst introduce certain notations that will be used later.\nGap to 1/2: In dueling bandits, an important benchmark for pij is 1/2, and thus we let \u2206ij be the\ngap between pij and 1/2, i.e., \u2206ij = |pij \u2212 1/2|.\nNumber of Comparisons: Under D-TS, (i, j) can be compared in the form of (a(1)\nand (a(1)\nN (1)\nnumber of comparisons is Nij(t) = N (1)\nfor i = j.\n\nt ) = (i, j)\nt ) = (j, i). We consider these two cases separately and de\ufb01ne the following counters:\n\u03c4 = i). Then the total\nii (t)\n\nij (t) for i (cid:54)= j, and Nii(t) = N (1)\n\nij (t) =(cid:80)t\n\nij (t) =(cid:80)t\n\n\u03c4 = j) and N (2)\n\nii (t) = N (2)\n\nij (t) + N (2)\n\n\u03c4 = j, a(2)\n\n\u03c4 = i, a(2)\n\n1(a(1)\n\n1(a(1)\n\n, a(2)\n\n, a(2)\n\n\u03c4 =1\n\n\u03c4 =1\n\nt\n\nt\n\n4.2.1 O(K 2 log T ) Regret\nTo obtain theoretical bounds for the regret of D-TS, we make the following assumption:\n\n1A Borda winner may be more appropriate in this special case [22], and we mainly use it to illustrate the\n\ndilemma.\n\n4\n\n\f(cid:21)\n\nRD-TS(T ) \u2264 (cid:88)\n\ni(cid:54)=j:pij <1/2\n\n(cid:20) 4\u03b1 log T\n\n\u22062\nij\n\nAssumption 1: The preference probability pij (cid:54)= 1/2 for any i (cid:54)= j.\nUnder Assumption 1, we present the \ufb01rst result for D-TS in general Copeland dueling bandits:\nProposition 1. When applying D-TS with \u03b1 > 0.5 in a Copeland dueling bandit with a preference\nmatrix P = [pij]K\u00d7K satisfying Assumption 1, its regret is bounded as:\n\n+ (1 + \u0001)\n\nlog T\n\n+ O(\n\nD(pij||1/2)\nq + (1 \u2212 p) log 1\u2212p\n\nK 2\n\u00012 ),\n\n(2)\n\nwhere \u0001 > 0 is an arbitrary constant, and D(p||q) = p log p\n\n1\u2212q is the KL divergence.\nThe summation operation in Eq. (2) is conducted over all pairs (i, j) with pij < 1/2. Thus,\nProposition 1 states that D-TS achieves O(K 2 log T ) regret in Copeland dueling bandits. To the best\nof our knowledge, this is the \ufb01rst theoretical bound for TS in dueling bandits. The scaling behavior\nof this bound with respect to T is order optimal, since a lower bound \u2126(log T ) has been shown in [7].\nThe re\ufb01nement of the scaling behavior with respect to K will be discussed later.\nProving Proposition 1 needs to bound the number of comparisons for all pairs (i, j) with i /\u2208 C\u2217\nor j /\u2208 C\u2217. When \ufb01xing the \ufb01rst candidate as a(1)\nt = i, the selection of the second candidate a(2)\nis similar to a traditional K-armed bandit problem with expected utilities pji (j = 1, 2, . . . , K).\nHowever, the analysis is more complex here since different arms are eliminated differently depending\non the value of pji. We prove Proposition 1 through Lemmas 1 to 3, which bound the number of\ncomparisons for all suboptimal pairs (i, j) under different scenarios, i.e., pji < 1/2, pji > 1/2, and\npji = 1/2 (j = i /\u2208 C\u2217), respectively.\nLemma 1. Under D-TS, for an arbitrary constant \u0001 > 0 and one pair (i, j) with pji < 1/2, we have\n\nt\n\nE[N (1)\n\nij (T )] \u2264 (1 + \u0001)\n\nlog T\n\nD(pji||1/2)\n\n+ O(\n\n1\n\u00012 ).\n\n(3)\n\nProof. We can prove this lemma by viewing the comparison between the \ufb01rst candidate arm i and its\ninferiors as a traditional MAB. In fact, it may be even simpler than that in [15] because under D-TS,\narm j with pji < 1/2 is competing with arm i with pii = 1/2, which is known and \ufb01xed. Then we\ncan bound E[N (1)\nLemma 2. Under D-TS with \u03b1 > 0.5, for one pair (i, j) with pji > 1/2, we have\n\nij (T )] using the techniques in [15]. Details can be found in Appendix B.1.\n\nE[N (1)\n\nij (T )] \u2264 4\u03b1 log T\n\n\u22062\nji\n\n+ O(1).\n\n(4)\n\nt = i, arm j can be selected as a(2)\n\nonly when its RLCB lji(t) \u2264 1/2.\n) similarly to the analysis of traditional UCB algorithms\n\nt\n\nProof. We note that when a(1)\nThen we can bound E[N (1)\nij (T )] by O( 4\u03b1 log T\n[23]. Details can be found in Appendix B.2.\nLemma 3. Under D-TS, for any arm i /\u2208 C\u2217, we have\n\n\u22062\nji\n\n(cid:88)\n\n\u0398(cid:0) 1\n\n\u22062\nki\n\nk:pki>1/2\n\nE[Nii(T )] \u2264 O(K) +\n\n+\n\n1\n\nkiD(1/2||pki)\n\u22062\n\n+\n\n1\n\u22064\nki\n\n(cid:1) = O(K).\n\n(5)\n\nBefore proving Lemma 3, we present an important property for \u02c6\u03b6\u2217(t) := max1\u2264i\u2264K\n\u02c6\u03b6i(t). Recall\nthat \u03b6\u2217 is the maximum normalized Copeland score. Using the concentration property of RUCB\n(Lemma 6 in Appendix A), the following lemma shows that \u02c6\u03b6\u2217(t) is indeed a UCB of \u03b6\u2217.\n\u2212 2\u03b1\n\nLemma 4. For any \u03b1 > 0.5 and t > 0, P{\u02c6\u03b6\u2217(t) \u2265 \u03b6\u2217} \u2265 1 \u2212 K(cid:2)\n\nlog(\u03b1+1/2) + 1(cid:3)t\n\n\u03b1+1/2 .\n\nReturn to the proof of Lemma 3. To prove Lemma 3, we consider the cases of \u02c6\u03b6\u2217(t) < \u03b6\u2217 and\n\u02c6\u03b6\u2217(t) \u2265 \u03b6\u2217. The former case \u02c6\u03b6\u2217(t) < \u03b6\u2217 can be bounded by Lemma 4. For the latter case, we\nnote that when \u02c6\u03b6\u2217(t) \u2265 \u03b6\u2217, the event (a(1)\nt ) = (i, i) occurs only if: a) there exists at least one\nk \u2208 K with pki > 1/2, such that lki(t) \u2264 1/2; and b) \u03b8(2)\nki (t) \u2264 1/2 for all k with lki(t) \u2264 1/2. In\nthis case, we can bound the probability of (a(1)\n, a(2)\nt ) = (i, k), for k\nwith pki > 1/2 but lki(t) \u2264 1/2, where the coef\ufb01cient decays exponentially. Then we can bound\nE[Nii(T )] by O(1) similar to [15]. Details of proof can be found in Appendix B.4.\nThe conclusion of Proposition 1 then follows by combining Lemmas 1 to 3.\n\nt ) = (i, i) by that of (a(1)\n\n, a(2)\n\n, a(2)\n\nlog t\n\nt\n\nt\n\nt\n\n5\n\n\fRD-TS(T ) \u2264 (cid:88)\n\n(cid:20) (cid:88)\n+\u03b2(1 + \u0001)2 (cid:88)\n\nj:pji>1/2\n\ni\u2208C\u2217\n\n4\u03b1 log T\n\n+\n\n\u22062\nji\n\nK(cid:88)\n\ni /\u2208C\u2217\n\nj=LC +2\n\n(cid:88)\n\nj:pji<1/2\n\n(1 + \u0001)\n\nlog T\n\nD(pji||1/2)\n\n+\n\n(cid:21)\n\n(cid:88)\n\nLC +1(cid:88)\n\ni /\u2208C\u2217\n\nj=1\n\n4\u03b1 log T\n\u22062\n\n\u03c3i(j),i\n\nlog log T\n\nD(p\u03c3i(j),i||p\u03c3i(LC +1),i)\n\n+ O(K 3) + O(\n\nK 2\n\u00012 ),\n\n(6)\n\n4.2.2 Regret Bound Re\ufb01nement\nIn this section, we re\ufb01ne the regret bound for D-TS and reduce its scaling factor with respect to the\nnumber of arms K.\nWe sort the arms for each i /\u2208 C\u2217 in the descending order of pji, and let (\u03c3i(1), \u03c3i(2), . . . , \u03c3i(K))\nbe a permutation of (1, 2, . . . , K), such that p\u03c3i(1),i \u2265 p\u03c3i(2),i \u2265 . . . \u2265 p\u03c3i(K),i. In addition, for a\n1(pji\u2217 > 1/2) be the number of arms that beat arm i\u2217. To\n\nCopeland winner i\u2217 \u2208 C, let LC =(cid:80)K\nre\ufb01ne the regret, we introduce an additional no-tie assumption:\nAssumption 2: For each arm i /\u2208 C\u2217, p\u03c3i(LC +1),i > p\u03c3i(j),i for all j > LC + 1.\nWe present a re\ufb01ned regret bound for D-TS as follows:\nTheorem 1. When applying D-TS with \u03b1 > 0.5 in a Copeland dueling bandit with a preference\nmatrix P = [pij]K\u00d7K satisfying Assumptions 1 and 2, its regret is bounded as:\n\nj=1\n\nwhere \u03b2 > 2 and \u0001 > 0 are constants, and D(\u00b7||\u00b7) is the KL-divergence.\n\nis a winner, and is\nIn (6), the \ufb01rst term corresponds to the regret when the \ufb01rst candidate a(1)\nO(K|C\u2217| log T ). The second term corresponds to the comparisons between a non-winner arm\nand its \ufb01rst LC + 1 superiors, which is bounded by O(K(LC + 1) log T ). The remaining terms\ncorrespond to the comparisons between a non-winner arm and the remaining arms, and is bounded by\n\nO(cid:0)K 2 log log T(cid:1). As demonstrated in [6], LC is relatively small compared to K, and can be viewed\n\nt\n\nas a constant. Thus, the total regret RD-TS(T ) is bounded as RD-TS(T ) = O(K log T + K 2 log log T ).\nIn particular, this asymptotic trend can be easily seen for Condorcet dueling bandits where LC = 0.\nComparing Eq. (6) with Eq. (2), we can see the difference is the third and fourth terms in (6), which\nre\ufb01ne the regret of comparing a suboptimal arm and its last (K \u2212 LC \u2212 1) inferiors into O(log log T ).\nThus, to prove Theorem 1, it suf\ufb01ces to show the following additional lemma:\nLemma 5. Under Assumptions 1 and 2, for any suboptimal arm i /\u2208 C\u2217 and j > LC + 1, we have\n\nE[N (1)\ni\u03c3i(j)\n\n(T )] \u2264 \u03b2(1 + \u0001)2 log log T\nD(p\u03c3i(j),i||p\u03c3i(LC +1),i)\n\n+ O(K) + O(\n\n1\n\u00012 ),\n\n(7)\n\nwhere \u03b2 > 2 and \u0001 > 0 are constants.\n\nt\n\n(T ) = (cid:80)T\n\nt = i, the comparison between a(1)\n\nProof. We prove this lemma using a back substitution argument. The intuition is that when \ufb01xing the\n\ufb01rst candidate as a(1)\nand the other arms is similar to a traditional\nMAB with expected utilities pji (1 \u2264 j \u2264 K). Let N (1)\nt = i) be the number\nof time-slots when this type of MAB is played. Using the fact that the distribution of the samples\nonly depends on the historic comparison results (but not t), we can show E[N (1)\n(T )] =\n(T )] = O(K log T ) for\nO(log N (1)\nany i (cid:54)= C\u2217 when proving Proposition 1. Then, substituting the bound of E[N (1)\n(T )] back and\nusing the concavity of the log(\u00b7) function, we have E[N (1)\nO(log E[N (1)\n\n(T )]) = O(log log T + log K). Details can be found in Appendix C.1\n\n(T )] = E(cid:2)E[N (1)\n\n(T ). We have shown that E[N (1)\n\n(T )), which holds for any N (1)\n\n(T )](cid:3) \u2264\n\n(T )|N (1)\n\ni\n\ni\n\n(T )|N (1)\n\ni\n\n1(a(1)\n\ni,\u03c3i(j)\n\ni,\u03c3i(j)\n\ni,\u03c3i(j)\n\ni\n\ni\n\nt=1\n\ni\n\ni\n\ni\n\n4.3 Further Improvement: D-TS+\n\nD-TS is a TS framework for dueling bandits, and its performance can be improved by re\ufb01ning certain\ncomponents of it. In this section, we propose an enhanced version of D-TS, referred to as D-TS+,\nthat carefully breaks the ties to reduce the regret.\nNote that by randomly breaking the ties (Line 11 in Algorithm 1), D-TS tends to explore all potential\nwinners. This may be desirable in certain applications such as restaurant recommendation, where\n\n6\n\n\f(cid:80)\n\nj(cid:54)=i\n\nij (t), the normalized Copeland score for each arm i can be esti-\n1(\u03b8(1)\nij (t) > 1/2). Then the maximum normalized Copeland score is\n\nusers may not want to stick to a single winner. However, because of this, the regret of D-TS scales\nwith the number of winners |C\u2217| as shown in Theorem 1. To further reduce the regret, we can break\nthe ties according to estimated regret.\nSpeci\ufb01cally, with samples \u03b8(1)\nmated as \u02dc\u03b6i(t) = 1\nK\u22121\n\u02dc\u03b6\u2217(t) = maxi\nFor pij (cid:54)= 1/2, we need about \u0398(\nD(pij||1/2) ) time-slots to distinguish it from 1/2 [5]. Thus, when\nchoosing i as the \ufb01rst candidate, the regret of comparing it with all other arms can be estimated by\nij (t)||1/2). We propose the following D-TS+ algorithm that\n\u02dcR(1)\nbreaks the ties to minimize \u02dcR(1)\nD-TS+: Implement the same operations as D-TS, except for the selection of the \ufb01rst candidate\n(Line 11 in Algorithm 1) is replaced by the following two steps:\n\n\u02dc\u03b6i(t), and the loss of comparing arm i and arm j is \u02dcrij(t) = \u02dc\u03b6\u2217(t) \u2212 1\n\n(cid:2)\u02dc\u03b6i(t) + \u02dc\u03b6j(t)(cid:3).\n\nij (t)(cid:54)=1/2 \u02dcrij(t)/D(\u03b8(1)\n\nj:\u03b8(1)\n\n(t) =(cid:80)\n\n(t).\n\ni\n\nlog T\n\n2\n\ni\n\n(cid:88)\n\nj(cid:54)=i\n\n1(\u03b8(1)\n\nij > 1/2)};\n\nA(1) \u2190 {i \u2208 C : \u03b6i = max\ni\u2208C\n\na(1) \u2190 arg min\ni\u2208A(1)\n\n\u02dcR(1)\n\ni\n\n;\n\nD-TS+ only changes the tie-breaking criterion in selecting the \ufb01rst candidate. Thus, the regret bound\nof D-TS directly applies to D-TS+:\nCorollary 1. The regret of D-TS+, RD-TS+(T ), satis\ufb01es inequality (6) under Assumptions 1 and 2.\nCorollary 1 provides an upper bound for the regret of D-TS+. In practice, however, D-TS+ performs\nbetter than D-TS in the scenarios with multiple winners, as we can see in Section 5 and Appendix D.\nOur conjecture is that with this regret-minimization criterion, the D-TS+ algorithm tends to focus\non one of the winners (if there is no tie in terms of expected regret), and thus reduces the \ufb01rst term\nin (6) from O(K|C\u2217| log T ) to O(K log T ). The proof of this conjecture requires properties for the\nevolution of the statistics for all arms and the majority voting results based on the Thompson samples,\nand is complex. This is left as part of our future work.\nIn the above D-TS+ algorithm, we only consider the regret of choosing i as the \ufb01rst candidate. From\nTheorem 1, we know that comparing other arms with their superiors will also result in \u0398(log T )\nregret. Thus, although the current D-TS+ algorithm performs well in most practical scenarios, one\nmay further improve its performance by taking these additional comparisons into account in \u02dcR(1)\n(t).\n\ni\n\n5 Experiments\n\nTo evaluate the proposed D-TS and D-TS+ algorithms, we run experiments based on synthetic and\nreal-world data. Here we present the results for experiments based on the Microsoft Learning to Rank\n(MSLR) dataset [24], which provides the relevance for queries and ranked documents. Based on this\ndataset, [6] derives a preference matrix for 136 rankers, where each ranker is a function that maps a\nuser\u2019s query to a document ranking and can be viewed as one arm in dueling bandits. We use the\ntwo 5-armed submatrices in [6], one for Condorcet dueling bandit and the other for non-Condorcet\ndueling bandit. More experiments and discussions can be found in Appendix D 2.\nWe compare D-TS and D-TS+ with the following algorithms: BTM [16], SAVAGE [17], Sparring\n[18], RUCB [4], RCS [3], CCB [6], SCB [6], RMED1 [5], and ECW-RMED [7]. For BTM, we set\nthe relaxed factor \u03b3 = 1.3 as [16]. For algorithms using RUCB and RLCB, including D-TS and\nD-TS+, we set the scale factor \u03b1 = 0.51. For RMED1, we use the same settings as [5], and for\nECW-RMED, we use the same setting as [7]. For the \u201cexplore-then-exploit\u201d algorithms, BTM and\nSAVAGE, each point is obtained by resetting the time horizon as the corresponding value. The results\nare averaged over 500 independent experiments, where in each experiment, the arms are randomly\nshuf\ufb02ed to prevent algorithms from exploiting special structures of the preference matrix.\nIn Condorcet dueling bandits, our D-TS and D-TS+ algorithms achieve almost the same performance\nand both perform much better than existing algorithms, as shown in Fig. 1(a). In particular, compared\nwith RCS, we can see that the full utilization of TS in D-TS and D-TS+ signi\ufb01cantly reduces the\n\n2Source codes are available at https://github.com/HuasenWu/DuelingBandits.\n\n7\n\n\f(a) K = 5, Condorcet\n\n(b) K = 5, non-Condorcet\n\nFigure 1: Regret in MSLR dataset. In (b), there are 3 Copeland\nwinners with normalized Copeland score \u03b6\u2217 = 3/4.\n\nFigure 2:\nStandard deviation\n(STD) of regret for T = 106 (nor-\nmalized by RECW\u2212RMED(T )).\n\nregret. Compared with RMED1 and ECW-RMED, our D-TS and D-TS+ algorithms also perform\nbetter. [5] has shown that RMED1 is optimal in Condorcet dueling bandits, not only in the sense of\nasymptotic order, but also the coef\ufb01cients in the regret bound. The simulation results show that D-TS\nand D-TS+ not only achieve the similar slope as RMED1/ECW-RMED, but also converge faster to\nthe asymptotic regime and thus achieve much lower regret. This inspires us to further re\ufb01ne the regret\nbounds for D-TS and D-TS+ in the future.\nIn non-Condorcet dueling bandits, as shown in Fig. 1(b), D-TS and D-TS+ signi\ufb01cantly reduce the\nregret compared to the UCB-type algorithm, CCB (e.g., the regret of D-TS+ is less than 10% of that\nof CCB). Compared with ECW-RMED, D-TS achieves higher regret, mainly because it randomly\nexplores all Copeland winners due to the random tie-breaking rule. With a regret-minimization\ntie-breaking rule, D-TS+ further reduces the regret, and outperforms ECW-RMED in this dataset.\nMoreover, as randomized algorithms, D-TS and D-TS+ are more robust to the preference probabilities.\nAs shown in Fig. 2, D-TS and D-TS+ have much smaller regret STD than that of ECW-RMED in the\nnon-Condorcet dataset, where certain preference probabilities (for different arms) are close to 1/2.\nIn particular, the STD of regret for ECW-RMED is almost 200% of its mean value, while it is only\n13.16% for D-TS+. In addition, as shown in Appendix D.2.3, D-TS and D-TS+ are also robust to\ndelayed feedback, which is typically batched and provided periodically in practice.\nOverall, D-TS and D-TS+ signi\ufb01cantly outperform all existing algorithms, with the exception of\nECW-RMED. Compared to ECW-RMED, D-TS+ achieves much lower regret in the Condorcet case,\nlower or comparable regret in the non-Condorcet case, and much more robustness in terms of regret\nSTD and delayed feedback. Thus, the simplicity, good performance, and robustness of D-TS and\nD-TS+ make them good algorithms in practice.\n\n6 Conclusions and Future Work\n\nIn this paper, we study TS algorithms for dueling bandits. We propose a D-TS algorithm and its\nenhanced version D-TS+ for general Copeland dueling bandits, including Condorcet dueling bandits\nas a special case. Our study reveals desirable properties of D-TS and D-TS+ from both theoretical\nand practical perspectives. Theoretically, we show that the regret of D-TS and D-TS+ is bounded by\nO(K 2 log T ) in general Copeland dueling bandits, and can be re\ufb01ned to O(K log T + K 2 log log T )\nin Condorcet dueling bandits and most practical Copeland dueling bandits. Practically, experimental\nresults demonstrate that these simple algorithms achieve signi\ufb01cantly better overall-performance than\nexisting algorithms, i.e., D-TS and D-TS+ typically achieve much lower regret in practice and are\nrobust to many practical factors, such as preference matrix and feedback delay.\nAlthough logarithmic regret bounds have been obtained for D-TS and D-TS+, our analysis relies\nheavily on the properties of RUCB/RLCB and the regret bounds are likely loose. In fact, we see\nfrom experiments that RUCB-based elimination seldom occurs under most practical settings. We\nwill further re\ufb01ne the regret bounds by investigating the properties of TS-based majority-voting.\nMoreover, results from recent work such as [7] may be leveraged to improve TS algorithms. Last, it\nis also an interesting future direction to study D-TS type algorithms for dueling bandits with other\nde\ufb01nition of winners.\nAcknowledgements: This research was supported in part by NSF Grants CCF-1423542, CNS-\n1457060, and CNS-1547461. The authors would like to thank Prof. R. Srikant (UIUC), Prof. Shipra\nAgrawal (Columbia University), Masrour Zoghi (University of Amsterdam), and Dr. Junpei\nKomiyama (University of Tokyo) for their helpful discussions and suggestions.\n\n8\n\nTime t102104106Regret050010001500RUCB/CCBRCS/RMED1/ECW-RMEDD-TS/D-TS+BTMSAVAGERUCBRCSSparringCCBSCBRMED1ECW-RMEDD-TSD-TS+Time t104105106107Regret#104051015CCBD-TSECW-RMEDBTMSAVAGERUCBRCSSparringCCBSCBRMED1ECW-RMEDD-TSD-TS+D-TS+DatasetK = 5, CondorcetK = 5, non-CondorcetNormalized STD of regret05010015020029.37%193.09%27.65%29.12%30.51%13.16%ECW-RMEDD-TSD-TS+\fReferences\n[1] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal\n\nof Computer and System Sciences, 78(5):1538\u20131556, 2012.\n\n[2] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling\nbandits problem. In International Conference on Machine Learning (ICML), pages 1201\u20131208,\n2009.\n\n[3] M. Zoghi, S. A. Whiteson, M. De Rijke, and R. Munos. Relative con\ufb01dence sampling for\nef\ufb01cient on-line ranker evaluation. In ACM International Conference on Web Search and Data\nMining, pages 73\u201382, 2014.\n\n[4] M. Zoghi, S. Whiteson, R. Munos, and M. D. Rijke. Relative upper con\ufb01dence bound for the\nk-armed dueling bandit problem. In International Conference on Machine Learning (ICML),\npages 10\u201318, 2014.\n\n[5] J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal\nalgorithm in dueling bandit problem. In Proceedings of Conference on Learning Theory, 2015.\n[6] M. Zoghi, Z. S. Karnin, S. Whiteson, and M. de Rijke. Copeland dueling bandits. In Advances\n\nin Neural Information Processing Systems, pages 307\u2013315, 2015.\n\n[7] J. Komiyama, J. Honda, and H. Nakagawa. Copeland dueling bandit problem: Regret lower\nbound, optimal algorithm, and computationally ef\ufb01cient algorithm. In International Conference\non Machine Learning (ICML), 2016.\n\n[8] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, pages 285\u2013294, 1933.\n\n[9] O. Chapelle and L. Li. An empirical evaluation of Thompson Sampling. In Advances in Neural\n\nInformation Processing Systems, pages 2249\u20132257, 2011.\n\n[10] S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multi-armed bandit problem.\n\nIn Conference on Learning Theory (COLT), 2012.\n\n[11] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of Thompson Sampling\nin stochastic multi-armed bandit problem with multiple plays. In International Conference on\nMachine Learning (ICML), 2015.\n\n[12] Y. Xia, H. Li, T. Qin, N. Yu, and T.-Y. Liu. Thompson sampling for budgeted multi-armed\n\nbandits. In International Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[13] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In\n\nInternational Conference on Machine Learning (ICML), pages 100\u2013108, 2014.\n\n[14] A. Gopalan and S. Mannor. Thompson sampling for learning parameterized Markov decision\n\nprocesses. In Proceedings of Conference on Learning Theory, pages 861\u2013898, 2015.\n\n[15] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson Sampling. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 99\u2013107, 2013.\n\n[16] Y. Yue and T. Joachims. Beat the mean bandit. In International Conference on Machine\n\nLearning (ICML), pages 241\u2013248, 2011.\n\n[17] T. Urvoy, F. Clerot, R. F\u00e9raud, and S. Naamane. Generic exploration and k-armed voting bandits.\n\nIn International Conference on Machine Learning (ICML), pages 91\u201399, 2013.\n\n[18] N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits.\n\nIn\nProceedings of The 31st International Conference on Machine Learning, pages 856\u2013864, 2014.\n[19] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson Sampling. arXiv\n\npreprint arXiv:1403.5341, 2014.\n\n[20] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\n[21] N. Welsh.\n\nThompson sampling for the dueling bandits problem.\n\nOnline Learning and Decision Making (LSOLDM) Workshop, 2012.\nhttp://videolectures.net/lsoldm2012_welsh_bandits_problem/.\n\nIn Large-Scale\navailable at\n\n[22] K. Jamieson, S. Katariya, A. Deshpande, and R. Nowak. Sparse dueling bandits. In Conference\n\non Learning Theory (COLT), 2015.\n\n[23] S. Bubeck. Bandits games and clustering foundations. PhD thesis, Universit\u00e9 des Sciences et\n\nTechnologie de Lille-Lille I, 2010.\n\n[24] Microsoft Research, Microsoft Learning to Rank Datasets. http://research.microsoft.com/en-\n\nus/projects/mslr/, 2010.\n\n9\n\n\f", "award": [], "sourceid": 350, "authors": [{"given_name": "Huasen", "family_name": "Wu", "institution": "University of California at Davis"}, {"given_name": "Xin", "family_name": "Liu", "institution": "University of California"}]}