{"title": "Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 604, "page_last": 612, "abstract": "We study the problem of online rank elicitation, assuming that rankings of a set of alternatives obey the Plackett-Luce distribution. Following the setting of the dueling bandits problem, the learner is allowed to query pairwise comparisons between alternatives, i.e., to sample pairwise marginals of the distribution in an online fashion. Using this information, the learner seeks to reliably predict the most probable ranking (or top-alternative). Our approach is based on constructing a surrogate probability distribution over rankings based on a sorting procedure, for which the pairwise marginals provably coincide with the marginals of the Plackett-Luce distribution. In addition to a formal performance and complexity analysis, we present first experimental studies.", "full_text": "Online Rank Elicitation for Plackett-Luce:\n\nA Dueling Bandits Approach\n\nBal\u00b4azs Sz\u00a8or\u00b4enyi\n\nTechnion, Haifa, Israel /\n\nMTA-SZTE Research Group on\nArti\ufb01cial Intelligence, Hungary\n\nszorenyibalazs@gmail.com\n\nR\u00b4obert Busa-Fekete, Adil Paul, Eyke H\u00a8ullermeier\n\nDepartment of Computer Science\n\nUniversity of Paderborn\n\nPaderborn, Germany\n\n{busarobi,adil.paul,eyke}@upb.de\nAbstract\n\nWe study the problem of online rank elicitation, assuming that rankings of a set\nof alternatives obey the Plackett-Luce distribution. Following the setting of the\ndueling bandits problem, the learner is allowed to query pairwise comparisons\nbetween alternatives, i.e., to sample pairwise marginals of the distribution in an\nonline fashion. Using this information, the learner seeks to reliably predict the\nmost probable ranking (or top-alternative). Our approach is based on constructing\na surrogate probability distribution over rankings based on a sorting procedure, for\nwhich the pairwise marginals provably coincide with the marginals of the Plackett-\nLuce distribution. In addition to a formal performance and complexity analysis,\nwe present \ufb01rst experimental studies.\n\nIntroduction\n\n1\nSeveral variants of learning-to-rank problems have recently been studied in an online setting, with\npreferences over alternatives given in the form of stochastic pairwise comparisons [6]. Typically, the\nlearner is allowed to select (presumably most informative) alternatives in an active way\u2014making a\nconnection to multi-armed bandits, where single alternatives are chosen instead of pairs, this is also\nreferred to as the dueling bandits problem [28].\nMethods for online ranking can mainly be distinguished with regard to the assumptions they make\nabout the probabilities pi,j that, in a direct comparison between two alternatives i and j, the former\nis preferred over the latter. If these probabilities are not constrained at all, a complexity that grows\nquadratically in the number M of alternatives is essentially unavoidable [27, 8, 9]. Yet, by exploiting\n(stochastic) transitivity properties, which are quite natural in a ranking context, it is possible to\ndevise algorithms with better performance guaranties, typically of the order M log M [29, 28, 7].\nThe idea of exploiting transitivity in preference-based online learning establishes a natural con-\nnection to sorting algorithms. Naively, for example, one could simply apply an ef\ufb01cient sorting\nalgorithm such as MergeSort as an active sampling scheme, thereby producing a random order of\nthe alternatives. What can we say about the optimality of such an order? The problem is that the\nprobability distribution (on rankings) induced by the sorting algorithm may not be well attuned with\nthe original preference relation (i.e., the probabilities pi,j).\nIn this paper, we will therefore combine a sorting algorithm, namely QuickSort [15], and a stochas-\ntic preference model that harmonize well with each other\u2014in a technical sense to be detailed later\non. This harmony was \ufb01rst presented in [1], and our main contribution is to show how it can be\nexploited for online rank elicitation. More speci\ufb01cally, we assume that pairwise comparisons obey\nthe marginals of a Plackett-Luce model [24, 19], a widely used parametric distribution over rankings\n(cf. Section 5). Despite the quadratic worst case complexity of QuickSort, we succeed in developing\nits budgeted version (presented in Section 6) with a complexity of O(M log M ). While only return-\ning partial orderings, this version allows us to devise PAC-style algorithms that \ufb01nd, respectively, a\nclose-to-optimal item (Section 7) and a close-to-optimal ranking of all items (Section 8), both with\nhigh probability.\n\n1\n\n\f2 Related Work\nSeveral studies have recently focused on preference-based versions of the multi-armed bandit setup,\nalso known as dueling bandits [28, 6, 30], where the online learner is only able to compare arms in\na pairwise manner. The outcome of the pairwise comparisons essentially informs the learner about\npairwise preferences, i.e., whether or not an option is preferred to another one. A \ufb01rst group of\npapers, including [28, 29], assumes the probability distributions of pairwise comparisons to possess\ncertain regularity property, such as strong stochastic transitivity. A second group does not make\nassumptions of that kind; instead, a target (\u201cground-truth\u201d) ranking is derived from the pairwise\npreferences, for example using the Copeland, Borda count and Random Walk procedures [9, 8, 27].\nOur work is obviously closer to the \ufb01rst group of methods. In particular, the study presented in this\npaper is related to [7] which investigates a similar setup for the Mallows model.\nThere are several approaches to estimating the parameters of the Plackett-Luce (PL) model, includ-\ning standard statistical methods such as likelihood estimation [17] and Bayesian parameter estima-\ntion [14]. Pairwise marginals are also used in [26], in connection with the method-of-moments\napproach; nevertheless, the authors assume that full rankings are observed from a PL model.\nAlgorithms for noisy sorting [2, 3, 12] assume a total order over the items, and that the comparisons\nare representative of that order (if i precedes j, then the probability of option i being preferred to\nj is bigger than some > 1/2). In [25], the data is assumed to consist of pairwise comparisons\ngenerated by a Bradley-Terry model, however, comparisons are not chosen actively but according to\nsome \ufb01xed probability distribution.\nPure exploration algorithms for the stochastic multi-armed bandit problem sample the arms a certain\nnumber of times (not necessarily known in advance), and then output a recommendation, such as the\nbest arm or the m best arms [4, 11, 5, 13]. While our algorithms can be viewed as pure exploration\nstrategies, too, we do not assume that numerical feedback can be generated for individual options;\ninstead, our feedback is qualitative and refers to pairs of options.\n3 Notation\nA set of alternatives/options/items to be ranked is denoted by I. To keep the presentation simple,\nwe assume that items are identi\ufb01ed by natural numbers, so I = [M ] = {1, . . . , M}. A ranking is a\nbijection r on I, which can also be represented as a vector r = (r1, . . . , rM ) = (r(1), . . . , r(M )),\nwhere rj = r(j) is the rank of the jth item. The set of rankings can be identi\ufb01ed with the symmetric\ngroup SM of order M. Each ranking r naturally de\ufb01nes an associated ordering o = (o1, . . . , oM ) 2\nSM of the items, namely the inverse o = r1 de\ufb01ned by or(j) = j for all j 2 [M ].\nFor a permutation r, we write r(i, j) for the permutation in which ri and rj, the ranks of items i\nand j, are replaced with each other. We denote by L(ri = j) = {r 2 SM | ri = j} the subset of\npermutations for which the rank of item i is j, and by L(rj > ri) = {r 2 SM | rj > ri} those for\nwhich the rank of j is higher than the rank of i, that is, item i is preferred to j, written i j. We\nwrite i r j to indicate that i is preferred to j with respect to ranking r.\nWe assume SM to be equipped with a probability distribution P : SM ! [0, 1]; thus, for each\nranking r, we denote by P(r) the probability to observe this ranking. Moreover, for each pair of\nitems i and j, we denote by\n\npi,j = P(i j) = Xr2L(rj >ri)\n\nP(r)\n\n(1)\n\nthe probability that i is preferred to j (in a ranking randomly drawn according to P). These pairwise\nprobabilities are called the pairwise marginals of the ranking distribution P. We denote the matrix\ncomposed of the values pi,j by P = [pi,j]1\uf8ffi,j\uf8ffM.\n4 Preference-based Approximations\nOur learning problem essentially consists of making good predictions about properties of P. Con-\ncretely, we consider two different goals of the learner, depending on whether the application calls\nfor the prediction of a single item or a full ranking of items:\nIn the \ufb01rst problem, which we call PAC-Item or simply PACI, the goal is to \ufb01nd an item that is\nalmost as good as the optimal one, with optimality referring to the Condorcet winner. An item i\u21e4 is\n\n2\n\n\fa Condorcet winner if pi\u21e4,i > 1/2 for all i 6= i\u21e4. Then, we call an item j a PAC-item, if it is beaten\nby the Condorcet winner with at most an \u270f-margin: |pi\u21e4,j 1/2| <\u270f . This setting coincides with\nthose considered in [29, 28]. Obviously, it requires the existence of a Condorcet winner, which is\nindeed guaranteed in our approach, thanks to the assumption of a Plackett-Luce model.\nThe second problem, called AMPR, is de\ufb01ned as \ufb01nding the most probable ranking [7], that is,\nr\u21e4 = argmaxr2SM P(r). This problem is especially challenging for ranking distributions for which\nthe order of two items is hard to elicit (because many entries of P are close to 1/2). Therefore, we\nagain relax the goal of the learner and only require it to \ufb01nd a ranking r with the following property:\nThere is no pair of items 1 \uf8ff i, j \uf8ff M, such that r\u21e4i < r\u21e4j , ri > rj and pi,j > 1/2 + \u270f. Put in\nwords, the ranking r is allowed to differ from r\u21e4 only for those items whose pairwise probabilities\nare close to 1/2. Any ranking r satisfying this property is called an approximately most probable\nranking (AMPR).\nBoth goals are meant to be achieved with probability at least 1 , for some > 0. Our learner\noperates in an online setting. In each iteration, it is allowed to gather information by asking for a\nsingle pairwise comparison between two items\u2014or, using the dueling bandits jargon, to pull two\narms. Thus, it selects two items i and j, and then observes either preference i j or j i; the\nformer occurs with probability pi,j as de\ufb01ned in (1), the latter with probability pj,i = 1pi,j. Based\non this observation, the learner updates its estimates and decides either to continue the learning\nprocess or to terminate and return its prediction. What we are mainly interested in is the sample\ncomplexity of the learner, that is, the number of pairwise comparisons it queries prior to termination.\nBefore tackling the problems introduced above, we need some additional notation. The pair of items\nchosen by the learner in the t-th comparison is denoted (it, jt), where it < jt, and the feedback\nreceived is de\ufb01ned as ot = 1 if it jt and ot = 0 if jt it. The set of steps among the\n\ufb01rst t iterations in which the learner decides to compare items i and j is denoted by I t\ni,j = {` 2\ni,j.1 The proportion of \u201cwins\u201d of item\n[t]| (i`, j`) = (i, j)}, and the size of this set by nt\ni,j = #I t\no`. Since our samples are\ni,j = 1\nnt\ni,j is a reasonable estimate of\n\ni,jP`2I t\ni against item j up to iteration t is then given by bp t\nindependent and identically distributed (i.i.d.), the relative frequencybp t\n\ni,j\n\nthe pairwise probability (1).\n5 The Plackett-Luce Model\nThe Plackett-Luce (PL) model is a widely-used probability distribution on rankings [24, 19]. It is\nparameterized by a \u201cskill\u201d vector v = (v1, . . . , vM ) 2 RM\n+ and mimics the successive construction\nof a ranking by selecting items position by position, each time choosing one of the remaining items\ni with a probability proportional to its skill vi. Thus, with o = r1, the probability of a ranking r is\n\nP(r| v) =\n\nMYi=1\n\nvoi\n\nvoi + voi+1 + \u00b7\u00b7\u00b7 + voM\n\n.\n\n(2)\n\nAs an appealing property of the PL model, we note that the marginal probabilities (1) are very easy\nto calculate [21], as they are simply given by\n\npi,j =\n\nvi\n\nvi + vj\n\n.\n\n(3)\n\nLikewise, the most probable ranking r\u21e4 can be obtained quite easily, simply by sorting the items\naccording to their skill parameters, that is, r\u21e4i < r\u21e4j iff vi > vj. Moreover, the PL model satis\ufb01es\nstrong stochastic transitivity, i.e., pi,k max(pi,j, pj,k) whenever pi,j 1/2 and pj,k 1/2 [18].\n6 Ranking Distributions based on Sorting\nIn the classical sorting literature, the outcome of pairwise comparisons is deterministic and deter-\nmined by an underlying total order of the items, namely the order the sorting algorithm seeks to \ufb01nd.\nNow, if the pairwise comparisons are stochastic, the sorting algorithm can still be run, however, the\nresult it will return is a random ranking. Interestingly, this is another way to de\ufb01ne a probability dis-\ntribution over the rankings: P(r) = P(r| P) is the probability that r is returned by the algorithm if\n\n1We omit the index t if there is no danger of confusion.\n\n3\n\n\fAlgorithm 1 BQS(A, B)\nRequire: A, the set to be sorted, and a budget B\nEnsure: (r, B00), where B00 is the remaining bud-\nget, and r is the (partial) order that was con-\nstructed based on B B00 samples\n\nstochastic comparisons are speci\ufb01ed by P. Obviously, this view is closely connected to the problem\nof noisy sorting (see the related work section).\nIn a recent work by Ailon [1], the well-known QuickSort algorithm is investigated in a stochastic\nsetting, where the pairwise comparisons are drawn from the pairwise marginals of the Plackett-Luce\nmodel. Several interesting properties are shown about the ranking distribution based on QuickSort,\nnotably the property of pairwise stability. We denote the QuickSort-based ranking distribution by\nPQS(\u00b7| P), where the matrix P contains the marginals (3) of the Plackett-Luce model. Then, it can\nbe shown that PQS(\u00b7| P) obeys the property of pairwise stability, which means that it preserves the\nmarginals, although the distributions themselves might not be identical, i.e., PQS(\u00b7| P) 6= P(\u00b7| v).\nTheorem 1 (Theorem 4.1 in [1]). Let P be given by the pairwise marginals (3), i.e., pi,j = vi/(vi +\nvj). Then, pi,j = PQS(i j | P) =Pr2L(rj >ri) PQS(r| P).\nOne drawback of the QuickSort algorithm is its complexity: To generate a random ranking, it com-\npares O(M 2) items in the worst case. Next, we shall introduce a budgeted version of the Quick-\nSort algorithm, which terminates if the algorithm compares too many pairs, namely, more than\nO(M log M ). Upon termination, the modi\ufb01ed Quicksort algorithm only returns a partial order.\nNevertheless, we will show that it still preserves the pairwise stability property.\n6.1 The Budgeted QuickSort-based Algorithm\nAlgorithm 1 shows a budgeted version of the\nQuickSort-based random ranking generation\nprocess described in the previous section.\nIt\nworks in a way quite similar to the standard\nQuickSort-based algorithm, with the notable\ndifference of terminating as soon as the number\nof pairwise comparisons exceeds the budget B,\nwhich is a parameter assumed as an input. Ob-\nviously, the BQS algorithm run with A = [M ]\nand B = 1 (or B > M 2) recovers the orig-\ninal QuickSort-based sampling algorithm as a\nspecial case.\nA run of BQS(A,1) can be represented quite\nnaturally as a random tree \u2327: the root is labeled\n[M ], end whenever a call to BQS(A, B) initi-\nates a recursive call BQS(A0, B0), a child node\nwith label A0 is added to the node with label A.\nNote that each such tree determines a ranking,\nwhich is denoted by r\u2327 , in a natural way.\nThe random ranking generated by BQS(A,1)\nfor some subset A \u2713 [M ] was analyzed by Ailon [1], who showed that it gives back the same\nmarginals as the original Plackett-Luce model (as recalled in Theorem 1). Now, for B > 0, denote\nby \u2327 B the tree the algorithm would have returned for the budget B instead of 1. 2 Additionally, let\nT B denote the set of all possible outcomes of \u2327 B, and for two distinct indices i and j, let T B\ni,j denote\nthe set of all trees T 2T B in which i and j are incomparable in the associated ranking (i.e., some\nleaf of T is labelled by a superset of {i, j}).\nThe main result of this section is that BQS does not introduce any bias in the marginals (3), i.e.,\nTheorem 1 also holds for the budgeted version of BQS.\nProposition 2. For any B > 0, any set A \u2713I and any indices i, j 2 A, the partial order r = r\u2327 B\ngenerated by BQS(A, B) satis\ufb01es P(i r j | \u2327 B 2T B \\ T B\nThat is, whenever two items i and j are comparable by the partial ranking r generated by BQS,\ni r j with probability exactly\n. The basic idea of the proof (deferred to the appendix) is to\nshow that, conditioned on the event that i and j are incomparable by r, i r j would have been\n2Put differently, \u2327 is obtained from \u2327 B by continuing the execution of BQS ignoring the stopping criterion\n\n6:\n7: A0 = {j 2 A| j 6= i & oi,j = 0}\n8: A1 = {j 2 A| j 6= i & oi,j = 1}\n9: (r0, B0) = BQS(A0, B | A| + 1)\n10: (r00, B00) = BQS(A1, B0)\n11: update r based on r0 and r00\n12: return (r, B00)\n\n1: Initialize r to be the empty partial order over A\n2: if B \uf8ff 0 or |A|\uf8ff 1 then return (r, 0)\n3: pick an element i 2 A uniformly at random\n4: for all j 2 A \\ {i} do\n5:\n\ndraw a random sample oij according to the\n\nupdate r accordingly\n\ni,j) = vi\nvi+vj\n\n.\n\nPL marginal (3)\n\nB \uf8ff 0.\n\nvi\n\nvi+vj\n\n4\n\n\fin case execution of BQS had been continued (see Claim 6). The\n\n. Initialization\n\nAlgorithm 2 PLPAC(, \u270f)\n1: for i, j = 1 ! M do\n2:\n3:\n4: Set A = {1, . . . , M}\n5: repeat\n6:\n\nni,j = 0\n\nSorting based random ranking\n\n. bP = [bpi,j]M\u21e5M\n. bN = [ni,j]M\u21e5M\n\nbpi,j = 0\nr = BQS(A, a 1) where a = #A.\nupdate the entries of bP and N correspond-\nset ci,j =r 1\nfor all i 6= j\nfor (i, j 2 A) ^ (i 6= j) do\nif bpi,j + ci,j < 1/2 then\n\nC = {i 2 A | (8j 2 A \\ {i})\n\nA = A \\ {i}\n\n. Discard\n\n4M 2n2\n\nlog\n\n2ni,j\n\ni,j\n\nvi\n\nvi+vj\n\nobtained with probability\nresult then follows by combining this with Theorem 1.\n7 The PAC-Item Problem and its Analysis\nOur algorithm for \ufb01nding the PAC item is\nbased on the sorting-based sampling tech-\nnique described in the previous section. The\npseudocode of the algorithm, called PLPAC,\nis shown in Algorithm 2.\nIn each iteration,\nwe generate a ranking, which is partial (line\n6), and translate this ranking into pairwise\ncomparisons that are used to update the es-\ntimates of the pairwise marginals. Based on\nthese estimates, we apply a simple elimina-\ntion strategy, which consists of eliminating an\nitem i if it is signi\ufb01cantly beaten by another\n\n7:\n\ning to A based on r\n\n\n\nitem j, that is, bpi,j + ci,j < 1/2 (lines 9\u2013\n\n8:\n9:\n10:\n11:\n12:\n13: until (#C 1)\n14: return C\n\n11). Finally, the algorithm terminates when\nit \ufb01nds a PAC-item for which, by de\ufb01nition,\n|pi\u21e4,i 1/2| <\u270f . To identify an item i as\na PAC-item, it is enough to guarantee that i\nis not beaten by any j 2 A with a margin\nbigger than \u270f, that is, pi,j > 1/2 \u270f for all\nbpi,j ci,j > 1/2 \u270f}\nj 2 A. This suf\ufb01cient condition is imple-\nmented in line 12. Since we only have empir-\nical estimates of the pi,j values, the test of the\ncondition does of course also take the con\ufb01dence intervals into account.\nNote that vi = vj, i 6= j, implies pi,j = 1/2. In this case, it is not possible to decide whether pi,j\nis above 1/2 or not on the basis of a \ufb01nite number of pairwise comparisons. The \u270f-relaxation of the\ngoal to be achieved provides a convenient way to circumvent this problem.\n7.1 Sample Complexity Analysis of PLPAC\nFirst, let rt denote the (partial) ordering produced by BQS in the t-th iteration. Note that each of\nthese (partial) orderings de\ufb01nes a bucket order: The indices are partitioned into different classes\n(buckets) in such a way that none of the pairs are comparable within one class, but pairs from\ndifferent classes are; thus, if i and i0 belong to some class and j and j0 belong to some other class,\nthen either i rt j and i0 rt j0, or j rt i and j0 rt i0. More speci\ufb01cally, the BQS algorithm\nwith budget a 1 (line 6) always results in a bucket order containing only two buckets since no\nrecursive call is carried out with this budget. Then one might show that the optimal arm i\u21e4 and\nan arbitrary arm i(6= i\u21e4) fall into different buckets \u201coften enough\u201d. This observation allows us to\nupper-bound the number of pairwise comparisons taken by PLPAC with high probability. The proof\nof the next theorem is deferred to Appendix B.\nTheorem 3. Set i = (1/2) max{\u270f, pi\u21e4,i 1/2} = (1/2) max{\u270f, vi\u21e4vi\nWith probability at least 1 , after O\u21e3maxi6=i\u21e4\nO\u21e3M maxi6=i\u21e4\n\n2(vi\u21e4 +vi)} for each index i 6= i\u21e4.\ni\u2318 calls for BQS with budget M \n\n1, PLPAC terminates and outputs an \u270f-optimal arm. Therefore, the total number of samples is\n\nIn Theorem 3, the dependence on M is of order M log M. It is easy to show that \u2326(M log M ) is a\nlower bound, therefore our result is optimal from this point of view.\nOur model assumptions based on the PL model imply some regularity properties for the pairwise\nmarginals, such as strong stochastic transitivity and stochastic triangle inequality (see Appendix\nA of [28] for the proof). Therefore, the INTERLEAVED FILTER [28] and BEAT THE MEAN [29]\nalgorithms can be directly applied in our online framework. Both algorithms achieve a similar\nsample complexity of order M log M. Yet, our experimental study in Section 9.1 clearly shows\nthat, provided our model assumptions on pairwise marginals are valid, PLPAC outperforms both\nalgorithms in terms of empirical sample complexity.\n\ni\u2318.\n\nlog M\n\nlog M\n\n1\n2\ni\n\n1\n2\ni\n\n5\n\n\f8 The AMPR Problem and its Analysis\nFor strictly more than two elements, the sorting-based surrogate distribution and the PL distribution\nare in general not identical, although their mode rankings coincide [1]. The mode r\u21e4 of a PL model\nis the ranking that sorts the items in decreasing order of their skill values: ri < rj iff vi > vj\nfor any i 6= j. Moreover, since vi > vj implies pi,j > 1/2, sorting based on the Copeland score\nbi = #{1 \uf8ff j \uf8ff M | (i 6= j) ^ (pi,j > 1/2)} yields a most probable ranking r\u21e4.\nOur algorithm is based on estimating the Copeland score of the items. Its pseudo-code is shown in\nAlgorithm 3 in Appendix C. As a \ufb01rst step, it generates rankings based on sorting, which is used to\n\nupdate the pairwise probability estimates bP. Then, it computes a lower and upper bound bi and bi\nfor each of the scores bi. The lower bound is given as bi = #{j 2 [M ]\\{i}|bpi,j c > 1/2}, which\nis the number of items that are beaten by item i based on the current empirical estimates of pairwise\nmarginals. Similarly, the upper bound is given as bi = bi + si, where si = #{j 2 [M ]\\{i}| 1/2 2\n[bpi,j c,bpi,j + c]}. Obviously, si is the number of pairs for which, based on the current empirical\nestimates, it cannot be decided whether pi,j is above or below 1/2.\nAs an important observation, note that there is no need to generate a full ranking based on sorting\nin every case, because if [bi, bi] \\ [bj, bj] = ;, then we already know the order of items i and j with\nrespect to r\u21e4. Motivated by this observation, consider the interval graph G = ([M ], E) based on the\n[bi, bi], where E = {(i, j) 2 [M ]2 | [bi, bi]\\ [bj, bj] 6= ;}. Denote the connected components of this\ngraph by C1, . . . , Ck \u2713 [M ]. Obviously, if two items belong to different components, then they do\nnot need to be compared anymore. Therefore, it is enough to call the sorting-based sampling with\nthe connected components.\nFinally, the algorithm terminates if the goal is achieved (line 20). More speci\ufb01cally, it terminates if\nthere is no pair of items i and j, for which the ordering with respect to r\u21e4 is not elicited yet, i.e.,\n[bi, bi] \\ [bj, bj] 6= ;, and their pairwise probabilities is close to 1/2, i.e., |pi,j 1/2| <\u270f .\n8.1 Sample Complexity Analysis of PLPAC-AMPR\nDenote by qM the expected number of comparisons of the (standard) QuickSort algorithm on M\nelements, namely, qM = 2M log M + O(log M ) (see e.g., [22]). Thanks to the concentration\nproperty of the performance of the QuickSort algorithm, there is no pair of items that falls into the\nsame bucket \u201ctoo often\u201d in bucket order which is output by BQS. This observation allows us to\nupper-bound the number of pairwise comparisons taken by PLPAC-AMPR with high probability.\nThe proof of the next theorem is deferred to Appendix D.\nTheorem 4. Set 0(i) = (1/2) max{\u270f, v(i+1)v(i)\nth largest skill parameter. With probability at least 1 , after O\u21e3max1\uf8ffi\uf8ffM1\nTherefore, the total number of samples is O\u21e3(M log M ) max1\uf8ffi\uf8ffM1\nmarginals bP into a row-stochastic matrix bQ. Then, considering bQ as a transition matrix of a\n\n2(v(i+1)+v(i))} for each 1 \uf8ff i \uf8ff M, where v(i) denotes the i-\n0(i)\u2318\n\nMarkov chain, it ranks the items based on its stationary distribution. In [25], the authors show that\nif the pairwise marginals obey a PL distribution, this algorithm produces the mode of this distribu-\ntion if the sample size is suf\ufb01ciently large. In their setup, the learning algorithm has no in\ufb02uence on\nthe selection of pairs to be compared; instead, comparisons are sampled using a \ufb01xed underlying\ndistribution over the pairs. For any sampling distribution, their PAC bound is of order at least M 3,\nwhereas our sample complexity bound in Theorem 4 is of order M log2 M.\n\nRemark 5. The RankCentrality algorithm proposed in [23] converts the empirical pairwise\n\n0(i)\u2318.\n\ncalls for BQS with budget 3\n\n2 qM, the algorithm PLPAC terminates and outputs an \u270f-optimal arm.\n\n1\n\n(0(i))2 log M\n\n1\n\n(0(i))2 log M\n\n9 Experiments\nOur approach strongly exploits the assumption of a data generating process that can be modeled by\nmeans of a PL distribution. The experimental studies presented in this section are mainly aimed at\nshowing that it is doing so successfully, namely, that it has advantages compared to other approaches\nin situations where this model assumption is indeed valid. To this end, we work with synthetic data.\n\n6\n\n\fNevertheless, in order to get an idea of the robustness of our algorithm toward violation of the model\nassumptions, some \ufb01rst experiments on real data are presented in Appendix I.3\n9.1 The PAC-Item Problem\nWe compared our PLPAC algorithm with other preference-based algorithms applicable in our set-\nting, namely INTERLEAVED FILTER (IF) [28], BEAT THE MEAN (BTM) [29] and MALLOWSMPI\n[7]. While each of these algorithms follows a successive elimination strategy and discards items\none by one, they differ with regard to the sampling strategy they follow. Since the time horizon\nmust be given in advance for IF, we run it with T 2{ 100, 1000, 10000}, subsequently referred\nto as IF(T ). The BTM algorithm can be accommodated into our setup as is (see Algorithm 3 in\n[29]). The MALLOWSMPI algorithm assumes a Mallows model [20] instead of PL as an underlying\nprobability distribution over rankings, and it seeks to \ufb01nd the Condorcet winner\u2014it can be applied\nin our setting, too, since a Condorcet winner does exist for PL. Since the baseline methods are not\nable to handle \u270f-approximation except the BTM, we run our algorithm with \u270f = 0 (and made sure\nthat vi 6= vj for all 1 \uf8ff i 6= j \uf8ff M).\n\nl\n\ny\nt\ni\nx\ne\np\nm\no\nc\n \n\nl\n\ne\np\nm\na\nS\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n#104\n\nPLPAC\nIF(100)\nIF(1000)\nIF(10000)\nBTM\nMallowsMPI\n\n15\n\n5 \n\n10\n\nNumber of arms\n(a) c = 0\n\nl\n\ny\nt\ni\nx\ne\np\nm\no\nc\n \n\nl\n\ne\np\nm\na\nS\n\n7\n6\n5\n4\n3\n2\n1\n0\n\n#104\n\nPLPAC\nIF(100)\nIF(1000)\nIF(10000)\nBTM\nMallowsMPI\n\n15\n\n5 \n\n10\n\nNumber of arms\n(b) c = 2\n\nl\n\ny\nt\ni\nx\ne\np\nm\no\nc\n \n\nl\n\ne\np\nm\na\nS\n\n18\n16\n14\n12\n10\n8\n6\n4\n2\n0\n\n#104\n\nPLPAC\nIF(100)\nIF(1000)\nIF(10000)\nBTM\nMallowsMPI\n\n15\n\n5 \n\n10\n\nNumber of arms\n(c) c = 5\n\nFigure 1: The sample complexity for M = {5, 10, 15}, = 0.1, \u270f = 0. The results are averaged\nover 100 repetitions.\n\n1+ i+c\n\n2 1\n\nWe tested the learning algorithm by setting the parameters of PL to vi = 1/(c + i) with c =\n{0, 1, 2, 3, 5}. The parameter c controls the complexity of the rank elicitation task, since the gaps\nbetween pairwise probabilities and 1/2 are of the form |pi,j 1/2| = | 1\nj+c |, which converges\nto zero as c ! 1. We evaluated the algorithm on this test case with varying numbers of items\nM = {5, 10, 15} and with various values of parameter c, and plotted the sample complexities, that\nis, the number of pairwise comparisons taken by the algorithms prior to termination. The results\nare shown in Figure 1 (only for c = {0, 2, 5}, the rest of the plots are deferred to Appendix E). As\ncan be seen, the PLPAC algorithm signi\ufb01cantly outperforms the baseline methods if the pairwise\ncomparisons match with the model assumption, namely, they are drawn from the marginals of a PL\ndistribution. MALLOWSMPI achieves a performance that is slightly worse than PLPAC for M = 5,\nand its performance is among the worst ones for M = 15. This can be explained by the elimination\nstrategy of MALLOWSMPI, which heavily relies on the existence of a gap mini6=j |pi,j 1/2| > 0\nbetween all pairwise probabilities and 1/2; in our test case, the minimal gap pM,M1 1/2 =\n21/(c+M ) 1/2 > 0 is getting smaller with increasing M and c . The poor performance of BTM\nfor large c and M can be explained by the same argument.\n9.2 The AMPR Problem\nSince the RankCentrality algorithm produces the most probable ranking if the pairwise marginals\nobey a PL distribution and the sample size is suf\ufb01ciently large (cf. Remark 5), it was taken as a base-\nline. Using the same test case as before, input data of various size was generated for RankCentrality\nbased on uniform sampling of pairs to be compared. Its performance is shown by the black lines in\nFigure 2 (the results for c = {1, 3, 4} are again deferred to Appendix F). The accuracy in a single\nrun of the algorithm is 1 if the output of RankCentrality is identical with the most probable ranking,\nand 0 otherwise; this accuracy was averaged over 100 runs.\n\n1\n\n3In addition, we conducted some experiments to asses the impact of parameter \u270f and to test our algorithms\nbased on Clopper-Pearson con\ufb01dence intervals. These experiments are deferred to Appendix H and G due to\nlack of space.\n\n7\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nn\no\n\ni\nt\nc\na\nr\nf\n \ny\nr\ne\nv\no\nc\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\nO\n\n0\n102\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nn\no\n\ni\nt\nc\na\nr\nf\n \ny\nr\ne\nv\no\nc\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\nO\n\n0\n102\n\nRankCentrality (M=5)\nRankCentrality (M=10)\nRankCentrality (M=15)\nPLPAC-AMPR (M=5)\nPLPAC-AMPR (M=10)\nPLPAC-AMPR (M=15)\n106\n\n104\n\nSample size\n(a) c = 0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nn\no\n\ni\nt\nc\na\nr\nf\n \ny\nr\ne\nv\no\nc\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\nO\n\n0\n102\n\nRankCentrality (M=5)\nRankCentrality (M=10)\nRankCentrality (M=15)\nPLPAC-AMPR (M=5)\nPLPAC-AMPR (M=10)\nPLPAC-AMPR (M=15)\n106\n\n104\n\nSample size\n(c) c = 5\n\nRankCentrality (M=5)\nRankCentrality (M=10)\nRankCentrality (M=15)\nPLPAC-AMPR (M=5)\nPLPAC-AMPR (M=10)\nPLPAC-AMPR (M=15)\n106\n\n104\n\nSample size\n(b) c = 2\n\nFigure 2: Sample complexity for \ufb01nding the approximately most probable ranking (AMPR) with\nparameters M 2{ 5, 10, 15}, = 0.05, \u270f = 0. The results are averaged over 100 repetitions.\n\nWe also run our PLPAC-AMPR algorithm and determined the number of pairwise comparisons it\ntakes prior to termination. The horizontal lines in Figure 2 show the empirical sample complexity\nachieved by PLPAC-AMPR with \u270f = 0. In accordance with Theorem 4, the accuracy of PLPAC-\nAMPR was always signi\ufb01cantly higher than 1 (actually equal to 1 in almost every case).\nAs can be seen, RankCentrality slightly outperforms PLPAC-AMPR in terms of sample complexity,\nthat is, it achieves an accuracy of 1 for a smaller number of pairwise comparisons. Keep in mind,\nhowever, that PLPAC-AMPR only terminates when its output is correct with probability at least\n1 . Moreover, it computes the con\ufb01dence intervals for the statistics it uses based on the Chernoff-\nHoeffding bound, which is known to be very conservative. As opposed to this, RankCentrality is\nan of\ufb02ine algorithm without any performance guarantee if the sample size in not suf\ufb01ciently large\n(see Remark 5). Therefore, it is not surprising that, asymptotically, its empirical sample complexity\nshows a better behavior than the complexity of our online learner.\nAs a \ufb01nal remark, ranking distributions can principally be de\ufb01ned based on any sorting algorithm,\nfor example MergeSort. However, to the best of our knowledge, pairwise stability has not yet\nbeen shown for any sorting algorithm other than QuickSort. We empirically tested the Merge-\nSort algorithm in our experimental study, simply by using it in place of budgeted QuickSort in the\nPLPAC-AMPR algorithm. We found MergeSort inappropriate for the PL model, since the accu-\nracy of PLPAC-AMPR, when being used with MergeSort instead of QuickSort, drastically drops\non complex tasks; for details, see Appendix J. The question of pairwise stability of different sorting\nalgorithms for various ranking distributions, such as the Mallows model, is an interesting research\navenue to be explored.\n10 Conclusion and Future Work\nIn this paper, we studied different problems of online rank elicitation based on pairwise comparisons\nunder the assumption of a Plackett-Luce model. Taking advantage of this assumption, our idea is\nto construct a surrogate probability distribution over rankings based on a sorting procedure, namely\nQuickSort, for which the pairwise marginals provably coincide with the marginals of the PL distri-\nbution. In this way, we manage to exploit the (stochastic) transitivity properties of PL, which is at\nthe origin of the ef\ufb01ciency of our approach, together with the idea of replacing the original Quick-\nSort with a budgeted version of this algorithm. In addition to a formal performance and complexity\nanalysis of our algorithms, we also presented \ufb01rst experimental studies showing the effectiveness of\nour approach.\nNeedless to say, in addition to the problems studied in this paper, there are many other interesting\nproblems that can be tackled within the preference-based framework of online learning. For exam-\n\nple, going beyond a single item or ranking, we may look for a good estimatebP of the entire distri-\nbution P, for example, an estimate with small Kullback-Leibler divergence: KL\u21e3P,bP\u2318 <\u270f . With\n\nregard to the use of sorting algorithms, another interesting open question is the following: Is there\nany sorting algorithm with a worst case complexity of order M log M, which preserves the marginal\nprobabilities? This question might be dif\ufb01cult to answer since, as we conjecture, the MergeSort and\nthe InsertionSort algorithms, which are both well-known algorithms with an M log M complexity,\ndo not satisfy this property.\n\n8\n\n\fReferences\n[1] Nir Ailon. Reconciling real scores with binary comparisons: A new logistic based model for ranking. In\n\nAdvances in Neural Information Processing Systems 21, pages 25\u201332, 2008.\n\n[2] M. Braverman and E. Mossel. Noisy sorting without resampling. In Proceedings of the nineteenth annual\n\nACM-SIAM Symposium on Discrete algorithms, pages 268\u2013276, 2008.\n\n[3] M. Braverman and E. Mossel. Sorting from noisy information. CoRR, abs/0910.1191, 2009.\n[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings\n\nof the 20th ALT, ALT\u201909, pages 23\u201337, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[5] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identi\ufb01cations in multi-armed bandits. In Proceedings\n\nof The 30th ICML, pages 258\u2013265, 2013.\n\n[6] R. Busa-Fekete and E. H\u00a8ullermeier. A survey of preference-based online learning with bandit algorithms.\n\nIn Algorithmic Learning Theory (ALT), volume 8776, pages 18\u201339, 2014.\n\n[7] R. Busa-Fekete, E. H\u00a8ullermeier, and B. Sz\u00a8or\u00b4enyi. Preference-based rank elicitation using statistical mod-\n\nels: The case of Mallows. In (ICML), volume 32 (2), pages 1071\u20131079, 2014.\n\n[8] R. Busa-Fekete, B. Sz\u00a8or\u00b4enyi, and E. H\u00a8ullermeier. Pac rank elicitation through adaptive sampling of\n\nstochastic pairwise preferences. In AAAI, pages 1701\u20131707, 2014.\n\n[9] R. Busa-Fekete, B. Sz\u00a8or\u00b4enyi, P. Weng, W. Cheng, and E. H\u00a8ullermeier. Top-k selection based on adaptive\n\nsampling of noisy preferences. In Proceedings of the 30th ICML, JMLR W&CP, volume 28, 2013.\n\n[10] C. J. Clopper and E. S. Pearson. The Use of Con\ufb01dence or Fiducial Limits Illustrated in the Case of the\n\nBinomial. Biometrika, 26(4):404\u2013413, 1934.\n\n[11] E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision\n\nprocesses. In Proceedings of the 15th COLT, pages 255\u2013270, 2002.\n\n[12] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM\n\nJ. Comput., 23(5):1001\u20131018, October 1994.\n\n[13] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identi\ufb01cation. In NIPS\n\n24, pages 2222\u20132230, 2011.\n\n[14] J. Guiver and E. Snelson. Bayesian inference for plackett-luce ranking models. In Proceedings of the\n\n26th ICML, pages 377\u2013384, 2009.\n\n[15] C. A. R. Hoare. Quicksort. Comput. J., 5(1):10\u201315, 1962.\n[16] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 58:13\u201330, 1963.\n\n[17] D.R. Hunter. MM algorithms for generalized bradley-terry models. The Annals of Statistics, 32(1):384\u2013\n\n406, 2004.\n\n[18] R. Luce and P. Suppes. Handbook of Mathematical Psychology, chapter Preference, Utility and Subjective\n\nProbability, pages 249\u2013410. Wiley, 1965.\n\n[19] R. D. Luce. Individual choice behavior: A theoretical analysis. Wiley, 1959.\n[20] C. Mallows. Non-null ranking models. Biometrika, 44(1):114\u2013130, 1957.\n[21] John I. Marden. Analyzing and Modeling Rank Data. Chapman & Hall, 1995.\n[22] C.J.H. McDiarmid and R.B. Hayward. Large deviations for quicksort. Journal of Algorithms, 21(3):476\u2013\n\n507, 1996.\n\n[23] S. Negahban, S. Oh, and D. Shah. Iterative ranking from pairwise comparisons. In Advances in Neural\n\nInformation Processing Systems, pages 2483\u20132491, 2012.\n\n[24] R. Plackett. The analysis of permutations. Applied Statistics, 24:193\u2013202, 1975.\n[25] Arun Rajkumar and Shivani Agarwal. A statistical convergence perspective of algorithms for rank aggre-\n\ngation from pairwise data. In ICML, pages 118\u2013126, 2014.\n\n[26] H. A. Sou\ufb01ani, W. Z. Chen, D. C. Parkes, and L. Xia. Generalized method-of-moments for rank aggrega-\n\ntion. In Advances in Neural Information Processing Systems (NIPS), pages 2706\u20132714, 2013.\n\n[27] T. Urvoy, F. Clerot, R. F\u00b4eraud, and S. Naamane. Generic exploration and k-armed voting bandits. In\n\nProceedings of the 30th ICML, JMLR W&CP, volume 28, pages 91\u201399, 2013.\n\n[28] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of\n\nComputer and System Sciences, 78(5):1538\u20131556, 2012.\n\n[29] Y. Yue and T. Joachims. Beat the mean bandit. In Proceedings of the ICML, pages 241\u2013248, 2011.\n[30] M. Zoghi, S. Whiteson, R. Munos, and M. Rijke. Relative upper con\ufb01dence bound for the k-armed\n\ndueling bandit problem. In ICML, pages 10\u201318, 2014.\n\n9\n\n\f", "award": [], "sourceid": 419, "authors": [{"given_name": "Bal\u00e1zs", "family_name": "Sz\u00f6r\u00e9nyi", "institution": "The Technion / University of Szeged"}, {"given_name": "R\u00f3bert", "family_name": "Busa-Fekete", "institution": "UPB"}, {"given_name": "Adil", "family_name": "Paul", "institution": "UPB"}, {"given_name": "Eyke", "family_name": "H\u00fcllermeier", "institution": "Marburguniversity"}]}