{"title": "Label Ranking with Partial Abstention based on Thresholded Probabilistic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2501, "page_last": 2509, "abstract": "Several machine learning methods allow for abstaining from uncertain predictions. While being common for settings like conventional classification, abstention has been studied much less in learning to rank. We address abstention for the label ranking setting, allowing the learner to declare certain pairs of labels as being incomparable and, thus, to predict partial instead of total orders. In our method, such predictions are produced via thresholding the probabilities of pairwise preferences between labels, as induced by a predicted probability distribution on the set of all rankings. We formally analyze this approach for the Mallows and the Plackett-Luce model, showing that it produces proper partial orders as predictions and characterizing the expressiveness of the induced class of partial orders. These theoretical results are complemented by experiments demonstrating the practical usefulness of the approach.", "full_text": "Label Ranking with Partial Abstention based on\n\nThresholded Probabilistic Models\n\nWeiwei Cheng\n\nMathematics and Computer Science\n\nPhilipps-Universit\u00a8at Marburg\n\nMarburg, Germany\n\nEyke H\u00a8ullermeier\n\nMathematics and Computer Science\n\nPhilipps-Universit\u00a8at Marburg\n\nMarburg, Germany\n\ncheng@mathematik.uni-marburg.de\n\neyke@mathematik.uni-marburg.de\n\nWillem Waegeman\n\nMathematical Modeling, Statistics and\n\nBioinformatics, Ghent University\n\nGhent, Belgium\n\nVolkmar Welker\n\nMathematics and Computer Science\n\nPhilipps-Universit\u00a8at Marburg\n\nMarburg, Germany\n\nwillem.waegeman@ugent.be\n\nwelker@mathematik.uni-marburg.de\n\nAbstract\n\nSeveral machine learning methods allow for abstaining from uncertain predic-\ntions. While being common for settings like conventional classi\ufb01cation, absten-\ntion has been studied much less in learning to rank. We address abstention for the\nlabel ranking setting, allowing the learner to declare certain pairs of labels as being\nincomparable and, thus, to predict partial instead of total orders. In our method,\nsuch predictions are produced via thresholding the probabilities of pairwise pref-\nerences between labels, as induced by a predicted probability distribution on the\nset of all rankings. We formally analyze this approach for the Mallows and the\nPlackett-Luce model, showing that it produces proper partial orders as predictions\nand characterizing the expressiveness of the induced class of partial orders. These\ntheoretical results are complemented by experiments demonstrating the practical\nusefulness of the approach.\n\n1\n\nIntroduction\n\nIn machine learning, the notion of \u201cabstention\u201d commonly refers to the possibility of refusing a\nprediction in cases of uncertainty. In classi\ufb01cation with a reject option, for example, a classi\ufb01er\nmay abstain from a class prediction if making no decision is considered less harmful than making an\nunreliable and hence potentially false decision [7, 1]. The same idea could be used in the context of\nranking, too, where a reject option appears to be even more interesting than in classi\ufb01cation. While\na conventional classi\ufb01er has only two choices, namely to predict a class or to abstain, a ranker can\nabstain to some degree: The order relation predicted can be more or less complete, ranging from a\ntotal order to the empty relation in which all alternatives are declared incomparable.\nOur focus is on so-called label ranking problems [16, 10], to be introduced more formally in Sec-\ntion 2 below. Label ranking has a strong relationship with the standard setting of multi-class classi-\n\ufb01cation, but each instance is now associated with a complete ranking of all labels instead of a single\nlabel. Typical examples, which also highlight the need for abstention, include the ranking of can-\ndidates for a given job and the ranking of products for a given customer. In such applications, it is\ndesirable to avoid the expression of unreliable or unwarranted preferences. Thus, if a ranker cannot\nreliably decide whether a \ufb01rst label should precede a second one or the other way around, it should\nabstain from this decision and instead declare these alternatives as being incomparable. Abstaining\nin a consistent way, the relation thus produced should form a partial order [6].\n\n1\n\n\fIn Section 4, we propose and analyze a new approach for abstention in label ranking that builds\non existing work on partial orders in areas like decision theory, probability theory and discrete\nmathematics. We predict partial orders by thresholding parameterized probability distributions on\nrankings, using the Plackett-Luce and the Mallows model. Roughly speaking, this approach is able\nto avoid certain inconsistencies of a previous approach to label ranking with abstention [6], to be\ndiscussed in Section 3. By making stronger model assumptions, our approach simpli\ufb01es the con-\nstruction of consistent partial order relations. In fact, it obeys a number of appealing theoretical\nproperties. Apart from assuring proper partial orders as predictions, it allows for an exact character-\nization of the expressivity of a class of thresholded probability distributions in terms of the number\nof partial orders that can be produced. The proposal and formal analysis of this approach constitute\nour main contributions.\nLast but not least, as will be shown in Section 5, the theoretical advantages of our approach in\ncomparison with [6] are also re\ufb02ected in practical improvements.\n\n2 Label Ranking with Abstention\n\nIn the setting of label ranking, each instance x from an instance space X is associated with a total\norder of a \ufb01xed set of class labels Y = {y1, . . . , yM}, that is, a complete, transitive, and antisym-\nmetric relation (cid:31) on Y, where yi (cid:31) yj indicates that yi precedes yj in the order. Since a ranking\ncan be considered as a special type of preference relation, we shall also say that yi (cid:31) yj indicates a\npreference for yi over yj (given the instance x).\nFormally, a total order (cid:31) can be identi\ufb01ed with a permutation \u03c0 of the set [M ] = {1, . . . , M}, such\nthat \u03c0(i) is the position of yi in the order. We denote the class of permutations of [M ] (the symmetric\ngroup of order M) by \u2126. Moreover, we identify (cid:31) with the mapping (relation) R : Y 2 \u2212\u2192 {0, 1}\nsuch that R(i, j) = 1 if yi (cid:31) yj and R(i, j) = 0 otherwise.\nThe goal in label ranking is to learn a \u201clabel ranker\u201d in the form of an X \u2212\u2192 \u2126 mapping. As training\ndata, a label ranker uses a set of instances xn (n \u2208 [N ]), together with preference information in the\nform of pairwise comparisons yi (cid:31) yj of some (but not necessarily all) labels in Y, suggesting that\ninstance xn prefers label yi to yj.\nThe prediction accuracy of a label ranker is assessed by comparing the true ranking \u03c0 with the\nprediction \u02c6\u03c0, using a distance measure D on rankings. Among the most commonly used measures\nis the Kendall distance, which is de\ufb01ned by the number of inversions, that is, pairs {i, j} \u2282 [M ]\n(cid:80)M\nsuch that sign(\u03c0(i) \u2212 \u03c0(j)) (cid:54)= sign(\u02c6\u03c0(i) \u2212 \u02c6\u03c0(j)). Besides, the sum of squared rank distances,\ni=1(\u03c0(i) \u2212 \u02c6\u03c0(i))2, is often used as an alternative distance; it is closely connected to Spearman\u2019s\nrank correlation (Spearman\u2019s rho), which is an af\ufb01ne transformation of this number to the interval\n[\u22121, +1].\nMotivated by the idea of a reject option in classi\ufb01cation, Cheng et al. [6] introduced a variant of\nthe above setting in which the label ranker is allowed to partially abstain from a prediction. More\nspeci\ufb01cally, it is allowed to make predictions in the form of a partial order Q instead of a total order\nR: If Q(i, j) = Q(j, i) = 0, the ranker abstains on the label pair (yi, yj) and instead declares these\nlabels as being incomparable. Abstaining in a consistent way, Q should still be antisymmetric and\ntransitive, hence a partial order relation. Note that a prediction Q can be associated with a con\ufb01dence\nset, i.e., a subset of \u2126 supposed to cover the true ranking \u03c0, namely the set of all linear extensions\n\nof this partial order: C(Q) =(cid:8)\u03c0 \u2208 \u2126| Q(i, j) = 1 \u21d2 (\u03c0(i) < \u03c0(j)) for all i, j \u2208 [M ](cid:9).\n\n3 Previous Work\n\nDespite a considerable amount of work on ranking in general and learning to rank in particular, the\nliterature on partial rankings is relatively sparse. Worth mentioning is work on a speci\ufb01c type of\npartial orders, namely linear orders of unsorted or tied subsets (partitions, bucket orders) [13, 17].\nHowever, apart from the restriction to this type of order relation, the problems addressed in these\nworks are quite different from our goals. The authors in [17] speci\ufb01cally address computational\naspects that arise when working with distributions on partially ranked data, while [13] seeks to\ndiscover an underlying bucket order from pairwise precedence information between the items.\n\n2\n\n\fMore concretely, in the context of the label ranking problem, the aforementioned work [6] is the\nonly one so far that addresses the more general problem of learning to predict partial orders. This\nmethod consists of two main steps and can be considered as a pairwise approach in the sense that,\nas a point of departure for a prediction, a valued preference relation P : Y 2 \u2212\u2192 [0, 1] is produced,\nwhere P (i, j) is interpreted as a measure of support of the pairwise preference yi (cid:31) yj. Support\nis commonly interpreted in terms of probability, hence P is assumed to be reciprocal: P (i, j) =\n1 \u2212 P (j, i) for all i, j \u2208 [M ].\nThen, in a second step, a partial order Q is derived from P via thresholding:\n\n(cid:26) 1,\n\nQ(i, j) =(cid:74)P (i, j) > q(cid:75) =\n\nif P (i, j) > q\notherwise\n\n,\n\n0,\n\n(1)\nwhere 1/2 \u2264 q < 1 is a threshold. Thus, the idea is to predict only those pairwise preferences that\nare suf\ufb01ciently likely, while abstaining on pairs (i, j) for which P (i, j) \u2248 1/2.\nThe \ufb01rst step of deriving the relation P is realized by means of an ensemble learning technique:\nTraining an ensemble of standard label rankers, each of which provides a prediction in the form of\na total order, P (i, j) is de\ufb01ned by the fraction of ensemble members voting for yi (cid:31) yj. Other\npossibilities are of course conceivable, and indeed, the only important point to notice here is that the\npreference degrees P (i, j) are essentially independent of each other. Or, stated differently, they do\nnot guarantee any speci\ufb01c properties of the relation P except being reciprocal. In particular, P does\nnot necessarily obey any type of transitivity property.\nFor the relation Q derived from P via thresholding, this has two important consequences: First, if\nthe threshold q is not large enough, then Q may have cycles. Thus, not all thresholds in [1/2, 1) are\nactually feasible. In particular, if q = 1/2 cannot be chosen, this also implies that the method may\nnot be able to predict a total order as a special case. Second, even if Q does not have cycles, it is not\nguaranteed to be transitive.\nTo overcome these problems, the authors devise an algorithm (of complexity O(|Y|3)) that \ufb01nds\nto be cycle-free for each threshold q \u2208 [qmin, 1). Then, since Q may still be non-transitive, it is\n\u201crepaired\u201d in a second step by replacing it with its transitive closure [23].\n\nthe smallest feasible threshold qmin, namely the threshold that guarantees Q(i, j) =(cid:74)P (i, j) > q(cid:75)\n\n4 Predicting Partial Orders based on Probabilistic Models\n\nIn order to tackle the problems of the approach in [6], our idea is to restrict the relation P in (1) so\nas to exclude the possibility of cycles and intransitivity from the very beginning, thereby avoiding\nthe need for a post-reparation of a prediction. To this end, we take advantage of methods for label\nranking that produce (parameterized) probability distributions over \u2126 as predictions. Our main the-\noretical result is to show that thresholding pairwise preferences induced by such distributions, apart\nfrom being semantically meaningful due to their clear probabilistic interpretation, yields preference\nrelations with the desired properties, that is, partial order relations Q.\n\n4.1 Probabilistic Models\n\nDifferent types of probability models for rank data have been studied in the statistical literature\n[11, 20], including the Mallows model and the Plackett-Luce (PL) model as the most popular rep-\nresentatives of the class of distance-based and stagewise models, respectively. Both models have\nrecently attracted attention in machine learning [14, 15, 22, 21, 18] and, in particular, have been\nused in the context of label ranking.\nA label ranking method that produces predictions expressed in terms of the Mallows model is pro-\nposed in [5]. The standard Mallows model\n\nP(\u03c0 | \u03b8, \u03c00) =\n\nexp(\u2212\u03b8D(\u03c0, \u03c00))\n\n(2)\nis determined by two parameters: The ranking \u03c00 \u2208 \u2126 is the location parameter (mode, center\nranking) and \u03b8 \u2265 0 is a spread parameter. Moreover, D is a distance measure on rankings, and\n\u03c6 = \u03c6(\u03b8) is a normalization factor that depends on the spread (but, provided the right-invariance\n\n\u03c6(\u03b8)\n\n3\n\n\fof D, not on \u03c00). Obviously, the Mallows model assigns the maximum probability to the center\nranking \u03c00. The larger the distance D(\u03c0, \u03c00), the smaller the probability of \u03c0 becomes. The spread\nparameter \u03b8 determines how quickly the probability decreases, i.e., how peaked the distribution is\naround \u03c00. For \u03b8 = 0, the uniform distribution is obtained, while for \u03b8 \u2192 \u221e, the distribution\nconverges to the one-point distribution that assigns probability 1 to \u03c00 and 0 to all other rankings.\nAlternatively, the Plackett-Luce (PL) model was used in [4]. This model is speci\ufb01ed by a parameter\nvector v = (v1, v2, . . . , vM ) \u2208 RM\n+ :\n\nM(cid:89)\n\nP(\u03c0 | v) =\n\nv\u03c0\u22121(i)\n\nv\u03c0\u22121(i) + v\u03c0\u22121(i+1) + . . . + v\u03c0\u22121(M )\n\ni=1\n\n(3)\n\nIt is a generalization of the well-known Bradley-Terry model for the pairwise comparison of alter-\nnatives, which speci\ufb01es the probability that \u201ca wins against b\u201d in terms of P(a (cid:31) b) = va\n.\nva+vb\nObviously, the larger va in comparison to vb, the higher the probability that a is chosen. Likewise,\nthe larger the parameter vi in (3) in comparison to the parameters vj, j (cid:54)= i, the higher the probability\nthat yi appears on a top rank.\n\n4.2 Thresholded Relations are Partial Orders\n\nGiven a probability distribution P on the set of rankings \u2126, the probability of a pairwise prefer-\nence yi (cid:31) yj (and hence the corresponding entry in the preference relation P ) is obtained through\nmarginalization:\n\nP (i, j) = P(yi (cid:31) yj) =\n\nP(\u03c0) ,\n\n(4)\n\n(cid:88)\n\n\u03c0\u2208E(i,j)\n\nwhere E(i, j) denotes the set of linear extensions of the incomplete ranking yi (cid:31) yj, i.e., the set\nof all rankings \u03c0 \u2208 \u2126 with \u03c0(i) < \u03c0(j). We start by stating a necessary and suf\ufb01cient condition\non P (i, j) for the threshold relation (1) to result in a (strict) partial order, i.e., an antisymmetric,\nirre\ufb02exive and transitive relation.\nLemma 1. Let P be a reciprocal relation and let Q be given by (1). Then Q de\ufb01nes a strict\npartial order relation for all q \u2208 [1/2, 1) if and only if P satis\ufb01es partial stochastic transitivity,\ni.e., P (i, j) > 1/2 and P (j, k) > 1/2 implies P (i, k) \u2265 min(P (i, j), P (j, k)) for each triple\n(i, j, k) \u2208 [M ]3.\n\nThis lemma was \ufb01rst proven by Fishburn [12], together with a number of other characterizations of\nsubclasses of strict partial orders by means of transitivity properties on P (i, j). For example, replac-\ning partial stochastic transitivity by interval stochastic transitivity (now a condition on quadruples\ninstead of triplets) leads to a characterization of interval orders, a subclass of strict partial orders; a\npartial order Q on [M ]2 is called an interval order if each i \u2208 [M ] can be associated with an interval\n(li, ui) \u2282 R such that Q(i, j) = 1 \u21d4 ui \u2264 lj.\nOur main theoretical results below state that thresholding (4) yields a strict partial order relation Q,\nboth for the PL and the Mallows model. Thus, we can guarantee that a strict partial order relation\ncan be predicted by simple thresholding, and without the need for any further reparation. Moreover,\nthe whole spectrum of threshold parameters q \u2208 [1/2, 1) can be used.\nTheorem 1. Let P in (4) be the PL model (3). Moreover, let Q be given by the threshold relation\n(1). Then Q de\ufb01nes a strict partial order relation for all q \u2208 [1/2, 1).\nTheorem 2. Let P in (4) be the Mallows model (2), with a distance D having the so-called trans-\nposition property. Moreover, let Q be given by the threshold relation (1). Then Q de\ufb01nes a strict\npartial order relation for all q \u2208 [1/2, 1).\n\nTheorem 1 directly follows from the strong stochastic transitivity of the PL model [19]. The proof\nof Theorem 2 is slightly more complicated and given below. Moreover, the result for Mallows is less\ngeneral in the sense that D must obey the transposition property. Actually, however, this property is\nnot very restrictive and indeed satis\ufb01ed by most of the commonly used distance measures, including\nthe Kendall distance (see, e.g., [9]). In the following, we always assume that the distance D in the\nMallows model (2) satis\ufb01es this property.\n\n4\n\n\fDe\ufb01nition 1. A distance D on \u2126 is said to have the transposition property, if the following holds: Let\n\u03c0 and \u03c0(cid:48) be rankings and let (i, j) be an inversion (i.e., i < j and (\u03c0(i)\u2212 \u03c0(j))(\u03c0(cid:48)(i)\u2212 \u03c0(cid:48)(j)) < 0).\nLet \u03c0(cid:48)(cid:48) \u2208 \u2126 be constructed from \u03c0(cid:48) by swapping yi and yj, that is, \u03c0(cid:48)(cid:48)(i) = \u03c0(cid:48)(j), \u03c0(cid:48)(cid:48)(j) = \u03c0(cid:48)(i)\nand \u03c0(cid:48)(cid:48)(m) = \u03c0(cid:48)(m) for all m \u2208 [M ] \\ {i, j}. Then, D(\u03c0, \u03c0(cid:48)(cid:48)) \u2264 D(\u03c0, \u03c0(cid:48)).\nLemma 2. If yi precedes yj in the center ranking \u03c00 in (2), then P(yi (cid:31) yj) \u2265 1/2. Moreover, if\nP(yi (cid:31) yj) > q \u2265 1/2, then yi precedes yj in the center ranking \u03c00.\nProof. For every ranking \u03c0 \u2208 \u2126, let b(\u03c0) = \u03c0 if yi precedes yj in \u03c0; otherwise, b(\u03c0) is de\ufb01ned by\nswapping yi and yj in \u03c0. Obviously, b(\u00b7) de\ufb01nes a bijection between E(i, j) and E(j, i). Moreover,\nsince D has the transposition property, D(b(\u03c0), \u03c00) \u2264 D(\u03c0, \u03c00) for all \u03c0 \u2208 \u2126. Therefore, according\nto the Mallows model, P(b(\u03c0)) \u2265 P(\u03c0), and hence\n\n(cid:88)\n\nP(\u03c0) \u2265 (cid:88)\n\nP(yi (cid:31) yj) =\n\n(cid:88)\n\n\u03c0\u2208E(i,j)\n\n\u03c0\u2208E(i,j)\n\n\u03c0\u2208E(j,i)\n\nP(b\u22121(\u03c0)) =\n\nP(\u03c0) = P(yj (cid:31) yi)\n\nSince, moreover, P(yi (cid:31) yj) = 1 \u2212 P(yj (cid:31) yi), it follows that P(yi (cid:31) yj) \u2265 1/2. The second\npart immediately follows from the \ufb01rst one by way of contradiction: If yj would precede yi, then\nP(yj (cid:31) yi) \u2265 1/2, and therefore P(yi (cid:31) yj) = 1 \u2212 P(yj (cid:31) yi) \u2264 1/2 \u2264 q.\nLemma 3. If yi precedes yj and yj precedes yk in the center ranking \u03c00 in (2), then P(yi (cid:31) yk) \u2265\nmax (P(yi (cid:31) yj), P(yj (cid:31) yk)).\nProof. We show that P(yi (cid:31) yk) \u2265 P(yi (cid:31) yj). The second inequality P(yi (cid:31) yk) \u2265 P(yj (cid:31) yk)\nis shown analogously. Let E(i, j, k) denote the set of linear extensions of yi (cid:31) yj (cid:31) yk,\ni.e., the set of rankings \u03c0 \u2208 \u2126 in which yi precedes yj and yj precedes yk. Now, for every\n\u03c0 \u2208 E(k, j, i), de\ufb01ne b(\u03c0) by \ufb01rst swapping yk and yj and then yk and yi in \u03c0. Obviously, b(\u00b7)\nde\ufb01nes a bijection between E(k, j, i) and E(j, i, k). Moreover, due to the transposition property,\nD(b(\u03c0), \u03c00) \u2264 D(\u03c0, \u03c00), and therefore P(b(\u03c0)) \u2265 P(\u03c0) under the Mallows model. Consequently,\nsince E(i, j) = E(i, j, k) \u222a E(i, k, j) \u222a E(k, i, j) and E(i, k) = E(i, k, j) \u222a E(i, j, k) \u222a E(j, i, k),\n\u03c0\u2208E(j,i,k) P(\u03c0) \u2212\n\nit follows that P(yi (cid:31) yk) \u2212 P(yi (cid:31) yj) = (cid:80)\n(cid:80)\n\u03c0\u2208E(k,j,i) P(\u03c0) =(cid:80)\n\n\u03c0\u2208E(i,k)\\E(i,j) P(\u03c0) = (cid:80)\n\n\u03c0\u2208E(k,j,i) P(b(\u03c0)) \u2212 P(\u03c0) \u2265 0.\n\nLemmas 2 and 3 immediately imply the following lemma.\nLemma 4. The relation P derived via P (i, j) = P(yi (cid:31) yj) from the Mallows model satis\ufb01es the\nfollowing property (closely related to strong stochastic transitivity): If (P (i, j) > q and P (j, k) > q,\nthen P (i, k) \u2265 max(P (i, j), P (j, k)), for all q \u2265 1/2 and all i, j, k \u2208 [M ].\nProof of Theorem 2. Since P(yi (cid:31) yj) = 1 \u2212 P(yj (cid:31) yi), it obviously follows that Q(yi, yj) = 1\nimplies Q(yj, yi) = 0. Moreover, Lemma 4 implies that Q is transitive. Consequently, Q de\ufb01nes a\nproper partial order relation.\n\nThe above statements guarantee that a strict partial order relation can be predicted by simple thresh-\nolding, and without the need for any further reparation. Moreover, the whole spectrum of threshold\nparameters q \u2208 [1/2, 1) can be used. As an aside, we mention that strict partial orders can also be\nproduced by thresholding other probabilistic preference learning models. All pairwise preference\nmodels based on utility scores satisfy strong stochastic transitivity. This includes traditional sta-\ntistical models such as the Thurstone Case 5 model [25] and the Bradley-Terry model [3], as well\nas modern learning models such as [8, 2]. These models are usually not applied in label ranking\nsettings, however.\n\n4.3 Expressivity of the Model Classes\n\nSo far, we have shown that predictions produced by thresholding probability distributions on rank-\nings are proper partial orders. Roughly speaking, this is accomplished by restricting P in (1) to\nspeci\ufb01c valued preference relations (namely marginals (4) of the Mallows or the PL model), in con-\ntrast to the approach of [6], where P can be any (reciprocal) relation. From a learning point of\nview, one may wonder to what extent this restriction limits the expressivity of the underlying model\nclass. This expressivity is naturally de\ufb01ned in terms of the number of different partial orders (up to\n\n5\n\n\fisomorphism) that can be represented in the form of a threshold relation (1). Interestingly, we can\nshow that, in this sense, the approach based on PL is much more expressive than the one based on\nthe Mallows model.\nTheorem 3. Let QM denote the set of different partial orders (up to isomorphism) that can be\nrepresented as a threshold relation Q de\ufb01ned by (1), where P is derived according to (4) from the\nMallows model (2) with D the Kendall distance. Then |QM| = M.\n\nProof. By the right invariance of D, different choices of \u03c00 lead to the same set of isomorphism\nclasses QM . Hence we may assume that \u03c00 is the identity. By Theorem 6.3 in [20] the (M \u00d7 M )-\nmatrix with entries P (i, j) is a Toeplitz matrix, i.e., P (i, j) = P (i + 1, j + 1) for all i, j \u2208 [M \u2212 1],\nwith entries strictly increasing along rows, i.e., P (i, j) < P (i, j + 1) for 1 \u2264 i < j < M. Thus, by\nTheorem 2, thresholding leads to M different partial orders.\nMore speci\ufb01cally, the partial orders in QM have a very simple structure that is purely rank-\ndependent: The \ufb01rst structure is the total order \u03c0 = \u03c00. The second structure is obtained by\nremoving all preferences between all direct neighbors, i.e., labels yi and yj with adjacent ranks\n(|\u03c0(i) \u2212 \u03c0(j)| = 1). The third structure is obtained from the second one by removing all prefer-\nences between 2-neighbors, i.e., labels yi and yj with (|\u03c0(i) \u2212 \u03c0(j)| = 2), and so forth.\nThe cardinality of QM increases for distance measures D other than Kendall (like Spearman\u2019s rho\nor footrule), mainly since in general the matrix with entries P (i, j) is no longer Toeplitz. However,\nfor some measures, including the two just mentioned, the matrix will still be symmetric with respect\nto the antidiagonal, i.e., P (i, j) = P (M + 1 \u2212 i, M + 1 \u2212 j) for j > i) and have entries increasing\nalong rows. While the exact counting of QM appears to be very dif\ufb01cult in such cases, an argument\nsimilar to the one used in the proof of the next result shows that |QM| is bounded by the number of\n\n(cid:1) (see Ch. 7 [24]). It is a simple consequence of\n\nsymmetric Dyck paths and hence |QM| \u2264 (cid:0) M(cid:98) M\n\nTheorem 4 below, showing that exponentially more orders can be produced based on the PL model.\nLemma 5. For \ufb01xed q \u2208 (1/2, 1) and a set A of subsets of [M ], the following are equivalent:\n\n2 (cid:99)\n\n(i) The set A is the set of maximal antichains of a partial order induced by (4) on [M ] for some\n\nv1 > \u00b7\u00b7\u00b7 > vM > 0.\n\n(ii) The set A is a set of mutually incomparable intervals that cover [M ].\n\nProof. The fact that (i) implies (ii) is a simple calculation. Now assume (ii). For any interval\n{a, a + 1, . . . , b} \u2208 A we must have\n\u2264 q for any c, d \u2208 {a, a + 1, . . . , b} for which c < d.\nFrom va \u2265 vc > vd \u2265 vb it follows that\n=\n\n\u2265 1\n\nvc+vd\n\nva\n\nvc\n\n=\n\nvc\n\n1\n\n.\n\nva + vb\n\n1 + vb\nva\n\n1 + vd\nvc\n\nvc + vd\n\nvc\n\nva\n\nvc+vd\n\nva+vb\n\n\u2264 q for any\n> q for any c < d which are not contained in an antichain from\n\nThus, it suf\ufb01ces to show that there are real numbers v1 > \u00b7\u00b7\u00b7 > vn > 0 such that\n{a, a + 1, . . . , b} \u2208 A and\nA. We proceed by induction on M.\nThe induction base M = 1 is trivial. Assume M \u2265 2. Since all elements of A are intervals and\nany two intervals are mutually incomparable, it follows that M is contained in exactly one set from\nA\u2014possibly the singleton {M}. Let A(cid:48) be the set A without the unique interval {a, a + 1, . . . , M}\ncontaining M. Then A(cid:48) is a set of intervals that cover a proper subset [M(cid:48)] of [M ] and ful\ufb01ll the\nassumptions of (ii) for [M(cid:48)]. Hence by induction there is a choice of real numbers v1 > \u00b7\u00b7\u00b7 >\nvM(cid:48) > 0 such that the set of maximal antichains of the order on [M(cid:48)] induced by (4) is exactly A(cid:48).\nWe consider two cases: (i) a = M(cid:48) + 1. Then, by the considerations above, we need to choose\nnumbers vM(cid:48) > va > va+1 > \u00b7\u00b7\u00b7 > vM > 0 such that\n> q. The\n> q for d \u2264 M(cid:48) > a = M(cid:48) + 1 \u2265 d \u2265 M. But\nlatter implies\nthose are easily checked to exist. (ii) a \u2264 M(cid:48). Since M(cid:48) is contained in at least one set from A(cid:48)\nand since this set is not contained in {a, a + 1, . . . , M}, it follows that q \u2265 va\u22121\nva+vM(cid:48) .\nIn particular (1 \u2212 q)va < qvM(cid:48). Now choose vM(cid:48)+1 > vM(cid:48)+2 > \u00b7\u00b7\u00b7 > vM > 0 such that\nqvM(cid:48) > qvM(cid:48)+1 > qvM > va(1 \u2212 q). Note that here q > 1/2 is essential. Then one checks that all\ndesired inequalities are ful\ufb01lled.\n\nva\u22121+vM(cid:48) > va\n\n\u2265 vM(cid:48)\nvM(cid:48) +va\n\n\u2264 q and\n\n= 1\n\nvM(cid:48) +va\n\n1+ vc\nvd\n\nva+vM\n\nvd+vc\n\nvM(cid:48)\n\nva\n\nd\n\n6\n\n\fTheorem 4. Let QP L denote the set of different partial orders (up to isomorphism) that can be\nrepresented as a threshold relation Q de\ufb01ned by (1), where P is derived according to (4) from the\nPL model (3). For any given threshold q \u2208 [1/2, 1), the cardinality of this set is given by the M th\nCatalan number:\n\n|QP L| =\n\n1\n\n(cid:18)2M\n\n(cid:19)\n\nM + 1\n\nM\n\nv1 > \u00b7\u00b7\u00b7 > vM > 0 .\n\nSketch of Proof. Without loss of generality, we can assume the parameters of the PL model to satisfy\n(5)\nConsider the (M\u00d7M )-matrix with entries P (i, j). By (5), the main diagonal of this matrix is strictly\nincreasing along rows and columns. From the set {(i, j) | 0 \u2264 i \u2264 M + 1, 0 \u2264 i\u2212 1 \u2264 j \u2264 M}, we\nremove those (i, j), 1 \u2264 i < j \u2264 M, for which P (i, j) is above the given threshold. As a picture\nin the plane, this yields a shape whose upper right boundary can be identi\ufb01ed with a Dyck path\u2014a\npath on integer points consisting of 2M moves (1, 0), (0, 1) from position (1, 0) to (M + 1, M ) and\nstaying weakly above the (i + 1, i)-diagonal. It is immediate that each path uniquely determines\nits partial order. Moreover, it is well-known that these Dyck paths are counted by the M th Catalan\nnumber.\nIn order to verify that any Dyck path is induced by a suitable choice of parameters, one establishes a\nbijection between Dyck paths from (1, 0) to (M +1, M ) and maximal sets of mutually incomparable\nintervals (in the natural order) in [M ]. To this end, consider for a Dyck path a peak at position (i, j),\ni.e., a point on the path where a (1, 0) move is followed by a (0, 1) move. Then we must have j \u2265 i,\nand we identify this peak with the interval {i, i + 1, . . . , j}. It is a simple yet tedious task to check\nthat assigning to a Dyck path the set of intervals associated to its peaks is indeed a bijection to the\nset of maximal sets of mutually incomparable intervals in [M ]. Again, it is easy to verify that the set\nof intervals associated to a Dyck path is the set of maximal antichains of the partial order determined\nby the Dyck path. Now, the assertion follows from Lemma 5.\nAgain, using Lemma 5, one checks that (5) implies that partial orders induced by (4) in the PL\nmodel have a unique labeling up to poset automorphism. Hence our count is a count of isomorphism\nclasses.\nWe note that, from the above proof, it follows that the partial orders in QP L are the so called\nsemiorders. We refer the reader to Ch. 8 \u00a72 [26] for more details. Indeed, the \ufb01rst part of the proof\nof Theorem 4 resembles the proof of Ch. 8 (2.11) [26]. Moreover, we remark that QM is not only\nsmaller in size than QP L, but the former is indeed strictly included in the latter: QM \u2282 QP L. This\ncan easily be seen by de\ufb01ning the weights vi of the PL model as vi = 2M\u2212i (i \u2208 [M ]), in which\ncase the matrix with entries P (i, j) = 2j\u2212i\nFinally, given that we have been able to derive explicit (combinatorial) expressions for |QM| and\n|QP L|, it might be interesting to note that, somewhat surprisingly at \ufb01rst sight, no such expression\nexists for the total number of partial orders on M elements.\n\n1+2j\u2212i is Toeplitz.\n\n5 Experiments\n\nWe complement our theoretical results by an empirical study, in which we compare the different\napproaches on the SUSHI data set,1 a standard benchmark for preference learning. Based on a\nfood-chain survey, this data set contains complete rankings of 10 types of sushi provided by 5000\ncustomers, where each customer is characterized by 11 numeric features.\nOur evaluation is done by measuring the tradeoff between correctness and completeness achieved\nby varying the threshold q in (1). More concretely, we use the measures that were proposed in [6]:\ncorrectness is measured by the gamma rank correlation between the true ranking and the predicted\npartial order (with values in [\u22121, +1]), and completeness is de\ufb01ned by one minus the fraction of\npairwise comparisons on which the model abstains. Obviously, the two criteria are con\ufb02icting:\nincreasing completeness typically comes along with reducing correctness and vice versa, at least if\nthe learner is effective in the sense of abstaining from those decisions that are indeed most uncertain.\n\n1Available online at http://www.kamishima.net/sushi\n\n7\n\n\fFigure 1: Tradeoff between completeness and correctness for the SUSHI label ranking data set:\nExisting pairwise method (direct) versus the probabilistic approach based on the PL model and\nMallows model with Spearman\u2019s rho (MS) and Kendall (MK) as distance measure. The \ufb01gure on\nthe right corresponds to the original data set with rankings of size 10, while and the \ufb01gure on the left\nshows results for rankings of size 6.\n\nWe compare the original method of [6] with our new proposal, calling the former direct, because\nthe pairwise preference degrees on which we threshold are estimated directly, and the latter derived,\nbecause these degrees are derived from probability distributions on \u2126. As a label ranker, we used\nthe instance-based approach of [5] with a neighborhood size of 50. We conducted a 10-fold cross\nvalidation and averaged the completeness/correctness curves over all test instances. Due to compu-\ntational reasons, we restricted the experiments with the Mallows model to a reduced data set with\nonly six labels, namely the \ufb01rst six of the ten sushis. (The aforementioned label ranker is based\non an instance-wise maximum likelihood estimation of the probability distribution P on \u2126; in the\ncase of the Mallows model, this involves the estimation of the center ranking \u03c00, which is done by\nsearching the discrete space \u2126, that is, a space of size |M !|.)\nThe experimental results are summarized in Figure 1. The main conclusion that can be drawn from\nthese results is that, as expected, our probabilistic approach does indeed achieve a better tradeoff\nbetween completeness and correctness than the original one, especially in the sense that it spans\na wider range of values for the former. Indeed, with the direct approach, it is not possible to go\nbeyond a completeness of around 0.4, whereas our probabilistic methods always allow for predicting\ncomplete rankings (i.e., to achieve a completeness of 1). Besides, we observe that the tradeoff curves\nof our new methods are even lifted toward a higher level of correctness. Among the probabilistic\nmodels, the PL model performs particularly well, although the differences are rather small.\nSimilar results are obtained on a number of other benchmark data sets for label ranking. These\nresults can be found in the supplementary material.\n\n6 Summary and Conclusions\n\nThe idea of producing predictions in the form of a partial order by thresholding a (valued) pairwise\npreference relation is meaningful in the sense that a learner abstains on the most unreliable com-\nparisons. While this idea has \ufb01rst been realized in [6] in an ad-hoc manner, we put it on a \ufb01rm\nmathematical grounding that guarantees consistency and, via variation of the threshold, allows for\nexploiting the whole spectrum between a complete ranking and an empty relation.\nBoth variants of our probabilistic approach, the one based on the Mallows and the other one based\nthe PL model, are theoretically sound, semantically meaningful, and show strong performance in\n\ufb01rst experimental studies. The PL model may appear especially appealing due to its expressivity,\nand is also advantageous from a computational perspective.\nAn interesting question to be addressed in future work concerns the possibility of improving this\nmodel further, namely by increasing its expressivity while still assuring consistency. In fact, the\ntransitivity properties guaranteed by PL seem to be stronger than what is necessarily needed. In this\nregard, we plan to study models based on the notion of Luce-decomposability [20], which include\nPL as a special case.\n\n8\n\n00,10,20,30,40,500,20,40,60,81correctnesscompletenessderived-MSderived-MKderived-PLdirect00,20,40,60,8100,20,40,60,81correctnesscompletenessderived-PLdirect\fReferences\n[1] P.L. Bartlett and M.H. Wegkamp. Classi\ufb01cation with a reject option using a hinge loss. Journal\n\nof Machine Learning Research, 9:1823\u20131840, 2008.\n\n[2] E.V. Bonilla, S. Guo, and S. Sanner. Gaussian process preference elicitation. In Proc. NIPS\u2013\n\n2010, pages 262\u2013270, Vancouver, Canada, 2010. MIT Press.\n\n[3] R. Bradley and M. Terry. Rank analysis of incomplete block designs. I: The method of paired\n\ncomparisons. Biometrika, 39:324\u2013345, 1952.\n\n[4] W. Cheng, K. Dembczy\u00b4nski, and E. H\u00a8ullermeier. Label ranking based on the Plackett-Luce\n\nmodel. In Proc. ICML\u20132010, pages 215\u2013222, Haifa, Israel, 2010. Omnipress.\n\n[5] W. Cheng, J. H\u00a8uhn, and E. H\u00a8ullermeier. Decision tree and instance-based learning for label\n\nranking. In Proc. ICML\u20132009, pages 161\u2013168, Montreal, Canada, 2009. Omnipress.\n\n[6] W. Cheng, M. Rademaker, B. De Baets, and E. H\u00a8ullermeier. Predicting partial orders: Rank-\ning with abstention. In Proc. ECML/PKDD\u20132010, pages 215\u2013230, Barcelona, Spain, 2010.\nSpringer.\n\n[7] C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information\n\nTheory, 16(1):41\u201346, 1970.\n\n[8] W. Chu and Z. Ghahramani. Preference learning with Gaussian processes. In Proc. ICML\u2013\n\n2005, pages 137\u2013144, Bonn, Germany, 2005. ACM.\n\n[9] D. Critchlow, M. Fligner, and J. Verducci. Probability models on rankings. Journal of Mathe-\n\nmatical Psychology, 35:294\u2013318, 1991.\n\n[10] O. Dekel, CD. Manning, and Y. Singer. Log-linear models for label ranking. In Proc. NIPS\u2013\n\n2003, Vancouver, Canada, 2003. MIT Press.\n\n[11] P. Diaconis. Group representations in probability and statistics, volume 11 of Lecture Notes\u2013\n\nMonograph Series. Institute of Mathematical Statistics, Hayward, CA, 1988.\n\n[12] P.C. Fishburn. Binary choice probabilities: on the varieties of stochastic transitivity. Journal\n\nof Mathematical Psychology, 10:321\u2013352, 1973.\n\n[13] A. Gionis, H. Mannila, K. Puolam\u00a8aki, and A. Ukkonen. Algorithms for discovering bucket\n\norders from data. In Proc. KDD\u20132006, pages 561\u2013566, Philadelphia, US, 2006. ACM.\n\n[14] I.C. Gormley and T.B. Murphy. A latent space model for rank data. In Proc. ICML\u201306, pages\n\n90\u2013102, Pittsburgh, USA, 2006. Springer.\n\n[15] J. Guiver and E. Snelson. Bayesian inference for Plackett-Luce ranking models.\n\nICML\u20132009, pages 377\u2013384, Montreal, Canada, 2009. Omnipress.\n\nIn Proc.\n\n[16] S. Har-Peled, D. Roth, and D. Zimak. Constraint classi\ufb01cation: a new approach to multiclass\n\nclassi\ufb01cation. In Proc. ALT\u20132002, pages 365\u2013379, L\u00a8ubeck, Germany, 2002. Springer.\n\n[17] G. Lebanon and Y. Mao. Nonparametric modeling of partially ranked data. Journal of Machine\n\nLearning Research, 9:2401\u20132429, 2008.\n\n[18] T. Lu and C. Boutilier. Learning Mallows models with pairwise preferences. In Proc. ICML\u2013\n\n2011, pages 145\u2013152, Bellevue, USA, 2011. Omnipress.\n\n[19] R. Luce and P. Suppes. Handbook of Mathematical Psychology, chapter Preference, Utility\n\nand Subjective Probability, pages 249\u2013410. Wiley, 1965.\n\n[20] J. Marden. Analyzing and Modeling Rank Data. Chapman and Hall, 1995.\n[21] M. Meila and H. Chen. Dirichlet process mixtures of generalized mallows models. In Proc.\n\nUAI\u20132010, pages 358\u2013367, Catalina Island, USA, 2010. AUAI Press.\n\n[22] T. Qin, X. Geng, and T.Y. Liu. A new probabilistic model for rank aggregation.\n\nNIPS\u20132010, pages 1948\u20131956, Vancouver, Canada, 2010. Curran Associates.\n\nIn Proc.\n\n[23] M. Rademaker and B. De Baets. A threshold for majority in the context of aggregating partial\n\norder relations. In Proc. WCCI\u20132010, pages 1\u20134, Barcelona, Spain, 2010. IEEE.\n\n[24] R.P. Stanley. Enumerative Combinatorics, Vol. 2. Cambridge University Press, 1999.\n[25] L. Thurstone. A law of comparative judgment. Psychological Review, 79:281\u2013299, 1927.\n[26] W.T. Trotter. Combinatorics and partially ordered sets: dimension theory. The Johns Hopkins\n\nUniversity Press, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1200, "authors": [{"given_name": "Weiwei", "family_name": "Cheng", "institution": null}, {"given_name": "Eyke", "family_name": "H\u00fcllermeier", "institution": ""}, {"given_name": "Willem", "family_name": "Waegeman", "institution": ""}, {"given_name": "Volkmar", "family_name": "Welker", "institution": ""}]}