{"title": "A New Probabilistic Model for Rank Aggregation", "book": "Advances in Neural Information Processing Systems", "page_first": 1948, "page_last": 1956, "abstract": "This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-of-the-art ranking accuracy, and are much more efficient than previous algorithms.", "full_text": "A New Probabilistic Model for Rank Aggregation\n\nTao Qin\n\nMicrosoft Research Asia\n\ntaoqin@microsoft.com\n\nXiubo Geng\n\nChinese Academy of Sciences\nxiubogeng@gmail.com\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\ntyliu@microsoft.com\n\nAbstract\n\nThis paper is concerned with rank aggregation, which aims to combine multiple\ninput rankings to get a better ranking. A popular approach to rank aggregation\nis based on probabilistic models on permutations, e.g., the Luce model and the\nMallows model. However, these models have their limitations in either poor ex-\npressiveness or high computational complexity. To avoid these limitations, in this\npaper, we propose a new model, which is de\ufb01ned with a coset-permutation dis-\ntance, and models the generation of a permutation as a stagewise process. We re-\nfer to the new model as coset-permutation distance based stagewise (CPS) model.\nThe CPS model has rich expressiveness and can therefore be used in versatile ap-\nplications, because many different permutation distances can be used to induce\nthe coset-permutation distance. The complexity of the CPS model is low because\nof the stagewise decomposition of the permutation probability and the ef\ufb01cient\ncomputation of most coset-permutation distances. We apply the CPS model to su-\npervised rank aggregation, derive the learning and inference algorithms, and em-\npirically study their effectiveness and ef\ufb01ciency. Experiments on public datasets\nshow that the derived algorithms based on the CPS model can achieve state-of-\nthe-art ranking accuracy, and are much more ef\ufb01cient than previous algorithms.\n\n1 Introduction\n\nRank aggregation aims at combining multiple rankings of objects to generate a better ranking. It is\nthe key problem in many applications. For example, in meta search [1], when users issue a query,\nthe query is sent to several search engines and the rankings given by them are aggregated to generate\nmore comprehensive ranking results.\nGiven the underlying correspondence between ranking and permutation, probabilistic models on\npermutations, originated in statistics [19, 5, 4], have been widely applied to solve the problems of\nrank aggregation. Among different models, the Mallows model [15, 6] and the Luce model [14, 18]\nare the most popular ones.\nThe Mallows model is a distance-based model, which de\ufb01nes the probability of a permutation ac-\ncording to its distance to a location permutation. Due to many applicable permutation distances, the\nMallows model has very rich expressiveness, and therefore can be potentially used in many different\napplications. Its weakness lies in the high computational complexity. In many cases, it requires a\ntime complexity of O(n!) to compute the probability of a single permutation of n objects. This is\nclearly intractable when we need to rank a large number of objects in real applications.\nThe Luce model is a stagewise model, which decomposes the process of generating a permutation of\nn objects into n sequential stages. At the k-th stage, an object is selected and assigned to position k\n\n1\n\n\faccording to a probability based on the scores of the unassigned objects. The product of the selection\nprobabilities at all the stages de\ufb01nes the probability of the permutation. The Luce model is highly\nef\ufb01cient (with a polynomial time complexity) due to the decomposition. The expressiveness of the\nLuce model, however, is limited because it is de\ufb01ned on the scores of individual objects and cannot\nleverage versatile distance measures between permutations.\nIn this paper, we propose a new probabilistic model on permutations, which inherits the advantages\nof both the Luce model and the Mallows model and avoids their limitations. We refer to the model\nas coset-permutation distance based stagewise (CPS) model. Different from the Mallows model,\nthe CPS model is a stagewise model. It decomposes the generative process of a permutation (cid:25) into\nsequential stages, which makes the ef\ufb01cient computation possible. At the k-th stage, an object is\nselected and assigned to position k with a certain probability. Different from the Luce model, the\nCPS model de\ufb01nes the selection probability based on the distance between a location permutation\n(cid:27) and the right coset of (cid:25) (referred to as coset-permutation distance) at each stage. In this sense,\nit is also a distance-based model. Because many different permutation distances can be used to\ninduce the coset-permutation distance, the CPS model also has rich expressiveness. Furthermore,\nthe coset-permutation distances induced by many popular permutation distances can be computed\nwith polynomial time complexity, which further ensures the ef\ufb01ciency of the CPS model.\nWe then apply the CPS model to supervised rank aggregation and derive corresponding algorithms\nfor learning and inference of the model. Experiments on public datasets show that the CPS model\nbased algorithms can achieve state-of-the-art ranking accuracy, and are much more ef\ufb01cient than\nbaseline methods based on previous probabilistic models.\n\n2 Background\n\n2.1 Rank Aggregation\n\nThere are mainly two kinds of rank aggregation, i.e., score-based rank aggregation [17, 16] and\norder-based rank aggregation [2, 7, 3]. In the former, objects in the input rankings are associated\nwith scores, while in the latter, only the order information of these objects is available. In this work,\nwe focus on the order-based rank aggregation, because it is more popular in real applications [7], and\nscore-based rank aggregation can be easily converted to order-based rank aggregation by ignoring\nthe additional score information [7].\nEarly methods for rank aggregation are heuristic based. For example, BordaCount [2, 7] and median\nrank aggregation [8] are simply based on average rank positions or the number of pairwise wins. In\nthe recent literature, probabilistic models on permutations, such as the Mallows model and the Luce\nmodel, have been introduced to solve the problem of rank aggregation. Previous studies have shown\nthat the probabilistic model based algorithms can outperform the heuristic methods in many settings.\nFor example, the Mallows model has been shown very effective in both supervised rank aggregation\nand unsupervised rank aggregation, and the effectiveness of the Luce model has been demonstrated\nin the context of unsupervised rank aggregation. In the next subsection, we will describe these two\nmodels in more detail.\n\n2.2 Probabilistic Models on Permutations\n\nIn order to better illustrate the probabilistic models on permutations, we \ufb01rst introduce some con-\ncepts and notations.\nLet {1; 2; : : : ; n} be a set of objects to be ranked. A ranking/permutation1 (cid:25) is a bijection from\n{1; 2; : : : ; n} to itself. We use (cid:25)(i) to denote the position given to object i and (cid:25)\n(cid:0)1(i) to denote the\n(cid:0)1 as vectors whose i-th component is (cid:25)(i)\nobject assigned to position i. We usually write (cid:25) and (cid:25)\n(cid:0)1(i), respectively. We also use the bracket alternative notation to represent a permutation,\nand (cid:25)\ni.e., (cid:25) = \u27e8(cid:25)\nThe collection of all permutations of n objects forms a non-abelian group under composition, called\nthe symmetric group of order n, denoted as Sn. Let Sn(cid:0)k denote the subgroup of Sn consisting of\n\n(cid:0)1(2); : : : ; (cid:25)\n\n(cid:0)1(1); (cid:25)\n\n(cid:0)1(n)\u27e9.\n\n1We will interchangeably use the two terms in the paper.\n\n2\n\n\fall permutations whose \ufb01rst k positions are \ufb01xed:\n\nSn(cid:0)k = {(cid:25) \u2208 Sn|(cid:25)(i) = i;\u2200i = 1; : : : ; k}:\n\n(1)\nThe right coset Sn(cid:0)k(cid:25) = {(cid:27)(cid:25)|(cid:27) \u2208 Sn(cid:0)k} is a subset of permutations whose top-k objects are\nexactly the same as in (cid:25). In other words,\n\nSn(cid:0)k(cid:25) = {(cid:27)|(cid:27) \u2208 Sn; (cid:27)\n\n(cid:0)1(i) = (cid:25)\n\n(cid:0)1(i);\u2200i = 1; : : : ; k}:\n\nWe also use Sn(cid:0)k(\u27e8i1; i2; : : : ; ik\u27e9) to denote the right coset with object i1 in position 1, i2 in position\n2, : : : , and ik in position k.\nThe Mallows model is a distance based probabilistic model on permutations. It uses a permutation\ndistance d on the symmetric group Sn to de\ufb01ne the probability of a permutation:\n\n(2)\n\n(3)\n\nP ((cid:25)|(cid:18); (cid:27)) =\n\n1\n\nexp(\u2212(cid:18)d((cid:25); (cid:27)));\n\nZ((cid:18); (cid:27))\n\nn\n\nn\ni=1\n\nn\ni=1\n\n(cid:25)2Sn\n\n\u2211\n\n\u2211\n\nZ((cid:18); (cid:27)) =\n\nwhere (cid:27) \u2208 Sn is the location permutation, (cid:18) \u2208 R is a dispersion parameter, and\n\n\u2211\nexp(\u2212(cid:18)d((cid:25); (cid:27))):\n\u2211\n\u2211\nThere are many well-de\ufb01ned metrics to measure the distance between two permutations, such as\ni=1((cid:25)(i) \u2212 (cid:27)(i))2, Spearman\u2019s footrule df ((cid:25); (cid:27)) =\nSpearman\u2019s rank correlation dr((cid:25); (cid:27)) =\n|(cid:25)(i) \u2212 (cid:27)(i)|, and Kendall\u2019s tau dt((cid:25); (cid:27)) =\nj>i 1f(cid:25)(cid:27)(cid:0)1(i)>(cid:25)(cid:27)(cid:0)1(j)g, where 1fxg = 1\nif x is true and 0 otherwise. One can (and sometimes should) choose different distances for different\napplications. In this regard, the Mallows model has rich expressiveness.\nNote that there are n! permutations in Sn. The computation of Z((cid:18); (cid:27)) involves the sum of n! items.\nAlthough for some speci\ufb01c distances (such as dt), there exist ef\ufb01cient ways for parameter estimation\nin the Mallows model, for many other distances (such as dr and df ), there is no known ef\ufb01cient\nmethod to compute Z((cid:18); (cid:27)) and one has to pay for the high computational complexity of O(n!) [9].\nThis has greatly limited the application of the Mallows model in real problems. Usually, one has to\nemploy sampling methods such as MCMC to reduce the complexity [12, 11]. This, however, will\naffect the effectiveness of the model.\nThe Luce model is a stagewise probabilistic model on permutations.\nIt assumes that there is a\n(hidden) score !i; i = 1; : : : ; n, for each individual object i. To generate a permutation (cid:25), \ufb01rstly\nexp(!(cid:25)(cid:0)1(1))\ni=1 exp(!(cid:25)(cid:0)1(i)); secondly the object\nthe object (cid:25)\n(cid:0)1(2) is assigned to position 2 with probability\nexp(!(cid:25)(cid:0)1(2))\ni=2 exp(!(cid:25)(cid:0)1(i)); the assignment is continued\n(cid:25)\nuntil a complete permutation is formed. In this way, we obtain the permutation probability of (cid:25) as\nfollows,\n\n(cid:0)1(1) is assigned to position 1 with probability\n\n\u2211\n\nn\n\n\u2211\n\nn\n\nn\u220f\n\n\u2211\n\nP ((cid:25)) =\n\ni=1\n\nexp(!(cid:25)(cid:0)1(i))\nn\nj=i exp(!(cid:25)(cid:0)1(j))\n\n:\n\n(4)\n\nThe computation of permutation probability in the Luce model is very ef\ufb01cient, as shown above.\nActually the corresponding complexity is in the polynomial order of the number of objects. This is a\nclear advantage over the Mallows model. However, the Luce model is de\ufb01ned as a speci\ufb01c function\nof the scores of the objects, and therefore cannot make use of versatile permutation distances. As a\nresult, its expressiveness is not as rich as the Mallows model, which may limit its applications.\n\n3 A New Probabilistic Model\n\nAs discussed in the above section, both the Mallows and the Luce model have certain advantages and\nlimitations. In this section, we propose a new probabilistic model on permutations, which can inherit\ntheir advantages and avoid their limitations. We call this model the coset-permutation distance based\nstagewise (CPS) model.\n\n3\n\n\f3.1 The CPS Model\n\n\u2211\n\nAs indicated by the name, the CPS model is de\ufb01ned on the basis of the so-called coset-permutation\ndistance. A coset-permutation distance is induced from a permutation distance, as shown in the\nfollowing de\ufb01nition.\nDe\ufb01nition 1. Given a permutation distance d, the coset-permutation distance ^d from a coset Sn(cid:0)k(cid:25)\nto a target permutation (cid:27) is de\ufb01ned as the average distance between the permutations in the coset\nand the target permutation:\n\n^d(Sn(cid:0)k(cid:25); (cid:27)) =\n\n1\n\n|Sn(cid:0)k(cid:25)|\n\n(cid:28)2Sn(cid:0)k(cid:25)\n\nd((cid:28); (cid:27));\n\n(5)\n\nwhere |Sn(cid:0)k(cid:25)| is the number of permutations in set Sn(cid:0)k(cid:25).\nIt is easy to verify that if the permutation distance d is right invariant, then the induced coset-\npermutation distance ^d is also right invariant.\nWith the concept of coset-permutation distance, given a dispersion parameter (cid:18) \u2208 R and a location\npermutation (cid:27) \u2208 Sn, we can de\ufb01ne the CPS model as follows. Speci\ufb01cally, the generative process\nof a permutation (cid:25) of n objects is decomposed into n sequential stages. As an initialization, all the\nobjects are placed in a working set. At the k-th stage, the task is to select the k-th object in the\noriginal permutation (cid:25) out of the working set. The probability of this selection is de\ufb01ned with the\ncoset-permutation distance between the right coset Sn(cid:0)k(cid:25) and the location permutation (cid:27):\n\nexp(\u2212(cid:18) ^d(Sn(cid:0)k(cid:25); (cid:27)))\n\n\u2211\nj=k exp(\u2212(cid:18) ^d(Sn(cid:0)k((cid:25); k; j); (cid:27)))\n(cid:0)1(j) in the top k positions respectively.\n\nn\n\n;\n\n(6)\n\n(cid:0)1(k \u2212 1), and (cid:25)\n\nwhere Sn(cid:0)k((cid:25); k; j) denotes the right coset including all the permutations that rank objects\n(cid:0)1(1); : : : ; (cid:25)\n(cid:25)\nFrom Eq. (6), we can see that the closer the coset Sn(cid:0)k(cid:25) is to the location permutation (cid:27), the larger\nthe selection probability is. Considering all the n stages, we will obtain the overall probability of\ngenerating (cid:25), which is shown in the following de\ufb01nition.\nDe\ufb01nition 2. The CPS model de\ufb01nes the probability of a permutation (cid:25) conditioned on a dispersion\nparameter (cid:18) and a location permutation (cid:27) as:\n\nP ((cid:25)j(cid:18); (cid:27)) =\n\nexp((cid:0)(cid:18) ^d(Sn(cid:0)k(cid:25); (cid:27)))\n\nn\n\nj=k exp((cid:0)(cid:18) ^d(Sn(cid:0)k((cid:25); k; j); (cid:27)))\n\n;\n\n(7)\n\nn\u220f\n\nk=1\n\n\u2211\n\nP ((cid:25)|(cid:18); (cid:27)) = 1.\n\nwhere Sn(cid:0)k((cid:25); k; j) is de\ufb01ned in the sentence after Eq. (6).\n\u2211\nIt is easy to verify that the probabilities P ((cid:25)|(cid:18); (cid:27)); (cid:25) \u2208 Sn de\ufb01ned in the CPS model naturally\nform a distribution over Sn. That is, for each (cid:25) \u2208 Sn, we always have P ((cid:25)|(cid:18); (cid:27)) \u2265 0, and\n(cid:25)2Sn\nIn rank aggregation, one usually needs to combine multiple input rankings. To deal with this sce-\nnario, we further extend the CPS model, following the methodology used in [12].\n\n(cid:0)\u2211\n(cid:0)\u2211\ne\nn\nj=i e\nwhere (cid:18)= {(cid:18)1; : : : ; (cid:18)M} and (cid:27)= {(cid:27)1; : : : ; (cid:27)M}.\nThe CPS model de\ufb01ned as above can be computed in a highly ef\ufb01cient manner, as discussed in the\nfollowing subsection.\n\nm=1 (cid:18)m ^d(Sn(cid:0)i(cid:25);(cid:27)m)\nm=1 (cid:18)m ^d(Sn(cid:0)i((cid:25);i;j);(cid:27)m)\n\nP ((cid:25)|(cid:18); (cid:27)) =\n\nn\u220f\n\n\u2211\n\n(8)\n\nM\n\nM\n\ni=1\n\n;\n\n3.2 Computational Complexity\nAccording to the de\ufb01nition of the CPS model, at the k-th stage, one needs to compute (n \u2212 k)\ncoset-permutation distances. At \ufb01rst glance, the complexity of computing each coset-permutation\n\n4\n\n\f(cid:0)1(i)) (cid:0) i)2 +\n\n((cid:27)((cid:25)\n\n(cid:0)1(i)) (cid:0) j)2;\n\n((cid:27)((cid:25)\n\nj(cid:27)((cid:25)\n\n(cid:0)1(i)) (cid:0) ij +\n\nj(cid:27)((cid:25)\n\n(cid:0)1(i)) (cid:0) jj;\n\nn\u2211\n1\nn (cid:0) k\nn\u2211\n1\nn (cid:0) k\nn\u2211\nk\u2211\n\ni=k+1\n\nn\u2211\nn\u2211\n\nj=k+1\n\ni=k+1\n\nj=k+1\n\n(9)\n\n(10)\n\n(11)\n\ndistance is about O((n \u2212 k)!), since the coset contains this number of permutations. This is clearly\nintractable. The good news is that the real complexity for computing the coset-permutation distance\ninduced by several popular permutation distances is much lower than O((n \u2212 k)!). Actually, they\ncan be as low as O(n2), according to the following theorem.\nTheorem 1. The coset-permutation distances induced from Spearman\u2019s rank correlation dr, Spear-\nman\u2019s footrule df , and Kendall\u2019s tau dt can all be computed with a complexity of O(n2). More\nspeci\ufb01cally, for k = 1; 2; : : : ; n \u2212 2, we have2\n\nk\u2211\nk\u2211\n\ni=1\n\ni=1\n\n^dr(Sn(cid:0)k(cid:25); (cid:27)) =\n\n^df (Sn(cid:0)k(cid:25); (cid:27)) =\n\n^dt(Sn(cid:0)k(cid:25); (cid:27)) =\n\n1\n4\n\n(n (cid:0) k)(n (cid:0) k (cid:0) 1) +\n\n1f(cid:27)((cid:25)(cid:0)1(i))>(cid:27)((cid:25)(cid:0)1(j))g:\n\ni=1\n\nj=i+1\n\nAccording to the above theorem, each induced coset-permutation distance can be computed with a\ntime complexity of O(n2). If we compute the CPS model according to Eq. (7), the time complexity\nwill then be O(n4). This is clearly much more ef\ufb01cient than O((n \u2212 k)!). Moreover, with careful\nimplementations, the time complexity of O(n4) can be further reduced to O(n2), as indicated by\nthe following theorem.\nTheorem 2. For the coset distances induced from dr, df and dt, the CPS model in Eq. (7) can be\ncomputed with a time complexity of O(n2).\n\n3.3 Relationship with Previous Models\n\nThe CPS model as de\ufb01ned above has strong connections with both the Luce model and the Mallows\nmodel, as shown below.\nThe similarity between the CPS model and the Luce model is that they are both de\ufb01ned in a stage-\nwise manner. This stagewise de\ufb01nition enables ef\ufb01cient inference for both models. The difference\nbetween the CPS model and the Luce model lies in that the CPS model has a much richer expres-\nsiveness than the Luce model. This is mainly because the CPS model is a distance based model\nwhile the Luce model is not. Our experiments in Section 5 show that different distances may be\nappropriate for different applications and datasets, which means a model with rich expressiveness\nhas the potential to be applied for versatile applications.\nThe similarity between the CPS model and the Mallows model is that they are both based on dis-\ntances. Actually when the coset-permutation distance in the CPS model is induced by the Kendall\u2019s\ntau dt, the CPS model is even mathematically equivalent to the Mallows model de\ufb01ned with dt.\nThe major difference between the CPS model and the Mallows model lies in the computational\nef\ufb01ciency. The CPS model can be computed ef\ufb01ciently with a polynomial time complexity, as dis-\ncussed in the previous sub section. However, for most permutation distances, the complexity of the\nMallows model is as huge as O(n!).3\nAccording to the above discussions, we can see that the CPS model inherits the advantages of both\nthe Luce model and the Mallows model, and avoids their limitations.\n\n4 Algorithms for Rank Aggregation\n\nIn this section, we show how to apply the extended CPS model to solve the problem of rank aggrega-\ntion. Here we take meta search as an example, and consider the supervised case of rank aggregation.\nThat is, given a set of training queries, we need to learn the parameters (cid:18) in the CPS model and\napply the model with the learned parameters to aggregate rankings for new test queries.\n\n2Note that ^d(Sn(cid:0)k(cid:25); (cid:27)) = d((cid:25); (cid:27)) for k = n (cid:0) 1; n.\n3An exception is that for Kendall\u2019s tau distance, the Mallows model can be as ef\ufb01cient as the CPS model\n\nbecause they are mathematically equivalent.\n\n5\n\n\fAlgorithm 1 Sequential inference\n\nInitialize the set of n objects: D = f1; 2; : : : ; ng.\n(cid:0)1(1) = arg minj2D\n(cid:25)\n\nInput: parameters (cid:18), input rankings (cid:27)\nInference:\n1:\n2:\n3: Remove object (cid:25)\n4:\n\n(cid:0)1(1) from set D.\n\n\u2211\n\nm (cid:18)m ^d(Sn(cid:0)1(< j >); (cid:27)m).\n\n\u2211\n\n(\n\n(cid:0)1(k) = arg minj2D\n\nfor k = 2 to n\n(4.1): (cid:25)\n(4.2): Remove object (cid:25)\nend\n\n5:\nOutput: the \ufb01nal ranking (cid:25).\n\nm (cid:18)m ^d\n(cid:0)1(k) from set D.\n\nSn(cid:0)k(< (cid:25)\n\n(cid:0)1(1); : : : ; (cid:25)\n\n(cid:0)1(k (cid:0) 1); j >); (cid:27)m\n\n)\n\n;\n\n4.1 Learning\nLet D = {((cid:25)(l);(cid:27)(l))} be the set of training queries, in which (cid:25)(l) is the ground truth ranking for\nquery ql, and (cid:27)(l) is the set of M input rankings.\nIn order to learn the parameters (cid:18) in Eq. (8), we employ maximum likelihood estimation. Speci\ufb01-\ncally, the log likelihood of the training data for the CPS model can be written as below,\n\n\u2211\n\nl\n\n\u220f\nP ((cid:25)(l)j(cid:18); (cid:27)(l)) =\nn\u2211\n\n\uf8f1\uf8f2\uf8f3(cid:0) M\u2211\n\nl\n\nm=1\n\nL((cid:18)) = log\n\n\u2211\n\n=\n\nl\n\nk=1\n\nlog P ((cid:25)(l)j(cid:18); (cid:27)(l))\nn\u2211\n\nm ) (cid:0) log\n\n(cid:0)\u2211\n\ne\n\nM\n\nm=1 (cid:18)m ^d(Sn(cid:0)k((cid:25)(l);k;j);(cid:27)\n\n(l)\nm )\n\n\uf8fc\uf8fd\uf8fe (12)\n\nj=k\n\n(cid:18)m ^d(Sn(cid:0)k(cid:25)(l); (cid:27)(l)\n\nIt is not dif\ufb01cult to prove that L((cid:18)) is concave with respect to (cid:18). Therefore, we can use simple\noptimization techniques like gradient ascent to \ufb01nd the globally optimal (cid:18).\n\n4.2\n\nInference\n\nIn the test phase, given a new query and its associated M input rankings, we need to infer a \ufb01nal\nranking with the learned parameters (cid:18).\nA straightforward method is to \ufb01nd the permutation with the largest probability conditioned on the\nM input rankings, just as the widely-used inference algorithm for the Mallows model [12]. We call\nthe method global inference since it \ufb01nds the globally most likely one from all possible permutations.\nThe problem with global inference lies in that its complexity is as high as O(n!). As a consequence,\nit cannot handle applications with a large number of objects to rank. Considering the stagewise\nde\ufb01nition of the CPS model, we propose a sequential inference algorithm. The algorithm decom-\nposes the inference into n steps. At the k-th step, we select the object j that can minimize the\n^d(Sn(cid:0)k(\u27e8(cid:25)\n(cid:0)1(k\u2212 1); j\u27e9; (cid:27)m), and put it at the k-th\ncoset-permutation distance\nposition. The procedure is listed in Algorithm 1.\nIn fact, sequential inference is an approximation of global inference, with a much lower complexity.\nTheorem 3 shows that the complexity of sequential inference is just O(M n2). Our experiments in\nthe next section indicate that such an approximation does not hurt the ranking accuracy by much,\nwhile signi\ufb01cantly speeds up the inference process.\nTheorem 3. For the coset distance induced from dr, df , and dt, the stagewise inference as shown\nin Algorithm 1 can be conducted with a time complexity of O(M n2) .\n\n(cid:0)1(1); : : : ; (cid:25)\n\n\u2211\n\nm (cid:18)m\n\n5 Experimental Results\n\nWe have performed experiments to test the ef\ufb01ciency and effectiveness of the proposed CPS model.\n\n6\n\n\f5.1 Settings\n\nWe take meta search as the target application, and use the LETOR [13] benchmark datasets in the\nexperiments. LETOR is a public collection created for ranking research.4 There are two meta search\ndatasets in LETOR, MQ2007-agg and MQ2008-agg. In addition to using them, we also composed a\nsmaller dataset from MQ2008-agg, referred to as MQ2008-small, by selecting queries with no more\nthan 8 documents from the MQ2008-agg dataset. This small dataset is used to perform detailed\ninvestigations on the CPS model and other baseline models.\nThere are three levels of relevance labels in all the datasets: highly relevant, relevant, and irrelevant.\nWe used NDCG [10] as the evaluation measure in our experiments. NDCG is a widely-used IR\nmeasure for multi-level relevance judgments. The larger the NDCG value, the better the aggregation\naccuracy.\nThe 5-fold cross validation strategy was adopted for all the datasets. All the results reported in this\nsection are the average results over the \ufb01ve folds.\nFor the CPS model, we tested two inference methods: global inference (denoted as CPS-G) and se-\nquential inference (denoted as CPS-S). For comparison, we implemented the Mallows model. When\napplied to supervised rank aggregation, the learning process of the Mallows model is also maximum\nlikelihood estimation. For inference, we chose the permutation with the maximal probability as the\n\ufb01nal aggregated ranking. The time complexity of both learning and inference of the Mallows model\nwith distance dr and df is O(n!). We also implemented an approximate algorithm as suggested\nby [12] using MCMC sampling to speed up the learning process. We refer to this approximate al-\ngorithm as MallApp. Note that the time complexity of the inference of MallApp is still O(n!) for\ndistance dr and df . Furthermore, as a reference, we also tested a traditional method, BordaCount\n[1], which is based on majority voting. We did not compare with the Luce model because it is not\nstraightforward to be applied to supervised rank aggregation, as far as we know.\nNote that Mallows, MallApp and CPS-G cannot handle the large datasets MQ2007-agg and\nMQ2008-agg, and were only tested on the small dataset MQ2008-small.\n\n5.2 Results\n\nFirst, we report the results of these algorithms on the MQ2008-small dataset.\nThe aggregation accuracies in terms of NDCG are listed in Table 1(a). Note that the accuracy\nof Mallows(dt) is the same as that of CPS-G(dt) because of the mathematical equivalence of the\ntwo models. Therefore, we omit Mallows(dt) in the table. We did not implement the sampling-\nbased learning algorithm for the Mallows model with distance dt, because in this case the learning\nalgorithm has already been ef\ufb01cient enough.\nFrom the table, we have the following observations.\n\n\u2022 For the Mallows model, exact learning is a little better than the approximate learning, es-\npecially for distance df . This is in accordance with our intuition. Sampling can improve\nthe ef\ufb01ciency of the algorithm, but also miss some information contained in the original\npermutation probability space.\n\u2022 For the CPS model, the sequential inference does not lead to much accuracy drop as com-\npared to global inference. For distances df and dr, the CPS model outperforms the Mallows\nmodel. For example, when df is used, the CPS model wins the Mallows model by about\n0.04 in terms of NDCG@2, which corresponds to a relative improvement of 10%.\n\u2022 For the same model, with different distance functions, the performances differ signi\ufb01cantly.\n\u2022 All the probabilistic model based methods are better than BordaCount, the heuristic\n\nThis indicates that one should select the most suitable distance for a given application.\n\nmethod.\n\nIn addition to the comparison of aggregation accuracy, we have also logged the running time of each\nmodel. For example, on our test machine (with 2.13Ghz CPU and 4GB memory), it took about 12\n\n4The datasets can be downloaded from http://research.microsoft.com/\u02dcletor.\n\n7\n\n\f(a) Results on MQ2008-small\n@6\n0.479\n0.518\n0.517\n0.490\n0.491\n0.519\n0.519\n0.491\n0.490\n0.530\n0.534\n\n@2\n0.335\n0.392\n0.389\n0.350\n0.343\n0.387\n0.388\n0.333\n0.343\n0.414\n0.419\n\n@4\n0.421\n0.471\n0.471\n0.449\n0.440\n0.476\n0.478\n0.442\n0.440\n0.485\n0.489\n\nNDCG\n\nBordaCount\nCPS-G(df )\nCPS-S(df )\nMallows(df )\nMallApp(df )\nCPS-G(dr)\nCPS-S(dr)\nMallows(dr)\nMallApp(dr)\nCPS-G(dt)\nCPS-S(dt)\n\nTable 1: Results\n\n(b) Results on MQ2008-agg and MQ2007-agg\n\n@8\n0.420\n0.446\n0.444\n0.422\n0.420\n0.443\n0.441\n0.420\n0.419\n0.451\n0.454\n\nNDCG\n\nBordaCount\nCPS-S(dt)\nCPS-S(dr)\nCPS-S(df )\n\nNDCG\n\nBordaCount\nCPS-S(dt)\nCPS-S(dr)\nCPS-S(df )\n\n@4\n0.343\n0.379\n0.376\n0.352\n\non MQ2008-agg\n@2\n0.281\n0.312\n0.314\n0.276\non MQ2007-agg\n@2\n0.201\n0.298\n0.332\n0.298\n\n@4\n0.213\n0.311\n0.341\n0.312\n\n@6\n0.389\n0.420\n0.419\n0.399\n\n@6\n0.225\n0.322\n0.352\n0.323\n\n@8\n0.372\n0.403\n0.398\n0.383\n\n@8\n0.238\n0.335\n0.362\n0.336\n\nseconds for CPS-G(df ),5 30 seconds for MallApp(df ), and 12 hours for Mallows(df ) to \ufb01nish the\ntraining process. The inference of the Mallows model based algorithms and the global inference\nof the CPS model based algorithms took more time than sequential inference of the CPS model,\nalthough the difference was not signi\ufb01cant (this is mainly because n \u2264 8 for MQ2008-small).\nFrom these results, we can see that the proposed CPS model plus sequential inference is the most\nef\ufb01cient one, and its accuracy is also very good as compared to other methods.\n\nSecond, we report the results on MQ2008-agg and MQ2007-agg in Table 1(b). Note that the results\nof the Mallows model based algorithms and that of the CPS model with global inference are not\navailable because of the high computational complexity for their learning or inference. The results\nshow that the CPS model with sequential inference outperforms BordaCount, no matter which dis-\ntance is used. Moreover, the CPS model with dt performs the best on MQ2008-agg, and the model\nwith dr performs the best on MQ2007-agg. This indicates that we can achieve good ranking per-\nformance by choosing the most suitable distances for different datasets (and so applications). This\nprovides a side evidence that it is bene\ufb01cial for a probabilistic model on permutations to have rich\nexpressiveness.\nTo sum up, the experimental results indicate that the CPS model based learning and sequential\ninference algorithms can achieve state-of-the-art ranking accuracy and are more ef\ufb01cient than other\nalgorithms.\n\n6 Conclusions and Future Work\n\nIn this paper, we have proposed a new probabilistic model, named the CPS model, on permutations\nfor rank aggregation. The model is based on coset-permutation distance and de\ufb01ned in a stagewise\nmanner.\nIt inherits the advantages of the Luce model (high ef\ufb01ciency) and the Mallows model\n(rich expressiveness), and avoids their limitations. We have applied the model to supervised rank\naggregation and investigated how to perform learning and inference. Experiments on public datasets\ndemonstrate the effectiveness and ef\ufb01ciency of the CPS model.\nAs future work, we plan to investigate the following issues. (1) We have shown that three induced\ncoset-permutation distances can be computed ef\ufb01ciently. We will explore whether other distances\nalso have such properties. (2) We have applied the CPS model to the supervised case of rank aggre-\ngation. We will study the unsupervised case. (3) We will investigate other applications of the model,\nand discuss how to select the most suitable distance for a given application.\n\n5The training process of CPS-G and CPS-S is exactly the same.\n\n8\n\n\fReferences\n[1] J. Aslam and M. Montague. Models for metasearch. In Proceedings of the 24th SIGIR, pages\n\n276\u2013284, 2001.\n\n[2] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR \u201901: Proceedings of the 24th\nannual international ACM SIGIR conference on Research and development in information\nretrieval, pages 276\u2013284, New York, NY, USA, 2001. ACM.\n\n[3] M. Beg. Parallel Rank Aggregation for the World Wide Web. World Wide Web. Kluwer Aca-\n\ndemic Publishers, 6(1):5\u201322, 2004.\n\n[4] D. Critchlow. Metric methods for analyzing partially ranked data. 1980.\n[5] H. Daniels. Rank correlation and population models. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 171\u2013191, 1950.\n\n[6] P. Diaconis. Group representations in probability and statistics.\n\nStatistics Hayward, CA, 1988.\n\nInstitute of Mathematical\n\n[7] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web.\nIn WWW \u201901: Proceedings of the 10th international conference on World Wide Web, pages\n613\u2013622, New York, NY, USA, 2001. ACM.\n\n[8] R. Fagin, R. Kumar, and D. Sivakumar. Ef\ufb01cient similarity search and classi\ufb01cation via rank\naggregation. In SIGMOD \u201903: Proceedings of the 2003 ACM SIGMOD international confer-\nence on Management of data, pages 301\u2013312, New York, NY, USA, 2003. ACM.\n\n[9] M. Fligner and J. Verducci. Distance based ranking models. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 48(3):359\u2013369, 1986.\n\n[10] J. Kalervo and K. Kek\u00a8al\u00a8ainen. Cumulated gain-based evaluation of ir techniques. ACM Trans.\n\nInf. Syst., 20(4):422\u2013446, 2002.\n\n[11] A. Klementiev, D. Roth, and K. Small. Unsupervised rank aggregation with distance-based\n\nmodels. In Proceedings of the 25th ICML, pages 472\u2013479, 2008.\n\n[12] G. Lebanon and J. Lafferty. Cranking: Combining rankings using conditional probability\n\nmodels on permutations. In ICML2002, pages 363\u2013370, 2002.\n\n[13] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning\nto rank for information retrieval. In SIGIR2007 Workshop on Learning to Rank for Information\nRetrieval, pages 3\u201310, 2007.\n\n[14] R. D. Luce. Individual Choice Behavior. Wiley, 1959.\n[15] C. L. Mallows. Non-null ranking models. Biometrika, 44:114\u2013130, 1957.\n[16] R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs\nof search engines. In SIGIR \u201901: Proceedings of the 24th annual international ACM SIGIR\nconference on Research and development in information retrieval, pages 267\u2013275, New York,\nNY, USA, 2001. ACM.\n\n[17] M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In CIKM \u201901:\nProceedings of the tenth international conference on Information and knowledge management,\npages 427\u2013433, New York, NY, USA, 2001. ACM.\n\n[18] R. L. Plackett. The analysis of permutations. Applied Statistics, 24(2):193\u2013202, 1975.\n[19] L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273\u2013286, 1927.\n\n9\n\n\f", "award": [], "sourceid": 354, "authors": [{"given_name": "Tao", "family_name": "Qin", "institution": null}, {"given_name": "Xiubo", "family_name": "Geng", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}]}