{"title": "McRank: Learning to Rank Using Multiple Classification and Gradient Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 904, "abstract": null, "full_text": "McRank: Learning to Rank Using Multiple\n\nClassi\ufb01cation and Gradient Boosting\n\nPing Li \u2217\n\nDept. of Statistical Science\n\nCornell University\n\nChristopher J.C. Burges\n\nMicrosoft Research\n\nMicrosoft Corporation\n\nQiang Wu\n\nMicrosoft Research\n\nMicrosoft Corporation\n\npingli@cornell.edu\n\ncburges@microsoft.com\n\nqiangwu@microsoft.com\n\nAbstract\n\nWe cast the ranking problem as (1) multiple classi\ufb01cation (\u201cMc\u201d) (2) multiple or-\ndinal classi\ufb01cation, which lead to computationally tractable learning algorithms\nfor relevance ranking in Web search. We consider the DCG criterion (discounted\ncumulative gain), a standard quality measure in information retrieval. Our ap-\nproach is motivated by the fact that perfect classi\ufb01cations result in perfect DCG\nscores and the DCG errors are bounded by classi\ufb01cation errors. We propose us-\ning the Expected Relevance to convert class probabilities into ranking scores. The\nclass probabilities are learned using a gradient boosting tree algorithm. Evalua-\ntions on large-scale datasets show that our approach can improve LambdaRank [5]\nand the regressions-based ranker [6], in terms of the (normalized) DCG scores. An\nef\ufb01cient implementation of the boosting tree algorithm is also presented.\n\n1 Introduction\nThe general ranking problem has widespread applications including commercial search engines and\nrecommender systems. We develop McRank, a computationally tractable learning algorithm for the\ngeneral ranking problem; and we present our approach in the context of ranking in Web search.\n\nFor a given user input query, a commercial search engine returns many pages of URLs, in an order\ndetermined by the underlying proprietary ranking algorithm. The quality of the returned results are\nlargely evaluated on the URLs displayed in the very \ufb01rst page. The type of ranking problem in this\nstudy is sometimes referred to as dynamic ranking (or simply, just ranking), because the URLs are\ndynamically ranked (in real-time) according to the speci\ufb01c user input query. This is different from\nthe query-independent static ranking based on, for example, \u201cpage rank\u201d [3] or \u201cauthorities and\nhubs\u201d [12], which may, at least conceptually, serve as an important \u201cfeature\u201d for dynamic ranking\nor to guide the generation of a list of URLs fed to the dynamic ranker.\n\nThere are two main categories of ranking algorithms. A popular scheme is based on learning\npairwise preferences, including RankNet [4], LambdaRank [5], RankSVM [11], RankBoost [7],\nGBRank [14], and FRank [13]. Both LambdaRank and RankNet used neural nets.1 RankNet used\na cross-entropy type of loss function and LambdaRank used a gradient based on NDCG smoothed\nby the RankNet loss. Another scheme is based on regression [6]. [6] considered the DCG measure\n(discounted cumulative gain) [10] and showed that the DCG errors are bounded by regression errors.\n\nIn this study, we also consider the DCG measure. From the de\ufb01nition of DCG, it appears more direct\nto cast the ranking problem as multiple classi\ufb01cation (\u201cMc\u201d) as opposed to regression. In order to\nconvert classi\ufb01cation results into ranking scores, we propose a simple and stable mechanism by\nusing the Expected Relevance. Our evaluations on large-scale datasets demonstrate the superiority\nof the classi\ufb01cation-based ranker (McRank) over both the regression-based and pair-based schemes.\n2 Discounted Cumulative Gain (DCG)\nFor an input query, the ranker returns n ordered URLs. Suppose the URLs fed to the ranker are orig-\ninally ordered {1, 2, 3, ..., n}. The ranker will output a permutation mapping \u03c0 : {1, 2, 3, ..., n} \u2192\n{1, 2, 3, ..., n}. We denote the inverse mapping by \u03c3i = \u03c3(i) = \u03c0\u22121(i).\nThe DCG score is computed from the relevance levels of the n URLs as\n\nn\n\nn\n\nDCG =\n\nc[i] (2y\u03c3i \u2212 1) =\n\nc[\u03c0i] (2yi \u2212 1) ,\n\n(1)\n\nXi=1\n\nXi=1\n\n\u2217Much of the work was conducted while Ping Li was an intern at Microsoft in 2006.\n1In fact LambdaRank supports any preference function, although the reported results in [5] are for pairwise.\n\n\fwhere [i] is the rank order, and yi \u2208 {0, 1, 2, 3, 4} is the relevance level of the ith URL in the\noriginal (pre-ranked) order. yi = 4 corresponds to a \u201cperfect\u201d relevance and yi = 0 corresponds to\na \u201cpoor\u201d relevance. For generating training datasets, human judges have manually labeled a large\nnumber of queries and URLs. In this study, we assume these labels are \u201cgold-standard.\u201d\nIn the de\ufb01nition of DCG, c[i], which is a non-increasing function of i, is typically set as\n\nc[i] =\n\n1\n\nlog(1 + i)\n\n,\n\nif i \u2264 L,\n\nand c[i] = 0,\n\nif i > L,\n\n(2)\n\nwhere L is the \u201ctruncation level\u201d and is typically set to be L = 10, to re\ufb02ect the fact that the search\nquality of commercial search engines is mainly determined by the URLs displayed in the \ufb01rst page.\nSuppose a dataset contains NQ queries. It is a common practice to normalize the DCG score for\neach query and report the normalized DCG (\u201cNDCG\u201d) score averaged over all queries. In other\nwords, the NDCG for the jth query (NDCGj) and the \ufb01nal NDCG of the dataset (NDCGF ) are\n\nNDCGj =\n\nDCGj\nDCGj,g\n\n,\n\nNDCGF =\n\n1\nNQ\n\nNDCGj,\n\n(3)\n\nNQ\n\nXj=1\n\nwhere DCGj,g is the maximum possible (or \u201cgold standard\u201d) DCG score of the jth query.\n3 Learning to Rank Using Classi\ufb01cation\nThe de\ufb01nition of DCG suggests that we can cast the ranking problem naturally as multiple classi-\n\ufb01cation (i.e., K = 5 classes), because obviously perfect classi\ufb01cations will lead to perfect DCG\nscores. While the DCG criterion is non-convex and non-smooth, classi\ufb01cation is very well-studied.\n\nWe should mention that one does not really need perfect classi\ufb01cations in order to produce perfect\nDCG scores. For example, suppose within a query, the URLs are all labeled level 1 or higher. If\nan algorithm always classi\ufb01es the URLs one level lower (i.e., URLs labeled level 4 are classi\ufb01ed as\nlevel 3, and so on), we still have the perfect DCG score but the classi\ufb01cation \u201cerror\u201d is 100%. This\nphenomenon to an extent, may provide some \u201csafety cushion\u201d for casting ranking as classi\ufb01cation.\n\n[6] cast ranking as regression and showed that the DCG errors are bounded by regression errors. It\nappears to us that the regression-based approach is less direct and possibly also less accurate than our\nclassi\ufb01cation-based proposal. For example, it is well-known that, although one can use regression\nfor classi\ufb01cation, it is often better to use logistic regression especially for multiple classi\ufb01cation [8].\n3.1 Bounding DCG Errors by Classi\ufb01cation Errors\nFollowing [6, Theorem 2], we show that the DCG errors can be bounded by classi\ufb01cation errors.\nFor a permutation mapping \u03c0, the error is DCGg - DCG\u03c0. One simple way to obtain the perfect\nDCGg is to rank the URLs directly according to the gold-standard relevance levels. That is, all\nURLs with relevance level k + 1 are ranked higher than those with relevance level \u2264 k; and the\nURLs with the same relevance levels are arbitrarily ranked without affecting DCGg. We denote the\ncorresponding permutation mapping also by g.\n\nLemma 1 Given n URLs, originally ordered as {1, 2, 3, ..., n}. Suppose a classi\ufb01er assigns a rele-\nvance level \u02c6yi \u2208 {0, 1, 2, 3, 4} to the ith URL, for all n URLs. A permutation mapping \u03c0 ranks the\nURLs according to \u02c6yi, i.e., \u03c0(i) < \u03c0(j) if \u02c6yi > \u02c6yj, and, URL i and URL j are arbitrarily ranked if\n\u02c6yi = \u02c6yj. The corresponding DCG error is bounded by the square root of the classi\ufb01cation error,\n\nProof:\n\nDCG\u03c0 =\n\nn\n\nn\n\nn\n\nn\n\nc2/n\n\nYi=1\n\nDCGg \u2212 DCG\u03c0 \u226415\u221a2 n\nXi=1\nc2\n[i] \u2212 n\nc[\u03c0i](cid:16)2\u02c6yi \u2212 1(cid:17) +\nXi=1\nc[\u03c0i] (2yi \u2212 1) =\nc[\u03c0i](cid:16)2yi \u2212 2\u02c6yi(cid:17)\nc[gi](cid:16)2\u02c6yi \u2212 1(cid:17) +\nXi=1\nc[gi](cid:16)2yi \u2212 2\u02c6yi(cid:17) +\nXi=1\nc[gi] (2yi \u2212 1) \u2212\nXi=1(cid:0)c[\u03c0i] \u2212 c[gi](cid:1)(cid:16)2yi \u2212 2\u02c6yi(cid:17) .\n\nXi=1\nXi=1\nXi=1\n\n=DCGg +\n\nn\n\n\u2265\n\n=\n\nn\n\nn\n\nn\n\n[i] !1/2 n\n1yi6= \u02c6yi!1/2\nXi=1\nc[\u03c0i](cid:16)2yi \u2212 2\u02c6yi(cid:17)\nXi=1\n\nn\n\n.\n\n(4)\n\nn\n\nXi=1\n\nc[\u03c0i](cid:16)2yi \u2212 2\u02c6yi(cid:17)\n\n\fi=1 c[gi](cid:0)2\u02c6yi \u2212 1(cid:1). Therefore,\n\nn\n\ni=1 c[\u03c0i](cid:0)2\u02c6yi \u2212 1(cid:1) \u2265Pn\n\nNote thatPn\nXi=1(cid:0)c[gi] \u2212 c[\u03c0i](cid:1)(cid:16)2yi \u2212 2\u02c6yi(cid:17)\nDCGg \u2212 DCG\u03c0 \u2264\nXi=1(cid:0)c[gi] \u2212 c[\u03c0i](cid:1)2!1/2 n\nXi=1(cid:16)2yi \u2212 2\u02c6yi(cid:17)2!1/2\n\u2264 n\nNote thatPn\n[gi] =Pn\n[i],Qn\nThus, we can minimize the classi\ufb01cation errorPn\n\n[\u03c0i] =Pn\n\ni=1 c2\n\ni=1 c2\n\ni=1 c2\n\nn\n\n\u2264 2\nXi=1\n[\u03c0i] =Qn\n\nYi=1\nc2\n[i] \u2212 2n\n[gi] =Qn\n\ni=1 c2\n\ni=1 c2\n\nn\n\nc2/n\n\n[i] !1/2\n\n15 n\nXi=1\n\n1yi6= \u02c6yi!1/2\n[i], and 24 \u2212 20 = 15.\n\ni=1 c2\n\ni=1 1yi6=\u02c6yi as a surrogate for minimizing the DCG\nerror. Of course, since the classi\ufb01cation error itself is non-convex and non-smooth, we actually\nshould use other (well-known) surrogate loss functions such as (7).\n\n3.2 Input Data for Classi\ufb01cation\nA training dataset contains NQ queries. The jth query corresponds to nj URLs; each URL is\nmanually labeled by one of the K = 5 relevance levels. Engineers have developed methodologies\nto construct \u201cfeatures\u201d by combining the query and URLs, but the details are usually \u201ctrade secret.\u201d\n\nOne important aspect in designing features, at least for the convenience of using traditional machine\nlearning algorithms, is that these features should be comparable across queries. For example, one\n(arti\ufb01cial) feature could be the number of times the query appears in the Web page, which is com-\nparable across queries. Both pair-based rankers and regression-based rankers implicitly made this\nassumption, as they tried to learn a single rank function for all queries using the same set of features.\n\n\u201ctraining data matrix\u201d of size N \u00d7 P , where N =PNQ\n\nThus, after we have generated feature vectors by combining the queries and URLs, we can create a\nj=1 nj is the total number of \u201cdata points\u201d (i.e.,\nQuery+URL) and P is the total number of features. This way, we can use the traditional machine\nlearning notation {yi, xi}N\ni=1 to denote the training dataset. Here xi \u2208 RP is the ith feature vector\nin P dimensions; and yi \u2208 {0, 1, 2, 3, 4 = K \u2212 1} is the class (relevance) label of the ith data point.\n\n3.3 From Classi\ufb01cation to Ranking\nAlthough perfect classi\ufb01cations lead to perfect DCG scores, in reality, we will need a mechanism to\nconvert (imperfect) classi\ufb01cation results into ranking scores.\n\nOne possibility is already mentioned in Lemma 1. That is, we classify each data point into one of\nthe K = 5 classes and rank the data points according to the class labels (data points with the same\nlabels are arbitrarily ranked). This suggestion, however, will lead to highly unstable ranking results.\n\nOur proposed solution is very simple. We \ufb01rst learn the class probabilities by some soft classi\ufb01cation\nalgorithm and then score each data point (query+URL) according to the Expected Relevance.\n\nRecall we assume a training dataset {yi, xi}N\ni=1, where the class label yi \u2208 {0, 1, 2, 3, 4 = K \u2212 1}.\nWe learn the class probabilities pi,k = Pr(yi = k), denoted by \u02c6pi,k, and de\ufb01ne a scoring function:\n\nSi =\n\nK\u22121\n\nXk=0\n\n\u02c6pi,kT (k),\n\n(5)\n\nwhere T (k) is some monotone (increasing) function of the relevance level k. Once we have com-\nputed the scores Si for all data points, we can then sort the data points within each query by the\ndescending order of Si. This approach is apparently sensible and highly stable. In fact, we exper-\nimented with both T (k) = k and T (k) = 2k; the performance difference in terms of the NDCG\nscores was negligible, although T (k) = k appeared to be a slightly better choice (see Figure 3(c) in\nAppendix II). In this paper, the reported experimental results were based on T (k) = k.\nWhen T (k) = k, the scoring function Si is the Expected Relevance. Note that any monotone\ntransformation on Si (e.g., 2Si \u2212 1) will not change the ranking results. Consequently, the ranking\nresults are not affected by any af\ufb01ne transformation on T (k), aT (k) + b, (a > 0), because\n\nK\u22121\n\nXk=0\n\npi,k (a \u00d7 T (k) + b) = a \u00d7 K\u22121\nXk=0\n\npi,kT (k)! + b,\n\nsince\n\nK\u22121\n\nXk=0\n\npi,k = 1.\n\n(6)\n\n\f3.4 The Boosting Tree Algorithm for Learning Class Probabilities\nFor multiple classi\ufb01cation, we consider the following common (e.g., [8, 9]) surrogate loss function\n\nN\n\nK\u22121\n\nXi=1\n\nXk=0\n\n\u2212 log(pi,k)1yi=k.\n\n(7)\n\nAlgorithm 1 implements a boosting tree algorithm for learning class probabilities pi,k; and we use\nbasically the same implementation later for regression as well as multiple ordinal classi\ufb01cation.\n\nAlgorithm 1 The boosting tree algorithm for multiple classi\ufb01cation, taken from [9, Algorithm 6],\nalthough the presentation is slightly different.\n0: \u02dcyi,k = 1, if yi = k, and \u02dcyi,k = 0 otherwise.\n1: Fi,k = 0, k = 0 to K \u2212 1, i = 1 to N\n2: For m = 1 to M Do\n3:\n4:\n5:\n\npi,k = exp(Fi,k)/PK\u22121\nj=1 = J-terminal node regression tree for {\u02dcyi,k \u2212 pi,k, xi}N\nK P xi\u2208Rj,k,m\nP xi\u2208Rj,k,m\n\nFor k = 0 to K \u2212 1 Do\n\n\u02dcyi,k\u2212pi,k\n\n(1\u2212pi,k)pi,k\n\ns=0 exp(Fi,s)\n\ni=1\n\n6:\n\n7:\n8:\n9: End\n\nEnd\n\n{Rj,k,m}J\n\n\u03b2j,k,m = K\u22121\n\nFi,k = Fi,k + \u03bdPJ\n\nj=1 \u03b2j,k,m1xi\u2208Rj,k,m\n\nThere are three main parameters. M is the total number of boosting iterations, J is the tree size\n(number of terminal nodes), and \u03bd is the shrinkage coef\ufb01cient. As commented in [9] and veri\ufb01ed in\nour experiments, the performance of the algorithm is not sensitive to these parameters.\n\nIn Algorithm 1, Line 5 contains most of the implementation work, i.e., building the regression trees\nwith J terminal nodes. Appendix I describes an ef\ufb01cient implementation for building the trees.\n\n4 Multiple Ordinal Classi\ufb01cation to Further Improve Ranking\n\nThere is the possibility to (slightly) further improve our classi\ufb01cation-based ranking scheme by\ntaking into account the natural orders among the class labels, i.e., the multiple ordinal classi\ufb01cation.\n\nA common approach for multiple ordinal classi\ufb01cation is to learn the cumulative probabilities\nPr (yi \u2264 k) instead of the class probabilities Pr (yi = k) = pi,k. We suggest a simple method\nsimilar to the so-called cumulative logits approach known in statistics [1, Section 7.2.1].\nWe \ufb01rst partition the training data points into two groups: {yi \u2265 4} and {yi \u2264 3}. Now we have\na binary classi\ufb01cation problem and hence we can use exactly the same boosting tree algorithm for\nmultiple classi\ufb01cation. Thus we can learn Pr (yi \u2264 3) easily. We can similarly partition the data and\nlearn Pr (yi \u2264 2), Pr (yi \u2264 1), and Pr (yi \u2264 0), separately. We then infer the class probabilities\n\npi,k = Pr (yi = k) = Pr (yi \u2264 k) \u2212 Pr (yi \u2264 k \u2212 1) ,\n\n(8)\n\nand again we use the Expected Relevance to compute the ranking scores and sort the URLs.\n\nWe call both rankers based on multiple classi\ufb01cation and multiple ordinal classi\ufb01cation as McRank.\n\n5 Regression-based Ranking Using Boosting Tree Algorithm\n\nWith slight modi\ufb01cations, the boosting tree algorithm can be used for regressions. Recall the input\ndata are {yi, xi}N\ni=1, where yi \u2208 {0, 1, 2, 3, 4}. [6] suggested regressing the feature vectors xi on\nthe response values 2yi \u2212 1.\nAlgorithm 2 implements the least-square boosting tree algorithm. The pseudo code is similar to [9,\nAlgorithm 3] by replacing the (l1) least absolute deviation (LAD) loss with the (l2) least square loss.\nIn fact, we also implemented the LAD boosting tree algorithm but we found the performance was\nconsiderably worse than the least-square tree boost.\n\n\fAlgorithm 2 The boosting tree algorithm for regressions. After we have learned the values for Si,\nwe use them directly as the ranking scores to order the data points within each query.\n0: \u02dcyi = 2yi \u2212 1\n1: Si = 1\n2: For m = 1 to M Do\n5:\n6:\n7:\n9: End\n\nj=1 = J-terminal node regression tree for {\u02dcyi \u2212 Si, xi}N\n\n{Rj,m}J\n\u03b2j,m = meanxi\u2208Rj,m \u02dcyi \u2212 Si\n\ns=1 \u02dcys, i = 1 to N\n\nN PN\n\nSi = Si + \u03bdPJ\n\nj=1 \u03b2j,m1xi\u2208Rj,m\n\ni=1\n\n6 Experimental Results\nWe present the evaluations of 4 ranking algorithms (LambdaRank with two-layer nets, regression,\nmultiple classi\ufb01cation, and multiple ordinal classi\ufb01cation) on 4 datasets, including one arti\ufb01cial\ndataset and three Web search datasets, denoted by Web-1, Web-2, and Web-3. The arti\ufb01cial dataset\nand Web-1 are the same datasets used in [5]. Web-2 is the main dataset used in [13].\n\nFor the arti\ufb01cial data and Web-1, [5] reported that LambdaRank improved RankNet by about 1.0 (%)\nNDCG. For Web-2, [13] reported that FRank slightly improved RankNet (by about 0.5 (%) NDCG)\nand considerably improved RankSVM and RankBoost; but [13] did not compare with LambdaRank.\nOur experiment showed that LambdaRank improved FRank by about 0.9 (%) NDCG on Web-2.\n\n6.1 The Datasets\nThe arti\ufb01cial dataset [5] was meant to remove any variance caused by the quality of features and/or\nrelevance labels. The data were generated from random cubic polynomials, with 50 features, 50\nURLs per query, and 10,000/5,000/10,000 queries for train/validation/test.\n\nThe Web search dataset Web-1 [5] has 367 features and 10,000/5,000/10,000 queries for\ntrain/validation/test, with in total 652,500 URLs.\n\nWeb-2 [13] has 619 features and 12,000/3,800/3,800 queries for train/validation/test, with in total\n1,741,930 URLs. Note that this dataset is only partially labeled with 20 unlabeled URLs per query.\nThese unlabeled URLs were assigned the level 0 [13].\n\nWeb-3 has 450 features and 26,000 queries, with in total 474,590 URLs. We conducted \ufb01ve-fold\ncross-validations and report the average NDCG scores.\n\n6.2 The Parameters: M, J, \u03bd\nThere are three main parameters in the boosting tree algorithm. M is the total number of iterations,\nJ is the number of terminal nodes in each tree, and \u03bd is the shrinkage factor. Our experiments verify\nthat these parameters are not sensitive as long as they are within some \u201creasonable\u201d ranges [9]. Since\nthese experiments are time-consuming, we did not tune these parameters (M , J, \u03bd) exhaustively;\nbut the experiments appear to be convincing enough to establish the superiority of McRank.\n\n[9] suggested setting \u03bd \u2264 0.1, to avoid over-\ufb01tting. We \ufb01x \u03bd = 0.05 for the arti\ufb01cial dataset\nand Web-1, and \ufb01x \u03bd = 0.02 for Web-2 and Web-3. The number of terminal nodes, J, should be\nreasonably big (but not too big) when the dataset is large with a large number of features, because\nthe tree has to be deep enough to consider higher-order interactions [9]. We let J = 10 for the\narti\ufb01cial dataset and Web-1, J = 40 for Web-2, and J = 20 for Web-3.\nWith these values of J and \u03bd, we did not observe obvious over-\ufb01tting even for a very large number\nof boosting iterations M . We will report the results with M = 1000 for the arti\ufb01cial data and Web-1,\nM = 2000 for Web-2, and M = 1500 for Web-3.\n\n6.3 The Test NDCG Results at Truncation Level L = 10\nTable 1 lists the NDCG results (both the mean and standard deviation, in percentages (%)) for all 4\ndatasets and all 4 ranking algorithms, evaluated at the truncation level L = 10.\nThe NDCG scores indicate that that McRank (ordinal classi\ufb01cation and classi\ufb01cation) considerably\nimproves the regression-based ranker and LambdaRank. If we conduct a one-sided t-test, the im-\n\n\fTable 1: The test NDCG scores produced by 4 rankers on 4 datasets. The average NDCG scores are presented\nin percentages (%) with the standard deviations in the parentheses. Note that for the arti\ufb01cial data and Web-1,\nthe LambdaRank results were taken directly from [5]. We also report the (one-sided) p-values to measure the\nstatistical signi\ufb01cance of the improvement of McRank over regression and LambdaRank. For the arti\ufb01cial data,\nWeb-1, and Web-3, we use the ordinal classi\ufb01cation results to compute the p-values. However, for Web-2,\nbecause our implementation for testing ordinal classi\ufb01cation required too much memory for M = 2000, we\ndid not obtain the \ufb01nal test NDCG scores; the partial results indicated that ordinal classi\ufb01cation did not improve\nclassi\ufb01cation for this dataset. Therefore, we compute the p-values using classi\ufb01cation results for Web-2.\n\nDatasets\n\nOrdinal Classi\ufb01cation\n\nClassi\ufb01cation Regression, p-value\n\nLambdaRank, p-value\n\nArti\ufb01cial [5]\nWeb-1 [5]\nWeb-2 [13] \u2014\nWeb-3\n\n85.0 (9.5)\n72.4 (24.1)\n\n72.5 (26.5)\n\n83.7 (9.9)\n72.2 (24.1)\n75.8 (23.8)\n72.4 (27.3)\n\n82.9 (10.2), 0\n71.7 (24.4), 0.021\n74.7 (24.4), 0.023\n72.0 (27.6), 0.017\n\n74.9, (12.6), 0\n71.2 (24.5), 0.0002\n74.3 (24.3), 0.003\n71.3 (28.8), 3.8 \u00d7 10\u22127\n\nprovements are signi\ufb01cant at about 98% level. However, multiple ordinal classi\ufb01cation did not show\nsigni\ufb01cant improvement over multiple classi\ufb01cation, except for the arti\ufb01cial dataset.\n\nFor the arti\ufb01cial data, all other 3 rankers exhibit very large improvements over LambaRank. This is\nprobably due to the fact that the arti\ufb01cial data are generated noise-free and hence the \ufb02exible (with\nhigh capacity) rankers using boosting tree algorithms tend to \ufb01t the data very well.\n\n6.4 The NDCG Results at Various Truncation Levels (L = 1 to 10)\nFor the arti\ufb01cial dataset and Web-1, [5] also reported the NDCG scores at various truncation levels,\nL = 1 to 10. To make the comparisons more convincing, we also report similar results for the arti-\n\ufb01cial dataset and Web-1, in Figure 1. For a better clarity, we plot the standard deviations separately\nfrom the averages. Figure 1 veri\ufb01es that the improvements shown in Table 1 are not only true for\nL = 10 but also (essentially) true for smaller truncation levels.\n\nArtificial\n\n \n\n85\n\n80\n\n75\n\n70\n\n)\n\n%\n\n(\n \n\nG\nC\nD\nN\n\n \n\n65\n1\n\n2\n\n3\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n)\n\n%\n\n(\n \nd\nt\ns\n \nG\nC\nD\nN\n\n \n\n5\n1\n\n2\n\n3\n\n)\n\n%\n\n(\n \n\nG\nC\nD\nN\n\n)\n\n%\n\n(\n \nd\nt\ns\n \nG\nC\nD\nN\n\n73\n72\n71\n70\n69\n68\n67\n66\n65\n64\n63\n1\n\n \n\n42\n40\n38\n36\n34\n32\n30\n28\n26\n24\n1\n\n \n\nWeb\u22121\n\n2\n\n3\n\n2\n\n3\n\n \n\n10\n\n \n\nOrdinal\nClassification\nRegression\nLambdaRank\n7\n4\nTruncation level\n\n9\n\n5\n\n6\n\n8\n\nOrdinal\nClassification\nRegression\nLambdaRank\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\n)\n\n%\n\n(\n \n\nG\nC\nD\nN\n\n)\n\n%\n\n(\n \nd\nt\ns\n \nG\nC\nD\nN\n\n76\n\n75\n\n74\n\n73\n\n72\n\n71\n\n70\n\n69\n\n \n\n68\n1\n\n46\n44\n42\n40\n38\n36\n34\n32\n30\n28\n26\n24\n22\n1\n\n \n\n \n\n \n\nClassification\nRegression\nLambdaRank\n\nWeb\u22122\n\n2\n\n3\n\n2\n\n3\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\n \n\nClassification\nRegression\nLambdaRank\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\n)\n\n%\n\n(\n \n\nG\nC\nD\nN\n\n)\n\n%\n\n(\n \nd\nt\ns\n \nG\nC\nD\nN\n\n72\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n \n\n58\n1\n\n46\n44\n42\n40\n38\n36\n34\n32\n30\n28\n26\n1\n\n \n\nWeb\u22123\n\n2\n\n3\n\n2\n\n3\n\nOrdinal\nClassification\nRegression\nLambdaRank\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\n \n\nOrdinal\nClassification\nRegression\nLambdaRank\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\nOrdinal\nClassification\nRegression\nLambdaRank\n\n8\n\n9\n\n10\n\n5\n\n4\n7\nTruncation level\n\n6\n\n \n\nOrdinal\nClassification\nRegression\nLambdaRank\n\n5\n\n4\n7\nTruncation level\n\n6\n\n8\n\n9\n\n10\n\nFigure 1: The NDCG scores at truncation levels L = 1 to 10, for four datasets. Upper Panels: the\naverage NDCG scores. Bottom Panels: the corresponding standard deviations.\n\n7 Conclusion\nThe ranking problem has become an important topic in machine learning, partly due to its\nwidespread applications in many decision-making processes especially in commercial search en-\ngines.\nIn one aspect, the ranking problem is dif\ufb01cult because the measures of rank quality are\nusually based on sorting, which is not directly optimizable (at least not ef\ufb01ciently). On the other\nhand, one can cast ranking into various classical learning tasks such as regression and classi\ufb01cation.\n\nThe proposed classi\ufb01cation-based ranking scheme is motivated by the fact that perfect classi\ufb01cations\nlead to perfect DCG scores and the DCG errors are bounded by the classi\ufb01cation errors. It appears\n\n\fnatural that the classi\ufb01cation-based ranker is more direct and should work better than the regression-\nbased ranker suggested in [6]. To convert classi\ufb01cation results into ranking, we propose a simple and\nstable mechanism by using the Expected Relevance, computed from the learned class probabilities.\n\nTo learn the class probabilities, we implement a boosting tree algorithm for multiple classi\ufb01ca-\ntion and we use the same implementation for multiple ordinal classi\ufb01cation and regression. Since\ncommercial proprietary datasets are usually very large, an adaptive quantization-based approach ef-\n\ufb01ciently implements the boosting tree algorithm, which avoids sorting and has lower memory cost.\n\nOur experimental results have demonstrated that McRank (including multiple classi\ufb01cation and mul-\ntiple ordinal classi\ufb01cation) outperforms both the regression-based ranker and the pair-based Lamb-\ndaRank. However, except for the arti\ufb01cial dataset, we did not observe signi\ufb01cant improvement of\nordinal classi\ufb01cation over classi\ufb01cation.\n\nIn a summary, we regard McRank algorithm (retrospectively) simple, robust, and capable of produc-\ning quality ranking results.\n\nAn Ef\ufb01cient Implementation for Building Boosting Trees\n\nAppendix I\nWe use the standard regression tree algorithm [2], which recursively splits the training data points\ninto two groups on the current \u201cbest\u201d feature that will reduce the mean square errors (MSE) the most.\nEf\ufb01cient (in both time and memory) implementation needs some care. The standard practice [9] is to\npre-sort all the features; then after every split, carefully keep track of the indexes of the data points\nand the sorted orders in all other features for the next split.\n\nWe suggest a simpler and more ef\ufb01cient approach, by taking advantage of some properties of the\nboosting tree algorithm. While the boosting tree algorithm is well-known to be robust and also\naccurate, an individual tree has limited predictive power and usually can be built quite crudely.\n\nWhen splitting on one feature, Figure 2(a) says that sometimes the split point can be chosen within a\ncertain range without affecting the accuracy (i.e., the reduced MSE due to the split). In Figure 2(b),\nwe bin (quantize) the data points into two (0/1) levels on the horizontal (i.e., feature) axis. Suppose\nwe choose the quantization as shown in the Figure 2(b), then the accuracy will not be affected either.\n\ny\n\ny\n\ny\n\nBin length\n\ns\nL\n\ns\n\nsR\n\nx\n\nBin 0\n\n(a)\n\nBin 1\n\nx\n\ns\n\n(b)\n\n0 1\n\n2 3\n\n4 5 6 7 8 9\n\n10 11 12\n\n(c)\n\nx\n\nFigure 2: To split on one feature (x), we seek a split point s on x such that after the splitting, the\nmean square error (MSE, in the y axis) of the data points at the left plus the MSE at the right is\nreduced the most. Panel (a) suggests that in some cases we can choose s in a range (within sL and\nsR) without affecting the reduced MSE. Panel (b) suggests that, if we bin the data on the x axis to\nbe binary, the reduced MSE will not be affected either, if the data are binned in the way as in (b).\nPanel (c) pictures an adaptive binning scheme to make the accuracy loss (if any) as little as possible.\n\nOf course, we would not know ahead of time how to bin the data to avoid losing accuracy. Therefore,\nwe suggest an adaptive quantization scheme, pictured in Figure 2(c), to make the accuracy loss (if\nany) as little as possible. In the pre-processing stage, for each feature, the training data points are\nsorted according to the feature value; and we bin the feature values in the sorted order. We start with\na very small initial bin length, e.g., 10\u22128. As shown in Figure 2(c), we only bin the data where there\nare indeed data, because the boosting tree algorithm will not consider the area where there are no\ndata anyway. We set an allowed maximum number of bins, denoted by B. If the bin length is so\nsmall that we need more than B bins, we simply increment the bin length and re-do the quantization.\nAfter the quantization, we replace the original feature value by the bin labels (0, 1, 2, ...). Note that\nsince we start with a small bin length, the ordinal categorical features are naturally taken care of.\n\nThis simple binning scheme is very effective particularly for the boosting tree algorithm:\n\n\f\u2022 It simpli\ufb01es the implementation. After the quantization, there is no need for sorting (and\n\nkeeping track of the indexes after splitting) because we conduct \u201cbucket sort\u201d implicitly.\n\u2022 It speeds up the computations for the tree-building step, the bottleneck of the algorithm.\n\u2022 It reduces the memory cost for training. For example, if we set the maximum allowed\n\nnumber of bins to be B = 28, we only need one byte per data entry.\n\n\u2022 It does not really result in loss of accuracy. We experimented with both B = 28 = 256 and\nB = 216 = 65536; and we did not observe real differences in the NDCG scores, although\nreported experimental results were all based on B = 216. See Appendix II, Figure 3(a)(b).\n\nSome More Experiments on Web-1\n\nAppendix II\nFigure 3 (a)(b) present the experiment with our adaptive quantization scheme on Web-1 dataset. We\nbinned the data with the maximum bin number B = 23, 24, 25, 26, 27, 28, and 216. In (a) and (b),\nthe horizontal axis is the \u201cexponent\u201d of B. Panel (a) plots the relative number of total bins in Web-1\nas a function of the exponent, normalized by the total number of bins at B = 216. Panel (b) plots\nthe \u201cNDCG loss\u201d due to the quantization, relative to the NDCG scores at B = 216. When B = 28,\nthe total number of bins is only about 6% of that when B = 216; however, both quantization levels\nachieved the same test NDCG scores. Besides the bene\ufb01t of computational ef\ufb01ciency, quantization\ncan also be considered as a way of \u201cregularization\u201d to slow down the training, as re\ufb02ected in (b).\n\n100\n\n10\u22121\n\n10\u22122\n\ns\nn\nb\n\ni\n\n \nl\n\na\no\n\nt\n\nt\n \nf\n\n \n\no\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n)\n\n%\n\n \n\n(\n \ns\ns\no\nL\nG\nC\nD\nN\n\nTrain\nValidation\nTest\n\n10\u22123\n\n2 3 4 5 6 7 8\n\nMax bin number (Exponent)\n\n16\n\n\u22124\n\n \n2 3 4 5 6 7 8\n\nMax bin number (Exponent)\n\n(a)\n\n(b)\n\n \n\n)\n\n%\n\n(\n \n\nG\nC\nD\nN\n\n77\n76\n75\n74\n73\n72\n71\n70\n69\n1\n\n \n\n16\n\nRelevance\nGain\n\n \n\nTrain\n\nValidation\n\nTest\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nIteration\n(c)\n\nFigure 3: Web-1. (a)(b): Experiment with our adaptive quantization scheme. (c): Experiment with\ntwo different scoring functions.\n\nFigure 3 (c) compares two scoring functions to convert learned class probabilities into rank-\nk=0 \u02c6pi,kk and the Expected Gain Si =\n\ning scores, including the Expected Relevance Si = PK\u22121\nPK\u22121\nk=0 \u02c6pi,k(cid:0)2k \u2212 1(cid:1). Panel (c) suggests that using the Expected Relevance is consistently better\n\nthan using the Expected Gain but the differences are small, especially for the test NDCG scores.\n\nReferences\n[1] A. Agresti. Categorical Data Analysis. John Wiley & Sons, Inc., Hoboken, NJ, second edition, 2002.\n[2] L. Brieman, J. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and Regression Trees. 1983.\n[3] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107\u2013117, 1998.\n[4] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In\n\nICML, pages 89\u201396, 2005.\n\n[5] C. Burges, R. Ragno, and Q. Le. Learning to rank with nonsmooth cost functions. In NIPS, pages 193\u2013200, 2007.\n[6] D. Cossock and T. Zhang. Subset ranking using regression. In COLT, pages 605\u2013619, 2006.\n[7] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining preferences. Journal of Machine Learning\n\nResearch, 4:933\u2013969, 2003.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337\u2013\n\n407, 2000.\n\n[9] J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189\u20131232, 2001.\n[10] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. IR evaluation methods for retrieving highly relevant documents. In SIGIR, pages 41\u201348, 2000.\n[11] T. Joachims. Optimizing search engines using clickthrough data. In KDD, pages 133\u2013142, 2002.\n[12] J. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, pages 668\u2013677, 1998.\n[13] M. Tsai, T. Liu, T. Qin, H. Chen, and W. Ma. Frank: a ranking method with \ufb01delity loss. In SIGIR, pages 383\u2013390, 2007.\n[14] Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression framework for learning ranking functions using relative relevance judgments. In\n\nSIGIR, pages 287-294, 2007.\n\n\f", "award": [], "sourceid": 845, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Qiang", "family_name": "Wu", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}]}