{"title": "Ranking via Robust Binary Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2582, "page_last": 2590, "abstract": "We propose RoBiRank, a ranking algorithm that is motivated by observing a close connection between evaluation metrics for learning to rank and loss functions for robust classification. The algorithm shows a very competitive performance on standard benchmark datasets against other representative algorithms in the literature. Further, in large scale problems where explicit feature vectors and scores are not given, our algorithm can be efficiently parallelized across a large number of machines; for a task that requires 386,133 x 49,824,519 pairwise interactions between items to be ranked, our algorithm finds solutions that are of dramatically higher quality than that can be found by a state-of-the-art competitor algorithm, given the same amount of wall-clock time for computation.", "full_text": "Ranking via Robust Binary Classi\ufb01cation\n\nHyokun Yun\n\nAmazon\n\nSeattle, WA 98109\n\nyunhyoku@amazon.com\n\nParameswaran Raman, S. V. N. Vishwanathan\n\nDepartment of Computer Science\n\nUniversity of California\nSanta Cruz, CA 95064\n\n{params,vishy}@ucsc.edu\n\nAbstract\n\nWe propose RoBiRank, a ranking algorithm that is motivated by observing a close\nconnection between evaluation metrics for learning to rank and loss functions\nfor robust classi\ufb01cation. It shows competitive performance on standard bench-\nmark datasets against a number of other representative algorithms in the literature.\nWe also discuss extensions of RoBiRank to large scale problems where explicit\nfeature vectors and scores are not given. We show that RoBiRank can be ef\ufb01-\nciently parallelized across a large number of machines; for a task that requires\n386, 133 \u00d7 49, 824, 519 pairwise interactions between items to be ranked, RoBi-\nRank \ufb01nds solutions that are of dramatically higher quality than that can be found\nby a state-of-the-art competitor algorithm, given the same amount of wall-clock\ntime for computation.\n\n1\n\nIntroduction\n\nLearning to rank is a problem of ordering a set of items according to their relevances to a given\ncontext [8]. While a number of approaches have been proposed in the literature, in this paper we\nprovide a new perspective by showing a close connection between ranking and a seemingly unrelated\ntopic in machine learning, namely, robust binary classi\ufb01cation.\nIn robust classi\ufb01cation [13], we are asked to learn a classi\ufb01er in the presence of outliers. Standard\nmodels for classi\ufb01cation such as Support Vector Machines (SVMs) and logistic regression do not\nperform well in this setting, since the convexity of their loss functions does not let them give up\ntheir performance on any of the data points [16]; for a classi\ufb01cation model to be robust to outliers,\nit has to be capable of sacri\ufb01cing its performance on some of the data points. We observe that\nthis requirement is very similar to what standard metrics for ranking try to evaluate. Discounted\nCumulative Gain (DCG) [17] and its normalized version NDCG, popular metrics for learning to\nrank, strongly emphasize the performance of a ranking algorithm at the top of the list; therefore, a\ngood ranking algorithm in terms of these metrics has to be able to give up its performance at the\nbottom of the list if that can improve its performance at the top.\nIn fact, we will show that DCG and NDCG can indeed be written as a natural generalization of robust\nloss functions for binary classi\ufb01cation. Based on this observation we formulate RoBiRank, a novel\nmodel for ranking, which maximizes the lower bound of (N)DCG. Although the non-convexity\nseems unavoidable for the bound to be tight [9], our bound is based on the class of robust loss\nfunctions that are found to be empirically easier to optimize [10]. Indeed, our experimental results\nsuggest that RoBiRank reliably converges to a solution that is competitive as compared to other\nrepresentative algorithms even though its objective function is non-convex.\nWhile standard deterministic optimization algorithms such as L-BFGS [19] can be used to estimate\nparameters of RoBiRank, to apply the model to large-scale datasets a more ef\ufb01cient parameter es-\ntimation algorithm is necessary. This is of particular interest in the context of latent collaborative\n\n1\n\n\fretrieval [24]; unlike standard ranking task, here the number of items to rank is very large and ex-\nplicit feature vectors and scores are not given.\nTherefore, we develop an ef\ufb01cient parallel stochastic optimization algorithm for this problem. It has\ntwo very attractive characteristics: First, the time complexity of each stochastic update is indepen-\ndent of the size of the dataset. Also, when the algorithm is distributed across multiple number of\nmachines, no interaction between machines is required during most part of the execution; therefore,\nthe algorithm enjoys near linear scaling. This is a signi\ufb01cant advantage over serial algorithms, since\nit is very easy to deploy a large number of machines nowadays thanks to the popularity of cloud\ncomputing services, e.g. Amazon Web Services.\nWe apply our algorithm to latent collaborative retrieval task on Million Song Dataset [3] which con-\nsists of 1,129,318 users, 386,133 songs, and 49,824,519 records; for this task, a ranking algorithm\nhas to optimize an objective function that consists of 386, 133 \u00d7 49, 824, 519 number of pairwise\ninteractions. With the same amount of wall-clock time given to each algorithm, RoBiRank leverages\nparallel computing to outperform the state-of-the-art with a 100% lift on the evaluation metric.\n\n2 Robust Binary Classi\ufb01cation\n\nSuppose we are given training data which consists of n data points (x1, y1), (x2, y2), . . . , (xn, yn),\nwhere each xi \u2208 Rd is a d-dimensional feature vector and yi \u2208 {\u22121, +1} is a label associated\nwith it. A linear model attempts to learn a d-dimensional parameter \u03c9, and for a given feature\nL(\u03c9) :=(cid:80)n\nvector x it predicts label +1 if (cid:104)x, \u03c9(cid:105) \u2265 0 and \u22121 otherwise. Here (cid:104)\u00b7,\u00b7(cid:105) denotes the Euclidean dot\nproduct between two vectors. The quality of \u03c9 can be measured by the number of mistakes it makes:\ni=1 I(yi \u00b7 (cid:104)xi, \u03c9(cid:105) < 0). The indicator function I(\u00b7 < 0) is called the 0-1 loss function,\nbecause it has a value of 1 if the decision rule makes a mistake, and 0 otherwise. Unfortunately,\nsince the 0-1 loss is a discrete function its minimization is dif\ufb01cult [11]. The most popular solution\nto this problem in machine learning is to upper bound the 0-1 loss by an easy to optimize function\n[2]. For example, logistic regression uses the logistic loss function \u03c30(t) := log2(1 + 2\u2212t), to come\nup with a continuous and convex objective function\n\nn(cid:88)\n\ni=1\n\nL(\u03c9) :=\n\n\u03c30(yi \u00b7 (cid:104)xi, \u03c9(cid:105)),\n\n(1)\n\nwhich upper bounds L(\u03c9). It is clear that for each i, \u03c30(yi \u00b7 (cid:104)xi, \u03c9(cid:105)) is a convex function in \u03c9;\ntherefore, L(\u03c9), a sum of convex functions, is also a convex function which is relatively easier to\noptimize [6]. Support Vector Machines (SVMs) on the other hand can be recovered by using the\nhinge loss to upper bound the 0-1 loss.\nHowever, convex upper bounds such as L(\u03c9) are known to be sensitive to outliers [16]. The basic\nintuition here is that when yi \u00b7 (cid:104)xi, \u03c9(cid:105) is a very large negative number for some data point i, \u03c3(yi \u00b7\n(cid:104)xi, \u03c9(cid:105)) is also very large, and therefore the optimal solution of (1) will try to decrease the loss on\nsuch outliers at the expense of its performance on \u201cnormal\u201d data points.\nIn order to construct robust loss functions, consider the following two transformation functions:\n\n\u03c11(t) := log2(t + 1), \u03c12(t) := 1 \u2212\n\n1\n\nlog2(t + 2)\n\n,\n\n(2)\n\nwhich, in turn, can be used to de\ufb01ne the following loss functions:\n\n\u03c31(t) := \u03c11(\u03c30(t)), \u03c32(t) := \u03c12(\u03c30(t)).\n\n(3)\nOne can see that \u03c31(t) \u2192 \u221e as t \u2192 \u2212\u221e, but at a much slower rate than \u03c30(t) does; its derivative\n\u03c3(cid:48)1(t) \u2192 0 as t \u2192 \u2212\u221e. Therefore, \u03c31(\u00b7) does not grow as rapidly as \u03c30(t) on hard-to-classify data\npoints. Such loss functions are called Type-I robust loss functions by Ding [10], who also showed\nthat they enjoy statistical robustness properties. \u03c32(t) behaves even better: \u03c32(t) converges to a\nconstant as t \u2192 \u2212\u221e, and therefore \u201cgives up\u201d on hard to classify data points. Such loss functions\nare called Type-II loss functions, and they also enjoy statistical robustness properties [10].\nIn terms of computation, of course, \u03c31(\u00b7) and \u03c32(\u00b7) are not convex, and therefore the objective\nfunction based on such loss functions is more dif\ufb01cult to optimize. However, it has been observed\n\n2\n\n\fin Ding [10] that models based on optimization of Type-I functions are often empirically much\nmore successful than those which optimize Type-II functions. Furthermore, the solutions of Type-I\noptimization are more stable to the choice of parameter initialization. Intuitively, this is because\nType-II functions asymptote to a constant, reducing the gradient to almost zero in a large fraction\nof the parameter space; therefore, it is dif\ufb01cult for a gradient-based algorithm to determine which\ndirection to pursue. See Ding [10] for more details.\n\n3 Ranking Model via Robust Binary Classi\ufb01cation\n\nwhere Wxy is a real-valued score given to item y in context x.\n\nLet X = {x1, x2, . . . , xn} be a set of contexts, and Y = {y1, y2, . . . , ym} be a set of items to\nbe ranked. For example, in movie recommender systems X is the set of users and Y is the set of\nmovies. In some problem settings, only a subset of Y is relevant to a given context x \u2208 X ; e.g.\nin document retrieval systems, only a subset of documents is relevant to a query. Therefore, we\nde\ufb01ne Yx \u2282 Y to be a set of items relevant to context x. Observed data can be described by a set\nW := {Wxy}x\u2208X ,y\u2208Yx\nWe adopt a standard problem setting used in the literature of learning to rank. For each context x\nand an item y \u2208 Yx, we aim to learn a scoring function f (x, y) : X \u00d7 Yx \u2192 R that induces a\nranking on the item set Yx; the higher the score, the more important the associated item is in the\ngiven context. To learn such a function, we \ufb01rst extract joint features of x and y, which will be\ndenoted by \u03c6(x, y). Then, we parametrize f (\u00b7,\u00b7) using a parameter \u03c9, which yields the linear model\nf\u03c9(x, y) := (cid:104)\u03c6(x, y), \u03c9(cid:105), where, as before, (cid:104)\u00b7,\u00b7(cid:105) denotes the Euclidean dot product between two\nvectors. \u03c9 induces a ranking on the set of items Yx; we de\ufb01ne rank\u03c9(x, y) to be the rank of item y\nin a given context x induced by \u03c9. Observe that rank\u03c9(x, y) can also be written as a sum of 0-1 loss\nfunctions (see e.g. Usunier et al. [23]):\nrank\u03c9(x, y) =\n\n(cid:88)\n\n(4)\n\nI (f\u03c9(x, y) \u2212 f\u03c9(x, y(cid:48)) < 0) .\n\ny(cid:48)\u2208Yx,y(cid:48)(cid:54)=y\n\n3.1 Basic Model\n\n(cid:88)\n\n(cid:88)\n\nIf an item y is very relevant in context x, a good parameter \u03c9 should position y at the top of the list;\nin other words, rank\u03c9(x, y) has to be small, which motivates the following objective function [7]:\n\ncx\n\nL(\u03c9) :=\n\n(5)\nwhere cx is an weighting factor for each context x, and v(\u00b7) : R+ \u2192 R+ quanti\ufb01es the relevance\nlevel of y on x. Note that {cx} and v(Wxy) can be chosen to re\ufb02ect the metric the model is going to\nbe evaluated on (this will be discussed in Section 3.2). Note that (5) can be rewritten using (4) as a\nsum of indicator functions. Following the strategy in Section 2, one can form an upper bound of (5)\nby bounding each 0-1 loss function by a logistic loss function:\n\nv(Wxy) \u00b7 rank\u03c9(x, y),\n\ny\u2208Yx\n\nx\u2208X\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Yx\n\nL(\u03c9) :=\n\ncx\n\n(cid:88)\n\nv (Wxy) \u00b7\n\ny(cid:48)\u2208Yx,y(cid:48)(cid:54)=y\n\n\u03c30 (f\u03c9(x, y) \u2212 f\u03c9(x, y(cid:48))) .\n\n(6)\n\nJust like (1), (6) is convex in \u03c9 and hence easy to minimize.\n\n3.2 DCG\n\nAlthough (6) enjoys convexity, it may not be a good objective function for ranking. This is because\nin most applications of learning to rank, it is more important to do well at the top of the list than at\nthe bottom, as users typically pay attention only to the top few items. Therefore, it is desirable to\ngive up performance on the lower part of the list in order to gain quality at the top. This intuition is\nsimilar to that of robust classi\ufb01cation in Section 2; a stronger connection will be shown below.\nDiscounted Cumulative Gain (DCG) [17] is one of the most popular metrics for ranking. For each\ncontext x \u2208 X , it is de\ufb01ned as:\n\nDCG(\u03c9) := cx\n\n,\n\n(7)\n\n(cid:88)\n\nv (Wxy)\n\nlog2 (rank\u03c9(x, y) + 2)\n\ny\u2208Yx\n\n3\n\n\f1 \u2212\n\n(cid:26)\n\uf8eb\uf8ed (cid:88)\n\uf8eb\uf8ed (cid:88)\n\ny(cid:48)\u2208Yx,y(cid:48)(cid:54)=y\n\ny(cid:48)\u2208Yx,y(cid:48)(cid:54)=y\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Yx\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Yx\n\n\uf8f6\uf8f8 .\n\uf8f6\uf8f8 .\n\n\uf8f6\uf8f8 .\n\nwhere v(t) = 2t \u2212 1 and cx = 1. Since 1/ log(t + 2) decreases quickly and then asymptotes to\na constant as t increases, this metric emphasizes the quality of the ranking at the top of the list.\nNormalized DCG (NDCG) simply normalizes the metric to bound it between 0 and 1 by calculating\nthe maximum achievable DCG value mx and dividing by it [17].\n\n3.3 RoBiRank\n\nNow we formulate RoBiRank, which optimizes the lower bound of metrics for ranking in form (7).\nObserve that max\u03c9 DCG(\u03c9) can be rewritten as\n\n(cid:27)\n\n1\n\nlog2 (rank\u03c9(x, y) + 2)\n\n.\n\n(8)\n\n(cid:88)\n\n(cid:88)\n\nmin\n\n\u03c9\n\ncx\n\nx\u2208X\n\ny\u2208Yx\n\nv (Wxy) \u00b7\n\nUsing (4) and the de\ufb01nition of the transformation function \u03c12(\u00b7) in (2), we can rewrite the objective\nfunction in (8) as:\n\nL2(\u03c9) :=\n\ncx\n\nv (Wxy) \u00b7 \u03c12\n\nI (f\u03c9(x, y) \u2212 f\u03c9(x, y(cid:48)) < 0)\n\n(9)\n\nSince \u03c12(\u00b7) is a monotonically increasing function, we can bound (9) with a continuous function by\nbounding each indicator function using the logistic loss:\n\nL2(\u03c9) :=\n\ncx\n\nv (Wxy) \u00b7 \u03c12\n\n\u03c30 (f\u03c9(x, y) \u2212 f\u03c9(x, y(cid:48)))\n\n(10)\n\nThis is reminiscent of the basic model in (6); as we applied the transformation \u03c12(\u00b7) on the logistic\nloss \u03c30(\u00b7) to construct the robust loss \u03c32(\u00b7) in (3), we are again applying the same transformation\non (6) to construct a loss function that respects the DCG metric used in ranking. In fact, (10) can be\nseen as a generalization of robust binary classi\ufb01cation by applying the transformation on a group of\nlogistic losses instead of a single loss. In both robust classi\ufb01cation and ranking, the transformation\n\u03c12(\u00b7) enables models to give up on part of the problem to achieve better overall performance.\nAs we discussed in Section 2, however, transformation of logistic loss using \u03c12(\u00b7) results in Type-II\nloss function, which is very dif\ufb01cult to optimize. Hence, instead of \u03c12(\u00b7) we use an alternative trans-\nformation \u03c11(\u00b7), which generates Type-I loss function, to de\ufb01ne the objective function of RoBiRank:\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Yx\n\n\uf8eb\uf8ed (cid:88)\n\ny(cid:48)\u2208Yx,y(cid:48)(cid:54)=y\n\nL1(\u03c9) :=\n\ncx\n\nv (Wxy) \u00b7 \u03c11\n\n\u03c30 (f\u03c9(x, y) \u2212 f\u03c9(x, y(cid:48)))\n\n(11)\n\nSince \u03c11(t) \u2265 \u03c12(t) for every t > 0, we have L1(\u03c9) \u2265 L2(\u03c9) \u2265 L2(\u03c9) for every \u03c9. Note\nthat L1(\u03c9) is continuous and twice differentiable. Therefore, standard gradient-based optimization\ntechniques can be applied to minimize it. As is standard, a regularizer on \u03c9 can be added to avoid\nover\ufb01tting; for simplicity, we use the (cid:96)2-norm in our experiments.\n\n3.4 Standard Learning to Rank Experiments\n\nWe conducted experiments to check the performance of RoBiRank (11) in a standard learning to\nrank setting, with a small number of labels to rank. We pitch RoBiRank against the following\nalgorithms: RankSVM [15], the ranking algorithm of Le and Smola [14] (called LSRank in the se-\nquel), InfNormPush [22], IRPush [1], and 8 standard ranking algorithms implemented in RankLib1\nnamely MART, RankNet, RankBoost, AdaRank, CoordAscent, LambdaMART, ListNet and Ran-\ndomForests. We use three sources of datasets: LETOR 3.0 [8] , LETOR 4.02 and YAHOO LTRC\n[20], which are standard benchmarks for ranking (see Table 2 for summary statistics). Each dataset\nconsists of \ufb01ve folds; we consider the \ufb01rst fold, and use the training, validation, and test splits pro-\nvided. We train with different values of regularization parameter, and select one with the best NDCG\n\n1http://sourceforge.net/p/lemur/wiki/RankLib\n2http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx\n\n4\n\n\fon the validation dataset. The performance of the model with this parameter on the test dataset is\nreported. We used implementation of the L-BFGS algorithm provided by the Toolkit for Advanced\nOptimization (TAO)3 for estimating the parameter of RoBiRank. For the other algorithms, we either\nimplemented them using our framework or used the implementations provided by the authors.\n\nFigure 1: Comparison of RoBiRank with a number of competing algorithms.\n\nWe use values of NDCG at different levels of truncation as our evaluation metric [17]; see Figure 1.\nRoBiRank outperforms its competitors on most of the datasets; due to space constraints, we only\npresent plots for the TD 2004 dataset here and other plots can be found in Appendix B. The per-\nformance of RankSVM seems insensitive to the level of truncation for NDCG. On the other hand,\nRoBiRank, which uses non-convex loss function to concentrate its performance at the top of the\nranked list, performs much better especially at low truncation levels. It is also interesting to note\nthat the NDCG@k curve of LSRank is similar to that of RoBiRank, but RoBiRank consistently out-\nperforms at each level. RoBiRank dominates Inf-Push and IR-Push at all levels. When compared to\nstandard algorithms, Figure 1 (right), again RoBiRank outperforms especially at the top of the list.\nOverall, RoBiRank outperforms IRPush and InfNormPush on all datasets except TD 2003 and\nOHSUMED where IRPush seems to fare better at the top of the list. Compared to the 8 standard\nalgorithms, again RobiRank either outperforms or performs comparably to the best algorithm except\non two datasets (TD 2003 and HP 2003), where MART and Random Forests overtake RobiRank at\nfew values of NDCG. We present a summary of the NDCG values obtained by each algorithm in\nTable 2 in the appendix. On the MSLR30K dataset, some of the additional algorithms like InfNorm-\nPush and IRPush did not complete within the time period available; indicated by dashes in the table.\n\n4 Latent Collaborative Retrieval\n\nFor each context x and an item y \u2208 Y, the standard problem setting of learning to rank requires\ntraining data to contain feature vector \u03c6(x, y) and score Wxy assigned on the x, y pair. When the\nnumber of contexts |X| or the number of items |Y| is large, it might be dif\ufb01cult to de\ufb01ne \u03c6(x, y)\nand measure Wxy for all x, y pairs. Therefore, in most learning to rank problems we de\ufb01ne the set\nof relevant items Yx \u2282 Y to be much smaller than Y for each context x, and then collect data only\nfor Yx. Nonetheless, this may not be realistic in all situations; in a movie recommender system, for\nexample, for each user every movie is somewhat relevant.\nOn the other hand, implicit user feedback data is much more abundant. For example, a lot of users\non Net\ufb02ix would simply watch movie streams on the system but do not leave an explicit rating. By\nthe action of watching a movie, however, they implicitly express their preference. Such data consist\nonly of positive feedback, unlike traditional learning to rank datasets which have score Wxy between\neach context-item pair x, y. Again, we may not be able to extract feature vectors for each x, y pair.\nIn such a situation, we can attempt to learn the score function f (x, y) without a feature vector \u03c6(x, y)\nby embedding each context and item in an Euclidean latent space; speci\ufb01cally, we rede\ufb01ne the score\nfunction to be: f (x, y) := (cid:104)Ux, Vy(cid:105), where Ux \u2208 Rd is the embedding of the context x and Vy \u2208 Rd\n\n3http://www.mcs.anl.gov/research/projects/tao/index.html\n\n5\n\n51015200.40.60.81kNDCG@kTD2004RoBiRankRankSVMLSRankInfNormPushIRPush51015200.40.60.81kNDCG@kTD2004RoBiRankMARTRankNetRankBoostAdaRankCoordAscentLambdaMARTListNetRandomForests\f\uf8eb\uf8ed(cid:88)\n\uf8eb\uf8ed(cid:88)\n\ny(cid:48)(cid:54)=y\n\ny(cid:48)(cid:54)=y\n\n\uf8f6\uf8f8 .\n\uf8f6\uf8f8\n\n(cid:88)\n\n(x,y)\u2208\u2126\n\n\uf8eb\uf8ed(cid:88)\n\ny(cid:48)(cid:54)=y\n\n\uf8f6\uf8f8 .\n\nis that of the item y. Then, we can learn these embeddings by a ranking model. This approach was\nintroduced in Weston et al. [24], and was called latent collaborative retrieval.\nNow we specialize RoBiRank model for this task. Let us de\ufb01ne \u2126 to be the set of context-item pairs\n(x, y) which was observed in the dataset. Let v(Wxy) = 1 if (x, y) \u2208 \u2126, and 0 otherwise; this is a\nnatural choice since the score information is not available. For simplicity, we set cx = 1 for every\nx. Now RoBiRank (11) specializes to:\n\nL1(U, V ) =\n\n\u03c11\n\n\u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48)))\n\n(12)\n\nNote that now the summation inside the parenthesis of (12) is over all items Y instead of a smaller\nset Yx, therefore we omit specifying the range of y(cid:48) from now on. To avoid over\ufb01tting, a regularizer\nis added to (12); for simplicity we use the Frobenius norm of U and V in our experiments.\n\n4.1 Stochastic Optimization\nWhen the size of the data |\u2126| or the number of items |Y| is large, however, methods that require\nexact evaluation of the function value and its gradient will become very slow since the evaluation\ntakes O (|\u2126| \u00b7 |Y|) computation.\nIn this case, stochastic optimization methods are desirable [4];\nin this subsection, we will develop a stochastic gradient descent algorithm whose complexity is\nindependent of |\u2126| and |Y|.\nFor simplicity, let \u03b8 be a concatenation of all parameters {Ux}x\u2208X\n\u2207\u03b8L1(U, V ) of (12) is (cid:88)\n\n, {Vy}y\u2208Y\n\n. The gradient\n\n\u2207\u03b8\u03c11\n\n\u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48)))\n\n(x,y)\u2208\u2126\n\nFinding an unbiased estimator of the gradient whose computation is independent of |\u2126| is not dif\ufb01-\ncult; if we sample a pair (x, y) uniformly from \u2126, then it is easy to see that the following estimator\n\n|\u2126| \u00b7 \u2207\u03b8\u03c11\n\n\u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48)))\n\n(13)\n\nis unbiased. This still involves a summation over Y, however, so it requires O(|Y|) calculation.\nSince \u03c11(\u00b7) is a nonlinear function it seems unlikely that an unbiased stochastic gradient which\nrandomizes over Y can be found; nonetheless, to achieve convergence guarantees of the stochastic\ngradient descent algorithm, unbiasedness of the estimator is necessary [18].\nWe attack this problem by linearizing the objective function by parameter expansion. Note the\nfollowing property of \u03c11(\u00b7) [5]:\n\n\u03c11(t) = log2(t + 1) \u2264 \u2212 log2 \u03be +\n\n\u03be \u00b7 (t + 1) \u2212 1\n\n.\n\nlog 2\nt+1. Now introducing an auxiliary\n\n(14)\n\nThis holds for any \u03be > 0, and the bound is tight when \u03be = 1\nparameter \u03bexy for each (x, y) \u2208 \u2126 and applying this bound, we obtain an upper bound of (12) as\n\nL(U, V, \u03be) :=\n\n\u03bexy\n\n\u2212 log2 \u03bexy +\n\ny(cid:48)(cid:54)=y \u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48))) + 1\n\n\u2212 1\n\n.\n\n(15)\n\nlog 2\n\n(cid:16)(cid:80)\n\n(cid:88)\n\n(x,y)\u2208\u2126\n\nNow we propose an iterative algorithm in which, each iteration consists of (U, V )-step and \u03be-step;\nin the (U, V )-step we minimize (15) in (U, V ) and in the \u03be-step we minimize in \u03be. Pseudo-code can\nbe found in Algorithm 1 in Appendix C.\n\n(cid:80)\n\n(cid:16)(cid:80)\n\n(U, V )-step The partial derivative of (15) in terms of U and V can be calculated as:\n. Now it is easy\n\u2207U,V L(U, V, \u03be) := 1\nto see that the following stochastic procedure unbiasedly estimates the above gradient:\n\ny(cid:48)(cid:54)=y \u2207U,V \u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48)))\n\n(x,y)\u2208\u2126 \u03bexy\n\nlog 2\n\n(cid:17)\n\n(cid:17)\n\n6\n\n\fFigure 2: Left: Scaling of RoBiRank on Million Song Dataset. Center, Right: Comparison of\nRoBiRank and Weston et al. [24] when the same amount of wall-clock computation time is given.\n\n\u2022 Sample (x, y) uniformly from \u2126\n\u2022 Sample y(cid:48) uniformly from Y \\ {y}\n\u2022 Estimate the gradient by\n\n|\u2126| \u00b7 (|Y| \u2212 1) \u00b7 \u03bexy\n\nlog 2\n\n\u00b7 \u2207U,V \u03c30(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48))).\n\n(16)\n\nTherefore a stochastic gradient descent algorithm based on (16) will converge to a local minimum\nof the objective function (15) with probability one [21]. Note that the time complexity of calculating\n(16) is independent of |\u2126| and |Y|. Also, it is a function of only Ux and Vy; the gradient is zero in\nterms of other variables.\n\n\u03be-step When U and V are \ufb01xed, minimization of \u03bexy variable is independent of each other and\ny(cid:48)(cid:54)=y \u03c30(f (Ux,Vy)\u2212f (Ux,Vy(cid:48) ))+1. This of course requires\na simple analytic solution exists: \u03bexy =\nO(|Y|) work. In principle, we can avoid summation over Y by taking stochastic gradient in terms of\n\u03bexy as we did for U and V . However, since the exact solution is simple to compute and also because\nmost of the computation time is spent on (U, V )-step, we found this update rule to be ef\ufb01cient.\n\n(cid:80)\n\n1\n\nParallelization The linearization trick in (15) not only enables us to construct an ef\ufb01cient stochas-\ntic gradient algorithm, but also makes possible to ef\ufb01ciently parallelize the algorithm across multiple\nnumber of machines. Due to lack of space, details are relegated to Appendix D.\n\n4.2 Experiments\n\nIn this subsection we use the Million Song Dataset (MSD) [3], which consists of 1,129,318 users\n(|X|), 386,133 songs (|Y|), and 49,824,519 records (|\u2126|) of a user x playing a song y in the training\ndataset. The objective is to predict the songs from the test dataset that a user is going to listen to4.\nSince explicit ratings are not given, NDCG is not applicable for this task; we use precision at 1 and\n10 [17] as our evaluation metric. In our \ufb01rst experiment we study the scaling behavior of RoBiRank\nas a function of number of machines. RoBiRank p denotes the parallel version of RoBiRank which\nis distributed across p machines. In Figure 2 (left) we plot mean Precision@1 as a function of the\nnumber of machines \u00d7 the number of seconds elapsed; this is a proxy for CPU time. If an algorithm\nlinearly scales across multiple processors, then all lines in the \ufb01gure should overlap with each other.\nAs can be seen RoBiRank exhibits near ideal speed up when going from 4 to 32 machines5.\nIn our next experiment we compare RoBiRank with a state of the art algorithm from Weston et al.\n[24], which optimizes a similar objective function (17). We compare how fast the quality of the\nsolution improves as a function of wall clock time. Since the authors of Weston et al. [24] do not\nmake available their code, we implemented their algorithm within our framework using the same\ndata structures and libraries used by our method. Furthermore, for a fair comparison, we used the\nsame initialization for U and V and performed an identical grid-search over the step size parameter.\n\n4the original data also provides the number of times a song was played by a user, but we ignored this in our\n\nexperiment.\n\n5The graph for RoBiRank 1 is hard to see because it was run for only 100,000 CPU-seconds.\n\n7\n\n00.511.522.53\u00b710600.10.20.3numberofmachines\u00d7secondselapsedMeanPrecision@1RoBiRank4RoBiRank16RoBiRank32RoBiRank100.20.40.60.81\u00b710500.10.20.3secondselapsedMeanPrecision@1Westonetal.(2012)RoBiRank1RoBiRank4RoBiRank16RoBiRank3200.20.40.60.81\u00b710505\u00b710\u221220.10.150.2secondselapsedMeanPrecision@10Westonetal.(2012)RoBiRank1RoBiRank4RoBiRank16RoBiRank32\f(cid:88)\n\n\uf8eb\uf8ed1 +\n\n\u03a6\n\n(cid:88)\n\ny(cid:48)(cid:54)=y\n\n\uf8f6\uf8f8 ,\n\nIt can be seen from Figure 2 (center, right) that on a single machine the algorithm of Weston et al.\n[24] is very competitive and outperforms RoBiRank. The reason for this might be the introduction\nof the additional \u03be variables in RoBiRank, which slows down convergence. However, RoBiRank\ntraining can be distributed across processors, while it is not clear how to parallelize the algorithm\nof Weston et al. [24]. Consequently, RoBiRank 32 which uses 32 machines for its computation can\nproduce a signi\ufb01cantly better model within the same wall clock time window.\n\n5 Related Work\n\nIn terms of modeling, viewing ranking problems as generalization of binary classi\ufb01cation problems\nis not a new idea; for example, RankSVM de\ufb01nes the objective function as a sum of hinge losses,\nsimilarly to our basic model (6) in Section 3.1. However, it does not directly optimize the ranking\nmetric such as NDCG; the objective function and the metric are not immediately related to each\nother. In this respect, our approach is closer to that of Le and Smola [14] which constructs a convex\nupper bound on the ranking metric and Chapelle et al. [9] which improves the bound by introducing\nnon-convexity. The objective function of Chapelle et al. [9] is also motivated by ramp loss, which\nis used for robust classi\ufb01cation; nonetheless, to our knowledge the direct connection between the\nranking metrics in form (7) (DCG, NDCG) and the robust loss (3) is our novel contribution. Also,\nour objective function is designed to speci\ufb01cally bound the ranking metric, while Chapelle et al. [9]\nproposes a general recipe to improve existing convex bounds.\nStochastic optimization of the objective function for latent collaborative retrieval has been also ex-\nplored in Weston et al. [24]. They attempt to minimize\n\nI(f (Ux, Vy) \u2212 f (Ux, Vy(cid:48)) < 0)\n\nwhere \u03a6(t) =(cid:80)t\n\n1\n\nk=1\n\n(x,y)\u2208\u2126\nk . This is similar to our objective function (15); \u03a6(\u00b7) and \u03c12(\u00b7) are asymptoti-\ncally equivalent. However, we argue that our formulation (15) has two major advantages. First, it is\na continuous and differentiable function, therefore gradient-based algorithms such as L-BFGS and\nstochastic gradient descent have convergence guarantees. On the other hand, the objective function\nof Weston et al. [24] is not even continuous, since their formulation is based on a function \u03a6(\u00b7)\nthat is de\ufb01ned for only natural numbers. Also, through the linearization trick in (15) we are able\nto obtain an unbiased stochastic gradient, which is necessary for the convergence guarantee, and to\nparallelize the algorithm across multiple machines as discussed in Appendix D. It is unclear how\nthese techniques can be adapted for the objective function of Weston et al. [24].\n\n(17)\n\n6 Conclusion\n\nIn this paper, we developed RoBiRank, a novel model on ranking, based on insights and techniques\nfrom robust binary classi\ufb01cation. Then, we proposed a scalable and parallelizable stochastic opti-\nmization algorithm that can be applied to latent collaborative retrieval task which large-scale data\nwithout feature vectors and explicit scores have to take care of. Experimental results on both learning\nto rank datasets and latent collaborative retrieval dataset suggest the advantage of our approach.\nAs a \ufb01nal note, the experiments in Section 4.2 are arguably unfair towards WSABIE. For instance,\none could envisage using clever engineering tricks to derive a parallel variant of WSABIE (e.g.,\nby averaging gradients from various machines), which might outperform RoBiRank on the MSD\ndataset. While performance on a speci\ufb01c dataset might be better, we would lose global convergence\nguarantees. Therefore, rather than obsess over the performance of a speci\ufb01c algorithm on a speci\ufb01c\ndataset, via this paper we hope to draw the attention of the community to the need for developing\nprincipled parallel algorithms for this important problem.\n\nAcknowledgments We thank anonymous reviewers for their constructive comments, and Texas\nAdvanced Computing Center for infrastructure and support for experiments. This material is par-\ntially based upon work supported by the National Science Foundation under grant No. IIS-1117705.\n\n8\n\n\fReferences\n[1] S. Agarwal. The in\ufb01nite push: A new support vector ranking algorithm that directly optimizes\n\naccuracy at the absolute top of the list. In SDM, pages 839\u2013850. SIAM, 2011.\n\n[2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Pro-\nceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011),\n2011.\n\n[4] L. Bottou and O. Bousquet. The tradeoffs of large-scale learning. Optimization for Machine\n\nLearning, page 351, 2011.\n\n[5] G. Bouchard. Ef\ufb01cient bounds for the softmax function, applications to inference in hybrid\n\nmodels. 2007.\n\n[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge,\n\nEngland, 2004.\n\n[7] D. Buffoni, P. Gallinari, N. Usunier, and C. Calauz`enes. Learning scoring functions with\norder-preserving losses and standardized supervision. In Proceedings of the 28th International\nConference on Machine Learning (ICML-11), pages 825\u2013832, 2011.\n\n[8] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. Journal of Machine\n\nLearning Research-Proceedings Track, 14:1\u201324, 2011.\n\n[9] O. Chapelle, C. B. Do, C. H. Teo, Q. V. Le, and A. J. Smola. Tighter bounds for structured\n\nestimation. In Advances in neural information processing systems, pages 281\u2013288, 2008.\n\n[10] N. Ding. Statistical Machine Learning in T-Exponential Family of Distributions. PhD thesis,\n\nPhD thesis, Purdue University, West Lafayette, Indiana, USA, 2013.\n\n[11] V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by\n\nhalfspaces is hard. SIAM Journal on Computing, 41(6):1558\u20131590, 2012.\n\n[12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with\nIn Conference on Knowledge Discovery and Data\n\ndistributed stochastic gradient descent.\nMining, pages 69\u201377, 2011.\n\n[13] P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.\n[14] Q. V. Le and A. J. Smola. Direct optimization of ranking measures. Technical Report\n\n0704.3359, arXiv, April 2007. http://arxiv.org/abs/0704.3359.\n\n[15] C.-P. Lee and C.-J. Lin. Large-scale linear ranksvm. Neural Computation, 2013. To Appear.\n[16] P. Long and R. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters.\n\nMachine Learning Journal, 78(3):287\u2013304, 2010.\n\n[17] C. D. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cam-\n\nbridge University Press, 2008. URL http://nlp.stanford.edu/IR-book/.\n\n[18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[19] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research.\n\nSpringer, 2nd edition, 2006.\n\n[20] T. Qin, T.-Y. Liu, J. Xu, and H. Li. Letor: A benchmark collection for research on learning to\n\nrank for information retrieval. Information Retrieval, 13(4):346\u2013374, 2010.\n\n[21] H. E. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[22] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top\n\nof the list. The Journal of Machine Learning Research, 10:2233\u20132271, 2009.\n\n[23] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classi\ufb01ca-\n\ntion. In Proceedings of the International Conference on Machine Learning, 2009.\n\n[24] J. Weston, C. Wang, R. Weiss, and A. Berenzweig. Latent collaborative retrieval. arXiv\n\npreprint arXiv:1206.4603, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1336, "authors": [{"given_name": "Hyokun", "family_name": "Yun", "institution": "Amazon"}, {"given_name": "Parameswaran", "family_name": "Raman", "institution": "University of California - Santa Cruz"}, {"given_name": "S.", "family_name": "Vishwanathan", "institution": "Purdue University"}]}