{"title": "Structured ranking learning using cumulative distribution networks", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": "Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification, in which we predict outputs independently, in ranking, we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for ranking learning.", "full_text": "Structured Ranking Learning using\nCumulative Distribution Networks\n\nJim C. Huang\n\nBrendan J. Frey\n\nProbabilistic and Statistical Inference Group\n\nProbabilistic and Statistical Inference Group\n\nUniversity of Toronto\n\nToronto, ON, Canada M5S 3G4\n\njim@psi.toronto.edu\n\nUniversity of Toronto\n\nToronto, ON, Canada M5S 3G4\nfrey@psi.toronto.edu\n\nAbstract\n\nRanking is at the heart of many information retrieval applications. Unlike standard\nregression or classi\ufb01cation in which we predict outputs independently, in ranking\nwe are interested in predicting structured outputs so that misranking one object\ncan signi\ufb01cantly affect whether we correctly rank the other objects. In practice,\nthe problem of ranking involves a large number of objects to be ranked and ei-\nther approximate structured prediction methods are required, or assumptions of\nindependence between object scores must be made in order to make the problem\ntractable. We present a probabilistic method for learning to rank using the graph-\nical modelling framework of cumulative distribution networks (CDNs), where we\ncan take into account the structure inherent to the problem of ranking by mod-\nelling the joint cumulative distribution functions (CDFs) over multiple pairwise\npreferences. We apply our framework to the problem of document retrieval in\nthe case of the OHSUMED benchmark dataset. We will show that the RankNet,\nListNet and ListMLE probabilistic models can be viewed as particular instances\nof CDNs and that our proposed framework allows for the exploration of a broad\nclass of \ufb02exible structured loss functionals for learning to rank.\n\n1\n\nIntroduction\n\nRanking is the central problem for many information retrieval applications such as web search,\ncollaborative \ufb01ltering and document retrieval [8]. In these problems, we are given a set of objects\nto be ranked and a series of observations where each observation consists of some subset of the\nobjects, a feature vector and some ordering of the objects with highly ranked objects corresponding\nto a higher relevance or degree of importance. The goal is to then learn a model which allows\nus to assign a score to new test objects:\nthis often takes the form of a ranking function [2, 4]\nwhich assigns a higher score to objects with higher rankings. Unlike the canonical problems of\nregression or classi\ufb01cation in which we predict outputs independently of one another, in ranking\nwe are interested in predicting structured outputs, as the rank of one item can only be determined\ngiven the scores of all other items, and so complex inter-dependencies exist between outputs. This\nrequires measures of loss which are multivariate and structured. However, such ranking measures\nare typically dif\ufb01cult to optimize directly [3], making the problem of learning dif\ufb01cult. A previous\napproach has been to treat the problem as one of structured prediction [7], where the aim is to directly\noptimize ranking measures. Another approach has been to approximate these ranking measures with\nsmooth differentiable loss functionals by formulating probabilistic models on pairwise preferences\nbetween objects (RankNet; [2]), or on ordered lists of objects (ListNet and ListMLE; [4, 13]). In\npractice, these methods either require approximating a learning problem with an intractable number\nof constraints, or they require observations containing complete orderings over the objects to be\nranked or one must make independence assumptions on pairwise preferences.\n\nIn practice however, we can take advantage of the fact that each observation in the training set\nonly provides preference information about a small subset of the objects to be ranked, so that a\nsensible probabilistic representation would be the probability of observing a partial ordering over\n\n\fnodes for a given observation. We will show that 1) a probability over orderings is equivalent to a\nprobability over pairwise inequalities between objects to be ranked and 2) this amounts to specifying\na joint cumulative distribution function (CDF) over pairwise object preferences. We will present a\nframework for ranking using the recently-developed probabilistic graphical modelling framework\nof CDNs which compactly represents this joint CDF as a product of local functions [5]. While the\nproblem of inference in CDNs was addressed in [5], here we address the problem of learning in\nCDNs in the context of ranking learning where we estimate model parameters under a structured\nloss functional that accounts for dependencies between pairwise object preferences. We will then\ntest the proposed framework on the OHSUMED dataset [8], a benchmark dataset used in information\nretrieval research. Finally we will show that the frameworks proposed by [2, 4, 13] can be viewed\nas particular types of CDNs so that novel classes of \ufb02exible structured loss functionals for ranking\nlearning can be speci\ufb01ed under our framework.\n\n2 Cumulative distribution networks\nThe CDN [5] is an undirected graphical model in which the joint CDF F (z) over a set of random\nvariables is represented as a product over functions de\ufb01ned over subsets of these variables. More\nformally,\n\n(cid:1)\n\nc\u2208C\n\nF (z) =\n\n\u03c6c(zc),\n\n(1)\n\nwhere \u03c6c(zc) is a function de\ufb01ned over some subset of variables. An example of a CDN is shown\nin Figure 1(a), along with an example bivariate density which can be obtained by differentiating a\nproduct of 2 Gaussian CDF functions (Figure 1(b)).\n\nIn contrast to undirected models for probability density functions, the global normalization con-\nstraint on the CDF does not require computing a partition function and can be enforced locally for\neach \u03c6c(zc). Thus, in order for the CDN to represent a valid CDF, it is suf\ufb01cient that each of the local\nfunctions \u03c6c satisfy all of the properties of a multivariate CDF. These properties include the require-\nments that each CDN function \u03c6c be bounded between 0 and 1, and that each \u03c6c is monotonically\nnon-decreasing with respect to all of its argument variables zc, so that the joint CDF F (z) is also\nbounded between 0 and 1 and is monotonically non-decreasing with respect to any and all subsets\nof variables. In a CDN, disjoint sets of variables A, B are marginally independent if they share no\nfunctions in common, and disjoint sets of variables A, B are conditionally independent given vari-\nable set C if no path linking any variable in A to any variable in B passes through C. In addition,\nmarginalization of variables in a CDN can be done in constant-time via a trivial maximization of the\njoint CDF with respect to the variables being marginalized. The problem of inference in a CDN can\nbe solved ef\ufb01ciently using a message-passing algorithm called derivative-sum-product. For detailed\nderivations of the properties of CDNs, including marginal and conditional independence properties,\nwe refer the reader to [5]. The CDN framework provides us with a means to compactly represent\nmultivariate joint CDFs over many variables: in the next section we will formulate a loss functional\nfor learning to rank which takes on such a form.\n\ny\n\n8\n\n6\n\n4\n\n2\n\n0\n0\n\n(a)\n\n6\n\n8\n\n2\n\n4\n\nx\n(b)\n\na) Cumulative distribution network representing the joint CDF F (z1, z2, z3, z4, z5) =\nFigure 1:\n\u03c6a(z2)\u03c6b(z1, z2, z3)\u03c6c(z3)\u03c6d(z4)\u03c6e(z3, z4, z5)\u03c6f (z5); b) Example of a bivariate density P (x, y) corre-\nsponding to differentiating a CDF F (x, y) obtained from taking the product of 2 Gaussian bivariate CDFs.\n\n\f3 Structured loss functionals for ranking learning\nWe now proceed to formulate the problem of learning to rank in a structured setting. Suppose\nwe wish to rank N nodes in the set V = {V1,\u00b7\u00b7\u00b7 , VN} and we are given a set of observations\nD1,\u00b7\u00b7\u00b7 , DT . Each observation Dt consists of an ordering over the nodes in a subset Vt \u2286 V, where\neach node is provided with a corresponding feature vector x \u2208 RL which may be speci\ufb01c to the\ngiven observation. The orderings could be provided in the form of ordinal node labels1, or in the\nform of pairwise node preferences. The orderings can be represented as a directed graph over the\nnodes in which a directed edge e = (Vi \u2192 Vj) is drawn between 2 nodes Vi, Vj iff Vi is preferred\nto node Vj, which we denote as Vi (cid:4) Vj. In general, we assume that for any given observation,\nwe observe a partial ordering over nodes, with complete orderings being a special case. We denote\nthe above graph consisting of edges e = (Vi \u2192 Vj) \u2208 Et and the node set Vt as the order graph\nGt = (Vt,Et) for observation Dt so that Dt = {Gt,{xt\n}. A toy example of an observation\nover 4 nodes is shown in Figure 2(a). Note that under this framework, the absence of an edge\nbetween two nodes Vi, Vj in the order graph indicates we cannot assert any preference between the\ntwo nodes for the given observation.\n\nn}Vn\u2208Vt\n\n(a)\n\n(b)\n\nFigure 2: a) An example of an order graph over 4 nodes V1, V2, V3, V4 corresponding to the objects to be\nranked. The graph represents the set of preference relationships V1 (cid:1) V2, V1 (cid:1) V3, V1 (cid:1) V4, V2 (cid:1) V4, V3 (cid:1)\nV4; b) Learning the ranking function from training data. The training data consists of a set of order graphs over\nsubsets of the objects to be ranked. For order graph, the ranking function \u03c1 maps each node to the real line .\nThe goal is to learn \u03c1 such that we minimize our probability of misranking on test observations.\nWe now de\ufb01ne \u03c1 : V \u2192 R as a ranking function which assigns scores to nodes via their feature\nvectors so that for node Vi,\n\nSi = \u03c1(Vi) +\u03c0 i\n\n(2)\nwhere Si is a scalar and \u03c0i is a random variable speci\ufb01c to node Vi. We wish to learn such a function\ngiven multiple observations D1,\u00b7\u00b7\u00b7 , DT so that we minimize the probability of misranking on test\nobservations (Figure 2(b)). The above model allows us to account for the fact that the amount of\nuncertainty about a node\u2019s rank may depend on unobserved features for that node (e.g.: documents\nassociated with certain keywords might have less variability in their rankings than other documents).\nUnder this model, the preference relation Vi (cid:4) Vj is completely equivalent to\n\n\u03c1(Vi) + \u03c0i \u2265 \u03c1(Vj) + \u03c0j \u21d4 \u03c0ij = \u03c0j \u2212 \u03c0i \u2264 \u03c1(Vi) \u2212 \u03c1(Vj).\n\n(3)\n\nwhere we have de\ufb01ned \u03c0ij as a preference variable between nodes Vi, Vj.\nFor each edge e = (Vi \u2192 Vj) \u2208 Et in the order graph, we can de\ufb01ne r(\u03c1; e, Dt) \u2261 \u03c1(Vi) \u2212 \u03c1(Vj)\nand collect these into the vector r(\u03c1; Gt) \u2208 R|Et|\n. Similarly, let \u03c0e \u2261 \u03c0ij. Having de\ufb01ned the\npreferences, we must select an appropriate loss measure. A sensible metric here [13] is the joint\n\n1It is crucial to note that node labels may in general not be directly comparable with one another from one\nobservation to the next (e.g.: documents with the same rating might not truly have the same degree of relevance\nfor different queries), or the scale of the labels may be arbitrary.\n\n\fprobability of observing the order graph Gt = (Vt,Et) corresponding to the partial ordering of\nnodes in Vt. From Equation (3), this will take the form of a probability measure over events of the\n(cid:2) (cid:3)\ntype \u03c0e \u2264 r(\u03c1; e, Dt) so that we obtain\nP r{Et|Vt, \u03c1} = P r\n\n(cid:4)\n[\u03c0e \u2264 r(\u03c1; e, Dt)]\n\nr(\u03c1; Gt)\n\n= F\u03c0\n\n(cid:6)\n\n(cid:5)\n\n(4)\n\n,\n\ne\u2208Et\n\nwhere F\u03c0 is the joint CDF over the preference variables \u03c0e. Given an observation Dt, the goal is to\nlearn the ranking function \u03c1 by maximizing Equation (4). Note that under this framework, the set\nof edges Et corresponding to the set of pairwise preferences are treated as random variables which\nmay have a high degree of dependence between one another, so that F\u03c0\nis a joint CDF\nover multiple pairwise preferences. The problem of learning the ranking function then consists of\nscoring multiple nodes simultaneously whilst accounting for dependencies between node scores.\nNow, if we are given multiple independent (but not necessarily identically distributed) observations\nD = {D1,\u00b7\u00b7\u00b7 , DT}, we can de\ufb01ne a structured loss functional\n(cid:5)\n\nr(\u03c1; Gt)\n\n(cid:5)\n\n(cid:6)\n\n(cid:6)\n\nL(\u03c1, F\u03c0,D) =\u2212 T(cid:7)\n\nlog F\u03c0\n\nr(\u03c1; Gt)\n\n(5)\n\nt=1\n\nwhere each term in the loss functional depends on multiple preference relationships speci\ufb01ed by the\norder graph for observation t. The problem of learning then consists of solving the optimization\nproblem\n\nL(\u03c1, F\u03c0,D).\n\ninf\n\u03c1,F\u03c0\n\n(6)\n\nIn general, the above structured loss functional may be dif\ufb01cult to specify, as it takes on the form of\na joint CDF over many random variables with a high degree of inter-dependency which may require\na large number of parameters to specify. We can, however, compactly represent this using the CDN\nframework, as we will now show.\n\n3.1 Tranforming order graphs into CDNs\n\nFigure 3: Transforming the order graph Gt into a CDN. For each edge e = (Vi \u2192 Vj) in the order graph\n(left), a preference variable \u03c0ij is created. All such random variables are then connected to one another in a\nCDN (right), allowing for complex dependencies between preferences.\n\nThe representation of the structured loss functional in Equation (5) as a CDN consists of transform-\ning the order graph Gt for a each observation into a set of variable nodes in a CDN. More precisely,\nfor each edge e = (Vi \u2192 Vj) in the order graph, the preference variable \u03c0ij is created. All such\nvariables are then connected to one another in a CDN (Figure 3), where the pattern of connectivity\nused will determine the set of dependencies between these preferences \u03c0ij as given by the marginal\nand conditional independence properties of CDNs [5]. Thus for any given CDN topology, each\npreference node \u03c0e is a member of some neighborhood of preference nodes \u03c0e(cid:1) so that neighboring\npreferences nodes are marginally dependent of one another.\n\nOne possible concern here is that we may require a fully connected CDN topology over all possible\npairwise preferences between all nodes in order to capture all of these dependencies, leading to a\n\n\fmodel which is cumbersome to learn. In practice, because any observation only conveys information\nabout a small subset of the nodes in V and because in practice we observe partial orderings between\nthese, the order graph is sparse and so the number of preference nodes in the CDN for the given\nobservation will be much smaller than the worst-case number of all possible pairwise preferences\nbetween nodes. Furthermore, we do not have to store a large CDN in memory during training, as\nwe only need to store a single CDN over a relatively small number of preference variables for the\ncurrent observation. We can thus perform ranking learning in an online fashion by constructing a\nsingle CDN for each observation Dt and optimizing the loss \u2212 log F\u03c0\nde\ufb01ned by that\nCDN for the given observation.\n\nr(\u03c1; Gt)\n\n(cid:5)\n\n(cid:6)\n\n4 StructRank: a probabilistic model for structured ranking learning with\n\nnode labels\n\nSuppose now that each node in the training set is provided with an ordinal node label y along with a\nfeature vector x. For any given order graph over some subset of the nodes, the node labels y allow\nus to establish edges in the order graph, so that an edge Vi \u2192 Vj exists between two nodes Vi, Vj iff\nyi > yj. We can then parametrically model the ranking function \u03c1(V ) \u2261 \u03c1(x; a) (where a is a set\nof parameters) using a Nadaraya-Watson [10, 12] local estimator with a Gaussian kernel so that\n\n(cid:8)\n(cid:8)\n\n\u03c1(x; a) =\n\ni K(xi, x; a)yi\ni K(xi, x; a)\n\n,\n\nK(\u02dcx, x; a) = exp\n\n(cid:9)\n\n(cid:5)\n\n\u2212 1\n2\n\nx \u2212 \u02dcx\n\n(cid:5)\n\n(cid:6)T A\n\nx \u2212 \u02dcx\n\n(cid:6)(cid:10)\n\n,\n\n(7)\n\nwhere the summations are taken over all feature vector-label pairs in the training set, with A =\nL). Consider now an edge e = (Vi \u2192 Vj) in the order graph and de\ufb01ne re \u2261\n1,\u00b7\u00b7\u00b7 , a2\ndiag(a2\ni; a) \u2212 \u03c1(xt\nj; a). For a given order graph, the structured loss functional L(\u03b8; Dt) is\nre(a; Dt) =\u03c1( xt\ngiven by\n(cid:12)\n\nL(\u03b8; Dt) =\u2212 log F\u03c0\n(cid:11)\n\nlog \u03c6(re(a; Dt), re(cid:1)(a; Dt)),\n\n(cid:7)\n\nr(\u03c1; Gt)\n\n= \u2212\n\n(cid:5)\n\n(cid:6)\n\n(8)\n\ne,e(cid:1)\n\nis the parameter vector and the function \u03c6(r1, r2) set to a multivariate\n\nwhere \u03b8 =\nsigmoidal function so that\n\na w1 w2\n\n\u03c6(r1, r2) =\n\n1 + exp(\u2212w1r1) + exp(\u2212w2r2)\n\n1\n\n, w1, w2 \u2265 0,\n\n(9)\n\nwhere w1, w2 are weights parameterizing the CDN function \u03c6(r1, r2). It can be readily shown that\nthis choice of CDN function \u03c6(r1, r2), when combined with the constraints w1, w2 > 0, satis\ufb01es\nall of the necessary and suf\ufb01cient conditions required for the CDN to represent a valid CDF, as\n0 \u2264 \u03c6(r1, r2) \u2264 1 and is monotonically non-decreasing with respect to all of its arguments. For the\ngiven CDN and ranking functions, the learning problem for the current observation Dt then becomes\n\n(cid:5) \u2212 w1re(a; Dt)\n\n(cid:6)\n\n+ exp\n\n(cid:5) \u2212 w2re(cid:1)(a; Dt)\n\n(cid:6)(cid:10)\n\n(cid:9)\n\n(cid:7)\n\n(cid:7)\n\nt\n\ne,e(cid:1)\n\ninf\n\u03b8\n\nlog\n\n1 + exp\n\n(10)\nwhere we have introduced a regularizer in the form of an L1-norm constraint. Notice that our model\nhas one parameter per data feature and 2 parameters de\ufb01ning the CDN for any given observation.\nThe gradient \u2207aL(\u03b8; Dt) and the derivatives with respect to the CDN function weights w1, w2 for\na given observation Dt are provided in the Supplementary Information.\n\ns.t. \u03b8 \u2265 0\n\n(cid:9)\u03b8(cid:9)1 \u2264 t,\n\n5 Results\nTo compare the performance of our proposed framework to other methods, we will use the following\nthree metrics commonly in use in information retrieval research: Precision, Mean Average Precision\n(MAP) and Normalized Discounted Cumulative Gain (NDCG) [6]. The NDCG accounts for the fact\nthat less relevant documents are less likely to be examine by a user by putting more weight on highly\nrelevant documents than marginally relevant ones.\n\nWe downloaded the OHSUMED dataset provided as part of the LETOR 2.0 benchmark [8]. The\ndataset consists of a set of 106 query-document pairs, with a feature vector and relevance judgment\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: a) Average NDCG as a function of truncation level n for the OHSUMED dataset. NDCG values are\naveraged over 5 cross-validation splits; b) Mean average precision (MAP) as a function of truncation level n;\nc) Mean average precision value for several methods.\n\nprovided for each pair, where queries correspond to medical searches associated with patient and\ntopic information. There are a total of 16,140 query-document pairs with relevance judgments pro-\nvided by humans on three ordinal levels: de\ufb01nitely relevant, partially relevant or not relevant. For\n\n\fany given query, we used the ordinal labels y for each document in the query in order to establish\npreferences between documents for that query. Each node in the order graph is provided with 25\nquery-speci\ufb01c features including term frequency, document length, BM25 and LMIR features as\nwell as combinations thereof [1, 11, 14]. In accordance with the nomenclature above, we use the\nterms query and observation interchangeably.\n\nThe OHSUMED dataset is provided in the form of 5 training/validation/test splits of sizes 63/21/22\nobservations each. To ensure that features are comparable across all observations, we normalized\neach feature vector within each observation as described in [8]. We performed learning of our model\nusing a constrained stochastic gradients algorithm where for each observation, we prevent updates\nfrom violating the inequality constraints in the optimization problem de\ufb01ned by Equation (10) by\nreducing the learning rate \u03b1 until the update becomes feasible. We set the default learning rate\nto \u03b1 = 0.5 and we randomly initialized the model parameters a, w1, w2 in the range [0, 1]. This\noptimization was run for 10 epochs (passes through the training set) and \u03b1 was scaled by 1\u221a\n2 at\nthe end of each epoch. We set the regularization parameter using the validation set for a given data\nsplit. Due to the nonconvex nature of the optimization problem, for each cross-validation split, we\nperformed learning using 3 random initializations, and we then selected the model which achieved\nthe best MAP score on the validation set.\n\nWe tested a fully connected CDN which models full interdependence between preferences, and a\ncompletely disconnected CDN which models preferences independently of one another. The above\n3 performance metrics are shown in Figures 4(a),4(b),4(c) in addition to the performances of seven\nstate-of-the-art methods which are part of the LETOR 2.0 benchmarks. At the time of submission,\nnumerical performance scores for ListMLE [13] were not available and so were not included in\nthese plots. With the exception of ListNet and ListMLE, none of the above methods explicitly\nmodel dependencies between pairwise preferences. As can be seen, accounting for dependencies\nbetween pairwise preferences provides a signi\ufb01cant gain in performance compared to modellling\npreferences as being independent. Additional results on the TREC2004 dataset from LETOR 2.0\nare provided in Supplemental Information.\n\n6 Discussion\n\nWe have proposed here a novel framework for ranking learning using structured loss functionals.\nWe have shown that the problem of learning to rank can be reduced to maximizing a joint CDF\nover multiple pairwise preferences. We have shown how to compactly represent this using the CDN\nframework and have applied it to the OHSUMED benchmark dataset. We have demonstrated that\nrepresenting the dependencies between pairwise preferences leads to improved performance over\nmodelling preferences as being independent of one another.\n\n6.1 Relation to RankNet and ListNet/ListMLE\n\nThe probability models for ranking proposed by [2, 4, 13] can all be expressed as special cases of\nmodels de\ufb01ned by different CDNs. In the case of RankNet [2], the corresponding probability over a\ngiven pairwise preference Vi (cid:4) Vj is modelled by a logistic function of \u03c1(xi)\u2212 \u03c1(xj) and the model\nwas optimized using cross-entropy loss. The joint probability of preferences can thus be represented\nas a completely disconnected CDN with logistic functions in which all pairwise object preferences\nare treated as being independent. In the case of ListNet [4] and ListMLE [13], the probability of\nobserving a complete ordering V1 (cid:4) \u00b7\u00b7\u00b7 (cid:4) VN over N objects are de\ufb01ned as products of functions\nof the type\n(cid:6)(cid:14) =\nP (V1 (cid:4) \u00b7\u00b7\u00b7 (cid:4) VN|D) =\n\nN(cid:1)\n\nN(cid:1)\n\n(cid:13)\n\n=\n\n(cid:8)N\n\n(cid:8)N\n\nexp(\u03c1(xi))\nk=i exp(\u03c1(xk))\n\ni=1\n\ni=1\n\n1 +\n\nk=i+1 exp\n\n\u03c1(xi) \u2212 \u03c1(xk)\n\n1\n\n\u2212(cid:5)\n\nwhich we see is equivalent to a CDN with N multivariate sigmoids. As noted by the authors of [13],\nthe above model is also an example of the Plackett-Luce class of probability models over object\nscores [9]. In addition, the ListNet/ListMLE frameworks both require a complete ordering over\nobjects by de\ufb01nition: under the CDN framework, we can model partial orderings, with complete\norderings as a special case. The connections between RankNet, ListNet and ListMLE and the CDN\nframework are illustrated in Supplementary Figure 2. Our proposed framework uni\ufb01es the above\n\nN(cid:1)\n\ni=1\n\n\u03c6i(ri),\n\n\fviews of ranking as different instantiations of a joint CDF over pairwise preferences and hence as\nparticular types of CDNs. This allows us to consider \ufb02exible joint CDFs de\ufb01ned over different\nsubsets of object preferences and over different families of CDN functions so as to capture various\ndata speci\ufb01c properties.\n\n6.2 Future directions\n\nOur work here suggests several future directions for research. In [13], it was shown that the log-\nlikelihood corresponding to the probability of an ordering is a good surrogate to the 0-1 loss be-\ntween the predicted ordering and the true ordering, as the former is differentiable and penalizes\nmis-orderings in a sensible way. One could investigate connections between the structured loss\nfunctionals proposed in this paper and other ranking measures such as NDCG. Another possible di-\nrection is to generalize StructRank to products over Gaussian multivariate CDFs or other classes of\nfunctions which satisfy the requirements of CDN functions , as in this paper we have elected to use\na product of bivariate sigmoids \u03c6(re, re(cid:1)) to represent our loss functional. Also, it may be fruitful\nto investigate different CDN topologies: for example, we found that averaging randomly connected\nCDNs are very fast to learn and perform comparably to the fully-connected CDN we used in this\npaper (data not shown). In addition, we have only investigated representing the loss functional using\na single CDN function: this could easily be generalized to K functions. Lastly, alternatives to the\nNadaraya-Watson local estimator, such as the neural networks used in [2, 4, 13], can be investigated.\n\nReferences\n\n[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. Addison Wesley, 1999.\n[2] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender.\nLearning to rank using gradient descent. In Proceedings of the Twenty-Second International Con-\nference on Machine Learning (ICML), 2005.\n\n[3] C.J.C. Burges, R. Ragno and Q.V. Le. Learning to rank with nonsmooth cost functions. In Pro-\nceedings of the Nineteenth Annual Conference on Neural Information Processing Systems (NIPS),\n2007.\n\n[4] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai and H. Li. Learning to rank: from pairwise approach to\nlistwise approach. In Proceedings of the Twenty-Fourth International Conference on Machine\nLearning (ICML), 2007.\n\n[5] J.C. Huang and B.J. Frey. Cumulative distribution networks and the derivative-sum-product al-\ngorithm. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), 2008.\n\n[6] K. Jarvelin and J. Kekalainen. Cumulated evaluation of IR techniques, ACM Information Sys-\n\ntems, 2002.\n\n[7] T. Joachims. A support vector method for multivariate performance measures. In Proceedings\n\nof the Twenty-Second International Conference on Machine Learning (ICML), 2005.\n\n[8] T.Y. Liu, J. Xu, T. Qin, W. Xiong and H. Li. LETOR: Benchmark dataset for research on learning\n\nto rank for information retrieval. LR4IR 2007, in conjunction with SIGIR 2007, 2007.\n\n[9] J. I. Marden. Analyzing and modeling rank data. CRC Press, 1995.\n[10] E.A. Nadaraya. On estimating regression. Theory of Probability and its Applications 9(1), pp.\n\n141-142, 1964.\n\n[11] S.E. Robertson. Overview of the OKAPI projects. Journal of Documentation 53 (1), pp. 3-7,\n\n1997.\n\n[12] G.S. Watson. Smooth regression analysis. The Indian Journal of Statistics. Series A 26, pp.\n\n359-372, 1964.\n\n[13] F. Xia, T.Y. Liu, J. Wang, W. Zhang and H. Li. Listwise approach to learning to rank - theory\nand algorithm. In Proceedings of the Twenty-Fifth International Conference on Machine Learning\n(ICML), 2008.\n\n[14] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc\n\ninformation retrieval. In Proceedings of SIGIR 2001, 2001.\n\n\f", "award": [], "sourceid": 252, "authors": [{"given_name": "Jim", "family_name": "Huang", "institution": null}, {"given_name": "Brendan", "family_name": "Frey", "institution": null}]}