{"title": "Learning Rankings via Convex Hull Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 402, "abstract": "", "full_text": "Learning Rankings via Convex Hull Separation\n\nGlenn Fung, R\u00b4omer Rosales, Balaji Krishnapuram\n\nComputer Aided Diagnosis, Siemens Medical Solutions USA, Malvern, PA 19355\n\n{glenn.fung, romer.rosales, balaji.krishnapuram}@siemens.com\n\nAbstract\n\nWe propose ef\ufb01cient algorithms for learning ranking functions from or-\nder constraints between sets\u2014i.e. classes\u2014of training samples. Our al-\ngorithms may be used for maximizing the generalized Wilcoxon Mann\nWhitney statistic that accounts for the partial ordering of the classes: spe-\ncial cases include maximizing the area under the ROC curve for binary\nclassi\ufb01cation and its generalization for ordinal regression. Experiments\non public benchmarks indicate that: (a) the proposed algorithm is at least\nas accurate as the current state-of-the-art; (b) computationally, it is sev-\neral orders of magnitude faster and\u2014unlike current methods\u2014it is easily\nable to handle even large datasets with over 20,000 samples.\n\n1\n\nIntroduction\n\nMany machine learning applications depend on accurately ordering the elements of a set\nbased on the known ordering of only some of its elements. In the literature, variants of this\nproblem have been referred to as ordinal regression, ranking, and learning of preference\nrelations. Formally, we want to \ufb01nd a function f : \u211cn \u2192 \u211c such that, for a set of test\nsamples {xk \u2208 \u211cn}, the output of the function f (xk) can be sorted to obtain a ranking. In\norder to learn such a function we are provided with training data, A, containing S sets (or\ni=1), where the j-th set Aj contains\nj=1 mj samples in A. Further, we are also\nprovided with a directed order graph G = (S, E) each of whose vertices corresponds to a\nclass Aj, and the existence of a directed edge EP Q\u2014corresponding to AP \u2192 AQ\u2014means\nthat all training samples xp \u2208 AP should be ranked higher than any sample xq \u2208 AQ: i.e.\n\u2200 (xp\u2208 AP , xq\u2208 AQ), f (xp) \u2264 f (xq).\n\nclasses) of training samples: A = SS\nmj samples, so that we have a total of m = PS\n\nj=1(Aj = {xj\n\ni }mj\n\nIn general the number of constraints on the ranking function grows as O(m2) so that naive\nsolutions are computationally infeasible even for moderate sized training sets with a few\nthousand samples. Hence, we propose a more stringent problem with a larger (in\ufb01nite) set\nof constraints, that is nevertheless much more tractably solved. In particular, we modify\nthe constraints to: \u2200 (xp \u2208 CH(AP ), xq \u2208 CH(AQ)), f (xp) \u2264 f (xq), where CH(Aj)\ndenotes the set of all points in the convex hull of Aj.\nWe show how this leads to: (a) a family of approximations to the original problem; and (b)\nconsiderably more ef\ufb01cient solutions that still enforce all of the original inter-group order\nconstraints. Notice that, this formulation subsumes the standard ranking problem (e.g. [4])\nas a special case when each set Aj is reduced to a singleton and the order graph is equal to\n\n\f{v,w,x}\n\n{v,w}\n\n{v,w}\n\n{v}\n\n{w}\n\n{v,w,x}\n\n{v,w}\n\n{x}\n\n{y,z}\n\n(a)\n\n{x}\n\n{y,z}\n\n{x}\n\n{y,z}\n\n{y}\n\n(b)\n\n(c)\n\n{x}\n\n(d)\n\n{z}\n\n{y}\n\n{z}\n\n{y}\n\n{z}\n\n(e)\n\n(f)\n\nFigure 1: Various instances of the proposed ranking problem consistent with the training set\n{v, w, x, y, z} satisfying v > w > x > y > z. Each problem instance is de\ufb01ned by an order\ngraph. (a-d) A succession of order graphs with an increasing number of constraints (e-f) Two order\ngraphs de\ufb01ning the same partial ordering but different problem instances.\n\na full graph. However, as illustrated in Figure 1, the formulation is more general and does\nnot require a total ordering of the sets of training samples Aj, i.e. it allows any order graph\nG to be incorporated into the problem.\n\n1.1 Generalized Wilcoxon-Mann-Whitney Statistics\n\nA distinction is usually made between classi\ufb01cation and ordinal regression methods on\none hand, and ranking on the other. In particular, the loss functions used for classi\ufb01cation\nand ordinal regression evaluate whether each test sample is correctly classi\ufb01ed: in other\nwords, the loss functions that are used to evaluate these algorithms\u2014e.g. the 0\u20131 loss for\nbinary classi\ufb01cation\u2014are computed for every sample individually, and then averaged over\nthe training or test set.\n\nBy contrast, bipartite ranking solutions are evaluated using the Wilcoxon-Mann-Whitney\n(WMW) statistic which measures the (sample averaged) probability that any pair of sam-\nples is ordered correctly; intuitively, the WMW statistic may be interpreted as the area\nunder the ROC curve (AUC). We de\ufb01ne a slight generalization of the WMW statistic that\naccounts for our notion of class-ordering:\n\nW M W (f, A) =XEij\n\nl=1 \u03b4(cid:16)f (xi\nk=1Pmj\nPmi\nk=1Pmj\nPmi\n\nk) < f (xj\nl=1 1\n\nl )(cid:17)\n\n.\n\nHence, if a sample is individually misclassi\ufb01ed because it falls on the wrong side of the\ndecision boundary between classes it incurs a penalty in ordinal regression, whereas, in\nranking, it may be possible that it is still correctly ordered with respect to every other test\nsample, and thus it may incur no penalty in the WMW statistic.\n\n1.2 Previous Work\n\nOrdinal regression and methods for handling structured output classes: For a classic\ndescription of generalized linear models for ordinal regression, see [11]. A non-parametric\nBayesian model for ordinal regression based on Gaussian processes (GP) was de\ufb01ned\n[1]. Several recent machine learning papers consider structured output classes: e.g. [13]\npresents SVM based algorithms for handling structured and interdependent output spaces,\nand [5] discusses automatic document categorization into pre-de\ufb01ned hierarchies or tax-\nonomies of topics.\nLearning Rankings: The problem of learning rankings was \ufb01rst treated as a classi\ufb01cation\nproblem on pairs of objects by Herbrich [4] and subsequently used on a web page ranking\ntask by Joachims [6]; a variety of authors have investigated this approach recently. The\nmajor advantage of this approach is that it considers a more explicit notion of ordering\u2014\nHowever, the naive optimization strategy proposed there suffers from the O(m2) growth\n\n\fin the number of constraints mentioned in the previous section. This computational bur-\nden renders these methods impractical even for medium sized datasets with a few thousand\nsamples. In other related work, boosting methods have been proposed for learning prefer-\nences [3], and a combinatorial structure called the ranking poset was used for conditional\nmodeling of partially ranked data[8], in the context of combining ranked sets of web pages\nproduced by various web-page search engines. Another, less related, approach is [2].\nRelationship to the proposed work: Our algorithm penalizes wrong ordering of pairs of\ntraining instances in order to learn ranking functions (similar to [4]), but in addition, it can\nalso utilize the notion of a structured class order graph. Nevertheless, using a formula-\ntion based on constraints over convex hulls of the training classes, our method avoids the\nprohibitive computational complexity of the previous algorithms for ranking.\n\n1.3 Notation and Background\n\nIn the following, vectors will be assumed to be column vectors unless transposed to a row\nvector by a prime superscript \u2032. For a vector x in the n-dimensional real space \u211cn, the\ncardinality of a set A will be denoted by #(A). The scalar (inner) product of two vectors x\nand y in the n-dimensional real space \u211cn will be denoted by x\u2032y and the 2-norm of x will\nbe denoted by kxk. For a matrix A \u2208 \u211cm\u00d7n, Ai is the ith row of A which is a row vector in\n\u211cn, while A\u00b7j is the jth column of A. A column vector of ones of arbitrary dimension will\nbe denoted by e. For A \u2208 \u211cm\u00d7n and B \u2208 \u211cn\u00d7k, the kernel K(A, B) maps \u211cm\u00d7n \u00d7 \u211cn\u00d7k\ninto \u211cm\u00d7k. In particular, if x and y are column vectors in \u211cn then, K(x\u2032, y) is a real\nnumber, K(x\u2032, A\u2032) is a row vector in \u211cm and K(A, A\u2032) is an m \u00d7 m matrix. The identity\nmatrix of arbitrary dimension will be denoted by I.\n\n2 Convex Hull formulation\n\nWe are interested in learning a ranking function f : \u211cn \u2192 \u211c given known ranking rela-\ntionships between some training instances Ai, Aj \u2282 A. Let the ranking relationships be\nspeci\ufb01ed by a set E = {(i, j)|Ai \u227a Aj}\nTo begin with, let us consider the linearly separable binary ranking case which is equivalent\nto the problem of classifying m points in the n-dimensional real space \u211cn, represented by\nthe m \u00d7 n matrix A, according to membership of each point x = Ai in the class A+ or A\u2212\nas speci\ufb01ed by a given vector of labels d. In others words, for binary classi\ufb01ers, we want a\nlinear ranking function fw(x) = w\u2032x that satis\ufb01es the following constraints:\n\u2200 (x+\u2208 A+, x\u2212\u2208 A\u2212), f (x\u2212) \u2264 f (x+) \u21d2 f (x\u2212)\u2212 f (x+) = w\u2032x\u2212\u2212 w\u2032x+ \u2264 \u22121 \u2264 0.\n(1)\n\nClearly, the number of constraints grows as O(m+m\u2212), which is roughly quadratic in\nthe number of training samples (unless we have severe class imbalance). While easily\novercome\u2013based on additional insights\u2013in the separable problem, in the non-separable\ncase, the quadratic growth in the number of constraints poses huge computational burdens\non the optimization algorithm; indeed direct optimization with these constraints is infeasi-\nble even for moderate sized problems. We overcome this computational problem based on\nthree key insights that are explained below.\nFirst, notice that (by negation) the feasibility constraints in (1) can also be de\ufb01ned as:\n\u2200 (x+\u2208 A+, x\u2212\u2208 A\u2212), w\u2032x\u2212\u2212w\u2032x+ \u2264 \u22121 \u21d4 \u2204(x+\u2208 A+, x\u2212\u2208 A\u2212), w\u2032x\u2212\u2212w\u2032x+ > \u22121.\nIn other words, a solution w is feasible iff there exist no pair of samples from the two\nclasses such that fw((cid:5)) orders them incorrectly.\nSecond, we will make the constraints in (1) more stringent: instead of requiring that equa-\ntion (1) be satis\ufb01ed for each possible pair (x+\u2208 A+, x\u2212\u2208 A\u2212) in the training set, we will\n\n\fFigure 2: Example binary problem where points belonging to the A+ and A\u2212 sets are represented\nby blue circles and red triangles respectively. Note that two elements xi and xj of the set A\u2212\nare not correctly ordered and hence generate positive values of the corresponding slack variables\nyi and yj. Note that the point xk (hollow triangle) is in the convex hull of the set A\u2212 and hence\nj yj) of the two\nthe corresponding yk error can be writen as a convex combination (yk = \u03bbk\nnonzero errors corresponding to points of A\u2212\n\ni yi + \u03bbk\n\nrequire (1) to be satis\ufb01ed for each pair (x+\u2208 CH(A+), x\u2212\u2208 CH(A\u2212)), where CH(Ai)\ndenotes the convex hull of the set Ai [12]. Thus, our constraints become:\n\n\u2200(\u03bb+, \u03bb\u2212)\n\nsuch that (cid:26) 0 \u2264 \u03bb+ \u2264 1,P \u03bb+ = 1\n\n0 \u2264 \u03bb\u2212 \u2264 1,P \u03bb\u2212 = 1 (cid:27) , w\u2032A\u2212\u2032\n\nNext, notice that all the linear inequality and equality constraints on (\u03bb+, \u03bb\u2212) may be\nconveniently grouped together as B\u03bb \u2264 b, where,\n\n\u03bb\u2212\u2212 w\u2032A+\u2032\n\n\u03bb+ \u2264 \u22121.(2)\n\n\u03bb =(cid:20) \u03bb\u2212\n\u03bb+ (cid:21)m\u00d71\nB\u2212 =\" \u2212Im\u2212\n\ne\u2032\n\u2212e\u2032\n\nb+ =\" 0+\n0 #\n\n0\n0\n\nm+\u00d71\n1\n\u22121\n\n(m++2)\u00d71\n\nb\u2212 =\" 0\u2212\n#\nB+ =\" 0 \u2212Im+\n0 \u2212e\u2032 #\n\ne\u2032\n\n0\n\nm\u2212\u00d71\n1\n\u22121\n\n(m\u2212+2)\u00d7m\n\n(m++2)\u00d7m\n\nThus, our constraints on w can be written as:\n\n#\n\n(m\u2212+2)\u00d71\n\nb =(cid:20) b+\nb\u2212 (cid:21)\n\n(3)\n\nB =(cid:20) B\u2212\n\nB+ (cid:21)(m+4)\u00d7m\n\n(4)\n\n\u2200\u03bb s.t. B\u03bb \u2264 b, w\u2032A\u2212\u2032\n\u21d4 \u2204\u03bb s.t. B\u03bb \u2264 b, w\u2032A\u2212\u2032\n\u21d4 \u2203u s.t. B\u2032u\u2212 w\u2032[A\u2212\u2032\n\n\u03bb\u2212\u2212 w\u2032A+\u2032\n\u03bb\u2212\u2212 w\u2032A+\u2032\n\u2212 A+\u2032\n\n\u03bb+ \u2264 \u22121\n\u03bb+ > \u22121\n\n] = 0, b\u2032u \u2264 \u22121, u \u2265 0,\n\n(5)\n\n(6)\n\n(7)\n\nWhere the second equivalent form of the constraints was obtained by negation (as before),\nand the third equivalent form results from our third key insight: the application of Farka\u2019s\ntheorem of alternatives[9]. The resulting linear system of m equalities and m + 5 inequal-\nities in m + n + 4 variables can be used while minimizing any regularizer (such as kwk2)\nto obtain the linear ranking function that satis\ufb01es (1); notice, however, that we avoid the\nO(m2) scaling in constraints.\n\n2.1 The binary non-separable case\n\nIn the non-separable case, CH(A+)T CH(A\u2212) 6= \u2205 so the requirements have to be re-\n\nlaxed by introducing slack variables. To this end, we allow one slack variable yi \u2265 0\nfor each training sample xi, and consider the slack for any point inside the convex hull\nCH(Aj) to also be a convex combination of y (see Fig. 2). For example, this implies that\n\n\fif only a subset of training samples have non-zero slacks yi> 0 (i.e. they are possibly mis-\nclassi\ufb01ed), then the slacks of any points inside the convex hull also only depend on those\nyi. Thus, our constraints now become:\n\u2200\u03bb s.t. B\u03bb \u2264 b, w\u2032A\u2212\u2032\n\u03bb\u2212\u2212 w\u2032A+\u2032\nApplying Farka\u2019s theorem of alternatives, we get:\n\n\u03bb+ \u2264 \u22121 + (\u03bb\u2212y\u2212+ \u03bb+y+), y+\u2265 0, y\u2212\u2265 0.\n\n(8)\n\n(2) \u21d4 \u2203u s.t. B\u2032u \u2212(cid:20) A\u2212w\n\n\u2212A+w (cid:21) +(cid:20) y\u2212\n\ny+ (cid:21) = 0, b\u2032u \u2264 \u22121, u \u2265 0\n\nu+\u2032\n\n] \u2265 0 we get the constraints:\n\n(9)\n\n(10)\n(11)\n(12)\n\nReplacing B from equation (4) and de\ufb01ning u\u2032 = [u\u2212\u2032\nu+ + A+w + y+ = 0,\nu\u2212 \u2212 A\u2212w + y\u2212 = 0,\n\nB+\u2032\nB\u2212\u2032\n\nb+u+ + b\u2212u\u2212 \u2264 \u22121, u \u2265 0\n\n2.2 The general ranking problem\n\nNow we can extend the idea presented in the previous section for any given arbitrary di-\nrected order graph G = (S, E), as stated in the introduction, each of whose vertices corre-\nsponds to a class Aj and the existence of a directed edge Eij means that all training samples\nxi \u2208 Ai should be ranked higher than any sample xj \u2208 Aj, that is:\n\nf (xj) \u2264 f (xi) \u21d2 f (xj) \u2212 f (xi) = w\u2032xj \u2212 w\u2032xi \u2264 \u22121 \u2264 0\n\n(13)\nAnalogously we obtain the following set of equations that enforced the ordering between\nsets Ai and Aj:\n\nuij + Aiw + yi = 0\n\u02c6uij \u2212 Ajw + yj = 0\n\n(14)\n(15)\n(16)\n(17)\nIt can be shown that using the de\ufb01nitions of Bi,Bj,bi,bj and the fact that uij, \u02c6uij \u2265 0,\nequations (14) can be rewritten in the following way:\n\nBi\u2032\nBj \u2032\nbiuij + bj \u02c6uij \u2264 \u22121\nuij, \u02c6uij \u2265 0\n\n(18)\n(19)\n(20)\n(21)\nwhere \u03b3ij = biuij and \u02c6\u03b3ij = bj \u02c6uij. Note that enforcing the constraints de\ufb01ned above\nindeed implies the desired ordering, since we have:\n\n\u03b3ij + Aiw + yi \u2265 0\n\u02c6\u03b3ij \u2212 Ajw + yj \u2265 0\n\u03b3ij + \u02c6\u03b3ij \u2264 \u22121\nyi, yj \u2265 0\n\nAiw + yi \u2265 \u2212\u03b3ij \u2265 \u02c6\u03b3ij + 1 \u2265 \u02c6\u03b3ij \u2265 Ajw \u2212 yj\n\nIt is also important to note the connection with Support Vector Machines (SVM) formu-\nlation [10, 14] for the binary case. If we impose the extra constraints \u2212\u03b3ij = \u03b3 + 1 and\n\u02c6\u03b3ij = \u03b3 \u22121, then equations (18) imply the constraints included in the standard primal SVM\nformulation. To obtain a more general formulation,we can \u201ckernelize\u201d equations (14) by\nmaking a transformation of the variable w as: w = A\u2032v, where v can be interpreted as an\narbitrary variable in Rm ,This transformation can be motivated by duality theory [10], then\nequations (14) become:\n\n\u03b3ij + AiA\u2032v + yi \u2265 0\n\u02c6\u03b3ij \u2212 AjA\u2032v + yj \u2265 0\n\n\u03b3ij + \u02c6\u03b3ij \u2264 \u22121\nyi, yj \u2265 0\n\n(22)\n(23)\n(24)\n(25)\n\n\fIf we now replace the linear kernels AiA\u2032 and AiA\u2032 by more general kernels K(Ai, A\u2032)\nand K(Aj, A\u2032) we obtain a \u201ckernelized\u201d version of equations (14)\n\n(26)\n\nEij \u2261\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n\u03b3ij + K(Ai, A\u2032)v + yi \u2265 0\n\u02c6\u03b3ij \u2212 K(Aj, A\u2032)v + yj \u2265 0\n\u03b3ij + \u02c6\u03b3ij\nyi, yj\n\n\u2264 \u22121\n\u2265 0\n\n\uf8fc\uf8f4\uf8fd\n\uf8f4\uf8fe\n\nGiven a graph G = (V, E) representing the ordering of the training data and using equa-\ntions (26) , we present next, a general mathematical programming formulation the ranking\nproblem:\n\nmin\n\n{v,yi,\u03b3 ij | (i,j)\u2208E}\ns.t.\n\n\u03bd\u01eb(y) + R(v)\n\nEij \u2200(i, j) \u2208 E\n\n(27)\n\nWhere \u01eb is a given loss function for the slack variables yi and R(v) represents a regularizer\non the normal to the hyperplane v. For an arbitrary kernel K(x, x\u2032) the number of variables\nof formulation (27) is 2 \u2217 m + 2#(E) and the number of linear equations(excluding the\nnonnegativity constraints) is m#(E) + #(E) = #(E)(m + 1). for a linear kernel i.e.\nK(x, x\u2032) = xx\u2032 the number of variables of formulation (27) becomes m + n + 2#(E)\nand the number of linear equations remains the same. When using a linear kernel and\nusing \u01eb(x) = R(x) = kxk2\n2, the optimization problem (27) becomes a linearly constrained\nquadratic optimization problem for which a unique solution exists due to the convexity of\nthe objective function:\n\nmin\n\n{w,yi,\u03b3 ij | (i,j)\u2208E}\ns.t.\n\n\u03bd kyk2\n\n2 + 1\n\n2 w\u2032w\n\nEij \u2200(i, j) \u2208 E\n\n(28)\n\nUnlike other SVM-like methods for ranking that need a O(m2) number of slack variables\ny our formulation only require one slack variable for example, only m slack variables\nare used, giving our formulation computational advantage over ranking methods. Next,\nwe demonstrate the effectiveness of our algorithm by comparing it to two state-of-the-art\nalgorithms.\n\n3 Experimental Evaluation\n\nWe test tested our approach in a set of nine publicly available datasets 1 shown in Tab. 1\n(several large datasets are not reported since only the algorithm presented in this paper was\nable to run them). These datasets have been frequently used as a benchmark for ordinal\nregression methods (e.g. [1]). Here we use them for evaluating ranking performance. We\ncompare our method against SVM for ranking (e.g. [4, 6]) using the SVM-light package 2\nand an ef\ufb01cient Gaussian process method (the informative vector machine) 3 [7].\nThese datasets were originally designed for regression, thus the continuous target values\nfor each dataset were discretized into \ufb01ve equal size bins. We use these bins to de\ufb01ne\nour ranking constraints: all the datapoints with target value falling in the same bin were\ngrouped together. Each dataset was divided into 10% for testing and 90% for training.\nThus, the input to all of the algorithms tested was, for each point in the training set: (1) a\nvector in \u211cn (where n is different for each set) and (2) a value from 1 to 5 denoting the\nrank of the group to which it belongs.\n\nPerformance is de\ufb01ned in terms of the Wilcoxon statistic. Since we do not employ informa-\ntion about the ranking of the elements within each group, order constraints within a group\n\n1Available at http:\\\\www.liacc.up.pt\\\u02dcltorgo\\Regression\\DataSets.html\n2http:\\\\www.cs.cornell.edu\\People\\tj\\svm light\\\n3http:\\\\www.dcs.shef.ac.uk\\ neil\\ivm\\\n\n\fTable 1: Benchmark Datasets\n\nName\n\n1 Abalone\n2 Airplane Comp.\n3 Auto-MPG\n4 CA Housing\n5 Housing-Boston\n\nm\n\n4177\n950\n392\n20640\n506\n\nn\n\n9\n10\n8\n9\n14\n\nName\n\n6 Machine-CPU\n7 Pyrimidines\n8 Triazines\n9 WI Breast Cancer\n\nm\n\n209\n74\n186\n194\n\nn\n\n7\n28\n61\n33\n\nAccuracy\n\nRun time\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nL\n(\n \ne\nm\n\ni\nt\n \nn\nu\nR\n\nSVM\u2212light\nIVM\nProposed (full\u2212graph)\nProposed (chain\u2212graph)\n\n1\n\n2\n\n3\n\n4\n6\nDataset number\n\n5\n\n7\n\n8\n\n9\n\n1\n\n2\n\n3\n\n4\n6\nDataset number\n\n5\n\n7\n\n8\n\n9\n\n)\n\nC\nU\nA\n\n(\n \nc\ni\nt\ns\ni\nt\na\nt\ns\n \nn\no\nx\no\nc\nW\n \nd\ne\nz\n\nl\ni\n\ni\nl\n\na\nr\ne\nn\ne\nG\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nFigure 3: Experimental comparison of the ranking SVM, IVM and the proposed method on nine\nbenchmark datasets. Along with the mean values in 10 fold cross-validation, the entire range of vari-\nation is indicated in the error-bars. (a) The overall accuracy for all the three methods is comparable.\n(b) The proposed method has a much lower run time than the other methods, even for the full graph\ncase for medium to large size datasets. NOTE: Both SVM-light and IVM ran out of memory and\ncrashed on dataset 4; on dataset 1, SVM-light failed to complete even one fold after more than 24\nhours of run time, so its results could not be compiled in time for submission.\n\ncannot be veri\ufb01ed. Letting b(m) = m(m \u2212 1)/2, the total number of order constraints is\n\nequal to b(m) \u2212Pi b(mi), where mi is the number of instances in group i.\n\nThe results for all of the algorithms are shown in Fig.3. Our formulation was tested employ-\ning two order graphs, the full directed acyclic graph and the chain graph. The performance\nfor all datasets is generally comparable or signi\ufb01cantly better for our algorithm (when us-\ning a chain order graph). Note that the performance for the full graph is consistently lower\nthan that for the chain graph. Thus, interestingly enforcing more order constraints does not\nnecessarily imply better performance. We suspect that this is due to the role that the slack\nvariables play in both formulations, since the number of slack variables remains the same\nwhile the number of constraints increases. Adding more slack variables may positively\naffect performance in the full graph, but this comes at a computational cost. An interesting\nproblem is to \ufb01nd the right compromise. A different but potentially related problem is that\nof \ufb01nding good order graph given a dataset. Note also that the chain graph is much more\nstable regarding performance overall. Regarding run-time, our algorithm runs an order of\nmagnitude faster than current implementations of state-of-the-art methods, even approxi-\nmate ones (like IVM).\n\n4 Discussions and future work\n\nWe propose a general method for learning a ranking function from structured order con-\nstraints on sets of training samples. The proposed algorithm was illustrated on benchmark\nranking problems with two different constraint graphs: (a) a chain graph; and (b) a full\n\n\fordering graph. Although a chain graph was more accurate in the experiments shown in\nFigure 3, with either type of graph structure, the proposed method is at least as accurate (in\nterms of the WMW statistic for ordinal regression) as state-of-the-art algorithms such as\nthe ranking-SVM and Gaussian Processes for ordinal regression.\n\nBesides being accurate, the computational requirements of our algorithm scale much more\nfavorably with the number of training samples as compared to other state-of-the-art meth-\nods. Indeed it was the only algorithm capable of handling several large datasets, while the\nother methods either crashed due to lack of memory or ran for so long that they were not\npractically feasible. While our experiments illustrate only speci\ufb01c order graphs, we stress\nthat the method is general enough to handle arbitrary constraint relationships.\n\nWhile the proposed formulation reduces the computational complexity of enforcing or-\nder constraints, it is entirely independent of the regularizer that is minimized (under these\nconstraints) while learning the optimal ranking function. Though we have used a simple\nmargin regularization (via kwk2 in (28), and RKHS regularization in (27) in order to learn\nin a supervised setting, we can just as easily easily use a graph-Laplacian based regular-\nizer that exploits unlabeled data, in order to learn in semi-supervised settings. We plan to\nexplore this in future work.\n\nReferences\n\n[1] W. Chu and Z. Ghahramani, Gaussian processes for ordinal regression, Tech. report, University\n\nCollege London, 2004.\n\n[2] K. Crammer and Y. Singer, Pranking with ranking, Neural Info. Proc. Systems, 2002.\n[3] Y. Freund, R. Iyer, and R. Schapire, An ef\ufb01cient boosting algorithm for combining preferences,\n\nJournal of Machine Learning Research 4 (2003), 933\u2013969.\n\n[4] R. Herbrich, T. Graepel, and K. Obermayer, Large margin rank boundaries for ordinal regres-\n\nsion, Advances in Large Margin Classi\ufb01ers (2000), 115\u2013132.\n\n[5] T. Hofmann, L. Cai, and M. Ciaramita, Learning with taxonomies: Classifying documents and\n\nwords, (NIPS) Workshop on Syntax, Semantics, and Statistics, 2003.\n\n[6] T. Joachims, Optimizing search engines using clickthrough data, Proc. ACM Conference on\n\nKnowledge Discovery and Data Mining (KDD), 2002.\n\n[7] N. Lawrence, M. Seeger, and R. Herbrich, Fast sparse gaussian process methods: The informa-\n\ntive vector machine, Neural Info. Proc. Systems, 2002.\n\n[8] G. Lebanon and J. Lafferty, Conditional models on the ranking poset, Neural Info. Proc. Sys-\n\ntems, 2002.\n\n[9] O. L. Mangasarian, Nonlinear programming, McGraw\u2013Hill, New York, 1969, Reprint: SIAM\n\nClassic in Applied Mathematics 10, 1994, Philadelphia.\n\n[10]\n\n, Generalized support vector machines, Advances in Large Margin Classi\ufb01ers, 2000,\n\npp. 135\u2013146.\n\n[11] P. McCullagh and J. Nelder, Generalized linear models, Chapman & Hall, 1983.\n[12] R. T. Rockafellar, Convex analysis, Princeton University Press, Princeton, New Jersey, 1970.\n[13] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support vector machine learning for\n\ninterdependent and structured output spaces, Int.Conf. on Machine Learning, 2004.\n\n[14] V. N. Vapnik, The nature of statistical learning theory, second ed., Springer, New York, 2000.\n\n\f", "award": [], "sourceid": 2804, "authors": [{"given_name": "Glenn", "family_name": "Fung", "institution": null}, {"given_name": "R\u00f3mer", "family_name": "Rosales", "institution": null}, {"given_name": "Balaji", "family_name": "Krishnapuram", "institution": null}]}