{"title": "Learning Preferences for Multiclass Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": null, "full_text": " Learning Preferences for Multiclass Problems\n\n\n\n Fabio Aiolli Alessandro Sperduti\n Dept. of Computer Science Dept. of Pure and Applied Mathematics\n University of Pisa, Italy University of Padova, Italy\n aiolli@di.unipi.it sperduti@math.unipd.it\n\n\n\n\n Abstract\n\n Many interesting multiclass problems can be cast in the general frame-\n work of label ranking defined on a given set of classes. The evaluation\n for such a ranking is generally given in terms of the number of violated\n order constraints between classes. In this paper, we propose the Prefer-\n ence Learning Model as a unifying framework to model and solve a large\n class of multiclass problems in a large margin perspective. In addition,\n an original kernel-based method is proposed and evaluated on a ranking\n dataset with state-of-the-art results.\n\n\n1 Introduction\n\nThe presence of multiple classes in a learning domain introduces interesting tasks besides\nthe one to select the most appropriate class for an object, the well-known (single-label)\nmulticlass problem. Many others, including learning rankings, multi-label classification,\nhierarchical classification and ordinal regression, just to name a few, have not yet been\nsufficiently studied even though they should not be considered less important. One of the\nmajor problems when dealing with this large set of different settings is the lack of a single\nuniversal theory encompassing all of them.\n\nIn this paper we focus on multiclass problems where labels are given as partial order con-\nstraints over the classes. Tasks naturally falling into this family include category ranking,\nwhich is the task to infer full orders over the classes, binary category ranking, which is\nthe task to infer orders such that a given subset of classes are top-ranked, and any general\n(q-label) classification problem.\n\nRecently, efforts have been made in the direction to unify different ranking problems. In\nparticular, in [5, 7] two frameworks have been proposed which aim at inducing a label\nranking function from examples. Similarly, here we consider labels coded into sets of pref-\nerence constraints, expressed as preference graphs over the set of classes. The multiclass\nproblem is then reduced to learning a good set of scoring functions able to correctly rank the\nclasses according to the constraints which are associated to the label of the examples. Each\npreference graph disagreeing with the obtained ranking function will count as an error.\n\nThe primary contribution of this work is to try to make a further step towards the unifica-\ntion of different multiclass settings, and different models to solve them, by proposing the\nPreference Learning Model, a very general framework to model and study several kinds of\nmulticlass problems. In addition, a kernel-based method particularly suited for this setting\nis proposed and evaluated in a binary category ranking task with very promising results.\n\n\f\nThe Multiclass Setting Let be a set of classes, we consider a multiclass setting where\ndata are supposed to be sampled according to a probability distribution D over X Y,\nX Rd and an hypothesis space of functions F = {f : X R} with parameters\n. Moreover, a cost function c(x, y|) defines the cost suffered by a given hypothesis on\na pattern x X having label y Y. A multiclass learning algorithm searches for a set of\nparameters such to minimize the true cost, that is the expected value of the cost accord-\ning to the true distribution of data, i.e. Rt[] = E(x,y)D[c(x, y|)]. The distribution D\nis typically unknown, while it is available a training set S = {(x1, y1), . . . , (xn, yn)} with\nexamples drawn i.i.d. from D. An empirical approximation of the true cost, also referred\nto as the empirical cost, is defined by R n\n e[, S ] = 1 c(x\n n i=1 i, yi|).\n\n\n\n2 The Preference Learning Model\n\nIn this section, starting from the general multiclass setting described above, we propose a\ngeneral technique to solve a large family of multiclass settings. The basic idea is to \"code\"\nlabels of the original multiclass problem as sets of ranking constraints given as preference\ngraphs. Then, we introduce the Preference Learning Model (PLM) for the induction of\noptimal scoring functions that uses those constraints as supervision.\n\nIn the case of ranking-based multiclass settings, labels are given as partial orders over\nthe classes (see [1] for a detailed taxonomy of multiclass learning problems). Moreover,\nas observed in [5], ranking problems can be generalized by considering labels given as\npreference graphs over a set of classes = {1, . . . , m}, and trying to find a consistent\nranking function fR : X () where () is the set of permutations over . More\nformally, considering a set , a preference graph or \"p-graph\" over is a directed graph\nv = (N, A) where N is the set of nodes and A is the set of arcs of the graph accessed\nby the function A(v). An arc a A is associated with its starting node s = s(a) and\nits ending node e = e(a) and represents the information that the class s is preferred to,\nand should be ranked higher than, e. The set of p-graphs over will be denoted by G().\n\nLet be given a set of scoring functions f : X R with parameters working as pre-\ndictors of the relevance of the associated class to given instances. A definition of a ranking\nfunction naturally follows by taking the permutation of elements in corresponding to the\nsorting of the values of these functions, i.e. fR(x|) = argsortf(x, |). We say\nthat a preference arc a = (s, e) is consistent with a ranking hypothesis fR(x|), and\nwe write a fR(x|), when f (x, s|) f (x, e|) holds. Generalizing to graphs, a\np-graph g is said to be consistent with an hypothesis fR(x|), and we write g fR(x|),\nif every arc compounding it is consistent, i.e. g fR(x|) a A(g), a fR(x|).\n\nThe PLM Mapping Let us start by considering the way a multiclass problem is trans-\nformed into a PLM problem. As seen before, to evaluate the quality of a ranking func-\ntion fR(x|) is necessary to specify the nature of a cost function c(x, y|). Specifi-\ncally, we consider cost definitions corresponding to associate penalties whenever uncor-\nrect decisions are made (e.g. a classification error for classification problems or wrong\nordering for ranking problems). To this end, as in [5], we consider a label mapping\nG : y {g1(y), . . . , gq (y)} where a set of subgraphs g\n y i(y) G() are associated\nto each label y Y. The total cost suffered by a ranking hypothesis fR on the example\nx X labeled y Y is the number of p-graphs in G(y) not consistent with the ranking, i.e.\nc(x, y|) = qy [[g\n j=1 j (y) f (x|)]], where [[b]] is 1 if the condition b holds, 0 otherwise.\nLet us describe three particular mappings proposed in [5] that seem worthwhile of note: (i)\nThe identity mapping, denoted by GI , where the label is mapped on itself and every incon-\nsistent graph will have a unitary cost, (ii) the disagreement mapping, denoted by Gd, where\na simple (single-preference) subgraph is built for each arc in A(y), and (iii) the domination\nmapping, denoted by GD, where for each node r in y a subgraph consisting of r plus\n\n\f\n (a) (b) (c) (d) (e) (f )\n\n Figure 1: Examples of label mappings for 2-label classification (a-c) and ranking (d-f).\n\n\n\nthe nodes of its outgoing set is built. To clarify, in Figure 1 a set of mapping examples\nare proposed. Considering = {1, 2, 3, 4, 5}, in Figure 1-(a) the label y = [1, 2|3, 4, 5]\nfor a 2-label classification setting is given. In particular, this corresponds to the mapping\nG(y) = GI (y) = y where a single wrong ranking of a class makes the predictor to pay a\nunit of cost. Similarly, in Figure 1-(b) the label mapping G(y) = GD(y) is presented for\nthe same problem. Another variant is presented in Figure 1-(c) where the label mapping\nG(y) = Gd(y) is used and the target classes are independently evaluated and their errors\ncumulated. Note that all these graphs are subgraphs of the original label in 1-(a). As an\nadditional example we consider the three cases depicted in the right hand side of Figure 1\nthat refer to a ranking problem with three classes = {1, 2, 3}. In Figure 1-(d) the label\ny = [1|2|3] is given. As before, this also corresponds to the label mapping G(y) = GI (y).\nTwo alternative cost definitions can be obtained by using the p-graphs (sets of basic pref-\nerences actually) depicted in Figure 1-(e) and 1-(f). Note that the cost functions in these\ncases are different. For example, assume fR(x|) = [3|1|2], the p-graph in (e) induces a\ncost c(x, yb|) = 2 while the p-graph in (f) induces a cost c(x, yc|) = 1.\n\n\nThe PLM Setting Once the label mapping G is fixed, the preference constraints of\nthe original multiclass problem can be arranged into a set of preference constraints.\nSpecifically, we consider the set V(S) = V(x\n (x i, yi) where V (x, y) =\n i ,yi )S\n{(x, gj(y))}j{1,..,qy} and each pair (x, g) X G() is a preference constraint. Note\nthat the same instance can be replicated in V(S). This can happen, for example, when\nmultiple ranking constraints are associated to the same example of the original multiclass\nproblem. Because of this, in the following, we prefer to use a different notation for the\ninstances in preference constraints to avoid confusion with training examples.\n\nNotions defined for the standard classification setting are easily extended to PLM. For a\npreference constraint (v, g) V, the constraint error incurred by the ranking hypothesis\nfR(v|) is given by (v, g|) = [[g fR(v|)]]. The empirical cost is then defined\nas the cost over the whole constraint set, i.e. Re[, V] = N (v\n i=1 i, gi|). In addition,\nwe define the margin of an hypothesis on a pattern v for a preference arc a = (s, e),\nexpressing how well the preference is satisfied, as the difference between the scores of\nthe two linked nodes, i.e. A(v, a|) = f (v, s|) - f (v, e|). The margin for a p-\ngraph constraint (v, g) is then defined as the minimum of the margin of the compounding\npreferences, G(v, g|) = minaA(g) A(v, a|), and gives a measure of how well the\nhypothesis fulfills a given preference constraint. Note that, consistently with the classifica-\ntion setting, the margin is greater than 0 if and only if g fR(v|).\n\nLearning in PLM In the PLM we try to learn a \"simple\" hypothesis able to minimize the\nempirical cost of the original multiclass problem or equivalently to satisfy the constraints in\nV(S) as much as possible. The learning setting of the PLM can be reduced to the following\nscheme. Given a set V of pairs (vi, gi) X G(), i {1, . . . , N }, N = n q ,\n i=1 yi\nfind a set of parameters for the ranking function fR(v|) able to minimize a combination\nof a regularization and an empirical loss term, ^\n = arg min{Re[, V] + R()} with\n a given constant. However, since the direct minimization of this functional is hard due\nto the non continuous form of the empirical error term, we use an upper-bound on the true\nempirical error. To this end, let be defined a monotonically decreasing loss function L such\n\n\f\nthat L() 0 and L(0) = 1, then by defining a margin-based loss\n\n LC(v, g|) = L (G(v, g|)) = max L (A(v, a|)) (1)\n aA(g)\n\nfor a p-graph constraint (v, g) V and recalling the margin definition, the condition\n(v, g|) LC(v, g|) always holds thus obtaining Re[, V] N L\n i=1 C (vi, gi|).\n\nThe problem of learning with multiple classes (up to constant factors) is then reduced to a\nminimization of a (possibly regularized) loss functional\n\n ^\n = arg min{L(V|) + R()} (2)\n \n\nwhere L(V|) = N max\n i=1 aA(gi) L(f (vi, s(a)|) - f (vi, e(a)|)).\n\n Many different choices can be made for\n Method L() the function L(). Some well known\n -margin Perceptron [1 - -1]+ examples are the ones given in the ta-\n Logistic Regression log ble at the left. Note that, if the function\n 2(1 + exp(-))\n Soft margin [1 - ]+ L() is convex with respect to the para-\n Mod. Least Square [1 - ]2 meters , the minimization of the func-\n +\n Exponential exp(-) tional in Eq. (2) will result quite easy\n given a convex regularization term.\nThe only difficulty in this case is represented by the max term. A shortcoming to this\nproblem would consist in upper-bounding the max with the sum operator, though this\nwould probably lead to a quite row approximation of the indicator function when consid-\nering p-graphs with many arcs. It can be shown that a number of related works, e.g. [5, 7],\nafter minor modifications, can be seen as PLM instances when using the sum approxima-\ntion. Interestingly, PLM highlights that this approximation in fact corresponds to a change\non the label mapping obtained by decomposing a complex preference graph into a set of\nbinary preferences and thus changing the cost definition we are indeed minimizing. In this\ncase, using either GD or Gd is not going to make any difference at all.\n\nMulticlass Prediction through PLM A multiclass prediction is a function H : X Y\nmapping instances to their associated label. Let be given a label mapping defined as\nG(y) = {g1(y), . . . , gq (y)}. Then, the PLM multiclass prediction is given as the la-\n y\nbel whose induced preference constraints mostly agree with the current hypothesis, i.e.\nH(x) = arg miny L(V(x, y)|) where V(x, y) = {(x, gj(y))}j{1,..,qy}. It can be shown\nthat many of the most effective methods used for learning with multiple classes, including\noutput coding (ECOC, OvA, OvO), boosting, least squares methods and all the methods in\n[10, 3, 7, 5] fit into the PLM setting. This issue is better discussed in [1].\n\n\n3 Preference Learning with Kernel Machines\n\nIn this section, we focus on a particular setting of the PLM framework consisting of\na multivariate embedding h : X Rs of linear functions parameterized by a set\nof vectors Wk Rd, k {1, . . . , s} accommodated in a matrix W Rsd, i.e.\nh(x) = [h1(x), . . . , hs(x)] = [ W1, x , . . . , Ws, x ]. Furthermore, we consider the set\nof classes = {1, . . . , m} and M Rms a matrix of codes of length s with as many\nrows as classes. This matrix has the same role as the coding matrix in multiclass coding,\ne.g. in ECOC. Finally, the scoring function for a given class is computed as the dot product\nbetween the embedding function and the class code vector\n\n s\n\n f (x, r|W, M ) = h(x), Mr = Mrk Wk, x (3)\n k=1\n\n\f\nNow, we are able to describe a kernel-based method for the effective solution of the PLM\nproblem. In particular, we present the problem formulation and the associated optimization\nmethod for the task of learning the embedding function given fixed codes for the classes\n(embedding problem). Another worthwhile task consists in the optimization of the codes\nfor the classes when the embedding function is kept fixed (coding problem), or even to\nperform a combination of the two (see for example [8]). A deeper study of the embedding-\ncoding version of PLM and a set of examples can be found in [1].\n\n\nPLM Kesler's Construction As a first step, we generalize the Kesler's Construction\noriginally defined for single-label classification (see [6]) to the PLM setting, thus showing\nthat the embedding problem can be formulated as a binary classification problem in a higher\ndimensional space when new variables are appropriately defined. Specifically, consider\nthe vector y(a) = (Ms(a) - Me(a)) Rs defined for every preference arc in a given\npreference constraint, that is a = (s, e) A(g). For every instance vi and preference\n(s, e), the preference condition A(vi, a) 0 can be rewritten as\n\n A(vi, a) = f (vi, s) - f (vi, e) = y(a), h(vi) = s y\n k=1 k (a) Wk , vi\n = s W W ]s = W, za 0\n k=1 k , yk (a)vi = s\n k=1 k , [za\n i k i\n (4)\nwhere []s denotes the k-th chunk of a s-chunks vector, W \n k Rsd is the vector obtained by\nsequentially arranging the vectors Wk, and za = y(a) v\n i i Rsd is the embedded vector\nmade of the s chunks defined by [za]s = y\n i k k (a)vi, k {1, . . . , s}. From this derivation it\nturns out that each preference of a constraint in the set V can be viewed as an example of\ndimension s d in a binary classification problem. Each pair (vi, gi) V then generates\na number of examples in this extended binary problem equal to the number of arcs of the\np-graph gi for a total of N |A(g } is linearly\n i=1 i)| examples. In particular, the set Z = {za\n i\nseparable in the higher dimensional problem if and only if there exists a consistent solution\nfor the original PLM problem. Very similar considerations, omitted for space reasons,\ncould be given for the coding problem as well.\n\n\nThe Kernel Preference Learning Optimization As pointed out before, the central task\nin PLM is to learn scoring functions in such a way to be as much as possible consistent\nwith the set of constraints in V. This is done by finding a set of parameters minimizing a\nloss function that is an upper-bound on the empirical error function. For the embedding\nproblem, instantiating the problem (2), and choosing the 2-norm of the parameters as regu-\nlarizer, we obtain ^\n W = arg min 1 N\n W L\n N i=1 C (vi, gi|W, M ) + ||W ||2 where, according\nto Eq.(1), the loss for each preference constraint is computed as the maximum between the\nlosses of all the associated preferences, that is Li = maxaA(g ).\n i ) L( W, za\n i\n\nWhen the constraint set in V contains basic preferences only (that is p-graphs consisting of\na single arc ai = A(gi)), the optimization problem can be simplified into the minimization\nof a standard functional combining a loss function with a regularization term. Specifically,\nall the losses presented before can be used and, for many of them, it is possible to give a\nkernel-based solution. See [11] for a set of examples of loss functions and the formulation\nof the associated problem with kernels.\n\n\nThe Kernel Preference Learning Machine For the general case of p-graphs possibly\ncontaining multiple arcs, we propose a kernel-based method (hereafter referred to as Kernel\nPreference Learning Machine or KPLM for brevity) for PLM optimization which adopts\nthe loss max in Eq. (2). Borrowing the idea of soft-margin [9], for each preference arc, a\nlinear loss is used giving an upper bound on the indicator function loss. Specifically, we\nuse the SVM-like soft margin loss L() = [1 - ]+.\n\nSummarizing, we require a set of small norm predictors that fulfill the soft constraints of\n\n\f\nthe problem. These requirements can be expressed by the following quadratic problem\n\n min 1\n W, ||W||2 + C N \n 2 i i\n W, za 1 - (5)\n subject to: i i, i {1, .., N }, a A(gi)\n i 0, i {1, .., N }\n\nNote that differently from the SVM formulation for the binary classification setting, here\nthe slack variables i are associated to multiple examples, one for each preference arc in\nthe p-graph. Moreover, the optimal value of the i corresponds to the loss value as defined\nby Li. As it is easily verifiable, this problem is convex and it can be solved in the usual\nway by resorting to the optimization of the Wolfe dual problem. Specifically, we have to\nfind the saddle point (minimization w.r.t. to the primal variables {W, } and maximization\nw.r.t. the dual variables {, }) of the following Lagrangian:\n\n Q(W, , , ) = 1 ||W||2 + C N a(1 - )\n 2 i i + N\n i aA(g i - W, za\n i ) i i\n - N , \n i ii, s.t. a\n i i 0\n (6)\nBy differentiating the Lagrangian with respect to the primal variables and imposing the\noptimality conditions we obtain the set of constraints that the variables have to fulfill in\norder to be an optimal solution\n\n Q = W - N aza = 0 W = N aza\n W i aA(gi) i i i aA(gi) i i\n Q = C - a - a C (7)\n i i = 0 i\n i aA(gi) aA(gi)\n\n\n\nSubstituting conditions (7) in (6) and omitting constants that do not change the solution,\nthe problem can be restated as\n\n max s\n a - 1 y aj v\n i,a i 2 k i,a k (ai)yk (aj )ai i, vj\n i j,aj i j\n a 0, i {1, .., N }, a A(g (8)\n subject to: i i)\n a C, i {1, .., N }\n a i\n\nSince Wk = y v [M av\n i,a k (a)a\n i i = i,a s(a) - Me(a)]s\n k i i, k = 1, .., s, we obtain\nhk(x) = Wk, x = [M a v\n i,a s(a) - Me(a)]s\n k i i, x . Note that any kernel k(, ) can be\nsubstituted in place of the linear dot product , to allow for non-linear decision functions.\n\n\nEmbedding Optimization The problem in (8) recalls the one obtained for single-label\nmulticlass SVM [1, 2] and, in fact, its optimization can be performed in a similar way.\nAssuming a number of arcs for each preference constraint equal to q, the dual problem in\n(8) involves N q variables leading to a very large scale problem. However, it can be noted\nthat the independence of constraints among the different preference constraints allows for\nthe separation of the variables in N disjoints sets of q variables each.\n\nThe algorithm we propose for the optimization of the overall problem consists in iteratively\nselecting a preference constraint from the constraints set (a p-graph) and then optimizing\nwith respect to the variables associated with it, that is one for each arc of the p-graph. From\nthe convexity of the problem and the separation of the variables, since on each iteration we\noptimize on a different subset of variables, this guarantees that the optimal solution for the\nLagrangian will be found when no new selections can lead to improvements.\n\nThe graph to optimize at each step is selected on the basis of an heuristic selection strategy.\nLet the preference constraint (vi, gi) V be selected at a given iteration, to enforce the\nconstraint a + |a \n aA(g i = C, i 0, two elements from the set of variables {a\n i ) i i\nA(gi)} {i} will be optimized in pairs while keeping the solution inside the feasible\nregion a 0\n i . In particular, let 1 and 2 be the two selected variables, we restrict the\n\n\f\nupdates to the form 1 1- and 2 2+ with optimal choices for . The variables\nwhich most violate the constraints are iteratively selected until they reach optimality KKT\nconditions. For this, we have devised a KKT-based procedure which is able to select these\nvariables in time linear with the number of classes. For space reasons we omit the details\nand we do not consider at all any implementation issue. Details and optimized versions of\nthis basic algorithm can be found in [1].\n\n\nGeneralization of KPLM As a first immediate result we can give an upper-bound on the\nleave-one-out error by utilizing the sparsity of a KPLM solution, namely LOO |V |/N ,\nwhere V = {i {1, . . . , N }| maxaA(g > 0} is the set of support vectors. Another\n i ) a\n i\ninteresting result about the generalization ability of a KPLM is in the following theorem.\n\nTheorem 1 Consider a KPLM hypothesis = (W, M ) with s ||W\n r=1 r ||2 = 1 and\n||Mr||2 RM such that min(v,g)V G(v, g|) . Then, for any probability distri-\nbution D on X Y with support in a ball of radius RX around the origin, with probability\n1 - over n random examples S, the following bound for the true cost holds\n\n 2QA 64R2 en 32n 4\n Rt[] log log + log\n n 2 8R2 2 \n\nwhere y Y, qy Q, |A(gr(y))| A, r {1, . . . , qy} and R = 2RM RX .\n\nProof. Similar to that of Theorem 4.11 in [7] when noting that the size of examples in Z\nare upper-bounded by R = 2RM RX .\n\n\n4 Experiments\n\nExperimental Setting We performed experiments on the `ModApte\" split of Reuters-\n21578 dataset. We selected the 10 most popular categories thus obtaining a reduced set\nof 6,490 training documents and a set of 2,545 test documents. The corpus was then pre-\nprocessed by discarding numbers and punctuation and converting letters to lowercase. We\nused a stop-list to remove very frequent words and stemming has been performed by means\nof Porter's stemmer. Term weights are calculated according to the tf/idf function. Term se-\nlection was not considered thus obtaining a set of 28,006 distinct features.\n\nWe evaluated our framework on the binary category ranking task induced by the original\nmulti-label classification task, thus requiring rankings having target classes of the original\nmulti-label problem on top. Five different well-known cost functions have been used. Let\nx be an instance having ranking label y. IErr is the cost function indicating a non-perfect\nranking and corresponds to the identity mapping in Figure 1-(a). DErr is the cost defined\nas the number of relevant classes uncorrectly ranked by the algorithm and corresponds to\nthe domination mapping in Figure 1-(b). dErr is the cost obtained counting the number of\nuncorrect rankings and corresponds to the disagreement mapping in Figure 1-(c). Other two\nwell-known Information Retrieval (IR) based cost functions have been used. The OneErr\ncost function that is 1 whenever the top ranked class is not a relevant class and the average\nprecision cost function, which is AvgP = 1 |{r y:rank(x,r )rank(x,r)}| .\n |y| ry rank(x,r)\n\nResults The model evaluation has been performed by comparing three different label\nmappings for KPLM and the baseline MMP algorithm [4], a variant of the Perceptron\nalgorithm for ranking problems, with respect to the above-mentioned ranking losses. We\nused the configuration which gave the best results in the experiments reported in [4]. KPLM\nhas been implemented setting s = m and the standard basis vectors er Rm as codes\nassociated to the classes. A linear kernel k(x, y) = ( x, y + 1) was used. Model selection\nfor the KPLM has been performed by means of a 5-fold cross validation for different values\n\n\f\nof the parameter C. The optimal parameters have been chosen as the ones minimizing the\nmean of the values of the loss (the one used for training) over the different folders. In Table\n1 we report the obtained results. It is clear that KPLM definitely outperforms the MMP\nmethod. This is probably due to the use of margins in KPLM. Moreover, using identity and\ndomination mappings seems to lead to models that outperform the ones obtained by using\nthe disagreement mapping. Interestingly, this also happens when comparing with respect\nto its own corresponding cost. This can be due to a looser approximation (as a sum of\napproximations) of the true cost function. The same trend was confirmed by another set of\nexperiments on artificial datasets that we are not able to report here due to space limitations.\n\n Method IErr % DErr % dErr % OneErr % AvgP %\n MMP 5.07 4.92 0.89 4.28 97.49\n KPLM (GI ) 3.77 3.66 0.55 3.10 98.25\n KPLM (GD) 3.81 3.59 0.54 3.14 98.24\n KPLM (Gd) 4.12 4.13 0.66 3.58 97.99\n\n\nTable 1: Comparisons of ranking performance for different methods using different loss\nfunctions according to different evaluation metrics. Best results are shown in bold.\n\n\n5 Conclusions and Future Work\n\nWe have presented a common framework for the analysis of general multiclass problems\nand proposed a kernel-based method as an instance of this setting which has shown very\ngood results on a binary category ranking task. Promising directions of research, that we\nare currently pursuing, include experimenting with coding optimization and considering to\nextend the current setting to on-line learning, interdependent labels (e.g. hierarchical or any\nother structured classification), ordinal regression problems, and classification with costs.\n\n\nReferences\n\n [1] F. Aiolli. Large Margin Multiclass Learning: Models and Algorithms. PhD thesis, Dept. of\n Computer Science, University of Pisa, 2004. http://www.di.unipi.it/~ aiolli/thesis.ps.\n\n [2] F. Aiolli and A. Sperduti. Multi-prototype support vector machine. In Proceedings of Interna-\n tional Joint Conference of Artificial Intelligence (IJCAI), 2003.\n\n [3] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass prob-\n lems. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory,\n pages 3546, 2000.\n\n [4] K. Crammer and Y. Singer. A new family of online algorithms for category ranking. Journal of\n Machine Learning Research, 2003.\n\n [5] O. Dekel, C.D. Manning, and Y. Singer. Log-linear models for label ranking. In Advances in\n Neural Information Processing Systems, 2003.\n\n [6] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, chapter 5, page 266. Wiley, 2001.\n\n [7] S. Har Peled, D. Roth, and D. Zimak. Constraint classification: A new approach to multiclass\n classification. In Proceedings of the 13th International Conference on Algorithmic Learning\n Theory (ALT-02), 2002.\n\n [8] G. Ratsch, A. Smola, and S. Mika. Adapting codes and embeddings for polychotomies. In\n Advances in Neural Information Processing Systems, 2002.\n\n [9] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.\n\n[10] J. Weston and C. Watkins. Multiclass support vector machines. In M. Verleysen, editor, Pro-\n ceedings of ESANN99. D. Facto Press, 1999.\n\n[11] T. Zhang and F.J. Oles. Text categorization based on regularized linear classification methods.\n Information Retrieval, 1(4):531, 2001.\n\n\f\n", "award": [], "sourceid": 2579, "authors": [{"given_name": "Fabio", "family_name": "Aiolli", "institution": null}, {"given_name": "Alessandro", "family_name": "Sperduti", "institution": null}]}