{"title": "A kernel method for multi-labelled classification", "book": "Advances in Neural Information Processing Systems", "page_first": 681, "page_last": 687, "abstract": null, "full_text": "A kernel method for multi-labelled classi\ufb01cation\n\nAndr\u00b4e Elisseeff and Jason Weston\n\nBIOwulf Technologies, 305 Broadway, New York, NY 10007\n\n andre,jason\n\n@barhilltechnologies.com\n\nAbstract\n\nThis article presents a Support Vector Machine (SVM) like learning sys-\ntem to handle multi-label problems. Such problems are usually decom-\nposed into many two-class problems but the expressive power of such a\nsystem can be weak [5, 7]. We explore a new direct approach. It is based\non a large margin ranking system that shares a lot of common proper-\nties with SVMs. We tested it on a Yeast gene functional classi\ufb01cation\nproblem with positive results.\n\n1 Introduction\n\nMany problems in Text Mining or Bioinformatics are multi-labelled. That is, each point\nin a learning set is associated to a set of labels. Consider for instance the classi\ufb01cation\ntask of determining the subjects of a document, or of relating one protein to its many\neffects on a cell. In either case, the learning task would be to output a set of labels whose\nsize is not known in advance: one document can for instance be about food, meat and\n\ufb01nance, although another one would concern only food and fat. Two-class and multi-class\nclassi\ufb01cation or ordinal regression problems can all be cast into multi-label ones. This\nmakes the latter quite attractive but at the same time it gives a warning: their generality\nhides their dif\ufb01culty to solve them. The number of publications is not going to contradict\nthis statement: we are aware of only a few works about the subject [4, 5, 7] and they all\nconcern text mining applications.\n\n\u0002\u0004\u0003\u0006\u0005\u0007\u0005\b\u0005\n\nIn Schapire and Singer\u2019s work about Boostexter, one of the only general purpose multi-\nlabel ranking systems [7], they observe that over\ufb01tting occurs on learning sets of relatively\n). They conclude that controlling the complexity of the overall learning\nsmall size (\nsystem is an important research goal. The aim of the current paper is to provide a way\nof controlling this complexity while having a small empirical error. For that purpose, we\nconsider only architectures based on linear models and follow the same reasoning as for the\nde\ufb01nition of Support Vector Machines [1]. De\ufb01ning a cost function (section 2) and margin\nfor multi-label models, we focus our attention mainly on an approach based on a ranking\nmethod combined with a predictor of the size of the sets (section 3 and 4). Sections 5 and\n6 present experiments on a toy problem and on a real dataset.\n\n2 Cost functions\n\nLet\n\n\t\u000b\n\r\f\u000f\u000e\nformed by all the sets of integer between 1 and\n\nbe a d-dimensional input space. We consider as an output space the space\nidenti\ufb01ed here as the labels of the\n\n\u0001\n\u0010\n\u0011\n\fcomputed. Here, we consider only linear models. Given\n\nis a real-valued loss and can take different forms depending on how \u001e\n\nand\n\nelements and one output corresponds\nto one set of labels. The learning problem we are interested in is to \ufb01nd from a learning set\n, drawn identically and independently from\nsuch that the following generalization error is as\n\n\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\r\u0007\u000f\u000e\u0010\t\u000f\u0011\u0012\u0011\u000f\u0011\u0012\t\u0012\u0003\u0013\u0005\u0015\u0014\u0016\t\u0017\u000b\u0018\u0014\u0019\u000e\n\n\u0001\u0018\u001a\n\nlow as possible:\n\n&('\u0006)+*-,\n\n, we follow two schemes:\n\n\t\u001c\u001b\n, a function \u001e\n\n\u001c \u0019!#\"%$\n\nlearning problem. Such an output space contains \nan unknown distribution \u001d\nThe function.\n\t\u000f\u0011\u000f\u0011\u0012\u00114\t\n\t\f\u0005\u00157+8\nWith the binary approach:\u001e\n\u0003\u0013\u0005\u0015\u000e\nfunction applies component-wise. The value of \u001e\nof labels can be retrieved easily by stating that label:\nWith the ranking approach: assume that >\n\t\u0017\u0005\b7\b8\nknown. We de\ufb01ne: ?@;\n\u0003\u0013\u0005\u0015\u000e elements \u0003\niff ?9;\nfunctions \u0003\n\n3A;\nis among the largest >\n\t\u000f\u0011B\u0011B\t\n\nsign\u0003\u00176\n\n\u0003\u0013\u0005\b\u000e\n\n\u0003\u0013\u0005\b\u000e\n\n\u0003\u0013\u0005\b\u000e\n\nis a binary vector from which the set\n.\nFor example this can be achieved by using a SVM for each binary problem and applying\nthe latter rule [4].\n\n\u000e(=\n\u0003\u0013\u0005\u0015\u000e , the size of the label set for the input \u0005\nis in the label set of\u0005\n; and consider that a label :\n\u0003\u0006\u0005\u0015\u000e\u0017\u000e . The algorithm Boostexter\n\n\u0003\u0013\u0005\b\u000e\u0010\t\u0012\u0011#\u0011B\t\n\n, is\n\n[7] is an example of such a system. The ranking approach is analyzed more precisely in\nsection 3.\n\nWe consider the same loss functions as in [7] for any multi-label system built from real\n\n\t\u0017\u0005/\t\u0017\u000b0\u000e21\n\nvectors3\n\t\f\u0005\u00157+8\n\n\t\u000f\u0011\u000f\u0011\u0012\u0011\u000f\t96\nis in the set iff sign\u0003\u00176\n\n(1)\n\nis\nbias\n\n\u0003\u0006\u0005\u0015\u000e\n\t\u000f\u0011\u0012\u0011\u000f\u00114\t\n\u000e , where the sign\n\t\u0017\u0005\u0015748\n3<;\n\nwhich is exactly the same as the classi\ufb01cation error for multi-class problems (it ignores the\nrankings apart from the highest ranked one and so does not address the quality of the other\nlabels).\n\nOther losses concern only ranking systems (a system that speci\ufb01es a ranking but no set size\n. We de\ufb01ne\n\npredictor >\n\nthe Ranking Loss [7] to be:\n\n\u0003\u0013\u0005\b\u000e ). Let us denote by N\n\u000bHE@O\n\n\t\u0017\u0005/\t\u0017\u000b0\u000e\n\nthe complementary set of \u000b\ns.t. ?9W\n\u000bPO\n\n\u0003\u0013QR\tTSU\u000e(LV\u000b\n\nin \n\u0003\u0013\u0005\u0015\u000eYX\n\n\t\u0012\u0011#\u0011B\t\n\u0003\u0013\u0005\b\u000e@O\n?4Z\n\nIt represents the average fraction of pairs that are not correctly ordered. For ranking sys-\ntems, this loss is natural and is related to the precision which is a common error measure\nin Information Retrieval:\n\n(2)\n\nprecision\u0003\n\n\t\f\u0005+\t\f\u000b[\u000e\n\n9^_L\u0018\u000b\n\t\u0012\u0011#\u0011B\t\n\n\u0012^_L\n\n\u0003\u0006\u0005\u0015\u000eMa\ns.t. ?\u000f`\n?9;\n\u0003\u0013\u0005\b\u000eMa\ns.t. ?\u0012`\n\n\u0003\u0013\u0005\u0015\u000e\n?\n;\n\n\u0003\u0006\u0005\u0015\u000e\n\n\u000bHE/\\\n;9]\n\nfrom which a loss can be directly deduced. All these loss functions have been discussed\nin [7]. Good systems should have a high precision and a low Hamming or Ranking Loss.\nWe do not consider the one-error to be a good loss for multi-label systems but we retain it\nbecause it was measured in [7].\n\n\u000e . It includes the so-called Hamming Loss de\ufb01ned as\n\nC0D\n\n\t\u0017\u0005/\t\u0017\u000b0\u000e\n\n\u0003\u0006\u0005\u0015\u000e\u0017FG\u000bHE\nstands for the symmetric difference of sets. When E\nwhere F\n\u000bHE\ntem is in fact a multi-class one and the Hamming Loss is I\n\u0003\u0013\u0005\b\u000eML\r\u000b\n\u001eK;\n\nclassi\ufb01cation loss. We also consider the one-error:\n\nif argmax;\n\n\t\u0017\u0005/\t\u0017\u000b0\u000e\n\n1-err\u0003\n\notherwise\n\na multi-label sys-\ntimes the loss of the usual\n\n\u0001\n\u0002\n\n\u0003\n\u0010\n\u000e\n\u0014\n\u001f\n\u0003\n\u001e\n\u000e\n.\n\u0003\n\u001e\n\u0011\n\u0007\n3\n\u0001\n\u0011\n5\n\u0007\n5\n\u0001\n\n3\n\u0007\n5\n\u0007\n3\n\u0001\n5\n\u0001\n5\n;\n\u0005\n\n6\n5\n?\n\u0007\n?\n\u0001\n\u001e\n\u0007\n\u001e\n\u0001\n\u0003\n\u001e\n\n\u0003\n\u0011\nE\n\u001e\n\n\u0003\n\u0001\n\u001e\n\nJ\n\u0005\n\u0003\n\u000b\n\u0003\n\u0011\n\u0001\n\u001f\nD\n\u0003\n\u001e\n\n\u0003\nE\nO\nN\nO\nO\nO\n\u001b\nN\n\u000b\nO\n\u001e\n\n\u0003\nE\n&\nE\n\u0001\nE\nE\n\n\u0003\n\u0011\n\u0001\n\u0001\nE\n\fFor multi-label linear models, we need to de\ufb01ne a way of minimizing the empirical error\nmeasured by the appropriate loss and at the same time to control the complexity of the\nresulting model. A direct method would be to use the binary approach and thus take the\nbene\ufb01t of good two-class systems. However, as it has been raised in [5, 7], the binary\napproach does not take into account the correlation between labels and therefore does not\ncapture the structure of some learning problems. We propose here to instead focus on the\nranking approach. This will be done by introducing notions of margin and regularization\nas has been done for the two-class case in the de\ufb01nition of SVMs.\n\n3 Ranking based system\n\nOur goal is to de\ufb01ne a linear model that minimizes the Ranking Loss while having a large\n\n; , the decision boundaries for \u0005\nmargin. For systems that rank the values of 6\n\t\f\u0005\u00157\n3<;\nare de\ufb01ned by the hyperplanes whose equations are 6\n\t\u0017\u0005\u00157\n, where:\nbelongs to the label sets of\u0005 and^ does not. So, the margin of \u0003\u0006\u0005+\t\f\u000b[\u000e can be expressed as:\n\t\u0017\u0005\b7+8\n3M`\n\n3Y;\n\nto the decision boundary. Considering that all the\n\n. Maximizing the margin on the whole\n\n; such that:\n\n\u0002\u0004\u0003\u0006\u0005\n;9]\n`\u0013]\b\u0007\ndistance of\u0005\nare well ranked, we can normalize the parameters 3\n\t\u0017\u0005\b7+8\n\t\f^T\u000eYL\n, and \u0003\n&('\n!#\"%$\n\u0002\u0004\u0003\u0006\u0005\n7+8\n3M`\n\n`\u0013]\b\u0007\n\n\u0002\u0004\u0003\u0018\u0005\n\n]\u0017\u0016\n\n;9]\n\n\t\f\u0005\n\n\u0019! \n\n\u000f\u001b\u001a\u001d\u001c\u001e\u000f\u001b\u001f\n\t\f^T\u000e(LV\u000b\n\u000f\u001e\u001a\u001d\u001c\u001b\u000f\u001e\u001f\n\nIn the case where the problem is not ill-conditioned (two labels are always co-occurring),\nthe objective function can be replaced by: \u0002\f\u000b\u000e\r\nI . In order to get a simpler optimization procedure we approximate this maximum by\n3M`\nthe sum and, after some calculations (see [3] for details), we obtain:\n\n\u0002\f\u000b$\n\n\u0002\u0004\u0003\u0006\u0005\n\n\u0002#\u0003\u0006\u0005\n\n\u000f\"\u0010\n\n\u000f\"\u0010\n\nIt represents the signed\n\ndata in the learning set \u0002\nwith equality for some \u0005\n$\u0015\u0014\u0015\u0014\n\n\u000f\u0011\u0010\nsubject to:\n\n\u0002\f\u000b\u000e\r\nZ\u0013\u0012\n\nlearning set can then be done via the following problem:\n\n\u0002\u0004\u0003\u0006\u0005\nZ\u0013\u0012\n\n\u000f\u0011\u0010\nsubject to:\n\n$\u0015\u0014\u0015\u0014\n\n(6)\nTo generalize this problem in the case where the learning set can not be ranked exactly we\nfollow the same reasoning as for the binary case: the ultimate goal would be to maximize\nthe margin and at the same time to minimize the Ranking Loss. The latter can be expressed\nquite directly by extending the constraints of the previous problems. Indeed, if we have\n\n3M`\n\n\t\f^T\u000e(LV\u000b\n\nis:\n\n\u0001'&\n\nW , then the Ranking Loss on the\n\n\t\f\u0005\n\n7+8\n\nWB;\u0010` for \u0003\nEBE\n\n3A;\n\n;%\u0012\n\n\t\f^T\u000e\n\nLH\u000b\n\nW(\u001b\n&*)\u0006'\n\n(3)\n\n(4)\n\n(5)\n\nWB;R`\n\n(7)\n\n(8)\n(9)\n\n\t\u0017\u0005\n3A;\n3M`\nlearning set \u0002\n\nwhere -\nby only &\n\nW\u0006\u0012\n\nis the Heaviside function. As for SVMs we approximate the functions -\n\n&*),+\n\n;R`\nWB;R` and this gives the \ufb01nal quadratic optimization problem:\n\u0002\u0004\u0003\u0006\u0005\nZ\u0013\u0012\n&*),+\n&*)\n\u000e(LV\u000b\n\nE#E\n\u00010&\n\nsubject to:\n\n8/.\n\n\t-\u0003\n\n7+8\n\n$\u0015\u0014\u0015\u0014\n\n;%\u0012\n\n\t\u0017\u0005\n\nWB;R`\n\nW\u0018\u0012\n\n3A;\n\n3M`\n\nWB;\u0010`\n\n;R`\n\n8\n5\n3\n;\n\u0001\n3\n`\n8\n5\n;\n\u0001\n5\n`\n\n\u0005\n&\n$\n&\n6\n\u0001\n5\n;\n\u0001\n5\n`\n\t\n3\n;\n\u0001\n3\n`\n\t\n\nI\n6\n3\n;\n\u0001\n3\n`\n5\n;\n\u0001\n5\n`\na\n\u0003\nL\n\u0002\n:\n\u000b\n\u001b\nN\n\u000b\n$\n\u0007\n$\n\u0001\n&\n$\n&\n\u0007\n\u0019\n6\n3\n;\n\u0001\nW\n5\n;\n\u0001\n5\n`\na\n\u0003\n\t\n\u0003\n:\nW\n\u001b\nN\n\u000b\nW\n;\n$\n`\n\u0007\n\u0019\n\u0019\n \n\n;\n$\n`\n\t\n3\n;\n\u0001\n\t\n$\n\u0007\n$\n\u0001\n\u0001\n\\\n\u0007\n\t\n\t\nI\n6\n3\n;\n\u0001\nW\n5\n;\n\u0001\n5\n`\na\n\u0003\n\t\n\u0003\n:\nW\n\u001b\nN\n\u000b\nW\n6\n\u0001\nW\n7\n8\n5\n;\n\u0001\n5\n`\na\n\u0003\n:\nN\n\u000b\n\u0003\n(\n\u0014\n\\\n\u0007\n\u0003\nE\n\u000b\nW\nN\n\u000b\nW\nE\n\\\n;\n$\n`\n]\n!\n\u0007\n-\n\u0003\n\u0001\n\u0003\n8\n&\nW\n\u000e\n\u0003\n\u0001\n\u0003\n8\n&\n\u000e\n\u000f\n\u0010\n$\n\u0007\n$\n\u0001\n\u0001\n\\\n\u0007\n\t\n3\n;\n\t\nI\n\u0014\n\\\n\u0007\n\u0003\nE\n\u000b\nW\nN\n\u000b\nW\nE\n\\\n!\n;\n$\n`\n'\n]\n\u0007\n&\n6\n\u0001\nW\n5\n;\n\u0001\n5\n`\na\n\u0003\n:\n\t\n^\nW\n\u001b\nN\n\u000b\nW\n&\nW\na\n\u0005\n\fIn the case where the label sets\u000b\n\nwe \ufb01nd the same optimization prob-\nlem as has been derived for multi-class Support Vector Machines [8]. For this reason, we\ncall the solution of this problem a ranking Support Vector Machine (Rank-SVM). Another\ncommon property with SVM is the possibility to use kernels rather than linear dot prod-\nucts. This can be achieved by computing the dual of the former optimization problem. We\nrefer the reader to [3] for the dual formluation and to [2] and references therein for more\ninformation about kernels and SVMs.\n\nW all have a size of\n\nSolving a constrained quadratic problem like those we just introduced requires an amount\nof memory that is quadratic in terms of the learning set size and it is generally solved in\nthe number of labels. Such a\ncomplexity is too high to apply these methods in many real datasets. To circumvent this lim-\nitation, we propose to use a linearization method in conjunction with a predictor-corrector\nlogarithmic barrier procedure. Details are described in [3] with all the calculations relative\n\n\u000e computational steps where we have put into the \n\n(\u0002\u0001\n\nto the implementation. The memory cost of the method then becomes \n\u0014\u0004\u0003\n\nE is the maximum number of labels. In many applications\n\" . The time cost of each iteration is \n\n\u0002\f\u000b\u000e\r\n\u0014\u0005\u0003\n\nlarger than\n\n\u000e .\n\n\u0014\u0004\u0003\n\n\u000e where\n\nis much\n\n4 Set size prediction\n\n\u0003\u0006\u0005\u0015\u000e\n\nthat are greater than\n\nSo far we have only developed ranking systems. To obtain a complete multi-label system\n\ninspiration from the binary approach. The latter can indeed be interpreted as a ranking\n\n.\nis computed from a threshold value that differentiates labels in the target\nset from others. For the ranking system introduced in the previous section we generalize\n\n\u0003\u0013\u0005\u0015\u000e . A natural way of doing this is to look for\nwe need to design a set size predictor >\nsystem whose ranks are derived from the real values \u0003\n\u000e . The predictor of the set\n\t\u0012\u0011#\u0011B\t\n\u0003\u0013\u0005\b\u000eM=\nE is the number of\u001e\nsize is then quite simple: >\n\u0003\u0013\u0005\b\u000e\nThe function>\n\u0003\u0013\u0005\b\u000e\nE . The remaining problem now\n\u0003\u0006\u0005\u0015\u000eM=\u0007\u00064\u0003\u0013\u0005\b\u000e\nthis idea by designing a function >\nis to choose \u00064\u0003\u0006\u0005\u0015\u000e which is done by solving a learning problem. The training data are\ncomposed by the \u0003\n\u0003\u0013\u0005\n\u000e\f\u000e given by the ranking system, and by the target values\nL\u0018\u000b\ns.t. \u001eK;\n\nWhen the minimum is not unique and the optimal values are a segment, we choose the\nmiddle of this segment. We refer to this method of predicting the set size as the threshold\nbased method. In the following, we have used linear least squares, and we applied it not\nonly to Rank-SVM but also to Boostexter in order to transform these algorithms from\nranking methods to multi-label ones.\n\n\u000e\u0010\t\u0012\u0011#\u0011B\t\n\u0003\u0013\u0005\nargmin\b\n\n\u001e%;\n\u000e(X\t\u0006\n\n\u000e(a\t\u0006\r\f+O\n\ns.t. \u001eK;\n\nde\ufb01ned by:\n\n\u00064\u0003\u0013\u0005\n\n\u0003\u0006\u0005\n\nE\u00128\n\nO\u000b\n\n\u0003\u0006\u0005\n\nNote that we could have followed a much simpler scheme to build the function >\noriginal training data with the targets \u0003\n\nnaive method would be to consider the set size prediction as a regression problem on the\nand to use any regression learning\nsystem. This however does not provide a satisfactory solution mainly because it does not\ntake into account how the ranking is performed. In particular, when there are some errors in\nthe ranking, it does not learn how to compensate these errors although the threshold based\napproach tries to learn the best threshold with respect to these errors.\n\n$\u0015\u0014\u0015\u0014\n\nW\u0018\u0012\n\n\u0003\u0013\u0005\b\u000e . A\n\n5 Toy problem\n\nAs previously noticed the binary approach is not appropriate for problems where correla-\ntion between labels exist. To illustrate this point consider \ufb01gure 2. There are only three la-\nbels. One of them (label\n) is present for all points in the learning set. The binary approach\nfrom points of\n\nleads to a system that will fail to separate, for instance, points with label \u000e\n\n\u0003\n\n\u0003\n\u0003\n(\n\u0011\n\u0011\n\"\n\u0011\n\"\n\nW\nE\n\u000b\nW\n\u0011\n\u0011\n\u0003\n(\nI\n\u0011\n\u001e\n\u0007\n\u001e\n\u0001\n\nE\n\n\u001e\n;\n\u0005\n\u0001\n;\n\u0005\n\nE\n\n\u0001\n\u001e\n\u0007\nW\n\u001e\n\u0001\nW\nW\n\u000e\n\nE\n\n:\nW\n\u0001\nO\n:\nL\nN\n\u000b\nW\nO\nE\n\u000b\nW\nE\n\u000e\n\u0007\n$\n\u0014\n\u0003\n\fible power of a binary system can be quite low when simple con\ufb01gurations occur. If we\n,\n\nlabel sets not containing \u000e , that is, on points of label\nconsider the ranking approach, one can imagine the following solution: 3\n\u0003\u0013\u0005\u0015\u000e\ntaking the number of labels at point \u0005\n5 where 3\n\n\u0004\u0005\n, we have a simple multi-label system that separates all the regions exactly.\n\nand . We see then that the express-\n\u000e . By\n\n\u0001\n\u000e and\n\t\f\u0005\u00157\n\nis the hyperplane separating class 2 from class 3, and \u0003\n\nto be >\n\n, 5\n\nFigure 2:\nThree labels and three\nregions in the input space. The upper\nleft region is labelled with\n. The\nbottom right region is partitioned into\ntwo sub-regions with labels\n\n or\n\n\u000e .\n\n1\n\n1,3\n\n1,2\n\nTo make this point more concrete we sampled \u0002\noptimization problems with .\napproach was\n\nalthough for the direct approach it was\n\npoints uniformly on ,\n\nas expected.\n\n. On the learning set the Hamming Loss for the binary\n\nI and solved all\n\n6 Experiments on real data\n\n\u0003\n\n\u0005\u0005\u0004\n\nYeast\n\nSaccharomyces cerevisiae\n\nMetabolism\n\nTranscription\n\nTransport\nFacilitation\n\nIonic\nHomeostasis\n\nCell. Rescue,\nDefense, Cell Death\n and Aging\n\nEnergy\n\nProtein\n Synthesis\n\nCell. Transport,\nTransport\nMechanisms\n\nCellular\nOrganization\n\nCell. communication,\nSignal Transduction\n\nCell Growth,\nCell Division,\nDNA synthesis\n\nProtein\nDestination\n\nCellular\n Biogenesis\n\nTransposable elements\nViral and Plasmid proteins\n\nYAL041W\n\nFigure 3: First level of the hierarchy of the gene functional classes. There are 14 groups.\nOne gene, for instance the gene YAL041W can belong to different groups (shaded in grey\non the \ufb01gure).\n\n\u0003\u0006\u0005\n\n\u000e . Each gene is associated with a set of functional classes whose max-\n\nThe Yeast dataset is formed by micro-array expression data and phylogenetic pro-\n\ufb01les with 1500 genes in the learning set and 917 in the test set. The input dimen-\nsion is\n. This dataset has already been ana-\nimum size can be potentially more than\nlyzed with a two-class approach [6] and is known to be dif\ufb01cult.\nIn order to make\nit easier, we used the known structure of the functional classes. The whole set of\nclasses is indeed structured in a tree whose leaves are the functional categories (see\nhttp://mips.gsf.de/proj/yeast/catalogues/funcat/ for more details). Given\na gene, knowing which edge to take from one level to another leads directly to a leaf and\n\n\u0003\u0007\u0006\u0007\u0005\n\n\u0003\n\u0007\n\u0007\n\u0003\n3\nI\n\t\n5\nI\n\u000e\n3\n\u0001\n\t\n5\n\u0001\n\u000e\n\n\u0001\n\u0003\n3\nI\n\t\n5\nI\n\n6\n3\n8\n\n\u0003\n\u0001\n\u0003\n\t\n\u0003\n5\n\n\u0005\n\u0003\n\u0003\n\t\n\u0003\n\t\n\u0005\n\u0005\n\t\n\u0003\n1\n\u0005\n\u0011\n\u0005\n\fthus to a functional class. Here we try to predict which edge to take from the root to the\n\ufb01rst level of the tree (see \ufb01gure 3).\n\nSince one gene can have many functional classes this is a multi-label problem: one gene is\nassociated to different edges. We then have\nand the average number of labels for\nall genes in the learning set is\n\u0004 . We assessed the quality of our method from two\nperspectives. First as a ranking system with the Ranking Loss and the precision. In that\ncase, for the binary approach, the real outputs of the two-class SVMs were used as ranking\nvalues. Second, the methods were compared as multi-label systems using the Hamming\nLoss. We computed the latter for the binary approach used in conjunction with SVMs, for\nthe Rank-SVM and for Boostexter. To measure the Hamming Loss with Boostexter we used\n\n\u0003\u0002\n\n\u0003\u0001\n\n\u0003\u0006\u0005\u0015\u000e function in combination with the ranking given by the algorithm.\n\na threshold based >\n\ndegree\nPrecision\n\nRanking Loss\nHamming Loss\n\none-error\n\n2\n\n\u0005\u0007\u0006\t\b\n\u0005\f\u000b\n\u0011\u0010\u0011\f\b\n\u0005\u0007\u0006\n\u0005\u0007\u0006\n\u0011\u0010\u000b\u0012(\n\u0005\r\u0006\n\u000b\u0012\u000b#\u000e\n\nRank-SVM\n3\n4\n\n\b\n\u000e\f\u0005\n\u0005\r\u0006\n\u0005\r\u0006! \n\"\u0007 \n\u0005\r\u0006\n\u0011\u0007 #\b\n\u0005\r\u0006\n\u0011\u0012\u000f\u0012\u0011\n\n\b\n\u000e\f\u000f\n\u0005\r\u0006\n\u0005\r\u0006! \n\"#\u0005\n\u0005\r\u0006\n\u0011\u0010\u0005\u0012\"\n\u0005\u0007\u0006\n\u0011\u0010$\u0012$\n\n5\n\n\b\u0010\u000f\u0012\u0011\n\u0005\r\u0006\n\u0005\r\u0006! #\b\u0010$\n\u0005\r\u0006\n\u0011\u0010\u0005\r \n\u0005\r\u0006\n\u0011\u0012\u000b\u0012\u0011\n\n2\n\n\u0013\n\u0006\n\u0013\n\u0006\n\u0013\n\u0006\n\u0013\n\u0006\n\n\u0014\u0016\u0015\u0016\u0017\n\u0017\u0001\u001b\n\u0019\n\u0017\u0001\u001b)\u0018\n\u001f\u0001\u001b#\u0019\n\nBinary-SVM\n3\n4\n\n\u0013\n\u0006\n\u0013\n\u0006\n\u0013\n\u0006\n\u0013\n\u0006\n\n\u0018\u0016\u0017\u001a\u0019\n\u0017%\u0019&\u0017\n\u0017\u0016\u0017*\u001b\n\u001f\u001e\u0013\u0016\u0014\n\n\u0018\u001a\u0019\u001c\u001b\n\u0013\n\u0006\n\u0013\n\u0006!\u0019\u001c\u0015\u0016\u0014\n\u0013%\u0006\n\u0017\u001a\u0019\u0016\u0019\n\u0013%\u0006\n\u0017\u0001\u0014)\u0018\n\n5\n\n\u0018\u001e\u001d\u001e\u001f\n\u0013\n\u0006\n\u0013\n\u0006!\u0019\u001c'\u001e\u001b\n\u0013%\u0006\n\u0017\u0001\u0013)\u0018\n\u0013%\u0006\n\u0017\u001e\u001d\u001e\u0013\n\nFigure 4: Polynomials of degree 2-5. Loss functions for the rank-SVM and the binary\napproach based on two-class SVMs. Considering the size of the problem, two values dif-\nferent from less than\nare not signi\ufb01cantly different. Bold values represent superior\nperformance comparing classi\ufb01ers with the same kernel.\n\nFor rank-SVMs and for two-class SVMs in the binary approach we choose polynomial\nkernels of degrees two to nine (experiments on two-class problems using the Yeast data in\n[6] already showed that polynomial kernels were appropriate for this task). Boostexter was\nused with the standard stump weak learner and was stopped after 1000 iterations. Results\nare reported in tables 4, 5 and 6.\n\ndegree\nPrecision\n\nRanking Loss\nHamming Loss\n\none-error\n\n6\n\n\u0005\r\u0006\n\b\u0010\u000f\u0012$\n\u0005\r\u0006! #\b#\u0005\n\u0005\r\u0006! \n\"\u0012\"\n\u0013\n\u0006\n\u0017\u001e\u001f\u0016\u0017\n\nRank-SVM\n7\n8\n\n\u0005\u0007\u0006\t\b\u0010\b#\u0005\n\u0005\u0007\u0006+ %\u000f\u0012\u000f\n\u0005\u0007\u0006+ %\"\u0012(\n\u0005\r\u0006\n\u0011\u0012\u0011\u0012\u000b\n\n\u0005\r\u0006\n\b\u0012\b\u0010\u000b\n\u0005\r\u0006! \n\u000f\u0012\u000b\n\u0013\n\u0006!\u0019&\u0015\u001e\u0014\n\u0005\r\u0006\n\u0011\u0007 #\b\n\n9\n\n\u0013\n\u0006\n\u0018\u0001\u0014\u0016\u0015\n\u0005\u0007\u0006+ %\u000f\u0012\u000b\n\u0013\n\u0006!\u0019\u001c\u0015)\u0018\n\u0005\r\u0006\n\u0011\u0012\u0011\u0012$\n\n6\n\n\u0018\u001e\u0014\u0016\u0013\n\u0013%\u0006\n\u0013%\u0006!\u0019*\u0018\u001e\u0014\n\u0013%\u0006\n\u0017\u001e\u0013\u0016\u0013\n\u0013\n\u0006\n\u0017\u001e\u001f\u0016\u0017\n\nBinary-SVM\n7\n\n8\n\n\u0013\n\u0006\n\u0018\u001e\u0014\u0016\u001d\n\u0013\n\u0006!\u0019*\u0018\u0001\u0013\n\u0013\n\u0006!\u0019&\u0015\u001e\u0015\n\u0013\n\u0006\n\u0017\u0016\u0017\u001e\u0018\n\n\u0013%\u0006\n\u0018\u001e\u0018\u001e\u0013\n\u0013%\u0006+\u0019\u001c\u0014)\u001d\n\u0005\r\u0006! \n\"\u0012$\n\u0013%\u0006\n\u0017%\u0019&'\n\n9\n\n\u0013\n\u0006\n\u0018\u001e\u0014\u001e\u0015\n\u0013\n\u0006!\u0019&\u0014\u0001\u001b\n\u0005\r\u0006! \n\"\u0010$\n\u0013\n\u0006\n\u0017\u001e\u0017\u001e\u0014\n\nFigure 5: Polynomials of degree 6-9. Loss functions for the rank-SVM and the binary\napproach based on two-class SVMs. Considering the size of the problem, two values dif-\nare not signi\ufb01cantly different. Bold values represent superior\nferent from less than\nperformance comparing classi\ufb01ers with the same kernel.\n\nBoostexter (1000 iterations)\n\nPrecision\n\nRanking Loss\nHamming Loss\n\none-error\n\n\u0013%\u0006\n\u0013%\u0006\n\u0013%\u0006\n\u0013%\u0006\n\n\u0018\u001a\u0019*\u0018\n\u0017\u0001\u0015\u0016'\n\u0017\u0001\u001f)\u0018\n\u001f\u0016\u0013)\u0017\n\nFigure 6: Loss functions for Boostexter. Note that these results are worse than with the\nbinary approach or with rank-SVM.\n\nNote that Boostexter performs quite poorly on this dataset compared to SVM-based ap-\nproaches. This may be due to the simple decision function realized by Boostexter. One\n\n\u0011\n\n\n\u0011\n\u0003\n\u0011\n\u0005\n\u0011\n\u0005\n\u0003\n\u0005\n\u0011\n\u0005\n\u0003\n\fof the main advantages of the SVM-based approaches is the ability to incorporate priori\nknowledge into the kernel and control complexity via the kernel and regularization. We\nbelieve this may also be possible with Boostexter but we are not aware of any work in this\narea.\n\nTo compare the binary and the rank-SVM we put in bold the best results for each kernel.\nFor all kernels and for almost all losses, the combination ranking based SVM approach is\nbetter than the binary one. In terms of the Ranking Loss, the difference is signi\ufb01cantly in\nfavor of the rank-SVM. It is consistent with the fact that this system tends to minimize this\nparticular loss function. It is worth noticing that when the kernel becomes more and more\ncomplex the difference between rank-SVM and the binary method disappears.\n\n7 Discussion and conclusion\n\nIn this paper we have de\ufb01ned a whole system to deal with multi-label problems. The main\ncontribution is the de\ufb01nition of a ranking based SVM that extends the use of the latter to\nmany problems in the area of Bioinformatics and Text Mining.\n\nWe have seen on complex, real data that rank-SVMs lead to better performance than Boost-\nexter and the binary approach. On its own this could be interpreted as a suf\ufb01cient argument\nto motivate the use of such a system. However, we can also extend the rank-SVM sys-\ntem to perform feature selection on ranking problems [3] . This application can be very\nuseful in the \ufb01eld of bioinformatics as one is often interested in interpretability of a multi-\nlabel decision rule. For example one could be interested in a small set of genes which is\ndiscriminative in a multi-condition physical disorder.\n\nWe have presented only \ufb01rst experiments using multi-labelled systems applied to Bioinfor-\nmatics. Our future work is to conduct more investigations in this area.\n\nReferences\n\n[1] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classi\ufb01ers. In Fifth\nAnnual Workshop on Computational Learning Theory, pages 144\u2013152, Pittsburgh, 1992. ACM.\n[2] N. Cristianini and J. Shawe-Taylor. Introduction to Support Vector Machines. Cambridge Uni-\n\nversity Press, 2000.\n\n[3] Andr\u00b4e Elisseeff and Jason Weston. Kernel methods for multi-labelled classi\ufb01cation and cate-\ngorical regression problems. Technical report, BIOwulf Technologies, 2001. http://www.bht-\nlabs.com/public/.\n\n[4] T. Joachims. Text categorization with support vector machines: learning with many relevant\nfeatures. In Claire N\u00b4edellec and C\u00b4eline Rouveirol, editors, Proceedings of ECML-98, 10th Eu-\nropean Conference on Machine Learning, number 1398, pages 137\u2013142, Chemnitz, DE, 1998.\nSpringer Verlag, Heidelberg, DE.\n\n[5] A. McCallum. Multi-label text classi\ufb01cation with a mixture model trained by em. AAAI\u201999\n\nWorkshop on Text Learning., 1999.\n\n[6] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Combining microarray expression data and\nphylogenetic pro\ufb01les to learn functional categories using support vector machines. In RECOMB,\npages 242\u2013248, 2001.\n\n[7] R.E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Ma-\n\nchine Learning, 39(2/3):135\u2013168, 2000.\n\n[8] J. Weston and C. Watkins. Multi-class support vector machines. Technical Report 98-04, Royal\n\nHolloway, University of London, 1998.\n\n\f", "award": [], "sourceid": 1964, "authors": [{"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}]}