{"title": "Reducing multiclass to binary by coupling probability estimates", "book": "Advances in Neural Information Processing Systems", "page_first": 1041, "page_last": 1048, "abstract": null, "full_text": "Reducing multiclass to binary by\ncoupling probability estimates\n\nBianca Zadrozny\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093-0114\nzadrozny@cs.ucsd.edu\n\nAbstract\n\nThis paper presents a method for obtaining class membership probability esti-\nmates for multiclass classi\ufb01cation problems by coupling the probability estimates\nproduced by binary classi\ufb01ers. This is an extension for arbitrary code matrices\nof a method due to Hastie and Tibshirani for pairwise coupling of probability\nestimates. Experimental results with Boosted Naive Bayes show that our method\nproduces calibrated class membership probability estimates, while having similar\nclassi\ufb01cation accuracy as loss-based decoding, a method for obtaining the most\nlikely class that does not generate probability estimates.\n\n1 Introduction\n\nThe two most well-known approaches for reducing a multiclass classi\ufb01cation problem to\na set of binary classi\ufb01cation problems are known as one-against-all and all-pairs. In the\none-against-all approach, we train a classi\ufb01er for each of the classes using as positive ex-\namples the training examples that belong to that class, and as negatives all the other training\nexamples. In the all-pairs approach, we train a classi\ufb01er for each possible pair of classes\nignoring the examples that do not belong to the classes in question.\n\nAlthough these two approaches are the most obvious, Allwein et al. [Allwein et al., 2000]\nhave shown that there are many other ways in which a multiclass problem can be decom-\nposed into a number of binary classi\ufb01cation problems. We can represent each such decom-\nl, where k is the number of classes and l is\nposition by a code matrix M \u0002\u0001\u0004\u0003 1\n1 then the examples belong-\nthe number of binary classi\ufb01cation problems. If M\ning to class c are considered to be positive examples for the binary classi\ufb01cation problem\nb. Similarly, if M\n\u0003 1 the examples belonging to c are considered to be negative\nexamples for b. Finally, if M\n0 the examples belonging to c are not used in training\nc\na classi\ufb01er for b.\n\n\u000b\r\f\u000e\u0006\n\n\u000b\u000f\f\n\nc\n\nb\n\nc\n\nb\n\n0\n\n\u0005\u0007\u0006\n\n1\n\nk\t\n\nb\n\n\u000b\u0010\f\n\nFor example, in the 3-class case, the all-pairs code matrix is\n\nb1\n1\n1\n0\n\nb2\n1\n0\n1\n\nb3\n0\n1\n1\n\nc1\nc2\nc3\n\nThis approach for representing the decomposition of a multiclass problem into binary prob-\n\n\u0005\n\b\n\n\u0005\n\n\u0005\n\n\u0005\n\u0011\n\u0011\n\u0012\n\u0011\n\u0012\n\u0012\n\flems is a generalization of the Error-Correcting Output Codes (ECOC) scheme proposed\nby Dietterich and Bakiri [Dietterich and Bakiri, 1995]. The ECOC scheme does not allow\nzeros in the code matrix, meaning that all examples are used in each binary classi\ufb01cation\nproblem.\n\nOrthogonal to the problem of choosing a code matrix for reducing multiclass to binary is\nthe problem of classifying an example given the labels assigned by each binary classi\ufb01er.\nGiven an example x, Allwein et al. [Allwein et al., 2000] \ufb01rst create a vector v of length l\ncontaining the \u0001 -1,+1\nlabels assigned to x by each binary classi\ufb01er. Then, they compute\nthe Hamming distance between v and each row of M, and \ufb01nd the row c that is closest to v\naccording to this metric. The label c is then assigned to x. This method is called Hamming\ndecoding.\n\nFor the case in which the binary classi\ufb01ers output a score whose magnitude is a mea-\nsure of con\ufb01dence in the prediction, they use a loss-based decoding approach that takes\ninto account the scores to calculate the distance between v and each row of M, instead\nof using the Hamming distance. This method is called loss-based decoding. Allwein et\nal. [Allwein et al., 2000] present theoretical and experimental results indicating that this\nmethod is better than Hamming decoding.\n\nHowever, both of these methods simply assign a class label to each example. They do not\noutput class membership probability estimates \u02c6P\nfor an example x. These\nprobability estimates are important when the classi\ufb01cation outputs are not used in isolation\nand must be combined with other sources of information, such as misclassi\ufb01cation costs\n[Zadrozny and Elkan, 2001a] or the outputs of another classi\ufb01er.\n\nC\n\nX\n\nx\n\nc\n\nGiven a code matrix M and a binary classi\ufb01cation learning algorithm that outputs probabil-\nity estimates, we would like to couple the estimates given by each binary classi\ufb01er in order\nto obtain class probability membership estimates for the multiclass problem.\n\nc\n\nC\n\nHastie and Tibshirani [Hastie and Tibshirani, 1998] describe a solution for obtaining prob-\nability estimates \u02c6P\nin the all-pairs case by coupling the pairwise probability\nestimates, which we describe in Section 2. In Section 3, we extend the method to arbitrary\ncode matrices. In Section 4 we discuss the loss-based decoding approach in more detail\nand compare it mathematically to the method by Hastie and Tibshirani. In Section 5 we\npresent experimental results.\n\nX\n\nx\n\n2 Coupling pairwise probability estimates\n\nj, obtained by training\na classi\ufb01er using the examples belonging to class i as positives and the examples belonging\nto class j as negatives. We would like to couple these estimates to obtain a set of class\nfor each example x. The ri j are related\n\nWe are given pairwise probability estimates ri j\n\nmembership probabilities pi\n\nto the pi according to\n\nfor every class i\n\nci\n\nC\n\nX\n\nP\n\nx\n\nx\n\nx\n\nP\n\nC\n\ni\n\nC\n\ni\n\nC\n\nj\n\nX\n\nri j\n\nx\n\n\u0003\u0005\u0004\n\npi\n\nx\np j\n\nx\n\npi\n\nx\n\nx\n\n\u0003\t\u0004\n\nx\n\n1, there are k \u0003 1 free parameters and k\n2 constraints. This implies that there may not exist pi satisfying these constraints.\n\nSince we additionally require that (cid:229)\nk \u0003\n1\n\u000b\u000b\n\nLet ni j be the number of training examples used to train the binary classi\ufb01er that predicts\nri j.\n, Hastie and\nTibshirani \ufb01t the Bradley-Terrey model for paired comparisons [Bradley and Terry, 1952]\nand\nby minimizing the average weighted Kullback-Leibler distance l\n\nIn order to \ufb01nd the best approximation \u02c6ri j\n\ni pi\n\n\u02c6pi\n\n\u02c6pi\n\n\u000b\u000b\n\nx\n\nx\n\nx\n\nx\n\nx\n\n\u02c6p j\n\nbetween ri j\n\nx\n\n\b\n\n\f\n\n\f\n\u000b\n\n\f\n\n\f\n\u000b\n\u000b\n\u0001\n\f\n\u000b\n\f\n\n\f\n\n\f\n\u000b\n\u0002\n\u0002\n\u0004\n\u0006\n\u0004\n\u0007\n\u0004\n\b\n\u0004\n\u0002\n\u0003\n\u0002\n\u0003\n\u0011\n\u0002\n\u0003\n\u000b\n\f\n\n\u000b\n\f\n\n\u000b\n\u0006\n\u000b\n\u000b\n\n\u000b\n\u000b\n\f\u02c6ri j\n\nx\n\nfor each x, given by\n\nl\n\nx\n\nThe algorithm is as follows:\n\nj\n\nx\n\nlog\n\nri j\n\u02c6ri j\n\nx\nx\n\n1\n\nri j\n\nx\n\nlog\n\n1\n1\n\nri j\n\u02c6ri j\n\nx\nx\n\nni j\u0003 ri j\n\n\u0003\u0005\u0004\n\ni\u0001\n\n1. Start with some guess for the \u02c6pi\n2. Repeat until convergence:\n\nx\n\nand corresponding \u02c6ri j\n\nx\n\n.\n\n(a) For each i\n\n1\n\n2\n\nk\n\n\b\u0007\u0006\u0007\u0006\b\u0006\n\n(b) Renormalize the \u02c6pi\nx\n.\n(c) Recompute the \u02c6ri j\nx\n\n.\n\ni ni jri j\ni ni j \u02c6ri j\n\nx\nx\n\n\u02c6pi\n\nx\n\n\u02c6pi\n\nx\n\n\u0003\n\t\n\nj\u0001\nj\u0001\n\nx\n\nx\n\nx\n\nand \u02c6ri j\n\nHastie and Tibshirani [Hastie and Tibshirani, 1998] prove that the Kullback-Leibler dis-\ndecreases at each step. Since this distance is bounded below\nby zero, the algorithm converges. At convergence, the \u02c6ri j are consistent with the \u02c6pi. The\nclass predicted for each example x is \u02c6c\n\nare in the same order as the non-iterative\nare suf\ufb01cient for predicting the\nmost likely class for each example. However, as shown by Hastie and Tibshirani, they\nare not accurate probability estimates because they tend to underestimate the differences\n\ntance between ri j\n\nHastie and Tibshirani also prove that the \u02c6pi\n\nestimates \u02dcpi\n\n\u000b\u000f\f\nbetween the \u02c6pi\n\nx\n3 Extending the Hastie-Tibshirani method to arbitrary code matrices\n\nargmax \u02c6pi\n\nfor each x. Thus, the \u02dcpi\n\ni ri j\n\nvalues.\n\nx\n\nx\n\nx\n\n.\n\nx\n\nx\n\nj\n\nFor an arbitrary code matrix M, instead of having pairwise probability estimates, we have\n\nx\n\nfor each column b of M, such that\n\nan estimate rb\n\nrb\n\nx\n\nP\n\n\u0003\t\u0004\n\nc\n\nC\n\nI\n\nC\n\nc\n\nX\n\nx\n\nI\n\nJ\n\n\u0003\t\u0004\n\nc\n\nc\n\nI\n\nI pc\nJ pc\n\nx\nx\n\n\u0002\u000e\rc\n\nc\n\nx\n\nb\n\nb\n\n1 and M\n\nand subject to (cid:229)\n\nwhere I and J are the set of classes for which M\n\n\u0012\u0011\n\n\u0012\u0011\nWe would like to obtain a set of class membership probabilities pi\n\nx compatible with the rb\n\ni pi\n\nx\n\u000b\u000f\f\nparameters is k \u0003 1 and the number of constraints is l\nof the code matrix.\nSince for most code matrices l is greater than k \u0003 1, in general there is no exact solution to\nthis problem. For this reason, we propose an algorithm analogous to the Hastie-Tibshirani\nmethod presented in the previous section to \ufb01nd the best approximate probability estimates\n\u02c6pi(x) such that\n\n\u0003 1, respectively.\nfor each example\n1. In this case, the number of free\n1, where l is the number of columns\n\nx\n\n\u02c6rb\n\nx\n\n\u0003\t\u0004\n\nc\n\nc\n\nI\n\nI \u02c6pc\nx\nJ \u02c6pc\nx\n\nx\n\nand the Kullback-Leibler distance between \u02c6rb\n\nand rb\n\nLet nb be the number of training examples used to train the binary classi\ufb01er that corre-\nsponds to column b of the code matrix. The algorithm is as follows:\n\nx\n\nis minimized.\n\n1. Start with some guess for the \u02c6pi\n2. Repeat until convergence:\n\nx\n\nand corresponding \u02c6rb\n\nx\n\n.\n\n\u000b\n\u0002\n\u0003\n\u0004\n(cid:229)\n\u0002\n\u0002\n\u0003\n\u0002\n\u0003\n\u0002\n\u0003\n\u0011\n\u0002\n\u0012\n\u0002\n\u0003\n\u0003\n\u0012\n\u0002\n\u0003\n\u0012\n\u0002\n\u0002\n\u0003\n\u0002\n\u0003\n\u0004\n\b\n\b\n\u0002\n\u0002\n\u0003\n(cid:229)\n\u0002\n\u0002\n\u0003\n(cid:229)\n\u0002\n\u0002\n\u0003\n\u0002\n\u0003\n\u0002\n\u0003\n\u000b\n\u000b\n\n\u000b\n\f\n\u000b\n\u000b\n(cid:229)\n\u000b\n\f\n\u000b\n\u000b\n\u000b\n\u000b\n\u0002\n\u000f\n\u0004\n\u0006\n\u000f\n\u0010\n\u0004\n\b\n\u0004\n(cid:229)\n\u000f\n\u0002\n\u0003\n(cid:229)\n\u000f\n\u0010\n\u0002\n\u0003\n\u0005\n\u000b\n\f\n\u0005\n\u000b\n\f\n\u000b\n\u000b\n\u0006\n\u0002\n(cid:229)\n\u000f\n\u0002\n\u0003\n(cid:229)\n\u000f\n\u0010\n\u0002\n\u0003\n\b\n\u000b\n\u000b\n\u0002\n\u0003\n\u0002\n\u0003\n\f(a) For each i\n\n1\n\n2\n\n\u02c6pi\n\nx\n\n\b\u0007\u0006\u0007\u0006\b\u0006\n\n\u02c6pi\n\nx\n\nk\n\n(cid:229) b s\n(cid:229) b s\n\nt\n\nt\n\nb\n\nb\n\nM\u0002 i\nM\u0002 i\n\n(b) Renormalize the \u02c6pi\n(c) Recompute the \u02c6rb\nx\n\nx\n.\n\n.\n\nx\nx\n\n\u0002 1 nbrb\n\u0002 1 nb \u02c6rb\n\n(cid:229) b s\n(cid:229) b s\n\nt\n\nt\n\nb\n\nb\n\nM\u0002 i\nM\u0002 i\n\n1\n1\n\nrb\n\u02c6rb\n\nx\nx\n\n\u0002\u0006\u0005 1 nb\n\u0002\u0006\u0005 1 nb\n\nIf the code matrix is the all-pairs matrix, this algorithm reduces to the original method by\nHastie and Tibshirani.\n\nc\n\nLet B\ncolumns for which M\n\ni be the set of matrix columns for which M\n\ni be the set of matrix\n\u0003 1. By analogy with the non-iterative estimates suggested\nby Hastie and Tibshirani, we can de\ufb01ne non-iterative estimates \u02dcpi\n\n(cid:229) b\n. For the all-pairs code matrix, these estimates are the same as the ones\nsuggested by Hastie and Tibshirani. However, for arbitrary matrices, we cannot prove that\nthe non-iterative estimates predict the same class as the iterative estimates.\n\n1 and B\n\ni rb\n\nrb\n\n(cid:229) b\n\n\u0005\u0012\u0011\n\n1 \u0003\n\n\u000b\u000f\f\n\n\u000b\r\f\n\ni\n\nx\n\nx\n\nx\n\nB\n\nB\n\ni\n\n4 Loss-based decoding\n\nIn this section, we discuss how to apply the loss-based decoding method to classi\ufb01ers that\noutput class membership probability estimates. We also study the conditions under which\nthis method predicts the same class as the Hastie-Tibshirani method, in the all-pairs case.\n\nThe loss-based decoding method [Allwein et al., 2000] requires that each binary classi\ufb01er\noutput a margin score satisfying two requirements. First, the score should be positive if\nthe example is classi\ufb01ed as positive, and negative if the example is classi\ufb01ed as negative.\nSecond, the magnitude of the score should be a measure of con\ufb01dence in the prediction.\nThe method works as follows. Let f\nbe the margin score predicted by the classi\ufb01er\ncorresponding to column b of the code matrix for example x. For each row c of the code\nmatrix M and for each example x, we compute the distance between f and M\n\nas\n\nb\n\nc\n\nx\n\nc\nfor which dL is minimized.\n\nwhere L is a loss function that is dependent on the nature of the binary classi\ufb01er and M\n= 0, 1 or \u0003 1. We then label each example x with the label c\nIf the binary classi\ufb01cation learning algorithm outputs scores that are probability estimates,\nthey do not satisfy the \ufb01rst requirement because the probability estimates are all between 0\noutput by each classi\ufb01er\n2 from the scores, so that we consider as positives\nis above 1/2, and as negatives the examples x for which\n\nand 1. However, we can transform the probability estimates rb\n\nthe examples x for which rb\n\nrb\n\nWe now prove a theorem that relates the loss-based decoding method to the Hastie-\nTibshirani method, for a particular class of loss functions.\n\nb into margin scores by subtracting 1\n\nis below 1/2.\n\nb\n\nx\n\nx\n\nx\n\nTheorem 1 The loss-based decoding method for all-pairs code matrices predicts the same\ngiven by Hastie and Tibshirani, if the loss function\n\nclass label as the iterative estimates \u02c6pi\n\n\u0003 ay, for any a\n\nis of the form L\n\ny\n\nx\n0.\n\nProof: We \ufb01rst show that, if the loss function is of the form L\n\ny\n\ndecoding method predicts the same class label as the non-iterative estimates \u02dcpi\n\nall-pairs code matrix.\n\n\u000b\r\f\n\n\u0003 ay, the loss-based\n, for the\n\nx\n\nl(cid:229)\n\nL\n\nM\n\nc\n\nb\n\nf\n\nx\n\nb\n\ndL\n\nx\n\nc\n\n\u0003\t\u0004\n\nb\u0002 1\n\n\u0005\u0012\u0011\n\n(1)\n\n\u0004\n\b\n\b\n\u0002\n\u0003\n\t\n\u0002\n\u0003\n\u0001\n\u0001\n\u0003\n\u0004\n\u0002\n\u0003\n\u0011\n\u0001\n\u0001\n\u0003\n\u0004\n\u0002\n\u0012\n\u0002\n\u0003\n\u0003\n\u0001\n\u0001\n\u0003\n\u0004\n\u0002\n\u0003\n\u0011\n\u0001\n\u0001\n\u0003\n\u0004\n\u0002\n\u0012\n\u0002\n\u0003\n\u0003\n\u0002\n\u0003\n\u0002\n\u0003\n\u0007\n\n\u0005\n\u0011\n\u0006\n\b\n\n\u000b\n\f\n\t\n\n\u000b\n\u0006\n\t\n\u000b\n\u000b\n\u000b\n\n\u0005\n\u000b\n\n\u000b\n\u0002\n\b\n\u0002\n\u0002\n\b\n\u0003\n\u0002\n\b\n\u0003\n\u0003\n\n\u0005\n\u000b\n\f\n\u000b\n\n\u000b\n\u000b\n\u000b\n\n\u000b\n\f\n\n\u000b\n\fDataset\nsatimage\npendigits\nsoybean\n\n#Training Examples\n\n#Test Examples\n\n#Attributes\n\n#Classes\n\n4435\n7494\n307\n\n2000\n3498\n376\n\n36\n16\n35\n\n7\n10\n9\n\nTable 1: Characteristics of the datasets used in the experiments.\n\nThe non-iterative estimates \u02dcpi\n\n\u02dcpc\n\nx\n\nrb\n\nx\n\nx\n\nare given by\n\nb\n\nB\n\nc\n\nb\n\nB\n\n1\n\nrb\n\nx\n\n\u0003\t\u0004\n\nrb\n\nx\n\nrb\n\nx\n\nb\n\nB\n\nc\n\nb\n\nB\n\nc\n\nc\u0002\n\nc are the sets of matrix columns for which M\n\nc\n\nB\n\n\u0005 c\n1 and M\n\nc\n\n\u0003\t\u0004\nc and B\nwhere B\n\u0003 1, respectively.\nConsidering that L\nM\n\nb\n\nc\n\n\u0005\u0012\u0011\n\n\u0003\u0007\u0006\n\n0, we can rewrite Equation 1 as\n\n\u0003 ay and f\n\nx\n\nb\n\ny\n\nx\n\nrb\n\n\u0003 1\n\n2, and eliminating the terms for which\n\nd\n\nx\n\nc\n\na\n\nrb\n\nx\n\nb\n\nB\n\nc \u0012\n\n1\n2\u0003\n\na\n\nrb\n\nx\n\nb\n\nB\n\nc\n\n1\n2\u0003\n\nrb\n\nx\n\nrb\n\nx\n\nb\n\nB\n\nc\n\nb\n\nB\n\nc\n\n1\n\n2 \u0002\n\nB\n\n\u0005 c\n\nB\n\nc\n\na\u0002\n\nFor the all-pairs code matrix the following relationship holds: 1\n2, where k is the number of classes. So, the distance d\nk \u0003 1\n\n2\nx\n\nc\n\nc\n\nB\nis\n\nB\n\nc\n\nB\n\nc\n\n\u0003\u0005\u0004\n\nd\n\nx\n\nc\n\n\u0003\u0005\u0004\n\na\u0002\n\nrb\n\nx\n\nrb\n\nx\n\nb\n\nB\n\nc\n\nb\n\nB\n\nc\n\nB\n\n\u0005 c\n\nk\n\n1\n\n2\n\nx\n\nwhich minimizes d\nd\n\nthen p\n\nx\n\nx\n\ni\n\nj\n\nx\np\n\nc\nx\n\nfor example x, also max-\n, which means that the\nj\n\nranking of the classes for each example is the same.\n\nx\n\n. Furthermore, if d\n\nIt is now easy to see that the class c\ni\n\nimizes \u02dcpc\n\nSince the non-iterative estimates \u02dcpc\n\nx\n\n\u000b\t\b\n\nx\n\n,\nx\nwe can conclude that the Hastie-Tibshirani method is equivalent to the loss-based decoding\nmethod if L\n\nare in the same order as the iterative estimates \u02c6pc\n\n\u0003 ay, in terms of class prediction, for the all-pairs code matrix.\n\ny\n\nAllwein et al. do not consider loss functions of the form L\nloss functions such as L\nmay differ from the one predicted by the method by Hastie and Tibshirani.\n\n\u0003 ay, and uses non-linear\ny. In this case, the class predicted by loss-based decoding\n\n\u000b\u0010\f\n\ne\b\n\ny\n\ny\n\nc\n\nc\n\nB\n\nThis theorem applies only to the all-pairs code matrix. For other matrices such that\n(such as the one-against-all matrix), we can prove\nB\n\u0003 ay) predicts the same class as the non-iterative es-\nthat loss-based decoding (with L\ntimates. However, in this case, the non-iterative estimates do not necessarily predict the\nsame class as the iterative ones.\n\nis a linear function of\ny\n\nB\n\nc\n\n5 Experiments\n\nWe performed experiments using the following multiclass datasets from the UCI Machine\nLearning Repository [Blake and Merz, 1998]: satimage, pendigits and soybean. Table\n1 summarizes the characteristics of each dataset.\n\nThe binary learning algorithm used in the experiments is boosted naive Bayes\n[Elkan, 1997], since this is a method that cannot be easily extended to handle multiclass\nproblems directly. For all the experiments, we ran 10 rounds of boosting.\n\n\u000b\n\u0002\n(cid:229)\n\u000f\n\n\u0002\n\u0003\n\u0011\n(cid:229)\n\u000f\n\u0001\n\u0012\n\u0002\n\u0003\n(cid:229)\n\u000f\n\n\u0002\n\u0003\n\u0012\n(cid:229)\n\u000f\n\u0001\n\u0002\n\u0003\n\u0011\n\u0006\n\u0006\n\b\n\u0007\n\b\n\n\u000b\n\f\n\u0006\n\n\u0005\n\u0011\n\u000b\n\f\n\n\u000b\n\f\n\n\u0005\n\u000b\n\f\n\u000b\n\n\u0005\n\u000b\n\f\n\u0002\n\b\n\u0003\n\u0004\n(cid:229)\n\u000f\n\n\u0002\n\u0002\n\u0003\n\u0012\n\u0011\n(cid:229)\n\u000f\n\u0001\n\u0002\n\u0002\n\u0003\n\u0012\n\u0004\n\u0012\n(cid:229)\n\u000f\n\n\u0002\n\u0003\n\u0012\n(cid:229)\n\u000f\n\u0001\n\u0002\n\u0003\n\u0011\n\u0006\n\u0006\n\u0012\n\u0006\n\u0003\n\u0006\n\u0006\n\n\n\b\n\n\u0003\n\n\u0007\n\n\u000b\n\f\n\n\b\n\n\u0006\n\n\u000b\n\n\u0005\n\u000b\n\u0002\n\b\n\u0012\n(cid:229)\n\u000f\n\n\u0002\n\u0003\n\u0012\n(cid:229)\n\u000f\n\u0001\n\u0002\n\u0003\n\u0011\n\u0006\n\u0006\n\u0011\n\u0002\n\u0012\n\u0004\n\u0006\n\f\n\n\u000b\n\n\u0005\n\u000b\n\u000b\n\n\u0005\n\n\u0005\n\u000b\n\n\u0005\n\u000b\n\n\u0005\n\u000b\n\u000b\n\u000b\n\n\u000b\n\f\n\n\u000b\n\f\n\n\b\n\n\u0003\n\n\u0007\n\n\n\b\n\n\n\u000b\n\f\n\fy\ny\n\ny)\ne\u0005 y)\n\nCode Matrix\n\nAll-pairs\nAll-pairs\nAll-pairs\nAll-pairs\n\nMethod\nLoss-based (L\nLoss-based (L\nHastie-Tibshirani (non-iterative)\nHastie-Tibshirani (iterative)\nOne-against-all\nLoss-based (L\nLoss-based (L\nOne-against-all\nExtended Hastie-Tibshirani (non-iterative) One-against-all\nExtended Hastie-Tibshirani (iterative)\nOne-against-all\nLoss-based (L\nLoss-based (L\nExtended Hastie-Tibshirani (non-iterative)\nExtended Hastie-Tibshirani (iterative)\nMulticlass Naive Bayes\n\nSparse\nSparse\nSparse\nSparse\n\ny)\ne\u0005 y)\n\ny)\ne\u0005 y)\n\ny\ny\n\ny\ny\n\n-\n\nError Rate MSE\n\n0.1385\n0.1385\n0.1385\n0.1385\n0.1445\n0.1425\n0.1445\n0.1670\n0.1435\n0.1425\n0.1480\n0.1330\n0.2040\n\n-\n-\n\n0.0999\n0.0395\n\n-\n-\n\n0.1212\n0.0396\n\n-\n-\n\n0.1085\n0.0340\n0.0651\n\nTable 2: Test set results on the satimage dataset.\n\n15 log2 k\n\nWe use three different code matrices for each dataset: all-pairs, one-against-all and a sparse\nrandom matrix. The sparse random matrices have\ncolumns, and each element\nis 0 with probability 1/2 and -1 or +1 with probability 1/4 each. This is the same type of\nsparse random matrix used by Allwein et al.[Allwein et al., 2000]. In order to have good\nerror correcting properties, the Hamming distance r between each pair of rows in the matrix\nmust be large. We select the matrix by generating 10,000 random matrices and selecting\nthe one for which r\nis maximized, checking that each column has at least one \u0003 1 and one\nWe evaluate the performance of each method using two metrics. The \ufb01rst metric is the\nerror rate obtained when we assign each example to the most likely class predicted by\nthe method. This metric is suf\ufb01cient if we are only interested in classifying the examples\ncorrectly and do not need accurate probability estimates of class membership.\n\n1, and that the matrix does not have two identical columns.\n\nThe second metric is squared error, de\ufb01ned for one example x as SE\n\nx\n\nx\n\nx\n\nx\n\n2, where p j\n\np j\n\nj, and t j\n\nlabels are known, but not probabilities, t j\n\nis the probability estimated by the method for example x and class\nis the true probability of class j for x. Since for most real-world datasets true\nis de\ufb01ned to be 1 if the label of x is j and\n0 otherwise. We calculate the squared error for each x to obtain the mean squared er-\nror (MSE). The mean squared error is an adequate metrics for assessing the accuracy of\nprobability estimates [Zadrozny and Elkan, 2001b]. This metric cannot be applied to the\nloss-based decoding method, since it does not produce probability estimates.\n\nx\n\nx\n\nt j\n\nj\n\nTable 2 shows the results of the experiments on the satimage dataset for each type of code\nmatrix. As a baseline for comparison, we also show the results of applying multiclass Naive\nBayes to this dataset. We can see that the iterative Hastie-Tibshirani procedure (and its\nextension to arbitrary code matrices) succeeds in lowering the MSE signi\ufb01cantly compared\nto the non-iterative estimates, which indicates that it produces probability estimates that\nare more accurate. In terms of error rate, the differences between methods are small. For\none-against-all matrices, the iterative method performs consistently worse, while for sparse\nrandom matrices, it performs consistently better. Figure 1 shows how the MSE is lowered\nat each iteration of the Hastie-Tibshirani algorithm, for the three types of code matrices.\n\nTable 3 shows the results of the same experiments on the datasets pendigits and soybean.\nAgain, the MSE is signi\ufb01cantly lowered by the iterative procedure, in all cases. For the\nsoybean dataset, using the sparse random matrix, the iterative method again has a lower\nerror rate than the other methods, which is even lower than the error rate using the all-pairs\nmatrix. This is an interesting result, since in this case the all-pairs matrix has 171 columns\n(corresponding to 171 classi\ufb01ers), while the sparse matrix has only 64 columns.\n\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\n\u0001\n\u0006\n\n\u000b\n\f\n(cid:229)\n\u000b\n\u0003\n\u000b\n\u000b\n\u000b\n\u000b\n\u000b\n\fE\nS\nM\n\n0.12\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0\n\nSatimage\n\nall\u2212pairs \none\u2212against\u2212all\nsparse \n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nIteration\n\nFigure 1: Convergence of the MSE for the satimage dataset.\n\ny\ny\n\ny)\ne\u0005 y)\n\nMethod\nLoss-based (L\nLoss-based (L\nHastie-Tibshirani (non-iterative)\nHastie-Tibshirani (iterative)\nLoss-based (L\ny)\ne\u0005 y)\nLoss-based (L\nExt. Hastie-Tibshirani (non-it.)\nExt. Hastie-Tibshirani (it.)\nLoss-based (L\ny)\ne\u0005 y)\nLoss-based (L\nExt. Hastie-Tibshirani (non-it.)\nExt. Hastie-Tibshirani (it.)\nMulticlass Naive Bayes\n\ny\ny\n\ny\ny\n\nCode Matrix\n\nError Rate MSE\n\nError Rate MSE\n\npendigits\n\nsoybean\n\nAll-pairs\nAll-pairs\nAll-pairs\nAll-pairs\n\nOne-against-all\nOne-against-all\nOne-against-all\nOne-against-all\n\nSparse\nSparse\nSparse\nSparse\n\n-\n\n0.0723\n0.0715\n0.0723\n0.0718\n0.0963\n0.0963\n0.0963\n0.1023\n0.1284\n0.1266\n0.1484\n0.1261\n0.2779\n\n-\n-\n\n0.0747\n0.0129\n\n-\n-\n\n0.0862\n0.0160\n\n-\n-\n\n0.0789\n0.0216\n0.0509\n\n0.0665\n0.0665\n0.0665\n0.0665\n0.0824\n0.0931\n0.0824\n0.0931\n0.0718\n0.0718\n0.0798\n0.0636\n0.0745\n\n-\n-\n\n0.0454\n0.0066\n\n-\n-\n\n0.0493\n0.0073\n\n-\n-\n\n0.0463\n0.0062\n0.0996\n\nTable 3: Test set results on the pendigits and soybean datasets.\n\n6 Conclusions\n\nWe have presented a method for producing class membership probability estimates for\nmulticlass problems, given probability estimates for a series of binary problems determined\nby an arbitrary code matrix.\n\nresearch\n\nin\n\ndesigning\n\ncode matrices\n\noptimal\n[Crammer and Singer, 2000],\n\non-going\nSince\n[Utschick and Weichselberger, 2001]\nto\nbe able to obtain class membership probability estimates from arbitrary code matrices.\nIn current research, the effectiveness of a code matrix is determined primarily by the\nclassi\ufb01cation accuracy. However, since many applications require accurate class mem-\nbership probability estimates for each of the classes, it is important to also compare the\ndifferent types of code matrices according to their ability of producing such estimates. Our\nextension of Hastie and Tibshirani\u2019s method is useful for this purpose.\n\nstill\nis important\n\nis\nit\n\nOur method relies on the probability estimates given by the binary classi\ufb01ers to produce the\nmulticlass probability estimates. However, the probability estimates produced by Boosted\n\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\u0002\n\u0003\n\u0004\n\u0012\n\u0002\n\u0003\n\u0004\n\fNaive Bayes are not calibrated probability estimates. An interesting direction for future\nwork is in determining whether the calibration of the probability estimates given by the\nbinary classi\ufb01ers improves the calibration of the multiclass probabilities.\n\nReferences\n\n[Allwein et al., 2000] Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing multiclass\nto binary: A unifying approach for margin classi\ufb01ers. Journal of Machine Learning Research,\n1:113\u2013141.\n\n[Blake and Merz, 1998] Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning\ndatabases. Department of Information and Computer Sciences, University of California, Irvine.\nhttp://www.ics.uci.edu/\n\nmlearn/MLRepository.html.\n\n[Bradley and Terry, 1952] Bradley, R. and Terry, M. (1952). Rank analysis of incomplete block\n\ndesigns, I: The method of paired comparisons. Biometrics, pages 324\u2013345.\n\n[Crammer and Singer, 2000] Crammer, K. and Singer, Y. (2000). On the learnability and design of\nIn Proceedings of the Thirteenth Annual Conference on\n\noutput codes for multiclass problems.\nComputational Learning Theory, pages 35\u201346.\n\n[Dietterich and Bakiri, 1995] Dietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning\nproblems via error-correcting output codes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013\n286.\n\n[Elkan, 1997] Elkan, C. (1997). Boosting and naive bayesian learning. Technical Report CS97-557,\n\nUniversity of California, San Diego.\n\n[Hastie and Tibshirani, 1998] Hastie, T. and Tibshirani, R. (1998). Classi\ufb01cation by pairwise cou-\n\npling. In Advances in Neural Information Processing Systems, volume 10. MIT Press.\n\n[Utschick and Weichselberger, 2001] Utschick, W. and Weichselberger, W. (2001). Stochastic orga-\nnization of output codes in multiclass learning problems. Neural Computation, 13(5):1065\u20131102.\n[Zadrozny and Elkan, 2001a] Zadrozny, B. and Elkan, C. (2001a). Learning and making decisions\nIn Proceedings of the Seventh International\n\nwhen costs and probabilities are both unknown.\nConference on Knowledge Discovery and Data Mining, pages 204\u2013213. ACM Press.\n\n[Zadrozny and Elkan, 2001b] Zadrozny, B. and Elkan, C. (2001b). Obtaining calibrated probability\nestimates from decision trees and naive bayesian classi\ufb01ers.\nIn Proceedings of the Eighteenth\nInternational Conference on Machine Learning, pages 609\u2013616. Morgan Kaufmann Publishers,\nInc.\n\n\n\f", "award": [], "sourceid": 2122, "authors": [{"given_name": "B.", "family_name": "Zadrozny", "institution": null}]}