{"title": "Improved Output Coding for Classification Using Continuous Relaxation", "book": "Advances in Neural Information Processing Systems", "page_first": 437, "page_last": 443, "abstract": null, "full_text": "Improved Output Coding for Classification \n\nUsing Continuous Relaxation \n\nKoby Crammer and Yoram Singer \n\nSchool of Computer Science & Engineering \n\nThe Hebrew University, Jerusalem 91904, Israel \n\n{kob i cs ,sing e r }@ c s.huji.a c .il \n\nAbstract \n\nOutput coding is a general method for solving multiclass problems by \nreducing them to multiple binary classification problems. Previous re(cid:173)\nsearch on output coding has employed, almost solely, predefined discrete \ncodes. We describe an algorithm that improves the performance of output \ncodes by relaxing them to continuous codes. The relaxation procedure \nis cast as an optimization problem and is reminiscent of the quadratic \nprogram for support vector machines. We describe experiments with the \nproposed algorithm, comparing it to standard discrete output codes. The \nexperimental results indicate that continuous relaxations of output codes \noften improve the generalization performance, especially for short codes. \n\n1 Introduction \nThe problem of multiclass categorization is about assigning labels to instances where the la(cid:173)\nbels are drawn from some finite set. Many machine learning problems include a multiclass \ncategorization component in them. Examples for such applications are text classification, \noptical character recognition, medical analysis, and object recognition in machine vision. \nThere are many algorithms for the binary class problem, where there are only two possible \nlabels, such as SVM [17], CART [4] and C4.5 [14]. Some of them can be extended to \nhandle multiclass problems. An alternative and general approach is to reduce a multiclass \nproblem to a multiple binary problems. \nIn [9] Dietterich and Bakiri described a method for reducing multiclass problems to multi(cid:173)\nple binary problems based on error correcting output codes (ECOC). Their method consists \nof two stages. In the training stage, a set of binary classifiers is constructed, where each \nclassifier is trained to distinguish between two disjoint subsets of the labels. In the classi(cid:173)\nfication stage, each of the trained binary classifiers is applied to test instances and a voting \nscheme is used to decide on the label. Experimental work has shown that the output coding \napproach can improve performance in a wide range of problems such as text classifica(cid:173)\ntion [3], text to speech synthesis [8], cloud classification [1] and others [9, 10, 15]. The \nperformance of output coding was also analyzed in statistics and learning theoretic con(cid:173)\ntexts [11, 12, 16, 2]. \nMost of previous work on output coding has concentrated on the problem of solving multi(cid:173)\nclass problems using predefined output codes, independently of the specific application and \nthe learning algorithm used to construct the binary classifiers. Furthermore, the \"decoding\" \n\n\fscheme assigns the same weight to each learned binary classifier, regardless of its perfor(cid:173)\nmance. Last, the induced binary problems are treated as separate problems and are learned \nindependently. Thus, there might be strong statistical correlations between the resulting \nclassifiers, especially when the induced binary problems are similar. These problems call \nfor an improved output coding scheme. \nIn a recent theoretical work [7] we suggested a relaxation of discrete output codes to con(cid:173)\ntinuous codes where each entry of the code matrix is a real number. As in discrete codes, \neach column of the code matrix defines a partition of the set of the labels into two subsets \nwhich are labeled positive (+) and negative (-). The sign of each entry in the code matrix \ndetermines the subset association (+ or -) and magnitude corresponds to the confidence \nin this association. In this paper we discuss the usage of continuous codes for multiclass \nproblems using a two phase approach. First, we create a binary output code matrix that is \nused to train binary classifiers in the same way suggested by Dietterich and Bakiri. Given \nthe trained classifiers and some training data we look for a more suitable continuous code \nby casting the problem as a constrained optimization problem. We then replace the original \nbinary code with the improved continuous code and proceed analogously to classify new \ntest instances. \nAn important property of our algorithm is that the resulting continuous code can be ex(cid:173)\npressed as a linear combination of a subset of the training patterns. Since classification \nof new instances is performed using scalar products between the predictions vector of the \nbinary classifiers and the rows of the code matrix, we can exploit this particular form of \nthe code matrix and use kernels [17] to construct high dimensional product spaces. This \napproach enables an efficient and simple way to take into account correlations between the \ndifferent binary classifiers. \nThe rest of this paper is organized as follows. In the next section we formally describe \nthe framework that uses output coding for multiclass problems. In Sec. 3 we describe our \nalgorithm for designing a continuous code from a set of binary classifiers. We describe and \ndiscuss experiments with the proposed approach in Sec. 4 and conclude in Sec. 5. \n\n2 Multiclass learning using output coding \n\nLet S = {(Xl, Yl)\"'\" (xm, Ym)} be a set of m training examples where each instance \nXi belongs to a domain X. We assume without loss of generality that each label Yi is an \ninteger from the set Y = {I, ... , k}. A multiclass classifier is a function H : X -+ Y that \nmaps an instance X into an element Y of y. In this work we focus on a framework that uses \noutput codes to build multiclass classifiers from binary classifiers. A binary output code \nM is a matrix of size k x lover { -1, + I} where each row of M correspond to a class \nY E y. Each column of M defines a partition of Y into two disjoint sets. Binary learning \nalgorithms are used to construct classifiers, one for each column t of M. That is, the set \nof examples induced by column t of M is (Xl, Mt,yJ, . .. , (Xm, Mt,y~). This set is fed as \ntraining data to a learning algorithm that finds a binary classifier. In this work we assume \nthat each binary classifier ht is of the form ht : X -+ R This reduction yields l different \nbinary classifiers hl , ... , ht. We denote the vector of predictions of these classifiers on an \ninstance X by h(x) = (h l (x), ... , ht(x)). We denote the rth row of M by Mr. \nGiven an example X we predict the label Y for which the row My is the \"most similar\" to \nh(x). We use a general notion of similarity and define it through an inner-product function \nK : JRt X JRt -+ JR. The higher the value of K(h(x), Mr) is the more confident we are \nthat r is the correct label of x according to the set of classifiers h. Note that this notion \nof similarity holds for both discrete and continuous matrices. An example of a simple \n\n\fsimilarity function is K(u, v) = u . v. It is easy to verify that when both the output code \nand the binary classifiers are over { -1, + I} this choice of K is equivalent to picking the \nrow of M which attains the minimal Hamming distance to h(x). \nTo summarize, the learning algorithm receives a training set S, a discrete output code \n(matrix) of size k x l, and has access to a binary learning algorithm, denoted L. The \nlearning algorithm L is called l times, once for each induced binary problem. The result \nof this process is a set of binary classifiers h(x) = (hI (x), ... , hl(x)). These classifiers \nare fed, together with the original labels YI, ... , Ym to our second stage of the learning \nalgorithm which learns a continuous code. This continuous code is then used to classify \nnew instances by choosing the class which correspond to a row with the largest inner(cid:173)\nproduct. The resulting classifier can be viewed as a two-layer neural network. The first \n(hidden) layer computes hI (x), ... , hi (x) and the output unit predicts the final class by \nchoosing the label r which maximizes K(h(x), Mr). \n\n3 Finding an improved continuous code \nWe now describe our method for finding a continuous code that improves on a given en(cid:173)\nsemble of binary classifiers h. We would like to note that we do not need to know the \noriginal code that was originally used to train the binary classifiers. For simplicity we use \nthe standard scalar-product as our similarity function. We discuss at the end of this section \nmore general similarity functions based on kernels which satisfy Mercer conditions. \nThe approach we take is to cast the code design problem as a constrained optimization \nproblem. The multiclass empirical error is given by \n\nwhere [7f] is equal to 1 if the predicate 7f holds and 0 otherwise. Borrowing the idea \nof soft margins [6] we replace the 0-1 multiclass error with a piece wise linear bound \nmaxr{h(xi) . Mr + by\"r} -h(Xi) . My\" where bi,j = 1 - Oi,j, i.e., it is equal 0 if i = j \nand 1 otherwise. We now get an upper bound on the empirical loss \n\nf.s(M, h) ~ ~ f [m:x{h(xi) . Mr + by\"r} -h(Xi) . My,] \n\ni=1 \n\n(1) \n\nPut another way, the correct label should have a confidence value that is larger by at least \none than any of the confidences for the rest of the labels. Otherwise, we suffer loss which \nis linearly proportional to the difference between the confidence of the correct label and the \nmaximum among the confidences of the other labels. \nDefine the l2-norm of a code M to be the l2-norm of the vector represented by the con(cid:173)\ncatenation of M's rows, IIMII~ = II(MI , ... , Mk)ll~ = Ei,j Mi~j , where f3 > 0 is a \nregularization constant. We now cast the problem of finding a good code which minimizes \nthe bound Eq. (1) as a quadratic optimization problem with \"soft\" constraints, \n\n1 \n2f3IIMII~ + L~i \n\nm \n\ni=1 \n\nSolving the above optimization problem is done using its dual problem (details are omitted \n\n\fdue to lack of space). The solution of the dual problem result in the following form for M \n\n(3) \n\nwhere 'T/i,r are variables of the dual problem which satisfy Vi, r : 'T/i,r ~ 0 and L:r 'T/i,r = 1. \nEq. (3) implies that when the optimum of the objective function is achieved each row of \nthe matrix M is a linear combination of li(Xi). We thus say that example i is a support \npattern for class r if the coefficient (t5yi ,r - 'T/i,r) of li(Xi) in Eq. (3) is non-zero. There \nare two cases for which example i can be a support pattern for class r: The first is when \nYi = rand 'T/i,r < 1. The second case is when Yi i' rand 'T/i,r > O. Put another way, fixing \ni, we can view 'T/i,r as a distribution, iii, over the labels r. This distribution should give a \nhigh probability to the correct label Yi. Thus, an example i \"participates\" in the solution \nfor M (Eq. (3\u00bb if and only if iii is not a point distribution concentrating on the correct label \nYi. Since the continuous output code is constructed from the support patterns, we call our \nalgorithm SPOC for Support Patterns Output Coding. \nDenote by Ti = Iy. - iii. Thus, from Eq. (3) we obtain the classifier, \n\nH(x) = argm:x {li(x) . Mr } = argm:x { ~ Ti,r [li(x) \u00b7li(Xi)] } \n\n(4) \n\nNote that solution as defined by Eq. (4) is composed of inner-products of the prediction \nvector on a new instance with the support patterns. Therefore, we can transform each \nprediction vector to some high dimensional inner-product space Z using a transformation \n\u00a2 : ]Rl -t Z. We thus replace the inner-product in the dual program with a general inner(cid:173)\nproduct kernel K that satisfies Mercer conditions [17]. From Eq. (4) we obtain the kernel(cid:173)\nbased classification rule H(x), \n\n(5) \n\nThe ability to use kernels as a means for calculating inner-products enables a simple and \nefficient way to take into account correlations between the binary classifiers. For instance, a \nsecond order polynomial of the form (1 + iiii)2 correspond to a transformation to a feature \nspace that includes all the products of pairs of binary classifiers. Therefore, the relaxation \nof discrete codes to continuous codes offers a partial remedy by assigning different impor(cid:173)\ntance weight to each binary classifier while taking into account the statistical correlations \nbetween the binary classifiers. \n\n4 Experiments \nIn this section we describe experiments we performed comparing discrete and continuous \noutput codes. We selected eight multiclass datasets, seven from the VCI repository! and the \nmnist dataset available from AT&T2. When a test set was provided we used the original \nsplit into training and test sets, otherwise we used 5-fold cross validation for evaluating \nthe test error. Since we ran multiple experiments with 3 different codes, 7 kernels, and two \nbase-learners, we used a subset of the training set formnist, letter, and shut tIe. We \nare in the process of performing experiments with the complete datasets and other datasets \nusing a subset of the kernels. A summary of data sets is given in Table 1. \n\nIhttp://www.ics.uci.edllimlearnIMLRepository.html \n2http://www.research.att.comryann/ocr/mnist \n\n\fName \n\nNo. of \n\nNo. of \n\nNo. of \n\nNo. of \n\nTraining Examples Test Examples Classes Attributes \n\nsatimage \nshuttle \nmnist \nisolet \nletter \nvowel \nglass \nsoybean \n\n36 \n9 \n784 \n6 \n16 \n10 \n10 \n35 \nTable 1: Description of the datasets used in experiments. \n\n4435 \n5000 \n5000 \n6238 \n5000 \n528 \n214 \n307 \n\n2000 \n9000 \n10000 \n1559 \n4000 \n462 \n\n6 \n7 \n10 \n26 \n26 \n11 \n7 \n19 \n\n5-fold eval \n\n376 \n\nWe tested three different types of codes: one-against-all (denoted \"id\"), BCH (a linear \nerror correcting code), and random codes. For a classification problem with k classes we \nset the random code to have about 10 log2 (k) columns. We then set each entry in the matrix \ndefining the code to be -1 or + 1 uniformly at random. We used SVM as the base binary \nlearning algorithm in two different modes: In the first mode we used the margin of the \nvector machine classifier as its real-valued prediction. That is, each binary classifier ht is \nof the fonn ht(x) = w\u00b7x+b where wand b are the parameters ofthe separating hyperplane. \nIn the second mode we thresholded the prediction of the classifiers, ht(x) = sign(w \u00b7x+b) . \nThus, each binary classifier ht in this case is of the fonn ht : X -t {-I, +1}. For brevity, \nwe refer to these classifiers as thresholded-SVMs. We would like to note in passing that \nthis setting is by no means superficial as there are learning algorithms, such as RIPPER [5], \nthat build classifiers of this type. We ran SPOC with 7 different kernels: homogeneous and \nnon-homogeneous polynomials of degree 1,2, and 3, and radial-basis-functions (RBF). \nA summary of the results is depicted in Figure 1. The figure contains four plots. Each plot \nshow the relative test error difference between discrete and continuous codes. Formally, \nthe height of each bar is proportional to (Ed - Ec) / Ed where Ed (Ec) is the test error when \nusing a discrete (continuous) code. For each problem there are three bars, one for each \ntype of code (one-against-all, BCH, and random). The datasets are plotted left to right in \ndecreasing order with respect to the number of training examples per class. The left plots \ncorrespond to the results obtained using thresholded-SVM as the base binary classifier and \nright plots show the results using the real-valued predictions. For each mode we show the \nresults of best performing kernel on each dataset (top plots) and the average (over the 7 \ndifferent kernels) performance (bottom plots). \nIn general, the continuous output code relaxation indeed results in an improved perfor(cid:173)\nmance over the original discrete output codes. The most significant improvements are \nachieved with thresholded-SVM as the base binary classifiers. On most problems all the \nkernels achieve some improvement. However, the best performing kernel seems to be prob(cid:173)\nlem dependent. Impressive improvements are achieved for data sets with a large number of \ntraining examples per class, shuttle being a notable example. For this dataset the test \nerror is reduced from an average of over 3% when using discrete code to an average test \nerror which is significantly lower than 1% for continuous codes. Furthennore, using a \nnon-homogeneous polynomial of degree 3 reduces the test error rate down to 0.48%. In \ncontrast, for the so yb e a n dataset, which contains 307 training examples, and 19 classes, \nnone of the kernels achieved any improvement, and often resulted in an increase in the test \nerror. Examining the training error reveals that the greater the decrease in the training error \ndue to the continuous code relaxation the worse the increase in the corresponding test error. \nThis behavior indicates that SPOC overfitted the relatively small training set. \n\n\f0 _ \n_ BeH \n\n80 \n\n.\n\n,.;n;!om \n\nSO \n\n\u2022 , \n! \nt\" \nL. \n\n~ \n\n~II \n\n_.1 _II ~.I 0.. 011 \u00b01-\n\nsatmage \n\nstulle \n\nIsolel \n\nlell er \n\nglass \n\nsoybean \n\n~_ __I _._ 0._ oil ~II \n\nsallmage \n\nshlAlle \n\nm~st \n\nIsolel \n\nleller \n\nvowel \n\nglass \n\nsoybean \n\nFigure 1: Comparison of the performance of discrete and continuous output codes using \nSVM (right figures) and thresholded-SVM (left figures) as the base learner for three dif(cid:173)\nferent families of codes. The top figures show the relative change in test error for the best \nperforming kernel and the bottom figures show the relative change in test error averaged \nacross seven different kernels. \n\nTo conclude this section we describe an experiment that evaluated the performance of the \nSPOC algorithm as a function of the length of random codes. Using the same setting de(cid:173)\nscribed above we ran SPOC with random codes of lengths 5 through 35 for the vo wel \ndataset and lengths 15 through 50 for the let te r dataset. In Figure 2 we show the test \nerror rate as a function of the the code length with SVM as the base binary learner. (Sim(cid:173)\nilar results were obtained using thresholded-SVM as the base binary classifiers.) For the \nlette r dataset we see consistent and significant improvements of the continuous codes \nover the discrete ones whereas for vowe l dataset there is a major improvement for short \ncodes that decays with the code's length. Therefore, since continuous codes can achieve \nperformance comparable to much longer discrete codes they may serve as a viable alter(cid:173)\nnative for discrete codes when computational power is limited or for classification tasks of \nlarge datasets. \n\n5 Discussion \nIn this paper we described and experimented with an algorithm for continuous relaxation of \noutput codes for multiclass categorization problems. The algorithm appears to be especially \nuseful when the codes are short. An interesting question is whether the proposed approach \ncan be generalized by calling the algorithm successively on the previous code it improved. \nAnother viable direction is to try to combine the algorithm with other scheme for reducing \n\n\f, . , \n\n~ 50 \ni \n\nl \n\n\\ \u2022 \n\nvowel \n\n-e- Discrete \n-0- Conlinuous \n\nletter \n\n-e- o..avte \n-0- COltinUDUI \n\n\u2022 \n, , \n\no. \n\n'(9.\u20acJ-e.Q--q \n\np \n\nGI \n\n'I!! E'.I b/ O--a-0-C:H3, 9 Q O--o-Q. \n\n\",e-, -----!c----=~~,=_, _---=,,_----c'c-_~b-~----\"''! \n\nc:ode length \n\nFigure 2: Comparison of the performance of discrete random codes and their continuous \nrelaxation as a function of the code length. \n\nmulticlass problems to multiple binary problems such as tree-based codes and directed \nacyclic graphs [13]. We leave this for future research. \n\nReferences \n[1] D. W. Aha and R. L. Bankert. Cloud classification using error-correcting output codes. In \nArtificial Intelligence Applications: Natural Science, Agriculture, and Environmental Science, \nvolume 11 , pages 13- 28, 1997. \n\n[2] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach \nfor margin classifiers. In Machine Learning: Proceedings of the Seventeenth International \nConference, 2000. \n\n[3] A. Berger. Error-correcting output coding for text classification. In IJCAJ'99: Workshop on \n\nmachine learning for information filtering , 1999. \n\n[4] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification \n\nand Regression Trees. Wadsworth & Brooks, 1984. \n\n[5] William Cohen. Fast effective rule induction. In Proceedings of the Twelfth International \n\nConference on Machine Learning, pages 115- 123, 1995. \n\n[6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273-\n\n297, September 1995. \n\n[7] Koby Crammer and Yoram Singer. On the learnability and design of output codes for multiclass \nproblems. In Proceedings of the Thirteenth Annual Conference on Computational Learning \nTheory, 2000. \n\n[8] Ghulum Bakiri Thomas G. Dietterich. Achieving high-accuracy text-to-speech with machine \n\nlearning. In Data mining in ~peech synthesis, 1999. \n\n[9] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error(cid:173)\ncorrecting output codes. Journal of Artificial Intelligence Research, 2:263- 286, January 1995. \n[10] Tom Dietterich and Eun Bae Kong. Machine learning bias, statistical bias, and statistical vari(cid:173)\n\nance of decision tree algorithms. Technical report, Oregon State University, 1995. Available \nvia the WWW at http://www.cs.orst.edu:801'''tgd/cv/tr.html. \n\n[11] Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. The Annals of Statis(cid:173)\n\ntics, 26(1):451--471, 1998. \n\n[12] G. James and T. Hastie. The error coding method and PiCT. Journal of computational and \n\ngraphical stastistics, 7(3):377- 387, 1998. \n\n[13] J.C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classification. \n\nIn Advances in Neural Information Processing Systems 12. MIT Press, 2000. (To appear.). \n\n[14] J. Ross Quillian. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. \n[15] Robert E. Schapire. Using output codes to boost multiclass learning problems. In Machine \n\nLearning: Proceedings of the Fourteenth International Conference, pages 313- 321, 1997. \n\n[16] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated \n\npredictions. Machine Learning, 37(3):1--40, 1999. \n\n[17] Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. \n\n\f", "award": [], "sourceid": 1789, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}