{"title": "A Multi-class Linear Learning Algorithm Related to Winnow", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 525, "abstract": null, "full_text": "A Multi-class Linear Learning Algorithm \n\nRelated to Winnow \n\nChris Mesterhann* \n\nRutgers Computer Science Department \n\n110 Frelinghuysen Road \nPiscataway, NJ 08854 \n\nmesterha@paul.rutgers.edu \n\nAbstract \n\nIn this paper, we present Committee, a new multi-class learning algo(cid:173)\nrithm related to the Winnow family of algorithms. Committee is an al(cid:173)\ngorithm for combining the predictions of a set of sub-experts in the on(cid:173)\nline mistake-bounded model oflearning. A sub-expert is a special type of \nattribute that predicts with a distribution over a finite number of classes. \nCommittee learns a linear function of sub-experts and uses this function \nto make class predictions. We provide bounds for Committee that show \nit performs well when the target can be represented by a few relevant \nsub-experts. We also show how Committee can be used to solve more \ntraditional problems composed of attributes. This leads to a natural ex(cid:173)\ntension that learns on multi-class problems that contain both traditional \nattributes and sub-experts. \n\n1 Introduction \n\nIn this paper, we present a new multi-class learning algorithm called Committee. Committee \nlearns a k class target function by combining information from a large set of sub-experts. \nA sub-expert is a special type of attribute that predicts with a distribution over the target \nclasses. The target space of functions are linear-max functions. We define these as functions \nthat take a linear combination of sub-expert predictions and return the class with maximum \nvalue. It may be useful to think of the sub-experts as individual classifying functions that \nare attempting to predict the target function. Even though the individual sub-experts may \nnot be perfect, Committee attempts to learn a linear-max function that represents the target \nfunction. In truth, this picture is not quite accurate. The reason we call them sub-experts \nand not experts is because even though a individual sub-expert might be poor at prediction, \nit may be useful when used in a linear-max function. For example, some sub-experts might \nbe used to add constant weights to the linear-max function. \n\nThe algorithm is analyzed for the on-line mistake-bounded model oflearning [Lit89]. This \nis a useful model for a type of incremental learning where an algorithm can use feedback \nabout its current hypothesis to improve its performance. In this model, the algorithm goes \nthrough a series of learning trials. A trial is composed of three steps. First, the algorithm \n\n\u00b7Part of this work was supported by NEe Research Institute, Princeton, NJ. \n\n\f520 \n\nC. Mesterharm \n\nreceives an instance, in this case, the predictions of the sub-experts. Second, the algorithm \npredicts a label for the instance; this is the global prediction of Committee. And last, the \nalgorithm receives the true label of the instance; Committee uses this information to up(cid:173)\ndate its estimate of the target. The goal of the algorithm is to minimize the total number of \nprediction mistakes the algorithm makes while learning the target. \n\nThe analysis and performance of Committee is similar to another learning algorithm, Win(cid:173)\nnow [Lit89] . Winnow is an algorithm for learning a linear-threshold function that maps \nattributes in [0, 1] to a binary target. It is an algorithm that is effective when the concept \ncan be represented with a few relevant attributes, irrespective of the behavior of the other \nattributes. Committee is similar but deals with learning a target that contains only a few \nrelevant sub-experts. While learning with sub-experts is interesting in it's own right, it \nturns out the distinction between the two tasks is not significant. We will show in section 5 \nhow to transform attributes from [0, 1] into sub-experts. Using particular transformations, \nCommittee is identical to the Winnow algorithms, Balanced and WMA [Lit89]. Further(cid:173)\nmore, we can generalize these transformations to handle attribute problems with multi-class \ntargets. These transformations naturally lead to a hybrid algorithm that allows a combina(cid:173)\ntion of sub-experts and attributes for multi-class learning problems. This opens up a range of \nnew practical problems that did not easily fit into the previous framework of [0, 1 J attributes \nand binary classification. \n\n2 Previous work \n\nMany people have successfully tried the Winnow algorithms on real-world tasks. In the \ncourse of their work, they have made modifications to the algorithms to fit certain aspects \nof their problem. These modifications include multi-class extensions. \n\nFor example, [DKR97] use Winnow algorithms on text classification problems. This multi(cid:173)\nclass problem has a special form ; a document can belong to more than one class. Because \nof this property, it makes sense to learn a different binary classifier for each class. The linear \nfunctions are allowed, even desired, to overlap. However, this paper is concerned with cases \nwhere this is not possible. For example, in [GR96] the correct spelling of a word must be \nselected from a set of many possibilities. In this setting, it is more desirable to have the \nalgorithm select a single word. \n\nThe work in [GR96] presents many interesting ideas and modifications of the Winnow al(cid:173)\ngorithms. At a minimum, these modification are useful for improving the performance of \nWinnow on those particular problems. Part of that work also extends the Winnow algorithm \nto general multi-class problems. While the results are favorable, the contribution ofthis pa(cid:173)\nper is to give a different algorithm that has a stronger theoretical foundation for customizing \na particular multi-class problem. \n\nBlum also works with multi-class Winnow algorithms on the calendar scheduling problem \nof [MCF+94] . In [Blu95], a modified Winnow is given with theoretical arguments for good \nperformance on certain types of multi-class disjunctions. In this paper, these results are ex(cid:173)\ntended, with the new algorithm Committee, to cover a wider range of multi-class linear func(cid:173)\ntions. \n\nOther related theoretical work on multi-class problems includes the regression algorithm \nEG \u00b1. In [KW97], Kivinen and Warmuth introduce EG \u00b1, an algorithm related to Winnow \nbut used on regression problems. In general, while regression is a useful framework for \nmany multi-class problems, it is not straightforward how to extend regression to the con(cid:173)\ncepts learned by Committee. A particular problem is the inability of current regression tech(cid:173)\nniques to handle 0-1 loss. \n\n\fA Multi-class Linear Learning Algorithm Related to Winnow \n\n521 \n\n3 Algorithm \n\nThis section of the paper describes the details of Committee. Near the end of the section, \nwe will give a formal statement of the algorithm. \n\n3.1 Prediction scheme \n\nAssume there are n sub-experts. Each sub-expert has a positive weight that is used to vote \nfor k different classes; let Wi be the weight of sub-expert i. A sub-expert can vote for sev(cid:173)\neral classes by spreading its weight with a prediction distribution. For example, if k = 3, a \nsub-expert may give 3/5 of its weight to class 1, 1/5 of its weight to class 2, and 1/5 of its \nweight to class 3. Let Xi represent this prediction distribution, where x~ is the fraction of the \nweight sub-expert i gives to class j . The vote for class j is L~=I WiX~. Committee predicts \nthe class that has the highest vote. (On ties, the algorithm picks one of the classes involved \nin the tie.) We call the function computed by this prediction scheme a linear-max function, \nsince it is the maximum class value taken from a linear combination of the SUb-expert pre(cid:173)\ndictions. \n\n3.2 Target function \n\nThe goal of Committee is to mInimIZe the number of mistakes by quickly learning \nsub-expert weights that correctly classify the target function. Assume there exists fL, a vec(cid:173)\ntor of nonnegative weights that correctly classifies the target. Notice that fL can be multiplied \nby any constant without changing the target. To remove this confusion, we will normalize \nthe weights to sum to 1, i.e., L~=Il-1i = 1. Let ((j) be the target's vote for class j. \n\nn \n\n((j) = L l-1iX~ \n\nt=I \n\nPart of the difficulty of the learning problem is hidden in the target weights. Intuitively, a \ntarget function will be more difficult to learn ifthere is a small difference between the (votes \nof the correct and incorrect classes. We measure this difficulty by looking at the minimum \ndifference, over all trials, of the vote of the correct label and the vote of the other labels. \nAssume for trial t that Pt is the correct label. \n\n8= min \n\n(min(((pt)-((j))) \n\ntETnals rlpt \n\nBecause these are the weights of the target, and the target always makes the correct predic(cid:173)\ntion, 8 > o. \nOne problem with the above assumptions is that they do not allow noise (cases where 8 <::; 0). \nHowever, there are variations of the analysis that allow for limited amounts of noise [Lit89, \nLit91]. Also experimental work [Lit95, LM] shows the family of Winnow algorithms to be \nmuch more robust to noise than the theory would predict. Based on the similarity of the \nalgorithm and analysis, and some preliminary experiments, Committee should be able to \ntolerate some noise. \n\n3.3 Updates \n\nCommittee only updates on mistakes using multiplicative updates. The algorithm starts by \ninitializing all weights to 1 In. During the trials, let P be the correct label and .x be the pre(cid:173)\ndicted label of Committee. When .x =1= P the weight of each sub-expert i is multiplied by \naX; -x;. This corresponds to increasing the weights of the sub-experts who predicted the \n\n\f522 \n\nC. Mesterharm \n\ncorrect label instead of the label Committee predicted. The value of 0' is initialized at the \nstart of the algorithm. The optimal value of 0' for the bounds depends on 6. Often 6 is not \nknown in advance, but experiments on Winnow algorithms suggest that these algorithms \nare more flexible, often performing well with a wider range of 0' values [LM). Last, the \nweights are renormalize to sum to 1. While this is not strictly necessary, normalizing has \nseveral advantages including reducing the likelyhood of underflow/overflow errors. \n\n3.4 Committee code \n\nInitialization \n\n'Vi E {l , . .. , n} W i:= lin. \nSet 0' > 1. \n\nTrials \n\nInstance sub-experts (Xl , . .. , xn) . \nPrediction >. is the first class c such that for all other classes J, \n\n\"n \n\nL...-i =1 W i X i \n\nc > \"n \n\n-\n\nL...-i=1 W i X t \u2022 \n\nj \n\nUpdate Let p be the correct label. If mistake (>' # p) \n\nfori:=l ton \n\nW i := O'X i -x, Wi . \n\np \n\n>. \n\nNormalize weights, L:~= l W t = 1 \n\n3.5 Mistake bound \n\nWe do not have the space to give the proof for the mistake bound of Committee, but the \ntechnique is similar to the proof of the Winnow algorithm, Balanced, given in [Lit89). For \nthe complete proof, the reader can refer to [Mes99). \n\nTheorem 1 Committee makes at most 2ln (n) 162mistakes when the target conditions in \nsection 3.2 are satisfied and 0' is set to (1 - 6) - 1/ 2. \n\nSurprisingly, this bound does not refer to the number of classes. The effects of larger values \nof k show up indirectly in the 6 value. \n\nWhile it is not obvious, this bound shows that Committee performs well when the target \ncan be represented by a small fraction of the sub-experts. Call the sub-experts in the target \nthe relevant sub-experts. Since 6 is a function of the target, 6 only depends on the relevant \nsub-experts. On the other hand, the remaining sub-experts have a small effect on the bound \nsince they are only represented in the In( n) factor. This means that the mistake bound of \nCommittee is fairly stable even when adding a large number of additional sub-experts. In \ntruth, this doesn ' t mean that the algorithm will have a good bound when there are few rele(cid:173)\nvant sub-experts. In some cases, a small number of sub-experts can give an arbitrarily small \n6 value. (This is a general problem with all the Winnow algorithms.) What it does mean is \nthat, given any problem, increasing the number of irrelevant sub-experts will only have a \nlogarithmic effect on the mistake bound. \n\n4 Attributes to sub-experts \n\nOften there are no obvious sub-experts to use in solving a learning problem. Many times the \nonly information available is a set of attributes. For attributes in [0,1]' we will show how to \nuse Committee to learn a natural kind of k class target function , a linear machine. To learn \nthis target, we will transform each attribute into k separate sub-experts. We will use some \nof the same notion as Committee to help understand the transformation. \n\n\fA Multi-class Linear Learning Algorithm Related to Winnow \n\n523 \n\n4.1 Attribute target (linear machine) \n\nA linear machine [DH73] is a prediction function that divides the feature space into disjoint \nconvex regions where each class corresponds to one region. The predictions are made by \na comparing the value of k different linear functions where each function corresponds to a \nclass. \nMore formally, assume there are m - 1 attributes and k classes. Let Zi E [0,1] be attribute \ni. Assume the target function is represented using k linear functions of the attributes. Let \n((j) = L::II-L{ Zi be the linear function for class j where I-Li is the weight of attribute i in \nclass j. Notice that we have added one extra attribute. This attribute is set to 1 and is needed \nfor the constant portion of the linear functions. The target function labels an instance with \nthe class of the largest ( function. (Ties are not defined.) Therefore, ((j) is similar to the \nvoting function for class j used in Committee. \n\n4.2 Transforming the target \n\nOne difficulty with these linear functions is that they may have negative weights. Since \nCommittee only allows targets with nonnegative weights, we need transform to an equiv(cid:173)\nalent problem that has nonnegative weights. This is not difficult. Since we are only con(cid:173)\ncerned with the relative difference between the ( functions, we are allowed to add any func(cid:173)\ntion to the (functions as long as we add it to all (functions. This gives us a simple procedure \nto remove negative weights. For example, if ((1) = 3Z1 - 2Z2 + 1z3 -4, we can add 2Z2 +4 \nto every ( function to remove the negative weights from ((1). It is straightforward to extend \nthis and remove all negative weights. \n\nWe also need to normalize the weights. Again, since only the relative difference between \nthe ( functions matter, we can divide all the ( functions by any constant. We normalize \nthe weights to sum to 1, i.e., L:~=1L:~11-L{ = 1. At this point, without loss of generality, \nassume that the original ( functions are nonnegative and normalized. \n\nThe last step is to identify a 8 value. We use the same definition of 8 as Committee substi(cid:173)\ntuting the corresponding ( functions of the linear machine. Assume for trial t that Pt is the \ncorrect label. \n\n8 = min \n\ntETrwls \n\n(min( ((Pt) - ((j))) \nji-P, \n\n4.3 Transforming the attributes \n\nThe transformation works as follows: convert attribute Zi into k sub-experts. Each \nsub-expert will always vote for one of the k classes with value Zi. The target weight for \neach of these sub-experts is the corresponding target weight of the attribute, label pair in \nthe ( functions . Do this for every attribute. \n\nNotice that we are not using distributions for the sub-expert predictions. A sub-expert's \nprediction can be converted to a distribution by adding a constant amount to each class pre(cid:173)\n\ndiction. For example, a sub-expert that predicts Zl = .7, Z2 = 0, Z3 = \u00b0 can be changed \n\nto Zl = .8, Z2 = .1, Z3 = .1 by adding .1 to each class. This conversion does not affect \nthe predicting or updating of Committee. \n\n\f524 \n\nC. Mesterharm \n\nTheorem 2 Committee makes at most 2In(mk)/8 2mistakes on a linear machine, as de(cid:173)\nfined in this section, when 0 is set to (1 - 8)-1/2. \n\nProof: The above target transformation creates mk normalized target sub-experts that vote \nwith the same ( functions as the linear machine. Therefore, this set of sub-experts has the \nsame 8 value. Plugging these values into the bound for Committee gives the result. \n\nThis transformation provides a simple procedure for solving linear machine problems. \nWhile the details of the transformation may look cumbersome, the actual implementa(cid:173)\ntion of the algorithm is relatively simple. There is no need to explicitly keep track of the \nsub-experts. Instead, the algorithm can use a linear machine type representation. Each class \nkeeps a vector of weights, one weight for each attribute. During an update, only the correct \nclass weights and the predicted class weights are changed. The correct class weights are \nmultiplied by O Zi; the predicted class weights are multiplied by o -z' . \n\nThe above procedure is very similar to the Balanced algorithm from [Lit89], in fact, for k = \n2, it is identical. A similar transformation duplicates the behavior of the linear-threshold \nlearning version ofWMA as given in [Lit89]. \n\nWhile this transformation shows some advantages for k = 2, more research is needed to \ndetermine the proper way to generalize to the multi-class case. For both of these transfor(cid:173)\nmations, the bounds given in this paper are equivalent (except for a superficial adjustment \nin the 8 notation of WMA) to the original bounds given in [Lit89] . \n\n4.4 Combining attributes and sub-experts \n\nThese transformations suggest the proper way to do a hybrid algorithm that combines \nsub-experts and attributes: use the transformations to create new sub-experts from the at(cid:173)\ntributes and combine them with the original sub-experts when running Committee. It may \neven be desirable to break original sub-experts into attributes and use both in the algorithm \nbecause some sub-experts may perform better on certain classes. For example, if it is felt \nthat a sub-expert is particularly good at class 1, we can perform the following transforma-\ntion. \n\nNow, instead of using one weight for the whole sub-expert, Committee can also learn based \non the sub-expert's performance for the first class. Even if a good target is representable \nonly with the original sub-experts, these additional sub-experts will not have a large effect \nbecause of the logarithmic bound. In the same vein, it may be useful to add constant at(cid:173)\ntributes to a set of sub-experts. These add only k extra SUb-experts, but allow the algorithm \nto represent a larger set of target functions . \n\n5 Conclusion \n\nIn this paper, we have introduced Committee, a multi-class learning algorithm. We feel that \nthis algori thm will be important in practice, extending the range of problems that can be han(cid:173)\ndled by the Winnow family of algorithms. With a solid theoretical foundation, researchers \ncan customize Winnow algorithms to handle various multi-class problems. \n\n\fA Multi-class Linear Learning Algorithm Related to Winnow \n\n525 \n\nPart of this customization includes feature transformations. We show how Committee can \nhandle general linear machine problems by transforming attributes into sub-experts. This \nsuggests a way to do a hybrid learning algorithm that allows a combination of sub-experts \nand attributes. This same techniques can also be used to add to the representational power \non a standard sub-expert problem. \n\nIn the future, we plan to empirically test Committee and the feature transformations on real \nworld problems. Part of this testing will include modifying the algorithm to use extra infor(cid:173)\nmation, that is related to the proof technique [Mes99), in an attempt to lower the number of \nmistakes. We speculate that adjusting the multiplier to increase the change in progress per \ntrial will be useful for certain types of multi-class problems. \n\nAcknowledgments \n\nWe thank Nick Littlestone for stimulating this work by suggesting techniques for converting \nthe Balanced algorithm to multi-class targets. Also we thank Haym Hirsh, Nick Littlestone \nand Warren Smith for providing valuable comments and corrections. \n\nReferences \n\n[Blu95] \n\n[DH73) \n\n[DKR97] \n\n[GR96) \n\n[KW97) \n\n[Lit89] \n\n[Lit91) \n\n[Lit95] \n\n[LM) \n\nAvrim Blum. Empirical support for winnow and weighted-majority algorithms: \nresults on a calendar scheduling domain. In ML-95, pages 64-72, 1995. \nR. O. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, New \nYork,1973 . \nI. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. \nIn EMNLP-97, pages 55-63,1997. \nA. R. Golding and D. Roth. Applying winnow to context-sensitive spelling \ncorrection. In ML-96, 1996. \nJyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradi(cid:173)\nent updates for linear prediction. Information and Computation, 132(1): 1-64, \n1997. \nNick Littlestone. Mistake bounds and linear-threshold learning algorithms. \nPhD thesis, University of California, Santa Cruz, 1989. Technical Report \nUCSC-CRL-89-11. \nNick Littlestone. Redundant noisy attributes, attribute errors, and linear(cid:173)\nthreshold learning using winnow. In COLT-91 , pages 147-156,1991. \nNick Littlestone. Comparing several linear-threshold learning algorithms on \ntasks involving superfluous attributes. In ML-95, pages 353-361 , 1995. \nNick Littlestone and Chris Mesterharm. A simulation study of winnow and \nrelated algorithms . Work in progress. \n\n[MCF+94) T. Mitchell, R. Caruana, D. Freitag, 1. McDermott, and D. Zabowski. Experi(cid:173)\n\n[Mes99) \n\nence with a personal learning assistant. CACM, 37(7):81-91, 1994. \nChris Mesterharm. A multi-class linear learning algorithm related to winnow \nwith proof. Technical report, Rutgers University, 1999. \n\n\f", "award": [], "sourceid": 1718, "authors": [{"given_name": "Chris", "family_name": "Mesterharm", "institution": null}]}