{"title": "Boosting versus Covering", "book": "Advances in Neural Information Processing Systems", "page_first": 1109, "page_last": 1116, "abstract": "", "full_text": "Boosting versus Covering\n\nKohei Hatano\u2217\n\nTokyo Institute of Technology\n\nhatano@is.titech.ac.jp\n\nManfred K. Warmuth\n\nUC Santa Cruz\n\nmanfred@cse.ucsc.edu\n\nAbstract\n\nWe investigate improvements of AdaBoost that can exploit the fact\nthat the weak hypotheses are one-sided, i.e. either all its positive\n(or negative) predictions are correct.\nIn particular, for any set\nof m labeled examples consistent with a disjunction of k literals\n(which are one-sided in this case), AdaBoost constructs a consistent\nhypothesis by using O(k2 log m) iterations. On the other hand,\na greedy set covering algorithm \ufb01nds a consistent hypothesis of\nsize O(k log m). Our primary question is whether there is a simple\nboosting algorithm that performs as well as the greedy set covering.\nWe \ufb01rst show that InfoBoost, a modi\ufb01cation of AdaBoost pro-\nposed by Aslam for a di\ufb00erent purpose, does perform as well as\nthe greedy set covering algorithm. We then show that AdaBoost\nrequires \u2126(k2 log m) iterations for learning k-literal disjunctions.\nWe achieve this with an adversary construction and as well as in\nsimple experiments based on arti\ufb01cial data. Further we give a vari-\nant called SemiBoost that can handle the degenerate case when the\ngiven examples all have the same label. We conclude by showing\nthat SemiBoost can be used to produce small conjunctions as well.\n\n1 Introduction\n\nThe boosting method has become a powerful paradigm of machine learning. In this\nmethod a highly accurate hypothesis is built by combining many \u201cweak\u201d hypotheses.\nAdaBoost [FS97, SS99] is the most common boosting algorithm. The protocol is as\nfollows. We start with m labeled examples labeled with \u00b11. AdaBoost maintains\na distribution over the examples. At each iteration t, the algorithm receives a \u00b11\nvalued weak hypothesis ht whose error (weighted by the current distribution on the\nexamples) is slightly smaller than 1\n2. It then updates its distribution so that after\nthe update, the hypothesis ht has weighted error exactly 1\n2. The \ufb01nal hypothesis is\na linear combination of the received weak hypotheses and it stops when this \ufb01nal\nhypothesis is consistent with all examples.\nIt is well known [SS99] that if each weak hypothesis has weighted error at most\n2 \u2212 \u03b3\n1 \u2212 \u03b32\n1\n\u2217\n\n2 , then the upper bound on the training error reduces by a factor of\n\nThis research was done while K. Hatano was visiting UC Santa Cruz under the EAP\n\n(cid:1)\n\nexchange program.\n\n\f2 \u2212 \u03b3\n\nand after O( 1\n\u03b32 log m) iterations, the \ufb01nal hypothesis is consistent with all examples.\nAlso, it has been shown that if the \ufb01nal hypotheses are restricted to (unweighted)\nmajority votes of weak hypotheses [Fre95], then this upper bound on the number\nof iterations cannot be improved by more than a constant factor.\nHowever, if there always is a positively one-sided weak hypothesis (i.e. its positive\npredictions are always correct) that has error1 at most 1\n2 , then a set cover algo-\nrithm can be used to reduce the training error by a factor2 of 1\u2212 \u03b3 and O( 1\n\u03b3 log m)\nweak hypotheses su\ufb03ce to form a consistent hypothesis [Nat91].\nIn this paper\nwe show that the improved factor is also achieved by InfoBoost, a modi\ufb01cation of\nAdaBoost developed by Aslam [Asl00] based on a di\ufb00erent motivation.\nIn particular, consider the problem of \ufb01nding a consistent hypothesis for m examples\nlabeled by a k literal disjunction. Assume we use the literals as the pool of weak\nhypotheses and always choose as the weak hypothesis a literal that is consistent\nwith all negative examples. Then it can be shown that, for any distribution D on\nthe examples, there exists a literal (or a constant hypothesis) h with weighted error\nat most 1\n4k (See e.g. [MG92]). Therefore, the upper bound on the training error\nof AdaBoost reduces by a factor of\n\n4k2 and O(k2 log m) iterations su\ufb03ce.\n\n1 \u2212 1\n\n2 \u2212 1\n\n(cid:2)\n\nHowever, a trivial greedy set covering algorithm, that follows a strikingly similar\nprotocol as the boosting algorithms, \ufb01nds a consistent disjunction with O(k log m)\nliterals. We show that InfoBoost mimics the set cover algorithm in this case (and\nattains the improved factor of 1 \u2212 1\nk ).\nWe \ufb01rst explain the InfoBoost algorithm in terms of constraints on the updated\ndistribution. We then show that \u2126(k2 log m) iterations are really required by Ad-\naBoost using both an explicit construction (which requires some assumptions) and\narti\ufb01cial experiments. The di\ufb00erences are quite large: For m = 10, 000 random\nexamples and a disjunction of size k = 60, AdaBoost requires 2400 iterations (on\nthe average), whereas Covering and InfoBoost require 60 iterations. We then show\nthat InfoBoost has the improved reduction factor if the weak hypotheses happen\nto be one-sided. Finally we give a modi\ufb01ed version of AdaBoost that exploits the\none-sidedness of the weak hypotheses and avoids some technical problems that can\noccur with InfoBoost. We also discuss how this algorithm can be used to construct\nsmall conjunctions.\n\n2 Minimizing relative entropy subject to constraints\n\nAssume we are given a set of m examples (x1, y1), . . . , (xm, ym). The instances xi\nare in some domain X and the labels yi are in {\u22121, 1}. The boosting algorithms\nmaintain a distribution Dt over the examples. The initial distribution is D1 and is\ntypically uniform. At the t-th iteration, the algorithm chooses a weak3 hypothesis\nht : X \u2192 {\u22121, 1} and then updates its distribution. The most popular boosting\nalgorithm does this as follows:\n\nDt(i) exp{\u2212yiht(xi)\u03b1t}\n\n,\n\nAdaBoost: Dt+1(i) =\n\nZt\n1This assumes equal weight on both types of examples.\n2Wipe out the weights of positive examples that are correctly classi\ufb01ed and re-balance\n\nboth types of examples.\n3For the sake of simplicity we focus on the case when the range of the labels and the\nweak hypotheses is \u00b11 valued. Many parts of this paper generalize to the range [\u22121, 1]\n[SS99, Asl00].\n\n\f(cid:3)T\n\n2 ln 1\u2212\u0001t\n\nHere Zt is a normalization constant and the coe\ufb03cient \u03b1t depends on the error \u0001t\nand \u0001t = PrDt[ht(xi) (cid:2)= yi]. The \ufb01nal hypothesis is\nat iteration t: \u03b1t = 1\ngiven by the sign of the following linear combination of the chosen weak hypotheses:\nH(x) =\nt=1 \u03b1tht(x). Following [KW99, Laf99], we motivate the updates on the\ndistributions of boosting algorithms as a constraint minimization of the relative\nentropy between the new and old distributions:\n\n\u0001t\n\n(cid:3)\n\ni D(i)=1\u2206(D, Dt), s.t. Pr\n\nD\n\n[ht(xi) (cid:2)= yi] =\n\n.\n\n1\n2\n\ni D(i) ln D(i)\n\nD(cid:1)(i) and error w.r.t.\n\nAdaBoost: Dt+1 = argminD\u2208[0,1]m,\nHere the relative entropy is de\ufb01ned as \u2206(D, D(cid:3)) =\nthe updated distribution is constraint to half.\n\n(cid:3)\n\nThe constraint can be easily understood using the table of\nFigure 1. There are two types of misclassi\ufb01ed examples:\nfalse positive (weight c) and false negative (weight b). The\nAdaBoost constraint means b + c = 1\n2 w.r.t. the updated\ndistribution Dt+1.\nThe second boosting algorithm we discuss in this paper has the following update:\n\nFigure 1: Four types\nof examples.\n\nyi \\ ht +1 \u22121\n+1\nb\n\u22121\nd\n\na\nc\n\nInfoBoost: Dt+1(i) =\n\nDt(i) exp{\u2212yiht(xi)\u03b1t[ht(xi)]}\n\n,\n\nZt\n\nwhere \u03b1t[\u00b11] = 1\n\u0001t[\u00b11] , \u0001t[\u00b11] = PrDt[ht(xi) (cid:2)= yi|ht(xi) =\u00b11] and Zt is\n(cid:3)T\nthe normalization factor. The \ufb01nal hypothesis is given by the sign of H(x) =\n\n2 ln 1\u2212\u0001[\u00b11]\n\nt=1 \u03b1t[ht(x)] ht(x).\n\nIn the original paper [Asl00], the InfoBoost update was motivated by seeking a\ndistribution Dt+1 for which the error of ht is half and yi and ht(xi) have mutual\ninformation zero. Here we motivate InfoBoost as a minimization of the same relative\nentropy subject to the AdaBoost constraint b + c = 1\n2 and a second simultaneously\nenforced constraint a + b = 1\n2. Note that the second constraint is the AdaBoost\nconstraint w.r.t. the constant hypothesis 1. A natural question is why not just do\ntwo steps of AdaBoost at each iteration t: One for ht and and then, sequentially,\none for 1. We call the latter algorithm AdaBoost with Bias, since the constant\nhypothesis introduces a bias into the \ufb01nal hypothesis. See Figure 2 for an example\nof the di\ufb00erent updates.\n\nDt :\n\nDt+1 : InfoB.\n\n2\n5\n1\n5\n\n2\n5\n0\n\nyi \\ ht +1 \u22121\n+1\n\u22121\nyi \\ ht +1 \u22121\n+1\n\u22121\n\n0\n0\n\n1\n2\n1\n2\n\nDt+1 : AdaB.\n\nDt+1 : AdaB.w.Bias\n\n1\n2\n1\n6\n\n1\n3\n0\n\nyi \\ ht +1 \u22121\n+1\n\u22121\nyi \\ ht +1 \u22121\n+1\n\u22121\n\n1\n5\n0\n\n3\n10\n1\n2\n\nFigure 2: Updating based on a positively one-sided hypothesis ht (weight c is 0):\nThe updated distributions on the four types of examples are quite di\ufb00erent.\n\nWe will show in the next section that in the case of learning disjunctions, AdaBoost\nwith Bias (and plain AdaBoost) can require many more iterations than InfoBoost\nand the trivial covering algorithm. This is surprising because the AdaBoost with\nBias and InfoBoost seem so similar to each other (simultaneous versus sequential\n\n\fenforcement of the same constraints). A natural extension would be to constrain\nthe errors of all past hypotheses to half which is the Totally Corrective Algorithm\nof [KW99]. However this can lead to subtle convergence problems (See discussion\nin [RW02]).\n\n3 Lower bounds of AdaBoost for Learning k disjunctions\n\nSo far we did not specify how the weak hypothesis ht is chosen at iteration t. We\nassume there is a pool H of weak hypotheses and distinguish two methods:\nGreedy: Choose a ht \u2208 H for which the normalization factor Zt in the update of\nthe algorithm is minimized.\nMinimal: Choose ht with error smaller than a given threshold 1\nThe greedy method is motivated by the fact that\nt Zt upper bounds the training\nerror of the \ufb01nal hypothesis ([SS99, Asl00]) and this method greedily minimizes this\nupper bound. Note that the Zt factors are di\ufb00erent for AdaBoost and InfoBoost.\n\n2 \u2212 \u03b4.\n\n(cid:4)\n\nIn our lower bounds on the number of iterations the example set is always con-\nsistent with a k-literal monotone disjunction over N variables. More precisely the\ninstances xi are in {\u00b11}N and the label yi is xi,1\u2228 xi,2\u2228 . . .\u2228 xi,k. The pool of weak\nlearners consists of the N literals Xj, where Xj(xi) =x ij. For the greedy method\nwe show that on random data sets InfoBoost and the covering algorithm use drasti-\ncally fewer iterations than AdaBoost with Bias.\nWe chose 10, 000 examples as follows: The \ufb01rst\nk bits of each example are chosen independently\nat random so that the probability of label +1 is\nhalf (i.e. the probability of +1 for each of the\n\ufb01rst k bits is 1\u2212 2\u22121/k); the remaining N \u2212 k ir-\nrelevant bits of each example are chosen +1 with\nprobability half. Figure 3 shows the number of\niterations as function of the size of disjunction k\n(averaged over 20 runs) of AdaBoost with Bias\nuntil consistency is reached on all 10, 000 exam-\nples. The number of iteration in this very simple\nsetting grows quadratically with k. If the num-\nber of iterations is divided by k2 then the resulting curve is larger than a constant.\nIn contrast the number of iterations of the greedy covering algorithm and InfoBoost\nis provably linear in k: For k = 60 and m = 10, 000, the former require 60 iterations\non the average, whereas AdaBoost with Bias with the greedy choice of the weak\nhypothesis requires 1200 even though it never chooses irrelevant variables as weak\nlearners (Plain AdaBoost requires twice as many iterations).\nThe above construction is not theoretical. However we now give an explicit con-\nstruction for the minimal method of choosing the weak hypothesis for which the\nnumber of iterations of greedy covering and InfoBoost grow linearly in k and the\nnumber of iterations of AdaBoost with Bias is quadratic in k.\nFor any dimension N we de\ufb01ne an example set which is the rows of the following\n(N + 1)\u00d7 N dimensional matrix x: All entries on the main diagonal and above are\n+1 and the remaining entries \u22121. In particular, the last row is all \u22121 (See Figure\n4). The i-th instance xi is the i-th row of this matrix and the \ufb01rst N examples\n(rows) are labeled +1 and the label of the last row yN +1 is \u22121.\nClearly the literal XN is consistent with the labels and thus always has error 0\nw.r.t. any distribution on the examples. But note that the disjunction of the last k\n\nFigure 3:\nAverage # of steps\nof AdaBoost with Bias for k =\n10, 20, 30, 40, 50, 60.\n\n\fliterals is also consistent (for any k). We will construct a distribution on the rows\nthat gives high probability to the early rows (See Figure 4) and allows the following\n\u201dminimal\u201d choice of weak hypotheses: At iteration t, AdaBoost with Bias is given\nthe weak hypothesis Xt. This weak hypothesis will have error 1\n2k ) w.r.t.\ncurrent distribution of the examples.\nContrary to our construction, the initial distribution for boosting applications is\ntypically uniform. However, using padding this can be avoided but makes the con-\nstruction more complicated. For any precision parameter \u0001 \u2208 (0, 1) and disjunction\nsize k, we de\ufb01ne the dimension\n\n2k (\u03b4 = 1\n\n2 \u2212 1\n\n(cid:8)\n\n(cid:10)\n1 \u2212 1\n\nk\n\nN :=\n\n\uf8ee\n(cid:7)\n\uf8ef\uf8ef\uf8ef\u2212 ln 1\n(cid:9)\n2\u03b5 \u2212 ln\n1 \u2212 2\nThe initial distribution D1 is de\ufb01ned as\n(cid:8)(cid:9)\n(cid:7)\n2 \u2212(cid:3)N\u22121\n\n1 \u2212 1\nt=1 D(xt),\n\nD1(xt) :=\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n2k ,\n1\n\nln\n\nk\n\nk(k+1)\n\n1 \u2212 2\n\nk(k+1)\n\n\uf8f9\n\uf8fa\uf8fa\uf8fa + 1 \u2265 k2\n(cid:10)t\u22122\n\n2\n\n,\n\nk(k+1)\n1\n1\n2 ,\n\nln\n\n1\n2\u03b5\n\n\u2212 k\n2\n\nfor t = 1\nfor 2 \u2264 t \u2264 N \u2212 1\nfor t = N\nfor t=N + 1.\n\n.\n\nThe example xN has the lowest probability w.r.t. D1 (See Figure 4). However one\ncan show that its probability is at least \u03b5.\n\nD1(xi)\n\n\uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nxi,j\n\n+ + + + +\n\u2212 + + + +\n\u2212 \u2212+ + +\n\u2212 \u2212 \u2212+ +\n\u2212 \u2212 \u2212+ +\n\u2212 \u2212 \u2212 \u2212 +\n\u2212 \u2212 \u2212 \u2212 \u2212\n\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n\uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nyi\n+\n+\n+\n+\n+\n+\u2212\n\n(cid:8)(cid:9)\n\n1 \u2212(cid:7)\n\nFigure 4: The examples (rows of the matrix), the labels, and the distribution D1.\n(cid:23)\nAlso for t \u2264 N \u2212 1,\n1 \u2212 2\n\nthe probability D1(x\u2264t) of the \ufb01rst t examples is\n1\n. AdaBoost with Bias does two update steps at\n2\niteration t (constrain the error of ht to half and then sequentially the error of 1 to\nhalf.)\n\n(cid:10)t\u22121\n\n1 \u2212 1\n\n(cid:24)\n\nk(k+1)\n\nk\n\n(cid:25)Dt(i) =\n\nDt(i) exp{\u2212yiht(xi)\u03b1t}\n\nZt\n\nand Dt+1 =\n\n\u03b5t and (cid:25)\u03b1t = 1\n\n2 ln 1\u2212\u03b5t\n\n.\n\n(cid:25)Dt(i) exp{\u2212yi(cid:25)\u03b1t}\n(cid:25)Zt\n(cid:25)Dt(x\u2264N )\n(cid:3)T\n(cid:25)Dt(xN+1). The \ufb01nal\nt=1 \u03b1tht(x) +\n2 \u2212 1\n2k .\n\n2 ln\n\nThe Z\u2019s are normalization factors, \u03b1t = 1\nhypothesis is the sign of the following linear combination: H(x) =\n\n(cid:3)T\nt=1(cid:25)\u03b1t.\nProposition 1. For AdaBoost with Bias and t \u2264 N, PrDt[Xt(xi) (cid:2)= yi] = 1\n\nProof. (Outline) Since each literal Xt is one-sided, Xt classi\ufb01es the negative example\nxN +1 correctly. Since PrDt[Xt(xi) =y i] = Dt(xN +1)+D t(x\u2264t) and Dt(xN +1) = 1\n2\n\n\f2k for t \u2264 N. The proof is by induction on t.\nit su\ufb03ces to show that Dt(x\u2264t) = 1\nFor t = 1, the statement follows from the de\ufb01nition of D1. Now assume that the\nstatement holds for any t(cid:3) < t. Then we have\n\nDt(x\u2264t) =D t(x\u2264t\u22121) +D t(xt) = Dt\u22121(x\u2264t\u22121)\n\ne\u2212\u03b1t\u22121\nZt\u22121\n\n+ Dt(xt).\n\n(1)\n\ne(cid:25)\u03b1t\u22121(cid:25)Zt\u22121\n\nNote that the example xt is not covered by any previous hypotheses X1, . . . , Xt\u22121,\nand thus we have\n\nUsing the inductive assumption that PrDt(cid:1) [Xt(cid:1)(xi) (cid:2)= yi] = 1\n(cid:1)\n(cid:25)Dt(cid:1)(xN +1) = 1\ncan show that \u03b1t(cid:1) = 1\n2 \u2212 1\n\n2(k+1), (cid:25)\u03b1t(cid:1) = 1\n\nk\u22121, Zt(cid:1) = 1\n2 ln k+2\n\n2 ln k+1\n\nk+1\n\nk\n\nt\u22121(cid:26)\n\n.\n\nj=1\n\ne\u03b1j\nZj\n\ne(cid:25)\u03b1j(cid:25)Zj\n(cid:1)\n(k \u2212 1)(k + 1), (cid:25)Dt(cid:1)(x\u2264N ) = 1\nk , and (cid:25)Zt(cid:1) = 1\n\n2 \u2212 1\n\n2k , for t(cid:3) < t, one\n2(k+1),\nk(k + 2). Substituting\n\n2 + 1\n\n(2)\n\nDt(xt) =D 1(xt)\n\nthese values into the formulae (1) and (2), completes the proof.\n\nTheorem 2. For the described examples set, initial distribution D1, and mini-\nmal choice of weak hypotheses, AdaBoost with Bias needs at least N iterations to\nconstruct a \ufb01nal hypothesis whose error with respect to D1 is below \u03b5.\n\nProof. Let t be any integer smaller than N. At the end of the iteration t, the\nexamples xt+1, . . . , xN are not correctly classi\ufb01ed by the past weak hypotheses\nX1, . . . , Xt. In particular, the \ufb01nal linear combination evaluated at xN is\n\nt(cid:27)\n\nt(cid:27)\n\n(cid:28)\u03b1j = \u2212 t(cid:27)\n\n\u03b1j+\n\nt(cid:27)\n\n(cid:28)\u03b1j = \u2212 t\n\nH(xN ) =\n\n\u03b1jXj(xN )+\n\nln\n\nk + 1\nk \u2212 1\n\n+\n\nt\n2\n\nln\n\nk + 2\n\nk\n\n< 0.\n\n2\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\nThus sign(H(xN )) = \u22121 and the \ufb01nal hypothesis has error at least D1(xN ) \u2265 \u03b5\nwith respect to D1.\n\nTo show a similar lower bound for plain AdaBoost we use the same example set and\nthe following sequence of weak hypotheses X1, 1, X2, 1, . . . XN , 1. For odd iteration\nnumbers t the above proposition shows the error of the weak hypothesis is 1\n2k and\nfor even iteration numbers one can show that the hypothesis 1 has error 1\n2(k+1).\n\n2\u2212 1\n2 \u2212 1\n\n4 InfoBoost and SemiBoost for one-sided weak hypotheses\n\n(cid:1)\n\n(cid:4)T\n\nt=1 Zt, where Zt = PrDt[ht(xi) = +1]\n\n1 \u2212 \u03b3t[\u22121]2 and edge4 \u03b3t[\u00b11] = 1 \u2212 2\u03b5t[\u00b11].\n\nAslam proved the following upper bound on the training error[Asl00] of InfoBoost:\n(cid:1)\nTheorem 3. The training error of the \ufb01nal hypothesis produced by InfoBoost is\n1 \u2212 \u03b3t[+1]2 + PrDt[ht(xi) =\nbounded by\n\u22121]\n(cid:1)\n1 \u2212 \u03b32\nLet \u03b3t = 1 \u2212 2\u03b5t. If \u03b3t[+1] = \u03b3t[\u22121] = \u03b3t, then Zt =\n\u221a\nt , as for AdaBoost.\nHowever, if ht is one-sided, InfoBoost gives the improved factor of\n\u221a\nCorollary 4. For t \u2265 2, if ht is one-sided w.r.t. Dt, then Zt =\n1 \u2212 \u03b3t.\n4The edge \u03b3 and error \u0001 are related as follows: \u03b3 = 1\u22122\u0001 and \u0001 = 1\n2\u03b3 ; \u0001 = 1\n\n\u21d4 \u03b3 = 0.\n\n1 \u2212 \u03b3t:\n\n\u2212 1\n\n2\n\n2\n\n\fProof. Wlog. assume ht is always correct when it predicts +1. Then \u03b3t[+1] = 1 and\nthe \ufb01rst summand in the expression for Zt given in the above theorem disappears.\nRecall that InfoBoost maintains the distribution Dt over examples so that PrDt[yi =\n+1] = 1\n2\nPr\nDt\n\n(cid:2)\n2 for t \u2265 2. So the second summand becomes\n(cid:2)\n(cid:2)\n[ht(xi) =\u22121, y i = +1] Pr\n[ht(xi) =\u22121, y i = \u22121]\n(cid:2)\n[yi = \u22121]\nPr\n[yi = +1] Pr\nDt\nDt\n[ht(xi) = \u22121|yi = +1].\nPr\nDt\nBy the de\ufb01nition of \u03b3t, we have\n1 \u2212 \u03b3t = 2PrDt[ht(xi) (cid:2)= yi]\n\n[ht(xi) = \u22121|yi = +1] Pr\n\n[ht(xi) =\u22121|y i = \u22121]\n\n= 2\n\nPr\nDt\n\n=\n\nDt\n\nDt\n\n= 2PrDt[ht(xi) =\u22121, y i = +1]\n= 2PrDt[yi = +1] Pr\nDt\n= PrDt[ht(xi) = \u22121|yi = +1] (because PrDt[yi = +1] = 1\n2)\n\n[ht(xi) =\u22121|y i = +1]\n\n(because of one-sidedness of ht)\n\n2\nThis corollary implies that if a one-sided hypothesis is chosen at each iteration,\nthen InfoBoost constructs a \ufb01nal hypothesis consistent with all m examples within\n2\n\u03b3 ln m iterations. When the considered weak hypotheses are positively one-sided,\nthen the trivial greedy covering algorithm (which simply chooses the set that covers\nthe most uncovered positive examples), achieves the improved factor of 1\u2212 \u03b3, which\nmeans at most 1\n\u03b3 ln m iterations. By a careful analysis (not included), one can show\nthat the factor for InfoBoost can be improved to 1 \u2212 \u03b3, if all weak hypotheses are\none-sided. So in this case InfoBoost indeed matches the 1 \u2212 \u03b3 factor of the greedy\ncovering algorithm.\nA technical problem arises when InfoBoost is given a set of examples that are\nall labeled +1. Then we have \u03b11[+1] = \u221e and \u03b11[\u22121] = \u2212\u221e. This implies\nH(x) =\u03b1 1[h1(xi)]ht(xi) = \u221e for any instance xi. Thus InfoBoost terminates in\na single iteration and outputs a hypothesis that predicts +1 for any instance and\nInfoBoost cannot be used for constructing a cover.\nWe propose a natural way to cope with this subtlety. Recall that the \ufb01nal hy-\npothesis of InfoBoost is given by H(x) =\nt=1 \u03b1t[ht(x)] ht(x). This doesn\u2019t\nseem to be a linear combination of hypotheses from H since the coe\ufb03cients vary\nwith the prediction of weak hypotheses. However observe that \u03b1t[ht(x)] ht(x) =\nt (x), where h\u00b1 = h(x) if h(x) =\u00b11 and 0 otherwise.\n\u03b1t[+1] h+\nWe call h+ and h\u2212 the semi hypotheses of h. Note that h+(x) = h(x)+1\nand\nh\u2212(x) = h(x)\u22121\n. So the \ufb01nal hypothesis of InfoBoost and the new algorithm we\nwill de\ufb01ne in a moment is a bias plus a linear combination of the the original weak\nlearners in H.\nWe propose the following variant of AdaBoost (called Semi-Boost): In each iteration\nexecute one step of AdaBoost but the chosen weak hypothesis must be a semi\nhypothesis of one of the original hypothesis h \u2208 H which has a positive edge.\nSemiBoost avoids the outlined technical problem and can handle equally labeled\nexample sets. Also if all the chosen hypotheses are of the h+ type then the \ufb01nal\nhypothesis is a disjunction.\nIf hypotheses are chosen by smallest error (largest\nedge), then the greedy covering algorithm is simulated. Analogously, if all the\nchosen hypotheses are of the h\u2212 type then one can show that the \ufb01nal hypothesis of\nSemiBoost is a conjunction. Furthermore, two steps of SemiBoost (with hypothesis\nh+ in the \ufb01rst step followed by the sibling hypothesis h\u2212 in the second step) are\nequivalent to one step of InfoBoost with hypothesis h.\n\nt (x) +\u03b1 t[\u22121] h\u2212\n\n(cid:3)T\n\n2\n\n2\n\n\f2 ln 1\u2212\u03b5[\u00b11]+\u2206\n\nFinally we note that the \ufb01nal hypothesis of InfoBoost (or SemiBoost) is not well-\nde\ufb01ned when it includes both types of one-sided hypotheses, i.e. positive and neg-\native in\ufb01nite coe\ufb03cients may con\ufb02ict each other. We propose two solutions. First,\nfollowing [SS99] one can use the modi\ufb01ed coe\ufb03cients \u03b1[\u00b11](cid:3) = 1\n\u221a\n\u03b5[\u00b11]+\u2206 for\nsmall \u2206 > 0. It can be shown that the new Z(cid:3) increases by at most\n2\u2206([SS99]).\nSecond, we allow in\ufb01nite coe\ufb03cients but interpret the \ufb01nal hypothesis as a version\nof a decision list [Riv87]: Whenever more than one semi hypotheses with in\ufb01nite\ncoe\ufb03cients are non-zero on the current instance, then the semi hypothesis with the\nlowest iteration number determines the label. Once such a consistent decision list\nover some set of hypothesis ht and 1 has been found, it is easy the \ufb01nd an alternate\nlinear combination of the same set of hypotheses (using linear programming) that\nmaximizes the margin or minimizes the one-norm of the coe\ufb03cient vector subject\nto consistency.\nConclusion: We showed that AdaBoost can require signi\ufb01cantly more iterations\nthan the simple greedy cover algorithm when the weak hypotheses are one-sided\nand gave a variant of AdaBoost that can readily exploit one-sidedness. The open\nquestion is whether the new SemiBoost algorithm gives improved performance on\nnatural data and can be used for feature selection.\nAcknowledgment: This research bene\ufb01ted from many discussions with Gunnar\nR\u00a8atsch. He encouraged us to analyze AdaBoost with Bias and suggested to write\nthe \ufb01nal hypothesis of InfoBoost as a linear combination of semi hypotheses. We\nalso thank anonymous referees for helpful comments.\n\nReferences\n[Asl00] J. A. Aslam.\n\nImproving algorithms for boosting.\n\nIn Proc. 13th Annu.\n\nConference on Comput. Learning Theory, pages 200\u2013207, 2000.\n[Fre95] Y. Freund. Boosting a weak learning algorithm by majority.\n\nInform.\n\nComput., 121(2):256\u2013285, September 1995. Also appeared in COLT90.\n\n[FS97] Y. Freund and R. E. Schapire:. A decision-theoretic generalization of\non-line learning and an application to boosting. J. Comput. Syst. Sci.,\n55(1):119\u2013139, 1997.\n\n[KW99] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection.\nIn Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 134\u2013144.\nACM Press, New York, NY, 1999.\n\n[Laf99] J. La\ufb00erty. Additive models, boosting, and inference for generalized di-\nvergences. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages\n125\u2013133. ACM, 1999.\n\n[MG92] A. A. Razborov M. Goldmann, J. Hastad. Majority gates vs. general\nweighted threshold gates. Journal of Computation Complexity, 1(4):277\u2013\n300, 1992.\n\n[Nat91] B. K. Natarajan. Machine Learning: A Theoretical Approach. Morgan\n\nKaufmann, San Mateo, CA, 1991.\n\n[Riv87] R. L. Rivest. Learning decision lists. Machine Learning, 2:229\u2013246, 1987.\n[RW02] G. R\u00a8atsch and M. K. Warmuth. Maximizing the margin with boosting.\nIn Proceedings of the 15th Annual Conference on Computational Learning\nTheory, pages 334\u2013350. Springer, July 2002.\n\n[SS99] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using\n\ncon\ufb01dence-rated predictions. Machine Learning, 37(3):297\u2013336, 1999.\n\n\f", "award": [], "sourceid": 2532, "authors": [{"given_name": "Kohei", "family_name": "Hatano", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}