{"title": "FilterBoost: Regression and Classification on Large Datasets", "book": "Advances in Neural Information Processing Systems", "page_first": 185, "page_last": 192, "abstract": null, "full_text": "FilterBoost: Regression and Classi\ufb01cation on Large\n\nDatasets\n\nJoseph K. Bradley\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\njkbradle@cs.cmu.edu\n\nRobert E. Schapire\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08540\n\nschapire@cs.princeton.edu\n\nAbstract\n\nWe study boosting in the \ufb01ltering setting, where the booster draws examples from\nan oracle instead of using a \ufb01xed training set and so may train ef\ufb01ciently on very\nlarge datasets. Our algorithm, which is based on a logistic regression technique\nproposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve\nbounds equivalent to or better than previous work. Moreover, we give the \ufb01rst\nproof that the algorithm of Collins et al. is a strong PAC learner, albeit within the\n\ufb01ltering setting. Our proofs demonstrate the algorithm\u2019s strong theoretical proper-\nties for both classi\ufb01cation and conditional probability estimation, and we validate\nthese results through extensive experiments. Empirically, our algorithm proves\nmore robust to noise and over\ufb01tting than batch boosters in conditional probability\nestimation and proves competitive in classi\ufb01cation.\n\n1 Introduction\n\nBoosting provides a ready method for improving existing learning algorithms for classi\ufb01cation.\nTaking a weaker learner as input, boosters use the weak learner to generate weak hypotheses which\nare combined into a classi\ufb01cation rule more accurate than the weak hypotheses themselves. Boosters\nsuch as AdaBoost [1] have shown considerable success in practice.\nMost boosters are designed for the batch setting where the learner trains on a \ufb01xed example set.\nThis setting is reasonable for many applications, yet it requires collecting all examples before train-\ning. Moreover, most batch boosters maintain distributions over the entire training set, making them\ncomputationally costly for very large datasets. To make boosting feasible on larger datasets, learners\ncan be designed for the \ufb01ltering setting. The batch setting provides the learner with a \ufb01xed training\nset, but the \ufb01ltering setting provides an oracle which can produce an unlimited number of labeled\nexamples, one at a time. This idealized model may describe learning problems with on-line example\nsources, including very large datasets which must be loaded piecemeal into memory. By using new\ntraining examples each round, \ufb01ltering boosters avoid maintaining a distribution over a training set\nand so may use large datasets much more ef\ufb01ciently than batch boosters.\nThe \ufb01rst polynomial-time booster, by Schapire, was designed for \ufb01ltering [2]. Later \ufb01ltering boosters\nincluded two more ef\ufb01cient ones proposed by Freund, but both are non-adaptive, requiring a priori\nbounds on weak hypothesis error rates and combining weak hypotheses via unweighted majority\nvotes [3,4]. Domingo & Watanabe\u2019s MadaBoost is competitive with AdaBoost empirically but the-\noretically requires weak hypotheses\u2019 error rates to be monotonically increasing, an assumption we\nfound to be violated often in practice [5]. Bshouty & Gavinsky proposed another, but, like Freund\u2019s,\ntheir algorithm requires an a priori bound on weak hypothesis error rates [6]. Gavinsky\u2019s AdaFlat\ufb01lt\nalgorithm and Hatano\u2019s GiniBoost do not have these limitations, but the former has worse bounds\nthan other adaptive algorithms while the latter explicitly requires \ufb01nite weak hypothesis spaces [7,8].\n\n1\n\n\fThis paper presents FilterBoost, an adaptive boosting-by-\ufb01ltering algorithm. We show it is appli-\ncable to both conditional probability estimation, where the learner predicts the probability of each\nlabel given an example, and classi\ufb01cation. In Section 2, we describe the algorithm, after which\nwe interpret it as a stepwise method for \ufb01tting an additive logistic regression model for conditional\nprobabilities. We then bound the number of rounds and examples required to achieve any target\nerror in (0, 1). These bounds match or improve upon those for previous \ufb01ltering boosters but require\nfewer assumptions. We also show that FilterBoost can use the con\ufb01dence-rated predictions from\nweak hypotheses described by Schapire & Singer [9].\nIn Section 3, we give results from extensive experiments. For conditional probability estimation, we\nshow that FilterBoost often outperforms batch boosters, which prove less robust to over\ufb01tting. For\nclassi\ufb01cation, we show that \ufb01ltering boosters\u2019 ef\ufb01ciency on large datasets allows them to achieve\nhigher accuracies faster than batch boosters in many cases.\nFilterBoost is based on a modi\ufb01cation of AdaBoost by Collins, Schapire & Singer designed to min-\nimize logistic loss [10]. Their batch algorithm has yet to be shown to achieve arbitrarily low test\nerror, but we use techniques similar to those of MadaBoost to adapt the algorithm to the \ufb01ltering set-\nting and prove generalization bounds. The result is an adaptive algorithm with realistic assumptions\nand strong theoretical properties. Its robustness and ef\ufb01ciency on large datasets make it competitive\nwith existing methods for both conditional probability estimation and classi\ufb01cation.\n\n2 The FilterBoost Algorithm\n\nLet X be the set of examples and Y a discrete set of labels. For simplicity, assume X is count-\nable, and consider only binary labels Y = {\u22121, +1}. We assume there exists an unknown target\ndistribution D over labeled examples (x, y) \u2208 X \u00d7 Y from which training and test examples are\ngenerated. The goal in classi\ufb01cation is to choose a hypothesis h : X \u2192 Y which minimizes the\nclassi\ufb01cation error PrD[h(x) 6= y], where the subscript indicates that the probability is with respect\nto (x, y) sampled randomly from D.\nIn the batch setting, a booster is given a \ufb01xed training set S and a weak learner which, given any\ndistribution Dt over training examples S, is guaranteed to return a weak hypothesis ht : X \u2192 R\nsuch that the error \u0001t \u2261 PrDt[sign(ht(x)) 6= y] < 1/2. For T rounds t, the booster builds a\ndistribution Dt over S, runs the weak learner on S and Dt, and receives ht. The booster usually then\nestimates \u0001t using S and weights ht with \u03b1t = \u03b1t(\u0001t). After T rounds, the booster outputs a \ufb01nal\nt \u03b1tht(x)).\n\nhypothesis H which is a linear combination of the weak hypotheses (e.g. H(x) = P\n\nThe sign of H(x) indicates the predicted label \u02c6y for x.\nTwo key elements of boosting are constructing Dt over S and weighting weak hypotheses. Dt is\nbuilt such that misclassi\ufb01ed examples receive higher weights than in Dt\u22121, eventually forcing the\nweak learner to classify previously poorly classi\ufb01ed examples correctly. Weak hypotheses ht are\ngenerally weighted such that hypotheses with lower errors receive higher weights.\n\n2.1 Boosting-by-Filtering\n\nWe describe a general framework for boosting-by-\ufb01ltering which includes most existing algorithms\nas well as our algorithm Filterboost. The \ufb01ltering setting assumes the learner has access to an\nexample oracle, allowing it to use entirely new examples sampled i.i.d. from D on each round.\nHowever, while maintaining the distribution Dt is straightforward in the batch setting, there is no\n\ufb01xed set S on which to de\ufb01ne Dt in \ufb01ltering. Instead, the booster simulates examples drawn from\nDt by drawing examples from D via the oracle and reweighting them according to Dt. Filtering\nboosters generally accept each example (x, y) from the oracle for training on round t with probability\nproportional to the example\u2019s weight Dt(x, y). The mechanism which accepts examples from the\noracle with some probability is called the \ufb01lter.\nThus, on each round, a boosting-by-\ufb01ltering algorithm draws a set of examples from Dt via the\n\ufb01lter, trains the weak learner on this set, and receives a weak hypothesis ht. Though a batch booster\nwould estimate \u0001t using the \ufb01xed set S, \ufb01ltering boosters may use new examples from the \ufb01lter.\nLike batch boosters, \ufb01ltering boosters may weight ht using \u03b1t = \u03b1t(\u0001t), and they output a linear\ncombination of h1, . . . , hT as a \ufb01nal hypothesis.\n\n2\n\n\fThe \ufb01ltering setting allows the learner\nto estimate the error of Ht to arbitrary\nprecision by sampling from D via the\noracle, so FilterBoost does this to de-\ncide when to stop boosting.\n\n2.2 FilterBoost\n\nFilterBoost, given in Figure 1, is mod-\neled after the aforementioned algo-\nrithm by Collins et al. [10] and Mada-\nBoost [5]. Given an example oracle,\nweak learner, target error \u03b5 \u2208 (0, 1),\nand con\ufb01dence parameter \u03b4 \u2208 (0, 1)\nupper-bounding the probability of fail-\nure, it iterates until the current com-\nbined hypothesis Ht has error \u2264 \u03b5.\nOn round t, FilterBoost draws mt ex-\namples from the \ufb01lter to train the weak\nlearner and get ht. The number mt\nmust be large enough to ensure ht has\nerror \u0001t < 1/2 with high probabil-\nity. The edge of ht is \u03b3t = 1/2 \u2212 \u0001t,\nand this edge is estimated by the func-\ntion getEdge(), discussed below, and\nis used to set ht\u2019s weight \u03b1t. The cur-\nrent combined hypothesis is de\ufb01ned as\n\nHt = sign(Pt\n\nt0=1 \u03b1t0 ht0).\n\nDe\ufb01ne Ft(x) \u2261Pt\u22121\n\nt0=1 \u03b1t0 ht0(x)\n\nAlgorithm F ilterBoost accepts Oracle(), \u03b5, \u03b4, \u03c4:\n\n3t(t+1)\n\nFor t = 1, 2, 3, . . .\n\u03b4t \u2190\u2212 \u03b4\nCall F ilter(t, \u03b4t, \u03b5) to get\n(cid:17)\nmt examples to train WL; get ht\nt \u2190\u2212 getEdge(t, \u03c4, \u03b4t, \u03b5)\n\u02c6\u03b30\n(cid:16)\n\u03b1t \u2190\u2212 1\n1/2\u2212\u02c6\u03b30\nDe\ufb01ne Ht(x) = sign\n\n(cid:16) 1/2+\u02c6\u03b30\n\nFt+1(x)\n\n2 ln\n\n(cid:17)\n\nt\n\nt\n\n(Algorithm exits from F ilter() function.)\n\nFunction F ilter(t, \u03b4t, \u03b5) returns (x, y)\n\nDe\ufb01ne r = # calls to Filter so far on round t\nt \u2190\u2212 \u03b4t\n\u03b40\nFor (i = 0; i < 2\n\n); i = i + 1):\n\nr(r+1)\n\n\u03b5 ln( 1\n\u03b40\n\nt\n\n(x, y) \u2190\u2212 Oracle()\nqt(x, y) \u2190\u2212\nReturn (x, y) with probability qt(x, y)\n\n1+eyFt(x)\n\n1\n\nEnd algorithm; return Ht\u22121\n\nFunction getEdge(t, \u03c4, \u03b4t, \u03b5) returns \u02c6\u03b30\n\nt\n\nLet m \u2190\u2212 0, n \u2190\u2212 0, u \u2190\u2212 0, \u03b1 \u2190\u2212 \u221e\nWhile (|u| < \u03b1(1 + 1/\u03c4)):\n\n(x, y) \u2190\u2212 F ilter(t, \u03b4t, \u03b5)\nn \u2190\u2212 n + 1\nm \u2190\u2212 m + I(ht(x) = y)\nu \u2190\u2212 m/n \u2212 1/2\n\nThe F ilter() function generates (x, y)\nfrom Dt by repeatedly drawing (x, y)\nfrom the oracle, calculating the weight\nqt(x, y) \u221d Dt(x, y), and accepting\n(x, y) with probability qt(x, y).\nFunction getEdge() uses a modi\ufb01ca-\ntion of the Nonmonotonic Adaptive\nSampling method of Watanabe [11]\nand Domingo, Galvad`a & Watanabe\n[12]. Their algorithm draws an adap-\ntively chosen number of examples from\nthe \ufb01lter and returns an estimate \u02c6\u03b3t of the edge of ht within relative error \u03c4 of the true edge \u03b3t with\nhigh probability. The getEdge() function revises this estimate as \u02c6\u03b30\n\n\u03b1 \u2190\u2212p(1/2n) ln(n(n + 1)/\u03b4t)\n\nFigure 1: The algorithm FilterBoost.\n\nReturn u/(1 + \u03c4)\n\nt = \u02c6\u03b3t/(1 + \u03c4).\n\n2.3 Analysis: Conditional Probability Estimation\n\nWe begin our analysis of FilterBoost by interpreting it as an additive model for logistic regression,\nfor this interpretation will later aid in the analysis for classi\ufb01cation. Such models take the form\n\nft(x) = F (x),\n\nwhich implies\n\nPr[y = 1|x] =\n\n1\n\n1 + e\u2212F (x)\n\nlog\n\nPr[y = 1|x]\nPr[y = \u22121|x]\n\n=X\n\nt\n\nwhere, for FilterBoost, ft(x) = \u03b1tht(x). Dropping subscripts, we can write the expected negative\nlog likelihood of example (x, y) after round t as\n\n(cid:21)\n\nh\n\n(cid:16)\n\n1 + e\u2212y(F (x)+\u03b1h(x))(cid:17)i\n\n.\n\n= E\n\nln\n\n\u03c0(Ft + \u03b1tht) = \u03c0(F + \u03b1h) = E\n\n\u2212 ln\n\n1\n\n1 + e\u2212y(F (x)+\u03b1h(x))\n\n(cid:20)\n\nTaking a similar approach to the analysis of AdaBoost in [13], we show in the following theorem\nthat FilterBoost performs an approximate stepwise minimization of this negative log likelihood. The\nproof is in the Appendix.\n\n3\n\n\fTheorem 1 De\ufb01ne the expected negative log likelihood \u03c0(F + \u03b1h) as above. Given F , FilterBoost\nchooses h to minimize a second-order Taylor expansion of \u03c0 around h = 0. Given this h, it then\nchooses \u03b1 to minimize an upper bound of \u03c0.\n\nThe batch booster given by Collins et al. [10] which FilterBoost is based upon is guaranteed to\nconverge to the minimum of this objective when working over a \ufb01nite sample. Note that Filter-\nBoost uses weak learners which are simple classi\ufb01ers to perform regression. AdaBoost too may\nbe interpreted as an additive logistic regression model of the form Pr[y = 1|x] =\n1+e\u22122F (x) with\nE[exp(\u2212yF (x))] as the optimization objective [13].\n\n1\n\n2.4 Analysis: Classi\ufb01cation\n\nIn this section, we interpret FilterBoost as a traditional boosting algorithm for classi\ufb01cation and\nprove bounds on its generalization error. We \ufb01rst give a theorem relating errt, the error rate of Ht\nover the target distribution D, to pt, the probability with which the \ufb01lter accepts a random example\ngenerated by the oracle on round t.\nTheorem 2 Let errt = PrD[Ht(x) 6= y], and let pt = ED[qt(x, y)]. Then errt \u2264 2pt.\nProof:\n\nerrt = PrD[Ht(x) 6= y] = PrD[yFt\u22121(x) \u2264 0]\n= PrD[qt(x, y) \u2265 1/2] \u2264 2 \u00b7 ED[qt(x, y)]\n= 2pt (using Markov\u2019s inequality above)\n\n(cid:4)\n\nt, we can write \u03c0t = \u2212P\n\nWe next use the expected negative log likelihood \u03c0 from Section 2.3 as an auxiliary function to aid\nin bounding the required number of boosting rounds. Viewing \u03c0 as a function of the boosting round\n(x,y) D(x, y) ln(1 \u2212 qt(x, y)). Our goal is then to minimize \u03c0t, and the\nfollowing lemma captures the learner\u2019s progress in terms of the decrease in \u03c0t on each round. This\nlemma assumes edge estimates returned by getEdge() are exact, i.e. \u02c6\u03b30\nt = \u03b3t, which leads to a\nsimpler bound on T in Theorem 3. We then consider the error in edge estimates and give a revised\nbound in Lemma 2 and Theorem 5. The proofs of Lemmas 1 and 2 are in the Appendix.\n\nis estimated exactly.\n\nLet \u03c0t =\n\n\u2212P\n\nLemma 1 Assume for all t that \u03b3t\n(x,y) D(x, y) ln(1 \u2212 qt(x, y)). Then\n\n6= 0 and \u03b3t\n\n(cid:18)\n\nq\n\n(cid:19)\n\n\u03c0t \u2212 \u03c0t+1 \u2265 pt\n\n1 \u2212 2\n\n1/4 \u2212 \u03b32\n\nt\n\n.\n\nCombining Theorem 2, which bounds the error of the current combined hypothesis in terms of pt,\nwith Lemma 1 gives the following upper bound on the required rounds T .\nTheorem 3 Let \u03b3 = mint |\u03b3t|, and let \u03b5 be the target error. Given Lemma 1\u2019s assumptions, if\nFilterBoost runs\n\n(cid:16)\n\n1 \u2212 2p1/4 \u2212 \u03b32\n\n2 ln(2)\n\n(cid:17)\n\nT >\n\n\u03b5\n\nrounds, then errt < \u03b5 for some t, 1 \u2264 t \u2264 T . In particular, this is true for T > ln(2)\n2\u03b5\u03b32 .\n\nProof: For all (x, y), since F1(x, y) = 0, then q1(x, y) = 1/2 and \u03c01 = ln(2). Now, suppose\nerrt \u2265 \u03b5,\u2200t \u2208 {1, ..., T}. Then, from Theorem 2, pt \u2265 \u03b5/2, so Lemma 1 gives\n\nUnraveling this recursion asPT\n\n(cid:17)\n\n(cid:16)\n\n\u03c0t \u2212 \u03c0t+1 \u2265 1\n2 \u03b5\nt=1 (\u03c0t \u2212 \u03c0t+1) = \u03c01 \u2212 \u03c0T +1 \u2264 \u03c01 gives\n(cid:16)\n\n1 \u2212 2p1/4 \u2212 \u03b32\n(cid:17) .\n\n1 \u2212 2p1/4 \u2212 \u03b32\n\n2 ln(2)\n\nT \u2264\n\n\u03b5\n\n4\n\n\f\u03b4\n\n(cid:4)\n\n1 \u2212 x \u2264 x for x \u2208 [0, 1].\n\nSo, errt \u2265 \u03b5,\u2200t \u2208 {1, ..., T} is contradicted if T exceeds the theorem\u2019s lower bound. The simpli\ufb01ed\nbound follows from the \ufb01rst bound via the inequality 1 \u2212 \u221a\nTheorem 3 shows FilterBoost can reduce generalization error to any \u03b5 \u2208 (0, 1), but we have thus far\noverlooked the probabilities of failure introduced by three steps: training the weak learner, deciding\nwhen to stop boosting, and estimating edges. We bound the probability of each of these steps failing\non round t with a con\ufb01dence parameter \u03b4t =\n3t(t+1) so that a simple union bound ensures the\nprobability of some step failing to be at most FilterBoost\u2019s con\ufb01dence parameter \u03b4. Finally, we\nrevise Lemma 1 and Theorem 3 to account for error in estimating edges.\nThe number mt of examples the weak learner trains on must be large enough to ensure weak hy-\npothesis ht has a non-zero edge and should be set according to the choice of weak learner.\nTo decide when to stop boosting (i.e. when errt \u2264 \u03b5), we can use Theorem 2, which upper-bounds\nthe error of the current combined hypothesis Ht in terms of the probability pt that F ilter() accepts\na random example from the oracle. If the \ufb01lter rejects enough examples in a single call, we know pt\nis small, so Ht is accurate enough. Theorem 4 formalizes this intuition; the proof is in the Appendix.\nTheorem 4 In a single call to F ilter(t), if n examples have been rejected, where n \u2265 2\n\u03b5 ln(1/\u03b40\nt),\nthen errt \u2264 \u03b5 with probability at least 1 \u2212 \u03b40\nt.\nTheorem 4 provides a stopping condition which is checked on each call to F ilter(). Each check\nmay fail with probability at most \u03b40\nr(r+1) on the rth call to F ilter() so that a union bound\nensures FilterBoost stops prematurely on round t with probability at most \u03b4t. Theorem 4 uses a\nsimilar argument to that used for MadaBoost, giving similar stopping criteria for both algorithms.\nWe estimate weak hypotheses\u2019 edges \u03b3t using the Nonmonotonic Adaptive Sampling (NAS) algo-\nrithm [11,12] used by MadaBoost. To compute an estimate \u02c6\u03b3t of the true edge \u03b3t within relative error\n\u03c4 \u2208 (0, 1) with probability \u2265 1 \u2212 \u03b4t, the NAS algorithm uses at most 2(1+2\u03c4 )2\n) \ufb01ltered\nln(\nexamples. With this guarantee on edge estimates, we can rewrite Lemma 1 as follows:\nLemma 2 Assume for all t that \u03b3t 6= 0 and \u03b3t is estimated to within \u03c4 \u2208 (0, 1) relative error. Let\n\nt = \u03b4t\n\n(\u03c4 \u03b3t)2\n\n\u03c4 \u03b3t\u03b4t\n\n1\n\n\u03c0t = \u2212P\n\n \n(x,y) D(x, y) ln(1 \u2212 qt(x, y)). Then\n1 \u2212 2\n\n\u03c0t \u2212 \u03c0t+1 \u2265 pt\n\ns\n\n(cid:18)1 \u2212 \u03c4\n\n1 + \u03c4\n\n(cid:19)2!\n\n.\n\n1/4 \u2212 \u03b32\n\nt\n\n(cid:16)\n\nT >\n\nq\n\n2 ln(2)\n1/4 \u2212 \u03b32( 1\u2212\u03c4\n\n(cid:17)\n\nUsing Lemma 2, the following theorem modi\ufb01es Theorem 3 to account for error in edge estimates.\nTheorem 5 Let \u03b3 = mint |\u03b3t|. Let \u03b5 be the target error. Given Lemma 2\u2019s assumptions, if Filter-\nBoost runs\n\n\u03b5\n\n1+\u03c4 )2\n\n1 \u2212 2\nrounds, then errt < \u03b5 for some t, 1 \u2264 t \u2264 T .\nThe bounds from Theorems 3 and 5 show FilterBoost requires at most O(\u03b5\u22121\u03b3\u22122) boosting rounds.\nMadaBoost [5], which we test in our experiments, resembles FilterBoost but uses truncated ex-\nponential weights qt(x, y) = min{1, exp(yFt\u22121(x))} instead of the logistic weights qt(x, y) =\n(1 + exp(yFt(x)))\u22121 used by FilterBoost. The algorithms\u2019 analyses differ, with MadaBoost requir-\ning the edges \u03b3t to be monotonically decreasing, but both lead to similar bounds on the number of\nrounds T proportional to \u03b5\u22121. The non-adaptive \ufb01ltering boosters of Freund [3,4] and of Bshouty\n& Gavinsky [6] and the batch booster AdaBoost [1] have smaller bounds on T , proportional to\nlog(\u03b5\u22121). However, we can use boosting tandems, a technique used by Freund [4] and Gavinsky\n[7], to create a \ufb01ltering booster with T bounded by O(log(\u03b5\u22121)\u03b3\u22122). Following Gavinsky, we can\nuse FilterBoost to boost the accuracy of the weak learner to some constant and, in turn, treat Fil-\nterBoost as a weak learner and use an algorithm from Freund to achieve any target error. As with\nAdaFlat\ufb01lt, boosting tandems turn FilterBoost into an adaptive booster with a bound on T pro-\nportional to log(\u03b5\u22121). (Without boosting tandems, AdaFlat\ufb01lt requires T \u221d \u03b5\u22122 rounds.) Note,\nhowever, that boosting tandems result in more complicated \ufb01nal hypotheses.\n\n5\n\n\fAn alternate bound for FilterBoost may be derived using techniques from Shalev-Shwartz & Singer\n[14]. They use the framework of convex repeated games to de\ufb01ne a general method for bounding the\nperformance of online and boosting algorithms. For FilterBoost, their techniques, combined with\nTheorem 2, give a bound similar to that in Theorem 3 but proportional to \u03b5\u22122 instead of \u03b5\u22121.\nSchapire & Singer [9] show AdaBoost bene\ufb01ts from con\ufb01dence-rated predictions, where weak hy-\npotheses return predictions whose absolute values indicate con\ufb01dence. These values are chosen\nto greedily minimize AdaBoost\u2019s exponential loss function over training data, and this aggressive\nP\nweighting can result in faster learning. FilterBoost may use con\ufb01dence-rated predictions in an iden-\ntical manner. In the proof of Lemma 1, the decrease in the negative log likelihood \u03c0t of the data\n(relative to Ht and the target distribution D) is lower-bounded by pt\u2212pt\n(x,y) Dt(x, y)e\u2212\u03b1tyht(x).\nSince pt is \ufb01xed, maximizing this bound is equivalent to minimizing the exponential loss over Dt.\n\n3 Experiments\n\nVanilla FilterBoost accepts examples (x, y) from the oracle with probability qt(x, y), but it may\ninstead accept all examples and weight each with qt(x, y). Weighting instead of \ufb01ltering examples\nincreases accuracy but also increases the size of the training set passed to the weak learner. For\nef\ufb01ciency, we choose to \ufb01lter when training the weak learner but weight when estimating edges\n\u03b3t. We also modify FilterBoost\u2019s getEdge() function for ef\ufb01ciency. The Nonmonotonic Adaptive\nSampling (NAS) algorithm used to estimate edges \u03b3t uses many examples, but using several orders\nof magnitude fewer sacri\ufb01ces little accuracy. The same is true for MadaBoost. In all tests, we use\nCn log(t + 1) examples to estimate \u03b3t, where Cn = 300 and the log factor scales the number as\nthe NAS algorithm would. For simplicity, we train weak learners with Cn log(t + 1) examples as\nwell. These modi\ufb01cations mean \u03c4 (error in edge estimates) and \u03b4 (con\ufb01dence) have no effect on our\ntests. To simulate an oracle, we randomly permute the data and use examples in the new order. In\npractice, \ufb01ltering boosters can achieve higher accuracy by cycling through training sets again instead\nof stopping once examples are depleted, and we use this \u201crecycling\u201d in our tests.\nWe test FilterBoost with and without con\ufb01dence-rated predictions (labeled \u201c(C-R)\u201d in our results).\nWe compare FilterBoost against MadaBoost [5], which does not require an a priori bound on weak\nhypotheses\u2019 edges and has similar bounds without the complication of boosting tandems. We imple-\nment MadaBoost with the same modi\ufb01cations as FilterBoost. We test FilterBoost against two batch\nboosters: the well-studied and historically successful AdaBoost [1] and the algorithm from Collins\net al. [10] which is essentially a batch version of FilterBoost (labeled \u201cAdaBoost-LOG\u201d). We test\nboth with and without con\ufb01dence-rated predictions as well as with and without resampling (labeled\n\u201c(resamp)\u201d). In resampling, the booster trains weak learners on small sets of examples sampled from\nthe distribution Dt over the training set S rather than on the entire set S, and this technique often\nincreases ef\ufb01ciency with little effect on accuracy. Our batch boosters use sets of size Cm log(t + 1)\nfor training, like the \ufb01ltering boosters, but use all of S to estimate edges \u03b3t since this can be done\nef\ufb01ciently. We test the batch boosters using con\ufb01dence-rated predictions and resampling in order to\ncompare FilterBoost with batch algorithms optimized for the ef\ufb01ciency which boosting-by-\ufb01ltering\nclaims as its goal.\nWe test each booster using decision stumps and decision trees as weak learners to discern the effects\nof simple and complicated weak hypotheses. The decision stumps minimize training error, and the\ndecision trees greedily maximize information gain and are pruned using 1/3 of the data. Both weak\nlearners minimize exponential loss when outputing con\ufb01dence-rated predictions.\nWe use four datasets, described in the Appendix. Brie\ufb02y, we use two synthetic sets: Majority\n(majority vote) and Twonorm [15], and two real sets from the UCI Machine Learning Repository\n[16]: Adult (census data; from Ron Kohavi) and Covertype (forestry data with 7 classes merged to\n2; Copyr. Jock A. Blackard & Colorado State U.). We average over 10 runs, using new examples\nfor synthetic data (with 50,000 test examples except where stated) and cross validation for real data.\nFigure 2 compares the boosters\u2019 runtimes. As expected, \ufb01ltering boosters run slower per round than\nbatch boosters on small datasets but much faster on large ones. Interestingly, \ufb01ltering boosters take\nlonger on very small datasets in some cases (not shown), for the probability the \ufb01lter accepts an\nexample quickly shrinks when the booster has seen that example many times.\n\n6\n\n\fFigure 2: Running times: Ada/Filter/MadaBoost. Majority; WL = stumps.\n\n3.1 Results: Conditional Probability Estimation\n\nIn Section 2.3, we discussed the in-\nterpretation of FilterBoost and Ada-\nBoost as stepwise algorithms for con-\nditional probability estimation. We\ntest both algorithms and the variants\ndiscussed above on all four datasets.\nWe do not test MadaBoost, as it is\nnot clear how to use it to estimate\nconditional probabilities. As Figure\n3 shows, both FilterBoost variants\nare competitive with batch algorithms\nwhen boosting decision stumps. With\ndecision trees, all algorithms except\nfor FilterBoost over\ufb01t badly, includ-\ning FilterBoost(C-R). In each plot,\nwe compare FilterBoost with the best\nof AdaBoost and AdaBoost-LOG:\nAdaBoost was best with decision\nstumps and AdaBoost-LOG with de-\ncisions trees. For comparison, batch\nlogistic regression via gradient de-\nscent achieves RMSE 0.3489 and log\n(base e) loss .4259; FilterBoost, inter-\npretable as a stepwise method for logistic regression, seems to be approaching these asymptotically.\nOn Adult and Twonorm, FilterBoost generally outperforms the batch boosters, which tend to over\ufb01t\nwhen boosting decision trees, though AdaBoost slightly outperforms FilterBoost on smaller datasets\nwhen boosting decision stumps.\nThe Covertype dataset is an exception to our results and highlights a danger in \ufb01ltering and in\nresampling for batch learning: the complicated structure of some datasets seems to require decision\ntrees to train on the entire dataset. With decision stumps, the \ufb01ltering boosters are competitive,\nyet only the non-resampling batch boosters achieve high accuracies with decision trees. The \ufb01rst\ndecision tree trained on the entire training set achieves about 94% accuracy, which is unachievable\nby any of the \ufb01ltering or resampling batch boosters when using Cm = 300 as the base number of\nexamples for training the weak learner. To compete with non-resampling batch boosters, the other\nboosters must use Cm on the order of 105, by which point they become very inef\ufb01cient.\n\nFigure 3: Log (base e) loss & root mean squared error\n(RMSE). Majority; 10,000 train exs.\nLeft two: WL = stumps (FilterBoost vs. AdaBoost);\nRight two: WL = trees (FilterBoost vs. AdaBoost-LOG).\n\n3.2 Results: Classi\ufb01cation\n\nVanilla FilterBoost and MadaBoost perform similarly in classi\ufb01cation (Figure 4). Con\ufb01dence-rated\npredictions allow FilterBoost to outperform MadaBoost when using decision stumps but sometimes\ncause FilterBoost to perform poorly with decision trees. Figure 5 compares FilterBoost with the\nbest batch booster for each weak learner. With decision stumps, all boosters achieve higher accu-\nracies with the larger dataset, on which \ufb01ltering algorithms are much more ef\ufb01cient. Majority is\nrepresented well as a linear combination of decision stumps, so the boosters all learn more slowly\n\n7\n\n\fFigure 4: FilterBoost vs. MadaBoost.\n\nFigure 5: FilterBoost vs. AdaBoost & AdaBoost-LOG. Majority.\n\nwhen using the overly complicated decision trees. However, this problem generally affects \ufb01ltering\nboosters less than most batch variants, especially on larger datasets. Adult and Twonorm gave simi-\nlar results. As in Section 3.1, \ufb01ltering and resampling batch boosters perform poorly on Covertype.\nThus, while FilterBoost is competitive in classi\ufb01cation, its best performance is in regression.\n\nReferences\n\n[1] Freund, Y., & Schapire, R. E. (1997) A decision-theoretic generalization of on-line learning and an applica-\ntion to boosting. Journal of Computer and System Sciences, 55, 119-139.\n[2] Schapire., R. E. (1990) The strength of weak learnability. Machine Learning, 5(2), pp. 197-227.\n[3] Freund, Y. (1995) Boosting a weak learning algorithm by majority. Information and Computation, 121, pp.\n256-285.\n[4] Freund, Y. (1992) An improved boosting algorithm and its implications on learning complexity. 5th Annual\nConference on Computational Learning Theory, pp. 391-398.\n[5] Domingo, C., & Watanabe, O. (2000) MadaBoost: a modi\ufb01cation of AdaBoost. 13th Annual Conference\non Computational Learning Theory, pp. 180-189.\n[6] Bshouty, N. H., & Gavinsky, D. (2002) On boosting with polynomially bounded distributions. Journal of\nMachine Learning Research, 3, pp. 483-506.\n[7] Gavinsky, D. (2003) Optimally-smooth adaptive boosting and application to agnostic learning. Journal of\nMachine Learning Research, 4, pp. 101-117.\n[8] Hatano, K. (2006) Smooth boosting using an information-based criterion. 17th International Conference\non Algorithmic Learning Theory, pp. 304-319.\n[9] Schapire, R. E., & Singer, Y. (1999) Improved boosting algorithms using con\ufb01dence-rated predictions. Ma-\nchine Learning, 37, 297-336.\n[10] Collins, M., Schapire, R. E., & Singer, Y. (2002) Logistic regression, AdaBoost and Bregman distances.\nMachine Learning, 48, pp. 253-285.\n[11] Watanabe, O. (2000) Simple sampling techniques for discovery science. IEICE Trans. Information and\nSystems, E83-D(1), 19-26.\n[12] Domingo, C., Galvad`a, R., & Watanabe, O. (2002) Adaptive sampling methods for scaling up knowledge\ndiscovery algorithms. Data Mining and Knowledge Discovery, 6, pp. 131-152.\n[13] Friedman, J., Hastie, T., & Tibshirani, R. (2000) Additive logistic regression: a statistical view of boosting.\nThe Annals of Statistics, 28, 337-407.\n[14] Shalev-Shwartz, S., & Singer, Y. (2006) Convex repeated games and Fenchel duality. Advances in Neural\nInformation Processing Systems 20.\n[15] Breiman, L. (1998) Arcing classi\ufb01ers. The Annals of Statistics, 26, pp. 801-849.\n[16] Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998) UCI Repository of machine learning\ndatabases [http://www.ics.uci.edu/\u223cmlearn/MLRepository.html]. Irvine, CA: U. of California, Dept. of Infor-\nmation & Computer Science.\n\n8\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Joseph", "family_name": "Bradley", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}