{"title": "Generalization Error and Algorithmic Convergence of Median Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 657, "page_last": 664, "abstract": null, "full_text": " Generalization Error and Algorithmic\n Convergence of Median Boosting\n\n\n\n Balazs Kegl\n Department of Computer Science and Operations Research, University of Montreal\n CP 6128 succ. Centre-Ville, Montreal, Canada H3C 3J7\n kegl@iro.umontreal.ca\n\n\n\n Abstract\n\n We have recently proposed an extension of ADABOOST to regression\n that uses the median of the base regressors as the final regressor. In this\n paper we extend theoretical results obtained for ADABOOST to median\n boosting and to its localized variant. First, we extend recent results on ef-\n ficient margin maximizing to show that the algorithm can converge to the\n maximum achievable margin within a preset precision in a finite number\n of steps. Then we provide confidence-interval-type bounds on the gener-\n alization error.\n\n\n1 Introduction\n\nIn a recent paper [1] we introduced MEDBOOST, a boosting algorithm that trains base\nregressors and returns their weighted median as the final regressor. In another line of re-\nsearch, [2, 3] extended ADABOOST to boost localized or confidence-rated experts with\ninput-dependent weighting of the base classifiers. In [4] we propose a synthesis of the two\nmethods, which we call LOCMEDBOOST. In this paper we analyze the algorithmic con-\nvergence of MEDBOOST and LOCMEDBOOST, and provide bounds on the generalization\nerror.\n\nWe start by describing the algorithm in its most general form, and extend the result of [1] on\nthe convergence of the robust (marginal) training error (Section 2). The robustness of the\nregressor is measured in terms of the dispersion of the expert population, and with respect to\nthe underlying average confidence estimate. In Section 3, we analyze the algorithmic con-\nvergence. In particular, we extend recent results [5] on efficient margin maximizing to show\nthat the algorithm can converge to the maximum achievable margin within a preset preci-\nsion in a finite number of steps. In Section 4, we provide confidence-interval-type bounds\non the generalization error by generalizing results obtained for ADABOOST [6, 2, 3]. As\nin the case of ADABOOST, the bounds justify the algorithmic objective of minimizing the\nrobust training error. Note that the omitted proofs can be found in [4].\n\n\n2 The LOCMEDBOOST algorithm and the convergence result\n\nFor the formal description, let the training data be Dn = (x1, y1), . . . , (xn, yn) where\ndata points (xi, yi) are from the set Rd R. The algorithm maintains a weight distribu-\ntion w(t) = w(t)\n 1 , . . . , w(t)\n n over the data points. The weights are initialized uniformly\n\n\f\n LOCMEDBOOST(Dn, C (y , y), BASE(Dn, w), , T )\n 1 w (1/n, . . . , 1/n)\n 2 for t 1 to T\n 3 (h(t), (t)) BASE(Dn,w) see (1)\n 4 for i 1 to n\n 5 i 1 - 2C h(t)(xi),yi base rewards\n 6 i (t)(xi) base confidences\n n\n 7 (t) arg mine w(t)e-ii\n i\n i=1\n\n 8 if (t) = ii for all i = 1,...,n\n 9 return f (t)() = med h(\n ,() ) n\n 10 if (t) < 0 equivalent to w(t)\n i ii <\n i=1\n\n 11 return f (t-1)() = med h(\n ,() )\n 12 for i 1 to n\n exp( exp(\n 13 w(t+1) -(t)ii) =w(t) -(t)ii)\n i w(t)\n i n i\n w(t) exp( Z(t)\n j=1 j -(t)jj)\n 14 return f (T )() = med h(\n ,() )\nFigure 1: The pseudocode of the LOCMEDBOOST algorithm. Dn is the training data,\nC (y , y) I{|y-y |> } is the cost function, BASE(Dn,w) is the base regression algo-\nrithm, is the robustness parameter, and T is the number of iterations.\n\nin line 1, and are updated in each iteration in line 13 (Figure 1). We suppose that we are\ngiven a base learner algorithm BASE(Dn, w) that, in each iteration t, returns a base hy-\npothesis that consists of a real-valued base regressor h(t) H and a non-negative base\nconfidence function (t) K. In general, the base learner should attempt to minimize the\nbase objective\n\n n\n e(t)\n 1 (Dn) = 2 w(t)(t)(x\n i i)C h(t)(xi), yi - (t), (1)\n i=1\n\nwhere C (y, y ) is an -dependent loss function satisfying\n\n C (y, y ) C(0-1)(y,y ) = I{|y - y | > },1 (2)\nand n\n \n (t) = wi(t)(xi) (3)\n i=1\n\nis the average confidence of (t) on the training set. Intuitively, e(t)\n 1 (Dn) is a mixture\nof the two objectives of error minimization and confidence maximization. The first term\nis a weighted regression loss where the weight of a point xi is the product of its \"con-\nstant\" weight w(t)\n i and the confidence (t)(xi) of the base hypothesis. Minimizing this\n\n 1The indicator function I{A} is 1 if its argument A is true and 0 otherwise.\n\n\f\nterm means to place the high-confidence region of the base regressor into areas where the\nregression error is small. On the other hand, the minimization of the second term drives the\nhigh-confidence region of the base regressor into dense areas. After Theorem 1, we will\nexplain the derivation of the base objective (1).\n\nTo simplify the notation in Figure 1 and in Theorem 1 below, we define the base rewards\n(t)\n i and the base confidences (t)\n i for each training point (xi, yi), i = 1, . . . , n, base re-\ngressor h(t), and base confidence function (t), t = 1, . . . , T , as\n\n (t) = 1 = (t)(x\n i - 2C (h(t)(xi),yi) and (t)\n i i), (4)\n\nrespectively.2\n\nAfter computing the base rewards and the base confidences in lines 5 and 6, the algorithm\nsets the weight (t) of the base regressor h(t) to the value that minimizes the exponential\nloss n\n E(t)() = e w(t)e-ii,\n i (5)\n i=1\nwhere is a robustness parameter that has a role in keeping the algorithm in its operating\nrange, in avoiding over- and underfitting, and in maximizing the margin (Section 3). If\nii for all training points, then (t) = and E(t)((t)) = 0, so the algorithm\nreturns the actual regressor (line 9). Intuitively, this means that the capacity of the set of\nbase hypotheses is too large, so we are overfitting. If (t) < 0, the algorithm returns the\nregressor up to the last iteration (line 11). Intuitively, this means that the capacity of the\nset of base hypotheses is too small, so we cannot find a new base regressor that would\ndecrease the training loss. In general, (t) can be found easily by line-search because of\nthe convexity of E(t)(). In some special cases, (t) can be computed analytically.\n\nIn lines 9, 11, or 14, the algorithm returns the weighted median of the base regressors.\nFor the analysis of the algorithm, we formally define the final regressor in a more general\nmanner. First, let (t) = (t) be the normalized coefficient of the base hypothesis\n PT (j)\n j=1\n(h(t), (t)), and let\n\n T T (t)(t)(x)\n c(T )(x) = (t)(t)(x) = t=1 (6)\n T (t)\n t=1 t=1\n\nbe the average confidence function3 after the T th iteration. Let f (T )\n + (x) and f (T )\n - (x) be the\nweighted 1+/c(T )(x) - and 1-/c(T )(x) -quantiles, respectively, of the base regressors\n 2 2\n\nh(1)(x), . . . , h(T )(x) with respective weights (1)(1)(x), . . . , (T )(T )(x) (Figure 2(a)).\nFormally, for any R, if -c(T)(x) < < c(T)(x), let\n T (t)(t)(x)I 1 - \n f (T ) t=1 {h(j)(x) < h(t)(x)} c(T )(x)\n + (x) = min h(j)(x) : < , (7)\n j T (t)(t)(x) 2\n t=1\n T (t)(t)(x)I 1 - \n f (T ) t=1 {h(j)(x) > h(t)(x)} c(T )(x)\n - (x) = max h(j)(x) : < ,(8)\n j T (t)(t)(x) 2\n t=1\n\notherwise (including the case when c(T )(x) = 0) let f (T )\n + (x) = (+) and f(T)\n - (x) =\n\n (-)4. Then the weighted median is defined as f(T)() = med h(\n ,() ) = f(T)\n 0+ ().\n 2Note that we will omit the iteration index (t) where it does not cause confusion.\n 3Not to be confused with \n (t) in (3) which is the average base confidence over the training data.\n 4In the degenerative case we define 0 = 0/0 = .\n\n\f\n PSfrag replacements\n\nPSfrag replacements\n 1 - yi +\n }< c(T )() f+\n f 2\n +\n med f0+\n ,() = f0+ yi\n 1 - \n f f\n - } < c(T )() -\n 2\n (a) yi -\n (b)\n\n\n Figure 2: (a) Weighted 1+/c(T )(x) - and 1-/c(T )(x) -quantiles, and the weighted me-\n 2 2\n\n dian of linear base regressors with equal weights (t) = 1/9, constant base confidence\n functions (x) 1, and \n c(T )(x) 0.25. (b) -robust -precise regressor.\n\n\n To assess the final regressor f (T )(), we say that f(T)() is -robust -precise on (xi, yi)\n if and only if f (T )\n + (xi) yi + , and f(T)\n - (xi) yi - . For 0, this condition is\n equivalent to both quantiles being in the \" -tube\" around yi (Figure 2(b)).\n\n In the rest of this section we show that the algorithm minimizes the relative frequency\n of training points on which f (T )() is not -robust -precise. Formally, let the -robust\n -precise training error of f (T ) be defined as\n\n 1 n\n L()(f (T )) = I f (T )\n n + (xi) > yi + f(T)\n - (xi) < yi - .5 (9)\n i=1\n\n If = 0, L(0)(f (T )) gives the relative frequency of training points on which the regressor\n f (T ) has a larger L1 error than . If we have equality in (2), this is exactly the average loss\n of the regressor f (T ) on the training data. A small value for L(0)(f (T )) indicates that the\n regressor predicts most of the training points with -precision, whereas a small value for\n L()(f (T )) with a positive suggests that the prediction is not only precise but also robust\n in the sense that a small perturbation of the base regressors and their weights will not\n increase L(0)(f (T )). For classification with bi-valued base classifiers h : Rd {-1, 1},\n the definition (9) (with = 1) recovers the traditional notion of robust training error, that\n is, L()(f (T )) is the relative frequency of data points with margin smaller than .\n\n The following theorem upper bounds the -robust -precise training error L() of the re-\n gressor f (T ) output by LOCMEDBOOST.\n\n\n Theorem 1 Let L()(f (T )) defined as in (9) and suppose that condition (2) holds for the\n\n loss function C (, ). Define the base rewards (t)\n i and the base confidences (t)\n i as in (4).\n Let w(t)\n i be the weight of training point xi after the tth iteration (updated in line 13 in\n Figure 1), and let (t) be the weight of the base regressor h(t)() (computed in line 7 in\n Figure 1). Then for all R\n T\n L()(f (T )) E(t)\n ((t)), (10)\n t=1\n\n\n where E(t)\n ((t)) is defined in (5).\n\n\n 5For the sake of simplicity, in the notation we suppress the fact that L() depends on the whole\n sequence of base regressors, base confidences, and weights, not only on the final regressor f(T ).\n\n\f\nThe proof is based on the observation that if the median of the base regressors goes further\nthan from the real response yi at training point xi, then most of the base regressors must\nalso be far from yi, giving small base rewards to this point.\n\nThe goal of LOCMEDBOOST is to minimize L()(f (T )) at = so, in view of Theorem 1,\n\nour goal in each iteration t is to minimize E(t) (5). To derive the base objective (1), we\nfollow the two step functional gradient descent procedure [7], that is, first we maximize\nthe negative gradient -E () in = 0, then we do a line search to determine (t). Using\nthis approach, the base objective becomes e1(Dn) = - n w(t)\n i=1 i ii, which is identical\nto (1). Note that since E(t)() is convex and E(t)(0) = 1, a positive (t) means that\nmin E(t)() = E(t)((t)) < 1, so the condition in line 10 in Figure 1 guarantees that the\nupper bound of (10) decreases in each step.\n\n\n3 Setting and maximizing the minimum margin\n\nIn practice, ADABOOST works well with = 0, so setting to a positive value is only\nan alternative regularization option to early stopping. In the case of LOCMEDBOOST,\nhowever, one must carefully choose to keep the algorithm in its operating range and to\navoid over- and underfitting. A too small means that the algorithm can overfit and stop in\nline 9. In binary classification this is an unrealistic situation: it means that there is a base\nclassifier that correctly classifies all data points. On the other hand, it can happen easily\nin the abstaining classifier/regressor model, when (t)(x) = 0 on a possibly large input\nregion. In this case, a base classifier can correctly classify (or a base regressor can give\npositive base rewards i to) all data points on which it does not abstain, so if = 0, the\nalgorithm stops in line 9. At the other end of the spectrum, a large can make the algorithm\nunderfit and stop in line 11, so one needs to set carefully in order to avoid early stopping\nin lines 9 or 11.\n\nFrom the point of view of generalization, also has an important role as a regularization\nparameter. A larger decreases the stepsize (t) in the functional gradient view. From\nanother aspect, a larger decreases the effective capacity of the the class of base hypotheses\nby restricting the set of admissible base hypotheses to those having small errors. In general,\n has a potential role in balancing between over- and underfitting so, in practice, we suggest\nthat it be validated together with the number of iterations T and other possible complexity\nparameters of the base hypotheses.\n\nIn the context of ADABOOST, there have been several proposals to set in an adaptive\nway to effectively maximize the minimum margin. In the rest of this section, we extend the\nanalysis of marginal boosting [5] to this general case. Although the agressive maximization\nof the minimum margin can lead to overfitting, the analysis can provide valuable insight\ninto the understanding of LOCMEDBOOST and so it can guide the setting of in practice.\n\nFor the sake of simplicity, let us assume that base hypotheses (h, ) come from a finite set6\nHN with cardinality N , and let H(t) = (h(1),(1)),...,(h(t),(t)) be the set of base\nhypotheses after the tth iteration. Let us define the edge of the base hypothesis (h, ) HN\nas7 n n\n (h,)(w) = wiii = wi(xi) 1 - 2C h(xi),yi ,\n i=1 i=1\n\nand the maximum edge in the tth iteration as (t) = max(h,)H \n N (h,)(w(t)). Note\nthat (h,)(w) = -e1(Dn), so with this terminology, the objective of the base learner is\n 6The analysis can be extended to infinite base sets along the lines of [5].\n 7For the sake of simplicity, in the notation we suppress the dependence of (h,) on Dn.\n\n\f\nto maximize the edge (t) = (h(t),(t))(w(t)) (if the maximum is achieved, then (t) =\n(t)), and the algorithm stops in line 11 if the edge (t) is less than . On the other hand,\nlet us define the margin on a point (x, y) as the average reward8\n\n N N\n (x,y)() = (j)(j)(j) = (j)(j)(x) 1 - 2C h(j)(x),y .\n j=1 j=1\n\nLet us denote the minimum margin over the data points in the tth iteration by\n\n (t) = min (t-1)\n (x,y)( ), (11)\n (x,y)Dn\n\nwhere (t-1)\n = (1), . . . , (t-1) is the vector of base hypothesis coefficients up to the\n(t - 1)th iteration.\nIt is easy to see that in each iteration, the maximum edge over the base hypotheses is at\nleast the minimum margin over the training points:\n\n (t) = max (t-1)\n (h,)(w(t)) min (x,y)( ) = (t).\n (h,)HN (x,y)Dn\n\nMoreover, as several authors (e.g., [5]) noted in the context of ADABOOST, by the Min-\nMax-Theorem of von Neumann [8] we have\n\n = min max (h,)(w) = max min (x,y)() = ,\n w (h,)H \n N (x,y)Dn\n\nso the minimum achievable maximal edge by any weighting over the training points is equal\nto the maximum achievable minimal margin by any weighting over the base hypotheses.\nTo converge to within a factor in finite time, [5] sets\n\n (t) = min (j)\n RW j=1,...,t - ,\n\nand shows that (t) exceeds - after 2logn + 1 steps.\n 2\n\nIn the following, we extend these results to the general case of LOCMEDBOOST. First we\ndefine the minimum and maximum achievable base rewards by\n\n min = min min (x) 1 - 2C h(x),y , (12)\n (h,)HN (x,y)Dn\n\n max = max max (x) 1 - 2C h(x),y , (13)\n (h,)HN (x,y)Dn\n\nrespectively. Let A = max - min, (t) = (t) - min, and (t) = (t) - min.9\nLemma 1 (Generalization of Lemma 3 in [5]) Assume that min (t) (t). Then\n (t) (t) A A\n E(t) ((t)) log - (t) log - (t) . (14)\n (t) exp - A (t) - (t) A - (t)\nFinite convergence of LOCMEDBOOST both with (t) = = const. and with an adaptive\n (t) = (t) is based on the following general result.\n RW\n\nTheorem 2 Assume that (t) (t) - . Let = T (t) (t). Then L()(f(T)) = 0\n t=1\n\n(so (t) > ) after at most T = A2 log n + 1 iterations.\n 22\n\n 8For the sake of simplicity, in the notation we suppress the dependence of (x,y) on HN .\n 9In binary classification, min = -1, max = 1, A = 2, e(t) = 1 + (t), and e(t) = 1 + (t).\n\n\f\nThe first consequence is the convergence of LOCMEDBOOST with a constant .\n\nCorollary 1 (Generalization of Corollary 4 in [5]) Assume that the weak learner always\nachieves an edge (t) . If min < , then (t) > after at most T =\n A2 log n + 1 steps.\n 2(- )2\n\n\n (t)\nThe second corollary shows that if is set adaptively to then the minimum margin\n RW\n(t) will converge to within a precision in a finite number of steps.\n\nCorollary 2 (Generalization of Theorem 6 in [5]) Assume that the weak learner always\nachieves an edge (t) . If min (t) = (t) - , > 0, then (t) > - after at\nmost T = A2 log n + 1 iterations.\n 22\n\n\n4 The generalization error\n\nIn this section we extend probabilistic bounds on the generalization error obtained for\nADABOOST [6], confidence-rated ADABOOST [2], and localized boosting [3]. Here we\nsuppose that the data set Dn is generated independently according to a distribution D over\nRd R. The results provide bounds on the confidence-interval-type error\n L(f (T )) = PD f (T )(X) - Y > ,\nwhere (X, Y ) is a random point generated according to D independently from points in\nDn. The bounds state that with a large probability,\n\n L(f (T )) < L()(f (T )) + C(n, , H,K),\nwhere the complexity term C depends on the size or the pseudo-dimension of the base\nregressor set H, and the smoothness of the base confidence functions in K. As in the case\nof ADABOOST, these bounds qualitatively justify the minimization of the robust training\nerror L()(f (T )).\n\nLet C be the set of combined regressors obtained as a weighted median of base regressors\nfrom H, that is,\n C = f() = med h(\n ,() ) h HN, R+N, KN,N Z+ .\nIn the simplest case, we assume that H is finite and base coefficients are constant.\nTheorem 3 (Generalization of Theorem 1 in [6]) Let D be a distribution over Rd R,\nand let Dn be a sample of n points generated independently at random according to D.\nAssume that the base regressor set H is finite, and K contains only (x) 1. Then with\nprobability 1 - over the random choice of the training set Dn, any f C satisfies the\nfollowing bound for all > 0:\n\n 1 log n log 1 1/2\n L(f ) < L()(f ) + O |H| + log .\n n 2 \n\n\nSimilarly to the proof of Theorem 1 in [6], we construct a set CN that contains\nunweighted medians of N base functions from H, then approximate f by g() =\nmed1 h1(),...,hN() CN where the base functions hi are selected randlomly ac-\ncording to the coefficient distribution . We then separate the one-sided error into two\nterms by\n\nPD f (X) > Y + PD g+(X) > Y + + PD g+(X)\n 2 2 Y + f(X) > Y + ,\n\n\f\nand then upper bound the two terms as in [6].\n\nThe second theorem extends the first to the case of infinite base regressor sets.\n\nTheorem 4 (Generalization of Theorem 2 of [6]) Let D be a distribution over Rd R,\nand let Dn be a sample of n points generated independently at random according to D.\nAssume that the base regressor set H has pseudodimension p, and K contains only (x) \n1. Then with probability 1 - over the random choice of the training set Dn, any f C\nsatisfies the following bound for all > 0:\n\n 1/2\n 1 p log2(n/p) 1\n L(f ) < L()(f ) + O + log .\n n 2 \n\n\nThe proof goes as in Theorem 3 and in Theorem 2 in [6] until we upper bound the shatter\n\ncoefficient of the set A = (x, y) : g+(x) > y + : g , . . . , 2N by\n 2 CN, = 0, 4N N\n(N/2 + 1)(en/p)pN where p is the pseudodimension of H (or the VC dimension of H+ =\n {(x,y) : h(x) > y} : h H ).\nIn the most general case K can contain smooth functions.\nTheorem 5 (Generalization of Theorem 1 of [3]) Let D be a distribution over Rd R,\nand let Dn be a sample of n points generated independently at random according to D.\nAssume that the base regressor set H has pseudodimension p, and K contains functions\n(x) which are lower bounded by a constant a, and which satisfy for all x, x Rd the\nLipschitz condition |(x) - (x )| L x - x . Then with probability 1 - over the\nrandom choice of the training set Dn, any f C satisfies the following bound for all > 0:\n 1/2\n 1 (L/(a))dp log2(n/p) 1\n L(f ) < L()(f ) + O + log .\n n 2 \n\n\n5 Conclusion\n\nIn this paper we have analyzed the algorithmic convergence of LOCMEDBOOST by gener-\nalizing recent results on efficient margin maximization, and provided bounds on the gener-\nalization error by extending similar bounds obtained for ADABOOST.\n\nReferences\n\n[1] B. Kegl, \"Robust regression by boosting the median,\" in Proceedings of the 16th Conference on\n Computational Learning Theory, Washington, D.C., 2003, pp. 258272.\n[2] R. E. Schapire and Y. Singer, \"Improved boosting algorithms using confidence-rated predic-\n tions,\" Machine Learning, vol. 37, no. 3, pp. 297336, 1999.\n[3] R. Meir, R. El-Yaniv, and S. Ben-David, \"Localized boosting,\" in Proceedings of the 13th\n Annual Conference on Computational Learning Theory, 2000, pp. 190199.\n[4] B. Kegl, \"Confidence-rated regression by boosting the median,\" Tech. Rep. 1241, Department\n of Computer Science, University of Montreal, 2004.\n[5] G. Ratsch and M. K. Warmuth, \"Efficient margin maximizing with boosting,\" Journal of Machine\n Learning Research (submitted), 2003.\n[6] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, \"Boosting the margin: a new explanation\n for the effectiveness of voting methods,\" Annals of Statistics, vol. 26, no. 5, pp. 16511686,\n 1998.\n[7] L. Mason, P. Bartlett, J. Baxter, and M. Frean, \"Boosting algorithms as gradient descent,\" in\n Advances in Neural Information Processing Systems. 2000, vol. 12, pp. 512518, The MIT Press.\n[8] J. von Neumann, \"Zur Theorie der Gesellschaftsspiele,\" Math. Ann., vol. 100, pp. 295320,\n 1928.\n\n\f\n", "award": [], "sourceid": 2606, "authors": [{"given_name": "Bal\u00e1zs", "family_name": "K\u00e9gl", "institution": null}]}