{"title": "Margin-Based Generalization Lower Bounds for Boosted Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 11963, "page_last": 11972, "abstract": "Boosting is one of the most successful ideas in machine learning. \nThe most well-accepted explanations for the low generalization error \nof boosting algorithms such as AdaBoost stem from\nmargin theory. The study of margins in the context of boosting\nalgorithms was initiated by Schapire, Freund, Bartlett and Lee (1998), \nand has inspired numerous boosting algorithms and generalization bounds. \nTo date, the strongest known generalization (upper bound) is the $k$th margin \nbound of Gao and Zhou (2013). \nDespite the numerous generalization upper bounds that have been proved \nover the last two decades, nothing is known about the tightness of these bounds. \nIn this paper, we give the first margin-based lower bounds on the generalization\nerror of boosted classifiers. Our lower bounds nearly match the\n$k$th margin bound and thus almost settle the generalization performance\nof boosted classifiers in terms of margins.", "full_text": "Margin-Based Generalization Lower Bounds for\n\nBoosted Classi\ufb01ers\n\nAllan Gr\u00f8nlund \u2021\u00a7\n\nLior Kamma \u00a7 Kasper Green Larsen \u00a7 Alexander Mathiasen \u00a7\n\nJelani Nelson \u2217\n\nAbstract\n\nBoosting is one of the most successful ideas in machine learning. The most well-\naccepted explanations for the low generalization error of boosting algorithms such\nas AdaBoost stem from margin theory. The study of margins in the context of\nboosting algorithms was initiated by Schapire, Freund, Bartlett and Lee (1998) and\nhas inspired numerous boosting algorithms and generalization bounds. To date,\nthe strongest known generalization (upper bound) is the kth margin bound of Gao\nand Zhou (2013). Despite the numerous generalization upper bounds that have\nbeen proved over the last two decades, nothing is known about the tightness of\nthese bounds. In this paper, we give the \ufb01rst margin-based lower bounds on the\ngeneralization error of boosted classi\ufb01ers. Our lower bounds nearly match the kth\nmargin bound and thus almost settle the generalization performance of boosted\nclassi\ufb01ers in terms of margins.\n\n1\n\nIntroduction\n\nclassi\ufb01er is then added to f. The \ufb01nal classi\ufb01er is obtained by taking the sign of f (x) =(cid:80)\n\nBoosting algorithms produce highly accurate classi\ufb01ers by combining several less accurate classi\ufb01ers\nand are amongst the most popular learning algorithms, obtaining state-of-the-art performance on\nseveral benchmark machine learning tasks [KMF+17, CG16]. The most famous of these boosting\nalgorithm is arguably AdaBoost [FS97]. For binary classi\ufb01cation, AdaBoost takes a training set\nS = (cid:104)(x1, y1), . . . , (xm, ym)(cid:105) of m labeled samples as input, with xi \u2208 X and labels yi \u2208 {\u22121, 1}.\nIt then produces a classi\ufb01er f in iterations: in the jth iteration, a base classi\ufb01er hj : X \u2192 {\u22121, 1}\nis trained on a reweighed version of S that emphasizes data points that f struggles with and this\nj \u03b1jhj(x),\nwhere the \u03b1j\u2019s are non-negative coef\ufb01cients carefully chosen by AdaBoost. The base classi\ufb01ers hj all\ncome from a hypothesis set H, e.g. H could be a set of small decision trees or similar. As AdaBoost\u2019s\ntraining progresses, more and more base classi\ufb01ers are added to f, which in turn causes the training\nerror of f to decrease. If H is rich enough, AdaBoost will eventually classify all the data points in\nthe training set correctly [FS97].\nEarly experiments with AdaBoost report a surprising generalization phenomenon [SFBL98]. Even\nafter perfectly classifying the entire training set, further iterations keeps improving the test accuracy.\nThis is contrary to what one would expect, as f gets more complicated with more iterations, and thus\nprone to over\ufb01tting. The most prominent explanation for this phenomena is margin theory, introduced\nby Schapire et al. [SFBL98]. The margin of a training point (xi, yi) is a number in [\u22121, 1], which\ncan be interpreted, loosely speaking, as the classi\ufb01er\u2019s con\ufb01dence on that point. Formally, we say that\nj \u03b1jhj(x) is a voting classi\ufb01er if \u03b1j \u2265 0 for all j. Note that one can additionally assume\n\nf (x) =(cid:80)\n\n\u2021All authors contributed equally, and are presented in alphabetical order.\n\u00a7Department of Computer Science, Aarhus University, {jallan,lior.kamma,larsen,alexmath}@cs.au.dk\n\u2217Department of EECS, UC Berkeley, minilek@berkeley.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwithout loss of generality that(cid:80)\n\nj \u03b1j = 1 since normalizing each \u03b1i by(cid:80)\n\nj \u03b1j leaves the sign of\nf (xi) unchanged. The margin of a point (xi, yi) with respect to a voting classi\ufb01er f is then de\ufb01ned\nas\n\nmargin(xi) := yif (xi) = yi\n\n\u03b1jhj(xi) .\n\n(cid:88)\n\nj\n\nThus margin(xi) \u2208 [\u22121, 1], and if margin(xi) > 0, then taking the sign of f (xi) correctly classi\ufb01es\n(xi, yi). Informally speaking, margin theory guarantees that voting classi\ufb01ers with large (positive)\nmargins have a smaller generalization error. Experimentally AdaBoost has been found to continue\nto improve the margins even when training past the point of perfectly classifying the training set.\nMargin theory may therefore explain the surprising generalization phenomena of AdaBoost. Indeed,\nthe original paper by Schapire et al. [SFBL98] that introduced margin theory, proved the following\nmargin-based generalization bound. Let D be an unknown distribution over X \u00d7{\u22121, 1} and assume\nthat the training data S is obtained by drawing m i.i.d. samples from D. Then with high probability\nover S it holds that for every margin \u03b8 \u2208 (0, 1], every voting classi\ufb01er f satis\ufb01es\nln|H| ln m\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(x,y)\u223cD[yf (x) \u2264 0] \u2264 Pr\n(x,y)\u223cS\n\nPr\n\n[yf (x) < \u03b8] + O\n\n\u03b82m\n\n.\n\n(1)\n\nThe left-hand side of the equation is the out-of-sample error of f (since sign(f (x)) (cid:54)= y precisely\nwhen yf (x) < 0). On the right-hand side, we use (x, y) \u223c S to denote a uniform random point from\nS. Hence Pr(x,y)\u223cS[yf (x) < \u03b8] is the fraction of training points with margin less than \u03b8. The last\nterm is increasing in |H| and decreasing in \u03b8 and m. Here it is assumed H is \ufb01nite. A similar bound\ncan be proved for in\ufb01nite H by replacing |H| by d lg m, where d is the VC-dimension of H. This\nholds for all the generalization bounds below as well. The generalization bound thus shows that f\nhas low out-of-sample error if it attains large margins on most training points. This \ufb01ts well with the\nobserved behaviour of AdaBoost in practice.\nThe generalization bound above holds for every voting classi\ufb01er f, i.e. regardless of how f was\nobtained. Hence a natural goal is to design boosting algorithms that produce voting classi\ufb01ers with\nlarge margins on many points. This has been the focus of a long line of research and has resulted\nin numerous algorithms with various margin guarantees, see e.g. [GS98, Bre99, BDST00, RW02,\nRW05, GLM19]. One of the most well-known of these is Breimann\u2019s ArcGV [Bre99]. ArcGV\nproduces a voting classi\ufb01er maximizing the minimal margin, i.e. it produces a classi\ufb01er f for which\nmin(x,y)\u2208S yf (x) is as large as possible. Breimann complemented the algorithm with a generalization\nbound stating that with high probability over the sample S, it holds that every voting classi\ufb01er f\nsatis\ufb01es:\n\n(x,y)\u223cD[yf (x) \u2264 0] \u2264 O\n\nPr\n\n,\n\n(2)\n\n(cid:18) ln|H| ln m\n\n(cid:19)\n\n\u02c6\u03b82m\n\nwhere \u02c6\u03b8 = min(x,y)\u2208S yf (x) is the minimal margin over all training examples. Notice that if one\nchooses \u03b8 as the minimal margin in the generalization bound (1) of Schapire et al. [SFBL98], then\nthe term Pr(x,y)\u223cS[yf (x) < \u03b8] becomes 0 and one obtains the bound\nln|H| ln m\n\n(x,y)\u223cD[yf (x) \u2264 0] \u2264 O\n\nPr\n\n\u02c6\u03b82m\n\n\uf8eb\uf8ed(cid:115)\n\n\uf8f6\uf8f8 ,\n\nwhich is weaker than Breimann\u2019s bound and motivated his focus on maximizing the minimal margin.\nMinimal margin is however quite sensitive to outliers and work by Gao and Zhou [GZ13] proved a\ngeneralization bound which provides an interpolation between (1) and (2). Their bound is known\nas the kth margin bound, and states that with high probability over the sample S, it holds for every\nmargin \u03b8 \u2208 (0, 1] and every voting classi\ufb01er f that:\n(x,y)\u223cD[yf (x) < 0] \u2264 Pr\n(x,y)\u223cS\nThe kth margin bound remains the strongest margin-based generalization bound to date (see Sec-\ntion 1.2 for further details). The kth margin bound recovers Breimann\u2019s minimal margin bound by\nchoosing \u03b8 as the minimal margin (making Pr(x,y)\u223cS[yf (x) < \u03b8] = 0), and it is always at most the\n\nln|H| ln m\n\nln|H| ln m\n\n[yf (x) < \u03b8]+O\n\n[yf (x) < \u03b8]\n\n(cid:115)\n\n(x,y)\u223cS\n\n(cid:32)\n\n(cid:33)\n\n\u03b82m\n\n+\n\nPr\n\n\u03b82m\n\nPr\n\n.\n\n2\n\n\fsame as the bound (1) by Schapire et al. As with previous generalization bounds, it suggests that\nboosting algorithms should focus on obtaining a large margin on as large a fraction of training points\nas possible.\nDespite the decades of progress on generalization upper bounds, we still do not know how tight these\nbounds are. That is, we do not have any margin-based generalization lower bounds. Generalization\nlower bounds are not only interesting from a theoretical point of view, but also from an algorithmic\npoint of view: If one has a provably tight generalization bound, then a natural goal is to design\na boosting algorithm minimizing a loss function that is equal to this generalization bound. This\napproach makes most sense with a matching lower bound as the algorithm might otherwise minimize\na sub-optimal loss function. Furthermore, a lower bound may also inspire researchers to look for other\nparameters than margins when explaining the generalization performance of voting classi\ufb01ers. Such\nnew parameters may even prove useful in designing new algorithms, with even better generalization\nperformance in practice.\n\n1.1 Our Results\n\nIn this paper we prove the \ufb01rst margin-based generalization lower bounds for voting classi\ufb01ers. Our\nlower bounds almost match the kth margin bound and thus essentially settles the generalization\nperformance of voting classi\ufb01ers in terms of margins.\nf : X \u2192 [\u22121, 1] that can be written as f (x) = (cid:80)\nTo present our main theorems, we \ufb01rst introduce some notation. For a ground set X and hypothesis\n(cid:80)\nset H, let C(H) denote the family of all voting classi\ufb01ers over H, i.e. C(H) contains all functions\nh\u2208H \u03b1hh(x) such that \u03b1h \u2265 0 for all h and\nh \u03b1h = 1. For a (randomized) learning algorithm A and a sample S of m points, let fA,S denote\nthe (possibly random) voting classi\ufb01er produced by A when given the sample S as input. With this\nnotation, our \ufb01rst main theorem is the following:\nTheorem 1. For every large enough integer N, every \u03b8 \u2208 (1/N, 1/40) and every \u03c4 \u2208 [0, 49/100]\nthere exist a set X and a hypothesis set H over X , such that ln|H| = \u0398(ln N ) and for every\nover X \u00d7 {\u22121, 1} and a voting classi\ufb01er f \u2208 C(H) such that with probability at least 1/100 over\nthe choice of samples S \u223c Dm and the random choices of A\n\nm = \u2126(cid:0)\u03b8\u22122 ln|H|(cid:1) and for every (randomized) learning algorithm A, there exist a distribution D\n\n1.\n\n2.\n\n[yf (x) < \u03b8] \u2264 \u03c4; and\n\nPr\n\n(x,y)\u223cS\n(x,y)\u223cD[yfA,S(x) < 0] \u2265 \u03c4 + \u2126\n\nPr\n\n(cid:18)\n\n(cid:113)\n\n\u03c4 \u00b7 ln |H|\n\nm\u03b82\n\n(cid:19)\n\n.\n\nln |H|\nm\u03b82 +\n\nTheorem 1 states that for any algorithm A, there is a distribution D for which the out-of-sample error\nof the voting classi\ufb01er produced by A is at least that in the second point of the theorem. At the same\ntime, one can \ufb01nd a voting classi\ufb01er f obtaining a margin of at least \u03b8 on at least a 1 \u2212 \u03c4 fraction\nof the sample points. Our proof of Theorem 1 not only shows that such a classi\ufb01er exists, but also\nprovides an algorithm that constructs such a classi\ufb01er. Loosely speaking, the \ufb01rst part of the theorem\nre\ufb02ects on the nature of the distribution D and the hypothesis set H. Intuitively it means that the\ndistribution is not too hard and the hypothesis set is rich enough, so that it is possible to construct a\nvoting classi\ufb01er with good empirical margins. Clearly, we cannot hope to prove that the algorithm\nA constructs a voting classi\ufb01er that has a margin of at least \u03b8 on a 1 \u2212 \u03c4 fraction of the sample set,\nsince we make no assumptions on the algorithm. For example, if the constant hypothesis h1 that\nalways outputs 1 is in H, then A could be the algorithm that simply outputs h1. The interpretation is\nthus: D and H allow for an algorithm A to produce a voting classi\ufb01er f with margin at least \u03b8 on a\n1 \u2212 \u03c4 fraction of samples. The second part of the theorem thus guarantees that regardless of which\nvoting classi\ufb01er A produces, it still has large out-of-sample error. This implies that every algorithm\nthat constructs a voting classi\ufb01er by minimizing the empirical risk, must have a large error. Formally,\nTheorem 1 implies that if Pr(x,y)\u223cS[yfA,S(x) > \u03b8] \u2264 \u03c4 then\n\n(x,y)\u223cD[yfA,S(x) < 0] \u2265 Pr\n(x,y)\u223cS\n\nPr\n\n[yfA,S(x) > \u03b8] + \u2126\n\nln|H|\nm\u03b82 +\n\n\u03c4 \u00b7 ln|H|\n\nm\u03b82\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\n.\n\nThe \ufb01rst part of the theorem ensures that the condition is not void. That is, there exists an algorithm\nA for which Pr(x,y)\u223cS[yfA,S(x) < \u03b8] \u2264 \u03c4. Comparing Theorem 1 to the kth margin bound, we\n\n3\n\n\f\u221a\n\nsee that the parameter \u03c4 corresponds to Pr(x,y)\u223cS[yf (x) < \u03b8]. The magnitude of the out-of-sample\nerror in the second point in the theorem thus matches that of the kth margin bound, except for a\nfactor ln m in the \ufb01rst term inside the \u2126(\u00b7) and a\nln m factor in the second term. If we consider\nthe range of parameters \u03b8, \u03c4, ln|H| and m for which the lower bound applies, then these ranges are\nalmost as tight as possible. For \u03c4, note that the theorem cannot generally be true for \u03c4 > 1/2, as the\nalgorithm A that outputs a uniform random choice of hypothesis among h1 and h\u22121 (the constant\nhypothesis outputting \u22121), gives a (random) voting classi\ufb01er fA,S with an expected out-of-sample\nerror of 1/2. This is less than the second point of the theorem would state if it was true for \u03c4 > 1/2.\nFor ln|H|, observe that our theorem holds for arbitrarily large values of |H|. That is, the integer N\ncan be as large as desired, making ln|H| = \u0398(ln N ) as large as desired. Finally, for the constraint\non m, notice again that the theorem simply cannot be true for smaller values of m as then the term\nln|H|/(m\u03b82) exceeds 1.\nOur second main result gets even closer to the kth margin bound:\nTheorem 2. For every large enough integer N, every \u03b8 \u2208 (1/N, 1/40), \u03c4 \u2208 [0, 49/100] and every\n\nm =(cid:0)\u03b8\u22122 ln N(cid:1)1+\u2126(1), there exist a set X , a hypothesis set H over X and a distribution D over\n\nX \u00d7 {\u22121, 1} such that ln|H| = \u0398(ln N ) and with probability at least 1/100 over the choice of\nsamples S \u223c Dm there exists a voting classi\ufb01er fS \u2208 C(H) such that\n\n1.\n\n2.\n\nPr\n\n[yfS(x) < \u03b8] \u2264 \u03c4; and\n\n(x,y)\u223cS\n(x,y)\u223cD[yfS(x) < 0] \u2265 \u03c4 + \u2126\n\nPr\n\n(cid:18)\n\n(cid:113)\n\n\u03c4 \u00b7 ln |H|\n\nm\u03b82\n\n(cid:19)\n\n.\n\nln |H| ln m\n\nm\u03b82 +\n\n\u221a\n\nObserve that the second point of Theorem 2 has an additional ln m factor on the \ufb01rst term in \u2126(\u00b7)\ncompared to Theorem 1. It is thus only off from the kth margin bound by a\nln m factor in the\nsecond term and hence completely matches the kth margin bound for small values of \u03c4. To obtain\nthis strengthening, we replaced the guarantee in Theorem 1 saying that all algorithms A have such a\nlarge out-of-sample error. Instead, Theorem 2 demonstrates only the existence of a voting classi\ufb01er\nfS (that is chosen as a function of the sample S) that simultaneously achieves a margin of at least \u03b8\non a 1 \u2212 \u03c4 fraction of the sample points, and yet has out-of-sample error at least that in point 2. Since\nthe kth margin bound holds with high probability for all voting classi\ufb01ers, Theorem 2 rules out any\nstrengthening of the kth margin bound, except for possibly a\nln m factor on the second additive\nterm. Again, our lower bound holds for almost the full range of parameters of interest. As for the\n\nbound on m, our proof assumes m =(cid:0)\u03b8\u22122 ln N(cid:1)1+1/8, however the theorem holds for any constant\n\ngreater than 1 in the exponent.\nFinally, we mention that both our lower bounds are proved for a \ufb01nite hypothesis set H. This only\nmakes the lower bounds stronger than if we proved it for an in\ufb01nite H with bounded VC-dimension,\nsince the VC-dimension of a \ufb01nite H, is no more than lg |H|.\n\n\u221a\n\n1.2 Related Work\n\nWe mentioned above that the kth margin bound is the strongest margin-based generalization bound\nto date. Technically speaking, it is incomparable to the so-called emargin bound by Wang et al.\n[WSJ+11]. The kth margin bound by Gao and Zhou [GZ13], the minimum margin bound by\nBreimann [Bre99] and the bound by Schapire et al. [SFBL98] all have the form Pr(x,y)\u223cD[yf (x) <\n0] \u2264 Pr(x,y)\u223cS[yf (x) < \u03b8]+\u0393(\u03b8, m,|H|, Pr(x,y)\u223cS[yf (x) < \u03b8]) for some function \u0393. The emargin\nbound has a different (and quite involved) form, making it harder to interpret and compute. We\nwill not discuss it in further detail here and just remark that our results show that for generalization\nbounds of the form studied in most previous work [SFBL98, Bre99, GZ13], one cannot hope for\nmuch stronger upper bounds than the kth margin bound.\n\n2 Proof Overview\n\nThe main argument that lies in the heart of both proofs is a probabilistic method argument. With every\nlabeling (cid:96) \u2208 {\u22121, 1}u we associate a distribution D(cid:96) over X \u00d7 {\u22121, 1}. We then show that with\nsome positive probability if we sample (cid:96) \u2208 {\u22121, 1}u, D(cid:96) satis\ufb01es the requirements of Theorem 1\n\n4\n\n\f(cid:16) ln |H|\n\n(cid:17)\n\n\u03b82m\n\n10m, then the expected generalization error (over the choice of (cid:96)) is still \u2126(cid:0) 1\n\n10m|X|(cid:1).\n\n(respectively Theorem 2). We thus conclude the existence of a suitable distribution. We next give a\nmore detailed high-level description of the proof for Theorem 1. The proof of Theorem 2 follows\nsimilar lines.\nConstructing a Family of Distributions. We start by \ufb01rst describing the construction of D(cid:96) for\n(cid:96) \u2208 {\u22121, 1}u. Our construction combines previously studied distribution patterns in a subtle manner.\nEhrenfeucht et al. [EHKV89] observed that if a distribution D assigns each point in X a \ufb01xed (yet\nunknown) label, then, loosely speaking, every classi\ufb01er f, that is constructed using only information\nsupplied by a sample S, cannot do better than random guessing the labels for the points in X \\ S.\nIntuitively, consider a uniform distribution D(cid:96) over X . If we assume, for example, that |X| \u2265 10m,\nthen with very high probability over a sample S of m points, many elements of X are not in S.\nMoreover, assume that D(cid:96) associates every x \u2208 X with a unique \u201ccorrect\u201d label (cid:96)(x). Consider some\n(perhaps random) learning algorithm A, and let fA,S be the classi\ufb01er it produces given a sample S as\ninput. If (cid:96) is chosen randomly, then, loosely speaking, for every point x not in the sample, fA,S(x)\nand (cid:96)(x) are independent, and thus A returns the wrong label with probability 1/2. In turn, this\nimplies that there exists a labeling (cid:96) such that A is wrong on a constant fraction of X when receiving\na sample S \u223c Dm\n(cid:96) . While the argument above can in fact be used to prove an arbitrarily large\ngeneralization error, it requires |X| to be large, and speci\ufb01cally to increase with m. This con\ufb02icts\nwith the \ufb01rst point in Theorem 1, that is, we have to argue that a voting classi\ufb01er f with good margins\nexist for the sample S. If S consists of m distinct points, and each point in X can have an arbitrary\nlabel, then intuitively H needs to be very large to ensure the existence of f. In order to overcome\nthis dif\ufb01culty, we set D(cid:96) to assign very high probability to one designated point in X , and the rest of\nthe probability mass is then equally distributed between all other points. The argument above still\napplies for the subset of small-probability points. More precisely, if D(cid:96) assigns all but one point in\nX probability 1\nIt remains to determine how large can we set |X|. In the notations of the theorem, in order for a\nhypothesis set H to satisfy ln|H| = \u0398(ln N ), and at the same time, have an f \u2208 C(H) obtaining\nmargins of \u03b8 on most points in a sample, our proof (and speci\ufb01cally Lemma 3, described hereafter)\nrequires X to be not signi\ufb01cantly larger than ln N\n\u03b82 , and therefore the generalization error we get is\n. This accounts for the \ufb01rst term inside the \u2126-notation in the second point of Theorem 1.\n\u2126\nAnthony and Bartlett [AB09, Chapter 5] additionally observed that for a distribution D that assigns\neach point in X a random label, if S does not sample a point x enough times, any classi\ufb01er f, that is\nconstructed using only information supplied by S, cannot determine with good probability the Bayes\nlabel of x, that is, the label of x that minimizes the error probability. Intuitively, consider once more\na distribution D(cid:96) that is uniform over X . However, instead of associating every point x \u2208 X with\none correct label (cid:96)(x), D(cid:96) is now only slightly biased towards (cid:96). That is, given that x is sampled, the\nlabel in the sample point is (cid:96)(x) with probability that is a little larger than 1/2, say (1 + \u03b1)/2 for\nsome small \u03b1 \u2208 (0, 1). Note that every classi\ufb01er f has an error probability of at least (1 \u2212 \u03b1)/2 on\nevery given point in X . Consider once again a learning algorithm A and the voting classi\ufb01er fA,S it\nconstructs. Loosely speaking, if S does not sample a point x enough times, then with good probability\nfA,S(x) (cid:54)= (cid:96)(x). More formally, in order to correctly assign the Bayes label of x, an algorithm\nprobability the algorithm does not see a constant fraction of X enough times to correctly assign their\nexpectation is over the choice of (cid:96). By once again letting |X| = ln N\n\u03b82 we conclude that there exists a\nlabeling (cid:96) such that for S \u223c Dm\n.\nThis expression is almost the second term inside the \u2126-notation in the theorem statement, though\nslightly larger. We note, however, for large values of m, the in-sample error is arbitrarily close to\n1/2. One challenge is therefore to reduce the in-sample-error, and moreover guarantee that we can\n\ufb01nd a voting classi\ufb01er f where the (m\u03c4 )\u2019th smallest margin for f is at least \u03b8, where \u03c4, \u03b8 are the\nparameters provided by the theorem statement.\nTo this end, our proof subtly weaves the two ideas described above and constructs a family of\ndistributions {D(cid:96)}(cid:96)\u2208{\u22121,1}u. Informally, we partition X into two disjoint sets, and conditioned on\nthe sample point x \u2208 X belonging to each of the subsets, D(cid:96) is de\ufb01ned similarly to be one of the two\ndistribution patterns de\ufb01ned above. The main dif\ufb01culty lies in delicately balancing all ingredients and\n\nmust see \u2126(\u03b1\u22122) samples of x. Therefore if we set the bias \u03b1 to be(cid:112)|X|/(10m), then with high\nlabel. In turn, this implies an expected generalization error of (1 \u2212 \u03b1)/2 + \u2126((cid:112)|X|/m), where the\n(cid:19)\n(cid:18)(cid:113) ln |H|\n\n(cid:96) , the expected generalization error of fA,S is 1\u2212\u03b1\n\n2 + \u2126\n\n\u03b82m\n\n5\n\n\fensuring that we can \ufb01nd an f with margins of at least \u03b8 on all but \u03c4 m of the sample points, while\nstill enforcing a large generalization error. Our proof re\ufb01nes the proof given by Ehrenfeucht et al.\nand Anthony and Bartlett and shows that not only does there exists a labeling (cid:96) such that fA,S has\nlarge generalization error with respect to D(cid:96) (with probability at least 1/100 over the randomness of\nA, S), but rather that a large (constant) fraction of labelings (cid:96) share this property. This distinction\nbecomes crucial in the proof.\n\nSmall yet Rich Hypothesis Sets. The technical crux in our proofs is the construction of an ap-\npropriate hypothesis set. Loosely speaking, the size of H has to be small, and most importantly,\nindependent of the size m of the sample set. On the other hand, the set of voting classi\ufb01ers C(H)\nis required to be rich enough to, intuitively, contain a classi\ufb01er that with good probability has good\nin-sample margins for a sample S \u223c Dm\n(cid:96) with a large fraction of labelings (cid:96) \u2208 {\u22121, 1}u. Our main\ntechnical lemma presents a distribution \u00b5 over small hypothesis sets H \u2282 X \u2192 {\u22121, 1} such that\nfor every sparse (cid:96) \u2208 {\u22121, 1}u, that is (cid:96)i = \u22121 for a small number of entries i \u2208 [u], with high\nprobability over H \u223c \u00b5, there exists some voting classi\ufb01er f \u2208 C(H) that has minimum margin \u03b8\nwith (cid:96) over the entire set X . In fact, the size of the hypothesis set does not depend on the size of X ,\nbut only on the sparsity parameter d. More formally, we show the following.\nLemma 3. For every \u03b8 \u2208 (0, 1/40), \u03b4 \u2208 (0, 1) and integers d \u2264 u, there exists a distribution\n\u00b5 = \u00b5(u, d, \u03b8, \u03b4) over hypothesis sets H \u2282 X \u2192 {\u22121, 1}, where X is a set of size u, such that the\nfollowing holds for N = \u0398\n\n\u03b8\u22122 ln d ln(\u03b8\u22122d\u03b4\u22121)e\u0398(\u03b82d)(cid:17)\n\n(cid:16)\n\n.\n\n1. For all H \u2208 supp(\u00b5), we have |H| = N; and\n2. For every labeling (cid:96) \u2208 {\u22121, +1}u, if no more than d points x \u2208 X satisfy (cid:96)(x) = \u22121, then\n\n[\u2203f \u2208 C(H) : \u2200x \u2208 X . (cid:96)(x)f (x) \u2265 \u03b8] \u2265 1 \u2212 \u03b4 ,\n\nPrH\u223c\u00b5\n\nIn fact, we prove that if H is a random hypothesis set that also contains the hypothesis mapping all\npoints to 1, then with good probability H satis\ufb01es the second requirement in the theorem.\nTo show the existence of a good voting classi\ufb01er in C(H) our proof actually employs a slight variant\nof the celebrated AdaBoost algorithm, and shows that with high probability (over the choice of the\nrandom hypothesis set H), the voting classi\ufb01er constructed by this algorithm attains minimum margin\nat least \u03b8 over the entire set X .\n\nExistential Lower Bound. The difference between the generalization lower bound (second point)\nin Theorem 1 and 2 is a ln m factor in the \ufb01rst term inside the \u2126(\u00b7) notation. This term originated\nfrom having ln|H|/\u03b82 points with a probability mass of 1/10m in D(cid:96) and one point having the\nremaining probability mass. In the proof of Theorem 2, we \ufb01rst exploit that we are proving an\nexistential lower bound by assigning all points the same label 1. Since we are not proving a lower\nbound for every algorithm, this will not cause problems. We then change |X| to about m/ ln m and\nassign each point the same probability mass ln m/m in the distribution D. The key observation is\nthat on a random sample S of m points, by a coupon-collector argument, there will still be m\u2126(1)\npoints from X that were not sampled. From Lemma 3, we can now \ufb01nd a voting classi\ufb01er f, such\nthat sign(f (x)) is 1 on all points in x \u2208 S, and \u22121 on a set of d = ln|H|/\u03b82 points in X \\ S. This\nmeans that f has out-of-sample error \u2126(d ln m/m) = \u2126( ln |H| ln m\n\u03b82m ) under distribution D and obtains\na margin of \u03b8 on all points in the sample S.\n\n3 Proof of Algorithmic Lower Bound\n\nIn this section we prove Theorem 1 assuming Lemma 3. The proof of Lemma 3, as well as the proof\nof Theorem 2 are deferred to the full version of the paper [GKL+19]. To prove Theorem 1, \ufb01x some\ninteger N, and \ufb01x \u03b8 \u2208 (1/N, 1/40). In order to ensure that the hypothesis set constructed using\nLemma 3 is small enough, and speci\ufb01cally has size N O(1), we need the sparsity parameter to be\n\u03b82 . As described in Section 2, the family of distributions we present will be\nnot much larger than ln N\nde\ufb01ned separately over two subsets of X . To this end, and for ease of notations, we let the size u\nof X be 2 ln N\n2 be the size of each half. Finally, denote X = {\u03be1, . . . , \u03beu}.\n\n, and let d = ln N\n\n\u03b82 = u\n\n\u03b82\n\n6\n\n\fWe start by constructing the family {D(cid:96)}(cid:96)\u2208{\u22121,1}u of distributions over X \u00d7 {\u22121, 1}. Fixing a\nlabeling (cid:96) \u2208 {\u22121, 1}u, we de\ufb01ne D(cid:96) separately for the \ufb01rst u/2 points and the last u/2 points of\nX . Intuitively, every point in {\u03bei}i\u2208[u/2] has a \ufb01xed label determined by (cid:96), however all points but\none have a very small probability of being sampled according to D(cid:96). Every point in {\u03bei}i\u2208[u/2+1,u],\non the other hand, has an equal probability of being sampled, however its label is not \ufb01xed by (cid:96),\nbut instead slightly biased towards (cid:96). Formally, let \u03b1, \u03b2, \u03b5 \u2208 [0, 1] be constants to be \ufb01xed later.\nWe construct D(cid:96) using the ideas described earlier in Section 2, by sewing them together over two\nparts of the set X . We assign probability 1 \u2212 \u03b2 to {\u03bei}i\u2208[u/2] and \u03b2 to {\u03bei}i\u2208[u/2+1,u]. That is, for\n(x, y) \u223c D(cid:96), the probability that x \u2208 {\u03bei}i\u2208[u/2] is 1 \u2212 \u03b2. Next, conditioned on x \u2208 {\u03bei}i\u2208[u/2],\n(\u03be1, (cid:96)1) is assigned high probability (1 \u2212 \u03b5) and the rest of the measure is distributed uniformly over\n{(\u03bei, (cid:96)i)}i\u2208[2,u/2]. That is\n\n[(\u03be1, (cid:96)1)] = (1 \u2212 \u03b2)(1 \u2212 \u03b5) , and \u2200j \u2208 [2, u/2]. PrD(cid:96)\nPrD(cid:96)\n\n[(\u03bej, (cid:96)j)] =\n\nFinally, conditioned on x \u2208 {\u03bei}i\u2208[u/2+1,u], x distributes uniformly over {\u03bei}i\u2208[u/2+1,u], and\nconditioned on x = \u03bei, we have y = (cid:96)i with probability 1+\u03b1\n\n2 . That is\n\n\u2200j \u2208 [u/2 + 1, u]. PrD(cid:96)\n\n[(\u03bej, (cid:96)j)] =\n\n(1 + \u03b1)\u03b2\n\n2d\n\n[(\u03bej,\u2212(cid:96)j)] =\n\n, and PrD(cid:96)\n\n(1 \u2212 \u03b2)\u03b5\nu/2 \u2212 1\n\n.\n\n(1 \u2212 \u03b1)\u03b2\n\n.\n\n2d\n\n(cid:88)\n\n\u03a81((cid:96), f ) =\n\n(1 \u2212 \u03b5)\u03b2\nu/2 \u2212 1\n\n(cid:88)\n\ni\u2208[2,u/2]\n\nIn order to give a lower bound on the out-of-sample error for an arbitrary voting classi\ufb01er f, we\nde\ufb01ne a new random variable that is dominated by Pr(x,y)\u223cD(cid:96)[yf (x) < 0], and give a lower bound\non that random variable. To this end, \ufb01x some (cid:96) \u2208 {\u22121, 1}u and f : X \u2192 R, and denote\n\n1(cid:96)if (\u03bei)<0\n\n; \u03a82((cid:96), f ) =\n\n\u03b1\u03b2\nd\n\n1(cid:96)if (\u03bei)<0 .\n\n(3)\n\ni\u2208[u/2+1,u]\n\n2 + \u03a81 + \u03a82.\n\nWhen f, (cid:96) are clear from the context we shall simply denote \u03a81, \u03a82. In this notation, we show the\nfollowing.\nClaim 4. Pr(x,y)\u223cD(cid:96)[yf (x) < 0] \u2265 \u03b2(1\u2212\u03b1)\nWhile the proof of the claim is deferred to the full version of the paper [GKL+19], we explain why we\nfocus on \u03a81 + \u03a82, rather than bounding the out-of-sample error directly. The reason lies in the fact\nthat we need a lower bound to hold with constant probability over the choice of (cid:96) and S (and in the\ncase of Theorem 1 also the random choices made by the algorithm) and not only in expectation. While\nlower bounding E[Pr(x,y)\u223cD(cid:96)[yf (x) < 0]] is clearly not harder than lower bounding E[\u03a81 + \u03a82],\nshowing that a lower bound holds with some constant probability is slightly more delicate. Our proof\nuses the fact that with probability 1, \u03a81 + \u03a82 is not larger than a constant from its expectation, and\ntherefore we can use Markov\u2019s inequality to lower bound \u03a81 + \u03a82 with constant probability.\nWe next show that there exists a small enough (with respect to N) hypothesis set \u02c6H that is rich\nenough. That is, with high probability over (cid:96) \u2208 {\u22121, 1}u, there exists a weighted average f \u2208 C( \u02c6H)\nthat attains margin at least \u03b8 over the entire set X . The following claim follows from Lemma 3 and\nYao\u2019s minimax principle. Its proof is deferred to the full version of the paper [GKL+19].\nClaim 5. There exists a hypothesis set \u02c6H such that ln| \u02c6H| = \u0398 (ln N ) and\n\nPr\n\n(cid:96)\u2208R{\u22121,1}u\n\n[\u2203f \u2208 C( \u02c6H) : \u2200i \u2208 [u]. (cid:96)if (\u03bei) \u2265 \u03b8] \u2265 19/20 .\n\nWe next show that there exist some distribution D \u2208 {D(cid:96)}(cid:96)\u2208{\u22121,1}u and some classi\ufb01er \u02c6f \u2208 C( \u02c6H)\nsuch that for every algorithm A, with constant probability over S \u223c D(cid:96), \u02c6f has large margins on\npoints in S, yet fA,S has large out-of-sample error. To this end \ufb01x A to be a (perhaps randomized)\nlearning algorithm. For every m-point sample S, recall that fA,S denotes the (random) classi\ufb01er\nreturned by A when running on sample S.\nThe main challenge is to show that there exists a labeling \u02c6(cid:96) \u2208 {\u22121, 1}u such that C( \u02c6H) contains\na good voting classi\ufb01er for \u02c6(cid:96) and, in addition, fA,S has a out-of-sample error with respect to D\u02c6(cid:96).\nWe will show that if \u03b1 is small enough, then indeed such a labeling exists. Formally, we show the\nfollowing.\n\n7\n\n\fLemma 6. If \u03b1 \u2264(cid:113) u\n\n40\u03b2m , then there exists \u02c6(cid:96) \u2208 {\u22121, 1}u such that\n1. There exists \u02c6f = \u02c6f\u02c6(cid:96) \u2208 C( \u02c6H) such that for every i \u2208 [u], \u02c6(cid:96)i\n2. with probability at least 1/25 over S \u223c Dm\n\u02c6(cid:96)\n\n\u02c6f (\u03bei) \u2265 \u03b8 ; and\n\nand the randomness of A we have\n\n\u03a81(\u02c6(cid:96), fA,S) + \u03a82(\u02c6(cid:96), fA,S) \u2265 (1 \u2212 \u03b2)\u03b5\n\n+\n\n\u03b1\u03b2\n24\n\n.\n\n24\n\nThe proof of the lemma is quite involved technically, and is therefore also deferred to the full version\nof the paper [GKL+19]. We will next show that the lemma implies Theorem 1.\nProof of Theorem 1. Fix some \u03c4 \u2208 [0, 49/100]. Assume \ufb01rst that \u03c4 \u2264 u\n300m , and let \u03b5 = u\n10m and\n\u03b2 = \u03b1 = 0. Let \u02c6(cid:96), \u02c6f be as in Lemma 6, then for every sample S \u223c Dm\n, Pr(x,y)\u223cS[y \u02c6f (x) < \u03b8] =\n\u02c6(cid:96)\n0 \u2264 \u03c4, and moreover with probability at least 1/25 over S and the randomness of A\n\n(cid:16) u\n\n(cid:17)\n\nm\n\n= \u03c4 + \u2126\n\n(cid:115)\n\n\uf8eb\uf8ed ln| \u02c6H|\n\nm\u03b82 +\n\n\u03c4 ln| \u02c6H|\nm\u03b82\n\n\uf8f6\uf8f8 .\n\n[yfA,S(x) < 0] \u2265 (1 \u2212 \u03b2)\u03b5\n\n\u2265 \u03c4 + \u2126\n\n24\n\nPr\n\n(x,y)\u223cD\u02c6(cid:96)\n\n300m , and let \u03b5 = u\n\ntherefore \u03b1 = (cid:112) u\n\n2560\u03c4 m \u2264 (cid:113) u\n\n10m, \u03b1 =(cid:112) u\n\nwhere the \ufb01rst inequality follows from Claim 4 and the second point of Lemma 6, and the last\ntransition is due to the fact that u = 2\u03b8\u22122 ln N = \u0398(\u03b8\u22122 lg | \u02c6H|) and \u03c4 = O(u/m).\n32\u221231\u03b1. Since \u03c4 \u2265 u\nOtherwise, assume \u03c4 > u\n300m,\nthen \u03b1 \u2208 [0, 1]. Moreover, if m > Cu for large enough but universal constant C > 0, then\n100 \u2265 64\u03c4, and hence \u03b2 \u2208 [0, 1]. Moreover, since \u03b1 \u2264 1 then \u03b2 \u2264 64\u03c4, and\n32 \u2212 31\u03b1 \u2265 64 \u00b7 49\n40\u03b2m. Let therefore \u02c6(cid:96), \u02c6f be a labeling and a classi\ufb01er in C( \u02c6H)\nbe a sample of m\nyj \u02c6f (xj )<\u03b8] = (1\u2212\u03b1)\u03b2\n.\n\nwhose existence is guaranteed in Lemma 6. Let (cid:104)(x1, y1), . . . , (xm, ym)(cid:105) \u223c Dm\npoints drawn independently according to D\u02c6(cid:96). For every j \u2208 [m], we have E[1\nTherefore by Chernoff we get that for large enough N,\n\n2560\u03c4 m and \u03b2 = 64\u03c4\n\n\u02c6(cid:96)\n\n2\n\n(cid:20)\n\n(cid:104)\n\n(cid:21)\n\n(cid:105) \u2265 \u03c4\n\nPr\n\nS\u223cDm\n\u02c6(cid:96)\n\nPr\n\n(x,y)\u223cS\n\ny \u02c6f (x) < \u03b8\n\n= Pr\n\nS\u223cDm\n\u02c6(cid:96)\n\n\uf8ee\uf8f0 1\n\nm\n\n(cid:88)\n\nj\u2208[m]\n\nyj \u02c6f (xj )<\u03b8 \u2265 (1 \u2212 31\u03b1/32)\u03b2\n\n2\n\n1\n\n\uf8f9\uf8fb\n\n\u2264 e\u2212\u0398(\u03b12\u03b2m) \u2264 e\u2212\u0398(u) \u2264 10\u22123 ,\n\n2560\u03c4 = \u2126(u), since \u03b2 \u2265 2\u03c4.\nwhere the second-to-last inequality is due to the fact that \u03b12\u03b2m = u\u03b2\nMoreover, from Claim 4 and the second point of Lemma 6 we get that with probability at least 1/25\nover S and A we have\n\n[yfA,S(x) < 0] \u2265 (1 \u2212 \u03b1)\u03b2\n\nPr\n\n(x,y)\u223cD\u02c6(cid:96)\n\n(1 \u2212 31\u03b1/32)\u03b2\n\n2\n\n=\n\n(cid:115)\n\n\u03b1\u03b2\n32\n\n+\n\n\uf8eb\uf8ed ln| \u02c6H|\n\nm\u03b82 +\n\n2\n\u03c4 ln| \u02c6H|\nm\u03b82\n\n\uf8f6\uf8f8 ,\n\n\u2265 \u03c4 + \u2126\n\n+\n\n\u03b1\u03b2\n64\n\n= \u03c4 + \u2126\n\n(cid:18)(cid:114) \u03c4 u\n\n(cid:19)\n\nm\n\nwhere the last transition is due to the fact that \u03c4 = \u2126(u/m).\n\n4 Existence of a Small Hypotheses Set\n\nThis section is devoted to the proof of Lemma 3. That is, we present a distribution \u00b5 over \ufb01xed-size\nhypothesis sets and show that for every \ufb01xed labeling (cid:96) with not too many negative labels, with high\nprobability over H \u223c \u00b5, C(H) contains a voting classi\ufb01er f that attains good margins with respect to\n(cid:96). In fact, our proof not only shows existence of such a voting classi\ufb01er, but also presents a procedure\nfor constructing one. The presented algorithm is an adaptation of the AdaBoost algorithm.\n\n8\n\n\f\u03b4\n\n(cid:17)\n\nMore formally, \ufb01x some \u03b8 \u2208 (0, 1/40), \u03b4 \u2208 (0, 1) and an integer d \u2264 u. Let \u03b3 = 4\u03b8 \u2208 (0, 1/10) and\nlet N = 2\u03b3\u22122 ln d\u00b7 ln \u03b3\u22122 ln d\n\u00b7 eO(\u03b82d). We de\ufb01ne the distribution \u00b5 via the following procedure, that\nsamples a hypothesis set H \u223c \u00b5. Let \u02c6h be de\ufb01ned by \u02c6h(x) = 1 for all x \u2208 X . Sample independently\nand uniformly at random N hypotheses h1, . . . , hN , and de\ufb01ne H := {\u02c6h} \u222a {hj}j\u2208[N ].\nClearly every H \u2208 supp(\u00b5) satis\ufb01es |H| = N + 1. We therefore turn to prove the second property.\nTo this end, let k = \u03b3\u22122 ln d. In order to show existence of a voting classi\ufb01er, we conceptually change\nthe procedure de\ufb01ning \u00b5, and think of the random hypotheses as being sampled in k equally sized\n\u201cbatches\u201d, each of size N/k, and adding \u02c6h to each of them. Denote the batches by H1,H2, . . . ,Hk.\nWe consider next the following procedure to construct a voting classi\ufb01er f \u2208 C(H) given H \u223c \u00b5. We\nwill use the main ideas from the AdaBoost algorithm. Recall that AdaBoost creates a voting classi\ufb01er\nusing a sample S = ((x1, y1), . . . , (xu, yu)) in iterations. Staring with f0 = 0, in iteration j, it\ncomputes a new voting classi\ufb01er fj = fj\u22121 + \u03b1jhj for some hypothesis hj \u2208 H and weight \u03b1j. The\nheart of the algorithm lies in choosing hj. In each iteration, AdaBoost computes a distribution Dj over\nS and chooses a hypothesis hj minimizing the empirical error probability \u03b5j = Pri\u223cDj [hj(xi) (cid:54)= yi]\nwith respect to Dj and then reweighs the sample points to construct Dj+1. The weight it then assigns\nto hj is \u03b1j = (1/2) ln((1 \u2212 \u03b5j)/\u03b5j) The \ufb01rst distribution D1 is the uniform distribution.\nWe alter the above slightly assigning uniform weights on the hypotheses, and setting \u03b1j = 1\nfor all iterations j. The algorithm is formally described as Algorithm 1.\nInput: (H1, . . . ,Hk) \u223c \u00b5\nOutput: f \u2208 C\nj\u2208[k] Hj\n1: let \u03b1 = 1\n2: let f (x) = 0 for all x \u2208 X\n3: let D1(i) = 1\n4: for j = 1 to k do\n5:\n\n2 ln 1+2\u03b3\n1\u22122\u03b3\nu for all i \u2208 [u].\n\ni\u2208[u] Dj(i)1yi(cid:54)=hj (xi) \u2264 1\n\n(cid:16)(cid:83)\n\n2 ln 1+2\u03b3\n1\u22122\u03b3\n\n2 \u2212 \u03b3.\n\nAlgorithm 1: Construct a Voting Classi\ufb01er\n\n(cid:80)\nj\u2208[k] hj \u2208 C(H)\nFirst note that if f is the classi\ufb01er returned by the algorithm, then clearly f = 1\nk\nis a voting classi\ufb01er. The following claim implies Lemma 3. Its proof is quite technical, and deferred\nto the full version of the paper [GKL+19].\nClaim 7. With probability at least 1 \u2212 \u03b4 Algorithm 1 does not fail, and moreover, in that case, for\nevery i \u2208 [y], yif (xu) \u2265 \u03b8.\n\n5 Conclusions\n\nIn this work, we showed almost tight margin-based generalization lower bounds for voting classi\ufb01ers.\nThese new bounds essentially complete the theory of generalization for voting classifers based on\nmargins alone. Closing the remaining gap between the upper and lower bounds is an intriguing open\nproblem and we hope our techniques might inspire further improvements. Our results come in the\nform of two theorems, one showing generalization lower bounds for any algorithm producing a voting\nclassi\ufb01er, and a slightly stronger lower bound showing the existence of a voting classi\ufb01er with poor\ngeneralization. This raises the important question of whether speci\ufb01c boosting algorithms can produce\nvoting classi\ufb01ers that avoid the lg m factor in the second lower bound via a careful analysis tailored\nto the algorithm. As a \ufb01nal important direction for future work, we suggest investigating whether\nnatural parameters other than margins may be used to better explain the practical generalization error\nof voting classi\ufb01ers. At least, we now have an almost tight understanding, if no further parameters\nare taken into consideration.\n\n9\n\nFind a hypothesis hj \u2208 Hj satisfying(cid:80)\nZj \u2190(cid:80)\n\nIf there is no such hypothesis, return fail.\nfj \u2190 fj\u22121 + hj.\nfor every i \u2208 [u] let Dj+1(i) = 1\n\ni\u2208[u] Dj(i) exp(\u2212\u03b1yihj(xi)).\n\nZj\n\n6:\n7:\n8:\n9: return 1\n\nk fk.\n\nDj(i) exp(\u2212\u03b1yihj(xi)).\n\n\fAcknowledgments\n\nThis work was supported by a Villum Young Investigator Grant and an AUFF Starting Grant.\nJelani Nelson is supported by NSF CAREER award CCF-1350670, NSF grant IIS-1447471, ONR\ngrant N00014-18-1-2562, ONR DORECG award N00014-17-1-2127, an Alfred P. Sloan Research\nFellowship, and a Google Faculty Research Award\n\nReferences\n[AB09]\n\nM. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations.\nCambridge University Press, New York, NY, USA, 1st edition, 2009.\n\n[BDST00] K. P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for\n\n[Bre99]\n\n[CG16]\n\nboosting. In ICML, pages 65\u201372, 2000.\nL. Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493\u2013\n1517, 1999.\nT. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of\nthe 22nd acm sigkdd international conference on knowledge discovery and data mining,\npages 785\u2013794. ACM, 2016.\n\n[EHKV89] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the\nnumber of examples needed for learning. Information and Computation, 82(3):247 \u2013\n261, 1989.\nY. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139,\n1997.\n\n[FS97]\n\n[GKL+19] A. Gr\u00f8nlund, L. Kamma, K. G. Larsen, A. Mathiasen, and J. Nelson. Margin-based\n\ngeneralization lower bounds for boosted classi\ufb01ers, 2019. arXiv:1909.12518.\n\n[GLM19] A. Gr\u00f8nlund, K. G. Larsen, and A. Mathiasen. Optimal minimal margin maximization\nwith boosting. In Proceedings of the 36th International Conference on Machine Learning,\npages 4392\u20134401. PMLR, 2019.\nA. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned\nensembles. In AAAI/IAAI, pages 692\u2013699, 1998.\nW. Gao and Z.-H. Zhou. On the doubt about margin explanation of boosting. Arti\ufb01cial\nIntelligence, 203:1\u201318, 2013.\n\n[GZ13]\n\n[GS98]\n\n[RW02]\n\n[KMF+17] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm:\nA highly ef\ufb01cient gradient boosting decision tree. In Advances in Neural Information\nProcessing Systems, pages 3146\u20133154, 2017.\nG. R\u00e4tsch and M. K. Warmuth. Maximizing the margin with boosting. In COLT, volume\n2375, pages 334\u2013350. Springer, 2002.\nG. R\u00e4tsch and M. K. Warmuth. Ef\ufb01cient margin maximizing with boosting. Journal of\nMachine Learning Research, 6(Dec):2131\u20132152, 2005.\n\n[RW05]\n\n[SFBL98] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new\nexplanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651\u2013\n1686, 1998.\n\n[WSJ+11] L. Wang, M. Sugiyama, Z. Jing, C. Yang, Z.-H. Zhou, and J. Feng. A re\ufb01ned margin\nanalysis for boosting algorithms via equilibrium margin. Journal of Machine Learning\nResearch, 12(Jun):1835\u20131863, 2011.\n\n10\n\n\f", "award": [], "sourceid": 6431, "authors": [{"given_name": "Allan", "family_name": "Gr\u00f8nlund", "institution": "Aarhus University, MADALGO"}, {"given_name": "Lior", "family_name": "Kamma", "institution": "Aarhus University"}, {"given_name": "Kasper", "family_name": "Green Larsen", "institution": "Aarhus University"}, {"given_name": "Alexander", "family_name": "Mathiasen", "institution": "Aarhus University"}, {"given_name": "Jelani", "family_name": "Nelson", "institution": "UC Berkeley"}]}