NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6431
Title:Margin-Based Generalization Lower Bounds for Boosted Classifiers

Reviewer 1

The paper is well written and even though notation might be heavy when reading the proofs, the authors try to give intuition behind their approach. I have minor questions about some of the presentation in the paper. On line 167 do the authors assume that the size of the population is |\mathcal{X}| = 10m? On line 176 shouldn’t the inequality be an equality, otherwise how is the above distribution over \mathcal{X} proper? While first reading the paper, I had the following confusion about the intuitive explanation between lines 179 and 181: we need the mass on the single point to be very large compared to m, however, in that case it is highly likely that the returned classifier will have small generalization error because it will classify the point with large mass correctly and then the probability to sample every other point upper bounds the generalization error. Maybe the authors can hint at what the size of \mathcal{X} is compared to m? From my understanding an important part of the lower bound proof is carefully balancing between the distribution which puts large mass on a single point and the distribution which slightly biases one of the labels compared to the other. It still a bit mysterious why exactly these two types of distributions are needed. It seems that the terms appearing in Eq. 3 are the ones governed by the two distributions. Elaborating more on why these terms show up and their relation to the two distributions might make the presentation better. Overall the contributions in terms of lower bounds are significant and almost match the state of the art known upper bounds. The proofs are non-trivial and to my knowledge are novel.

Reviewer 2

This paper provides a lower bound that matches existing margin-based upper bounds for voting classifiers. These existing upper bounds bound the true error of a voting classifier, as generated by boosting algorithms, by the sum of its empirical probability of obtaining a large margin, and a generalization term that depends on this margin. The provided lower bound nearly matches this upper bound, up to log factors and small powers. The paper presents two theorems. Theorem 2 is a lower bound that can be directly compared to the upper bound. Theorem 1, in contrast, replaces the empirical margin of the voting classifier returned by an algorithm (f_{A,S}) with the empirical margin of another voting classifier (f). This change of classifiers results in a lower bound that does not have a direct bearing to the upper bound, which is phrased in terms of a single voting classifier. The discussion of Theorem 1 (see line 113) doesn't seem to acknowledge this difference or explain why this result can still be considered a matching lower bound. Theorem 2 does provide a lower bound which is directly comparable to known upper bounds, and is tight up to log factors and small powers. This theorem only holds for sample sizes that are slightly larger than what would be required to avoid triviality, m = (ln N/\theta^2)^(1+\Omega(1)). Strangely, this limitation is completely ignored in the text, although other easier limitations are discussed. The authors should address this limitation explicitly in their description of the result. The notation \Omega(1) in the power of m should be replaced by the actual number: the value here is of importance, since large values would make the result much weaker. It seems, from the supplementary material, that the actual value might be reasonable, however it should be explicitly spelled out. Almost none of the proofs are provided in the body of the paper. Instead, long explanations describing the proof parts that are based on standard lower-bounding techniques are provided. This is unfortunate, since the supplementary material seems to include some interesting proofs that could have been easily incorporated in the body of the paper instead of long explanations of the standard parts of the proof. The result is a paper which provides almost no interesting technical contribution, although such contribution does seem to exist --- in the supplementary material. The most interesting result and proof seem to be Lemma 3, which shows the existence of a relatively small set of hypotheses that admit voting classifiers with a given minimum margin if only a small number of the labels are negatively labeled. The proof of this lemma uses an interesting construction technique based on AdaBoost. However, like all other proofs, it is not presented in the body of the paper. Considering the significance and novelty of the result, I find the provided lower bound mildly interesting, though not surprising. Had the proof of Lemma 3 been provided in the body of the paper, along with some of the more interesting parts of the rest of the proofs, this might have made a more interesting contribution. The current organization leaves most of the actual contribution out of the paper. In fact, very little that can be technically verified is provided in the body of the paper and almost all the derivations are in the supplementary material. - line 110 "it is always possible" -- this should only be possible with the D whose existence is proved in theorem 1. - Sometimes lg is used, and sometimes ln. Please be consistent. - line 407 in the supplementary: "lebeling"-->"labeling" - Reference SFB+ includes an "et al" which is out of place. - The mention that proofs are "deferred to the full version of this work" repeats in many places, where in fact the proofs are provided in the supplementary material.

Reviewer 3

The authors provide (almost tight) lower bounds for the generalization errors for boosting based classifiers, in terms of the empirical margin, thus showing that the algorithm due to GZ13 is almost optimal when the only problem parameter is the margin. This work, hence, encourages a search for other data-dependent empirical quantities for the problem exhibiting stronger generalization. It is good to understand the problem well, but I fear that I do not see a lot of impact of this result on the way people currently think about boosting. All their lower bounds, construct a distribution D (which may depends on m, H and A). This is the standard flavor of lower bounds for the generalization error. However, the lower bounds developed are not fully satisfying. Theorem 1 does not suggests that the returned classifier which has a large test error, has small \theta-margin error on the training data. Theorem 2 does not suggest that the presented f which satisfies both the properties is the one returned by any “interesting” algorithm or is even computable from data. An ideal lower bound should look like - “No algorithm can return a classifier f which has small error (margin) on the training set, but a large test error for all data-distributions D.” Originality: The authors combine old ideas to prove lower bounds under different settings in a very clever way to get the required lower bound. The idea of lower bounding the population risk by the sum term (claim 4 - supplementary doc), which is further easier to lower bound with high probability, is also quite interesting. However, the techniques used to prove the lower bounds in each of the pieces are not new. Writing: The paper is overall well written. Using the same variable d to represent both the sparsity parameter and the data-set split (in lines 274-285 in the supplementary) is confusing, and I would recommend the authors to fix it. Line 289 also contains a small typo in the definition of \psi_1. ** After Author Feedback: Thank you for clarifying my confusions regarding Theorem 2. I find it quite interesting now. I would like to keep the same scores as Theorem 1 is still quite unsatisfying and the author responses did not really help.