NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3642
Title:Hypothesis Set Stability and Generalization

Reviewer 1

The article is clearly written presenting high-quality research with original results. I have two main criticisms: 1) Beyond the pattern A lot of the examples analyzed seem to follow a similar pattern: the algorithm that selects some data-dependent “base” hypothesis is uniformly stable, and then the predictor returned is some linear combination of these base hypothesis. Is that an essential construction which allows for the hypothesis set stability analysis? 2) Related work comparison to PAC Bayes The introduction discusses related work, including to data-dependent priors. I think the text here misrepresents slightly the nature of this work, first by lumping all these contributions together, and second, by suggesting that the PAC Bayes framework applies only to randomized classifiers. In both cases, the text muddies the relationship with Dziugaite and Roy, who also exploit differential privacy results to obtain bounds for data-dependent hypothesis classes: to see this, note that their bounds, when viewed from the Radmacher perspective of Kakade, Sridharan, and Tewari (2009), are working with (nonuniform bounds over) data-dependent hypothesis sets. In more detail, the authors cite a number of works, including Catoni (2007), Parrado-Hernandez et al. (2012), Lever et al. (2013) and Dziugaite and Roy (2018a and 2018b). In fact, almost all of these are examples of distribution-dependent but data-independent priors: Catoni, Parrado et al. and Lever specify distribution-dependent priors and then BOUND the KL term to derive PAC-Bayes bounds that are in terms of known computable quantities. In Catoni and Lever these bounds are distribution independent (but note that a data-dependent choice of the inverse temperature term can makes the bound data-dependent). Parrado et al give two bounds: The first “data-dependent” prior is not data dependent in the sense meant here: the empirical risk calculation EXCLUDES the same data used in the prior. Their second bound, however, uses data and concentration of measure (about the mean) to derive a data-dependent BOUND on the KL term. This allows them to use a distribution-dependent hypothesis class, which is related to the ideas here. In contrast, Dziugaite and Roy study DATA-dependent priors and then employ differential privacy techniques, like those used here, to derive valid statistical bounds based on PAC-Bayes bounds. Further, they then show how weaker-than-private mechanisms yield valid bounds. They give a particular application using a type of Wasserstein control. The DP underpinnings of their PAC Bayes bounds and those derived here deserves to be highlighted. More generous and specific references would likely be appreciated. Second, the idea that PAC-Bayes bounds apply only to randomized classifiers is somewhat naive and further misrepresents the extent to which data-dependent priors bear on the same problem studied here. “Derandomization” techniques used in PAC-Bayes are well known. The original application of PAC Bayes to the SVM solution (a single hypothesis) is the classical example of obtaining bounds on a single hypothesis using PAC Bayes bounds and margin. These ideas have been used recently by various authors, including Neyshabur et al 2018 (ICLR) and Nagarajan and Kolter (2018). Typos: Line 140: \bar H_m The word “sensitivity” seems to be reused multiple times in the examples. The authors might want to consider defining more officially. Is it any different from bounded differences? Line 236: The authors mention that their bound for bagging is “similar (but incomparable) to the one derived by Elisseeff et al. [2005]”. Could the authors expand a bit? Line 3 typo “results” *** Update after Author Response: Thank you for pointing out the Distillation Example. It still assumes that the predictors in the high complexity hypothesis class satisfy bounded differences (and the loss is Lipchitz, so it seems that one essentially gets stability again). Then the output hypothesis is selected from an L-inf ball around the complex data-dependent hypothesis. I don't think that this qualifies as a "breaking the pattern" example. However, assuming author's claim "standard uniform-stability argument would not necessarily apply here" is correct, this example is still interesting. I was also wondering whether the assumption made in line 294 (that f_S-f_S' is in H) is a standard one. In summary, I am not convinced that this theory is yet useful for a larger range of problems or is superior to some other data-dependent analysis. However, I think it is theoretically intriguing and further developments of it could become high-impact and thus should be accepted. I am hoping that the authors will implement promised changes (expanding on PAC-Bayes connections, comparing to Elisseeff et al. bounds, and being upfront about the limitations of current applications).

Reviewer 2

The theoretical results are novel and sound. The authors show several application problems for their theoretical framework. I encourage the authors to discuss a bit regarding impact/audience during the rebuttal phase. === AFTER REBUTTAL I am satisfied with the authors response.

Reviewer 3

This paper is clearly written and the results are interesting and mathematically sound. The unification of the complexity-based and stability-based analysis into learning with data-dependent hypothesis set seems a significant contribution. Under the assumption of data-dependent stability, the generalization ability was proved in Theorem 2. The bound in Theorem 2 was obtained by combining several theoretical techniques. Then, Theorem 2 applies to a wide range of learning algorithms. minor comments: - I guess that the invariance under the sample permutation is implicitly assumed for the correspondence from S to H_S, i.e., the permutation of the sample S does not affect H_S. The last equation on page 3 uses such a condition. Supplementary comment on the invariance might be helpful for readers. - line 117: The sentence "hat{R}_T(H_{S,T}) is the standard empirical Rademacher complexity" sounds slightly confusing, since H_{S,T} also depends on the sample T. In my understanding, the standard Rademacher complexity does not deal with such a hypothesis set. - The term "sensitive" was separately defined in Section 5.2 and 5.4 but not in Section 5.3. It may be good to show a general definition of the sensitiveness before Section 5.2. ----- after Author Response: Thanks for the response. All points in my comments are now clear.