NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5764 Fast-rate PAC-Bayes Generalization Bounds via Shifted Rademacher Processes

### Reviewer 1

This paper is clearly original since it provides a new proof of a well-known result (Catoni's bound, recovered up to constants), a new definition of flatness (h-flatness) of the landscape where the posterior concentrates and a new PAC-Bayes bound associated to the notion of flatness. This paper is a nice theoretical contribution to the PAC-Bayesian theory of learning. The details of the proofs, that are technically sound (use of symmetrization for deviation inequalities, theory of shifted empirical process, duality formula for the kullback-Leibler divergence, peeling with respect the values of the KL divergence), are deferred to the Appendix and condensed arguments of proofs are written in the main part. However, the comparison of the new bound (Theorem 4.3) with Catoni's bound remains at rather heuristic level. Indeed, it is proved that the new bound is not worst than Catoni's bound, but it is only assumed and not clearly demonstate that the new bound can actually be stronger than Catoni's bound in favorable cases. I have also a question in mind about the scheme of proof exposed in Section 3. To bypass the suboptimality of the bounded difference inequality, the paper advocates to use control by Rademacher quantities directly in the deviation inequalities. But what about Talagrand's type concentration inequalities for the supremum of the empirical process, that would allow (classically through a peeling argument) to take into account the localization of the variance of the classes (KL neighborhoods). This is the classical way to produce fast rates from empirical process theory, so I wonder if some adaptation could be worked out here in the PAC-Bayes framework? This paper is essentially well written and well organized. The notations are carfully introduced and many references to existing, connected works are provided. However, I have three remarks concerning the presentation, that are rather minor. Constant c does not appear in display (11) (that refers to Lemma A.2 in the supplementary file), on contrary to what is anounced (l. 158: "as long as c>0"). This should be fixed. Also, l. 175-176, the sentence that begins with "For fixed $h>0$ [...]" is unclear to me and should be rewritten. Finally, I think that Display (25) could help the reader to understand the notion of flatness if it would be placed right after Remark 4.2. I think that the proof of Proposition 3.1, the new bound of Theorem 4.3 and the definition of flatness provided in Definition 4.1 are interesting and important theoretical contributions, successfully investigating the links between two major lines of research, Rademacher complexities and PAC-Bayes bounds.

### Reviewer 2

The paper investigates the connection between PAC-Bayes and Rademacher complexities, two framework in statistical learning theory to upper bound the generalization error of predictors. The paper is quite pleasant to read, and clarity is remarkable. Detailed comments: * The paper investigates the binary classification case, with the loss taking values in $\{0,1\}$. Is there hope to transpose the present results to more general losses? It is unclear whether the technique used by the authors depends on the fact that the loss takes only two values -- or is bounded. There are also a few papers on PAC-Bayes with unbounded losses, see e.g. Alquier and Guedj (2018), "Simpler PAC-Bayesian bounds for hostile data", Machine Learning and references therein (note: slow rates). * Eq. (9) is reminiscent of the Legendre transform of the Kullback-Leibler divergence which has been long known and used in the PAC-Bayes literature. See for example Lemma 1 in the recent survey "A primer on PAC-Bayesian learning" (Guedj, 2019), https://arxiv.org/abs/1901.05353 Is there a connection here? * Lines 182-186: I disagree with that comment. Even if $\frac{c}{m}\sum_{i=1}^m \mathbb{E}_Q[f(z_i)-(1+h)\mathbb{E}_Q f(z_i)]^2$ is negligible, the right-hand side in (14) is still larger than the corresponding term in Catoni's bound, due to numerical constants (in the present paper: $\frac{4}{C}(3\mathbf{KL}(Q||P) + \log(1/\delta)) + 5$ compared to $(1+t)(\mathbf{KL}(Q||P)+\log(1/\deta))$ in Catoni's bound, where $t$ is arbitrarily close to 0). * Related work: a few PAC-Bayes references seem to be missing (see the afomentioned survey Guedj, 2019 and references therein). In particular, Tolstikhin and Seldin (2013), "PAC-Bayesian-empirical-Bernstein inequality", NIPS (PAC-Bayes bound with a variance term) seems relevant as the authors prove a PAC-Bayes bound with a variance term. I am unsure why the authors of the present paper insist that their notion of "flatness" is "\emph{not} related to the variance of the randomized classifier". While it is technically true, the terms are still quite similar and one would expect the actual values to be close when computing the bounds (especially when $h$ is small). The paragraph on lines 249-262 could be enriched with additional comments and comparisons. In particular, see Theorem 4 in Tolstikhin and Seldin (2013). * References: please remove "et al." for [3] and [39]. === after rebuttal === I thank the authors for taking the time to address my remarks in their rebuttal. After reading other reviews, rebuttal and engaging in discussion, my score remains unchanged. Nevertheless, I must say I am disappointed by some parts of the rebuttal. Point (5): I don't agree with the rebuttal and I think discarding a comparison with Tolstikhin and Seldin's bound on the ground that their term may be large whereas the authors' is zero is wrong: I see no evidence for that claim. Last but not least, see Alquier and Guedj (2018), propositions 1 and 2: PAC-Bayes bounds *can* be easily specified to the ERM case, contrary to what is written in the rebuttal. I still have an overall positive opinion on the paper but I sure would like to see a more thorough and rigorous comparison with existing bounds, both theoretically and numerically.