NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6604 PAC-Bayes Un-Expected Bernstein Inequality

### Reviewer 1

== SUMMARY == PAC-Bayesian risk bounds are considered by many to be the tightest existing analysis of generalization error. Despite this, there is still a gap between empirical test errors and analytic bounds, and this gap motivates further refinement of PAC-Bayes analysis. PAC-Bayes bounds come in various "flavors," the most popular one being the "little kl" bound, which bounds the KL divergence between Bernoulli distributions with parameters equal to the empirical risk and true risk, respectively. This form is the tightest, but considered by some as not easy to interpret. McAllester showed (COLT, 2003) how such bounds can be converted into a more interpretable bound on the difference of the empirical and true risks. One attractive facet of his bound is that when the empirical risk is small, the bound achieves the "fast" rate of O(1/n) -- thus capturing both realizable and non-realizable learning problems. Later, Tolstikhin & Seldin (TS; NIPS, 2013) derived a bound that accounts for the variance of the empirical losses. Since the empirical variance is often smaller than the empirical mean, their bound could be tighter than McAllester's (or others based on McAllester's analysis of the kl). Unfortunately, with all due respect to TS, their bound lacks the interpretability of McAllester's bound. (Of course, interpretability is subjective; this is simply this reviewer's opinion.) The current manuscript provides a bound that is similar in form to McAllester's, but includes a variance-like term. This term, denoted V_n in the paper, is the expected mean-squared error between a hypothesis drawn from the posterior and a reference hypothesis -- which could either be deterministic, or drawn from an alternate posterior -- trained on a subset of the data. To be precise, there are two such terms, arising from a split of the data. (Thanks to this splitting, the bound lends itself nicely to data-dependent priors, where you learn a prior on a subset of the data and use the rest to learn the posterior.) Unlike McAllester's bound, this one also has an irreducible term, V'_n, that in the worst case is O(1 / \sqrt{n}). The analysis follows from a modification of Bernstein's moment inequality where the X^2 variable is taken outside of its expectation -- hence the name, "un-expected Bernstein inequality." == PROS == The bound does appear to be tighter than existing bounds. The paper demonstrates this experimentally, which is helpful. It's not always easy to compare tightness just by looking at equations, so it's nice to see the claim supported by experiments. The form of the bound is, in my opinion, a bit more digestible than TS's bound. Though the variance-like quantity is not as interpretable as the sample variance, it is nice that the bound can be stated in one expression (i.e., without cases), and the paper gives an optimized form of the bound (Eq 1 combined with the Corollary of Theorem 1) which makes it easy to compare to existing bounds. == CONS == The irreducible term is disappointing. McAllester's bound does not have this term, so one would think it would be possible to remove it and achieve the fast rate. Perhaps the worst case upper bound on V'_n is too pessimistic? The experiments indicate that V'_n is on the order of the empirical risk, which could be O(1/n) under reasonable conditions. (McAllester's bound achieves the fast rate in this case.) While the bound seems to me to be more interpretable than TS's bound, it is not as interpretable as McAllester's. If you apply McAllester's kl analysis to a bound by Maurer, you get a bound that is, empirically, almost as tight as the current one, but more interpretable. If I were a practitioner looking for an "off-the-shelf" bound, I'd reach for the simpler one that is, practically speaking, about as tight. == PRESENTATION == The paper is very well written, which is much appreciated. The authors have taken care to make the theory accessible, starting with a generic bound in Eq 1, and a helpful corollary to interpret their main result. The presentation of the main theorem is a bit dense, but I suppose that is unavoidable. == DETAILED COMMENTS == Line 22: "Standard bounds all have an order ..." This is misleading. Most risk bounds have a term that is O(\sqrt{COMP_n / n}), but only a few have the empirical risk in that term. You can take any little-kl bound and convert it into a bound with that term, using McAllester's analysis. (I think there is also a risk bound due to Pollard with a similar term, but I can't find the reference.) The form of TS's empirical Bernstein bound is, as I understood it, much different from Eq 1. If the posterior satisfies a certain condition, you get the fast rate; if not, you get the variance-based term. I guess you could combine the terms with an indicator function of posterior -- is this what you mean? It would be helpful if you showed the TS bound in the form of Eq 1. (Apologies if this was somewhere in the appendix and I missed it.) How critical is data splitting to the bound? Is the bound meaningful if you take m = 0 (or m = n)? Lines 142 - 144: "... we could use an estimator that ... in the first term." This sentence was unclear to me. What are constants p and q in Eq 5? Are they used to split the data for cross-validation? How are p, q, m and n chosen in the experiments? Stability is mentioned in line 190, but it's not clear how stability plays into the bound. Why is it important that \hat{h}(Z_{m}) are close to \hat{h}(Z_{

### Reviewer 2

The bound is a bit harder to deal with because it is formulated as an oracle inequality, and this introduces additional complexity in terms of tuning (quantization and union bound). The paper also includes some empirical evaluation, which I didn't find particularly strong. An interesting part here would be a synthetic evaluation, however it is somewhat "fixed", in a sense that it does not vary relevant parameters of the experiment to show sensitivity of the variance proxy (and it's merits against another empirical Bernstein type of bounds). For instance, this can be done by varying the Bayes optimal predictor. Therefore it is impossible to judge whether the bound is empirically tighter compared to the prior work. Also, it is unclear how far we are from the ground truth --- Bayes error in the plots would definitely help. Finally, comparison to another PAC-Bayesian literature that introduces algorithmic stability ideas is missing. For instance: Rivasplata, Omar, et al. "PAC-Bayes bounds for stable algorithms with instance-dependent priors." Advances in Neural Information Processing Systems. 2018. == Post-rebuttal comments I would like to thank authors for their detailed response. Promised improvements will definitely make the paper better. There was also a number of issues raised during the discussion (see comments by the meta-reviewer). I agree that at the moment, the paper perhaps has too many "moving parts" and their effect should be ideally studied separately (that is, the effect of the "half-samples", biasing, new Bernstein-style bound). The work would be much more solid if this would be the case. If the paper is not accepted at this time, this is the main point for improvement. As for the conventions, I also agree with the meta-reviewer that "un-expected" indeed sounds a bit strange, and this could be changed regardless of acceptance.

### Reviewer 3

Originality: The paper provides a new PAC-Bayesian bound based on a notion of stability of the learning algorithm on two subsets of the sample. While the authors discuss links with other PAC-Bayesian bounds including Tolstikhin and Seldin's bound based on a different version of Bernstein's inequality, they fail to cite or compare their approach to Rivasplata et al. (NeurIPS 2018), PAC-Bayes bounds for stable algorithms with instance-dependent priors. This paper also combines the PAC-Bayes approach with the stability of the hypothesis learned by a learning algorithm on the sample, and also state that the empirical values can be much smaller than previous state-of-the-art PAC-Bayesian bounds. While I did not thoroughly compare the proposed bound with Rivasplata's bound, I feel that the paper can not be accepted unless the authors do such a comparison, in terms of theoretical properties and empirical results. Quality: The paper is complete and seem technically sound - all results are presented with complete proofs (and even a new notation to simplify some theorems and proofs). Empirical results on a toy dataset and some UCI datasets are provided to show the advantage of the bound when compared to other PAC-Bayesian bounds of the literature. Quality: Sections 1 to 4 are clearly written and are easy to follow even if the subject at hand is complex. However, the following sections leave me confused - there are many back-and-forth discussions and results that I feel should have been introduced earlier in the paper, giving me the impression that this paper was originally shorter with a bigger appendix with complete results. My conclusion on this matter is that I think the paper could be easier to follow if it was more polished (and possibly reorganized). Significance: The new Bernstein inequality seem significant and could be reused by other researches. However, the lack of comparison with Rivasplata's NeurIPS 2018 paper makes it hard to make a decision on the significance of the new PAC-Bayes bound or the authors.