NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6604
Title:PAC-Bayes Un-Expected Bernstein Inequality

The paper introduces three exciting ideas to the area of PAC-Bayesian analysis: (1) a new way of using "half-samples" to construct informed priors; (2) offsetting (biasing) the loss estimate by the loss of a reference hypothesis h_* to achieve "fast convergence rates" under Bernstein condition [even when the loss itself is bounded away from zero]; (3) a new form of Empirical Bernstein inequality, which is combined with PAC-Bayes to exploit low variance [the need in a new inequality and its advantages are not well explained]. The authors compare a bound based on combination of the three ideas with PAC-Bayes bound of Maurer (2004) and some other PAC-Bayes bounds, demonstrating superiority of the new approach. While the work is really exciting, the authors fail to clearly separate between the three major contributions. It is not shown how much each of the three novelties contribute to the success of the method. [I believe it's the biasing giving the most, rather than the new Empirical Bernstein. Biasing and informed priors can be easily combined with the bound of Tolstikhin & Seldin (2013) [TS] and this comparison should be added.] The relation of the new results to prior work is not sufficiently explored and the key ideas could be explained in a much simpler and cleaner way. And, please, do not call your bound Un-Expected Bernstein - see my explanation below. I have the following suggestions for improvement: Regarding informed priors: 1. Add a simple explanation of the idea of using "half-samples" for construction of informed priors and how it can be combined with existing bounds. I provide a one-line explanation below. 2. Add a comparison of Maurer's bound with uninformed prior, KL(P_n||P_0), with Maurer's bound with informed prior, KL(P_n||P_< m) + KL(P_n||P_\geq m). 3. Add a comparison of Maurer's bound with informed prior with the new version of PAC-Bayes Empirical Bernstein. 4. Ambroladze et al. use the first half of a sample to construct a prior, which is used for the second half of a sample. Let's call it a "forward" approach. Your construction is equivalent to taking an average of "forward" and "backward" approaches. It is elegant and more stable, but on average it is the same. [Therefore, I believe the improvement does not come from here.] Regarding biasing: 5. Add a simple explanation of the idea. I provide a one-line explanation below. 6. Biasing can be easily combined with PAC-Bayes Empirical Bernstein bound of TS [see the derivation below] and the need in derivation of a new bound and its advantages are not explained. I see why it is problematic to use informed priors in the final bound of TS, because of the square root, but I believe it is possible in the intermediate linear version of the bound in equation (15) in TS, see the derivation. And even if not, you can train the prior and bias on the first half of the sample and apply the bound of TS to the second half. As explained above, on average this is equivalent to what you obtain from your "forward" and "backward" split. This comparison should be added. Regarding the new form of Empirical Bernstein bound: 7. Calling the new bound "un-expected Bernstein" is very confusing. Replacing expected values with empirical ones is the main idea of Empirical Bernstein. The fact that you take X^2 out of the expectation is not what distinguishes your result from existing Empirical Bernstein inequalities (and there are several of them). All of them could be called "Un-expected Bernstein", but they were called "Empirical Bernstein" and I urge you to keep with the tradition. Otherwise, I can easily see people getting confused, starting rediscovering things, not comparing with the right baselines, missing relevant prior work, etc. Please, don't do that! 8. The comparison of the new Empirical Bernstein inequality with Maurer & Pontil (2009) should move from the appendix to the body of the paper. In the appendix you show that a relaxation of the new bound is weaker than Maurer & Pontil. So what are the advantages of the new bound? How they compare numerically? [It does not feel like the improvement is coming from here.] === Below I provide a one-line explanation of the idea of informed priors and how they can be incorporated into existing PAC-Bayes bounds, specifically Maurer's bound and PAC-Bayes bounds in a linear form. Then a one-line explanation of biasing and how it could be combined with the bound of TS. Then I give some minor comments and references. === One-line explanation of informed priors: Let L_1(h) be the empirical loss of h on the first half of the sample and L_2(h) on the second half. We then have that L_n(h) = (1/2) L_1(h) + (1/2) L_2(h) is the empirical loss on the full sample. Let P_1 be the prior constructed on the first half of the sample and P_2 on the second half. Let P_n be the posterior. Let E[L] = E_{h ~ P_n}[L(h)] be the expected loss when h is drawn according to P_n. Let E[L_1] = E_{h ~ P_n}[L_1(h)] and the same for E[L_2] and E[L_n]. Let X = \ln(2\sqrt{n}/\delta) be the logarithmic part of Maurer's bound. Then we have: kl(E[L]||E[L_n]) = kl((1/2)E[L] + (1/2)E[L]|| (1/2) E[L_1] + (1/2) E[L_2]) \leq (1/2) kl(E[L]||E[L_1]) + (1/2) kl(E[L]||E[L_2]) \leq (1/2) (KL(P_n||P_2) + X) / (n/2) + (1/2) (KL(P_n||P_1) + X) / (n/2) = (KL(P_n||P_2) + KL(P_n||P_1) + 2X) / n. Where: the first inequality is by Jensen's inequality and convexity of kl and the second is the Mauer's PAC-Bayes inequality. Maurer's bound is applicable, because the priors are built on a complimentary part of the sample, not the one used for estimating the empirical loss. The same kind of derivation can be applied to PAC-Bayes bounds in a linear form. === One-line explanation of biasing: Instead of bounding L(h) - L_n(h) directly, which is the standard way, you take a hypothesis h_*, which was trained on the other half of the sample and write L(h) - L_n(h) = L(h) - L(h_*) + L(h_*) - L_n(h_*) + L_n(h_*) - L_n(h). Then you use Empirical Bernstein (or PAC-Bayes Empirical Bernstein for distributions over H and you could use the bound of TS) to bound L(h) - L(h_*) + L_n(h_*) - L_n(h) and, for example, the kl bound to bound L(h_*) - L_n(h_*). The latter gives what you call the "irreducible term" (you work with just one hypothesis, which could also be a "randomized classifier" defined by a distribution over H). Strictly speaking, this term is not part of the PAC-Bayes bound: PAC-Bayes is used to bound "L(h) - L(h_*) + L_n(h_*) - L_n(h)" and a concentration inequality for a single hypothesis is used to bound "L(h_*) - L_n(h_*)". If you combine biasing with the bound of TS you would get the same separation. The term "L(h) - L(h_*) + L_n(h_*) - L_n(h)" concentrates well when there is a high degree of agreement in predictions of h and h_* and thus the variance is small. === Minor comments and references: Empirical Bernstein inequality has been introduced in Audibert, Munos, & Czepesvari, "Tuning bandit algorithms in stochastic environments.", ALT, 2007 and the name "Empirical Bernstein" was coined in Mnih, Czepesvari, & Audibert "Empirical Bernstein Stopping", ICML, 2008. I think these works deserve credit. An alternative form of Empirical Bernstein appears in Wintenberger, Optimal learning with Bernstein Online Aggregation, Machine Learning, 2017, based on an inequality from Cesa-Bianchi, Mansour, and Stoltz, Improved second-order bounds for prediction with expert advice, Machine Learning, 2007. The idea of applying PAC-Bayes to a cross-validation-type split of the data (including overlaps) appears in Thiemann, Igel, Wintenberger, and Seldin, A strongly quasiconvex PAC-Bayesian bound, ALT, 2017. Multiple overlapping splits could be directly used in your work to further improve stability. Your proof of change-of-measure inequality (Proposition 8) is actually very standard. I am not sure whether the use of ESI notation is beneficial. It's nice, but it makes things harder to follow for people who are not used to it.