Paper ID: 107
Title: PAC-Bayes-Empirical-Bernstein Inequality
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper present uniform convergence bounds which combines the PAC-Bayessian technique with empirical Bernstein inequalities. This allows the authors to improve upon the state-of-the-art under some regimes of parameters.

The paper takes a well studied problem (uniform convergence) and improves on top of the best known results. It is well presented and I like the fact that the authors clearly explain the regimes in which the new result is better than the current state-of-the-art and the regimes in which it is slightly worse. The presentation is very clear and easy to follow.

One place that I think the authors can improve is in exact definition of the empirical variance $E_\rho[V_n(h)]$. While $V_n(h)$ is defined in (5), how to take the expectation is not very clear.
Q2: Please summarize your review in 1-2 sentences
The authors address a classical problem in machine learning: uniform convergence, and improve upon the best known results.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
PAC-Bayes-Empirical Bernstein Inequality

The paper proposes a new PAC-Bayesian bound which has two specific features: a) it
uses empirical estimate of the variance of the Gibbs classiier and b) it combines
PAC-Bayesian techniques with concentration inequalities derived for self-bounding functions.
The new result is a "refinement" of a result by Seldin et al. on PAC-Bayesian bound
using the actual variance. Empirical results support the relevance of the approach.

The paper tackles an important question: that of having at hand accurate (PAC-Bayesian) bound
based on quantities that can be estimated from data. Here the authors propose a nice
approach to derive such a bound in the PAC-Bayesian framework. The technical results are
correct, the write-up is good and the empirical results show that there are situations where
the provided bound is indeed more accurate than other existing bounds.

It essentially a good and solid paper.

I have two questions:
- what would be the difficulty of going one step further and provide PAC-Bayesian bounds for, e.g. U-statistics, which are very important in the case of e.g. ranking, and for which there exist one the one hand empirical Bernstein inequalities and, on the other hand, dedicated PAC-Bayesian bounds ?
- more generally, what about non-IID settings ?
Q2: Please summarize your review in 1-2 sentences
Good paper introducing an original generalization bound that makes use of quantities computable from data. Nice writing and empirical evaluation.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper derives a new empircal PAC Bayesian bound by combining an existing (non-empircal) PAC Bayesian Berstein bound (i.e., involving the true variance of the loss values) with a PAC Bayesian analysis of the concentration of the empirical variance around its true value. This new bound has the advantage of being tighter when the empirical variance is small compared to the empirical loss. Experiments on real and empirical data with simple models compare the new bound with the usual empirical PAC Bayesian bound confirming the advantage.

While this is a well written paper that make a novel contribution to the PAC Bayesian literature it is a incremental and only marginally significant one. As the authors' claim, the newer bound is only really useful for analysing cases when there is a large mismatch between the model class and the data generating distribution.

Although the idea of applying a PAC Bayesian bound to the difference between empirical and true variances is original and addresses a shortcoming in the existing work, it appears that the steps from there were fairly routine, leaning heavily on techniques developed in other recent papers. That said, the definitions and proofs all seem to be correct and are logically organised.

Minor suggestions:

1. Put an article ("a" or "the") before all singular uses of "PAC-Bayes-Empirical-Berstein inequality" and the like. For example, the opening sentence of the abstract should be "We present a PAC-Bayes-Empirical-Bernstein inequality..."

2. You should define $KL(\rho\|\pi)$ or otherwise make clear in which direction the relative entropy is defined.
Q2: Please summarize your review in 1-2 sentences
A well motivated, clearly written, novel but modest contribution to the PAC Bayesian literature. I was not particularly surprised by any of the results or implications in this paper.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank the reviewers for their positive reviews, useful comments, and insightful questions.

Below we answer the questions raised by the reviewers. Reviewers are addressed by their number and their comments begin with “>”.

To Rev. 4
------------
> One place that I think the authors can improve is in exact definition of the empirical variance $E_\rho[V_n(h)]$. While $V_n(h)$ is defined in (5), how to take the expectation is not very clear.

The exact definition of $E_\rho[V_n(h)]$ is $E_{h\sim \rho}[V_n(h)]$, analogously to $E_\rho [L_n(h)]$. We thank the reviewer for pointing out that we forgot to put this definition and will add it to the revised manuscript.

To Rev. 5
------------
> what would be the difficulty of going one step further and provide PAC-­Bayesian bounds for, e.g. U­ statistics, which are very important in the case of e.g. ranking, and for which there exist on the one hand empirical Bernstein inequalities and, on the other hand, dedicated PAC­ Bayesian bounds?

We note that variance is second-order U-statistics and our bound in Theorem 3 treats this case. For some other U-statistics we would first of all need a bound on its moment generating function. This is definitely not a trivial task, but for many U-statistics we already have it (as it was in our case for the variance). Then it is possible to follow the general framework, as we did, up to equation (9). The main difficulty is the final step after the substitution of the bound on moment generating function in equation (9). At the moment we have no general recipe on how to finish the calculation after the substitution and we suspect that each case may require individual treatment. In our case it took 3 pages of calculations and we assume that for higher-order U-statistics it may be even more involved.

> more generally, what about non­ IID settings?

At the moment we are aware of two ways of extending IID to non-IID settings. One way is extension to martingales, as done by Seldin et. al. (2012). Our results can be directly extended to martingales in a similar way. The second way is picking an IID subsample from a dependent sample and working with the IID subsample (see Ralaivola et. al. (2010)). In this case the sample size n in the bound reduces to the size of IID subsample, but otherwise the results are identical.

To Rev. 7
------------
> Although the idea of applying a PAC Bayesian bound to the difference between empirical and true variances is original and addresses a shortcoming in the existing work, it appears that the steps from there were fairly routine, leaning heavily on techniques developed in other recent papers.

We are happy that things looked simple to the reviewer, but we would like to note that although we followed a general framework, putting all pieces together was not a “fairly routine” procedure and it actually took us 6 pages of non-trivial calculations that we moved to the supplementary material to ease on the reader.

We thank the reviewer for the minor language and technical suggestions that will be incorporated in the revised manuscript.