NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Reviewer 1
**Summary and main remarks** The manuscript investigates data-dependent PAC-Bayes priors. This is an area of great interest to the learning theory community: in PAC-Bayesian learning, most prior distributions do not rely on data and there has been some effort in leveraging information provided by data to design more efficient / relevant priors. Classical PAC-Bayes bounds hold for any prior and the crux is often to optimize a Kullback-Leibler (KL) divergence term between a pseudo-posterior (a Gibbs potential of the form $\exp(-\lambda R_n(\cdot))\pi(\cdot)$) and the prior. The manuscript starts with a very convincing and clear introduction to the problem, and builds upon the paper Lever, Laviolette and Shawe-Taylor (2013). The intuition defended by the authors is that when using a data-dependent prior which is *robust* to data changes (loosely meaning that the prior is not crudely overfitting the data), then PAC-Bayesian bounds using this prior must be tighter than similar bounds with any prior. This is a clever direction and the use of differential privacy to address this more formally appears very relevant to me. A second contribution of the manuscript is the use of SGLD (Stochastic Gradient Langevin Dynamics) to elicit such data-dependent priors (section 5). This section closes with an important message, which is that the approximation found by SGLD still yields a valid PAC-Bayesian bound (Corollary 5.4). This is reassuring for practitioners as they benefit from the comforting PAC-Bayesian theory. **Overall assessment** I find the paper to be very well-written and with clever contributions to PAC-Bayesian learning, with a significant impact to the NIPS community. I have carefully checked the proofs and found no flaw. Since differential privacy is not my strongest suit, I have lowered my confidence score to 3. I recommend acceptance. **Specific remarks** - Some improper capitalization of words such as PAC, Bayesian, etc. in the references. - typo: end of page 4, "ogften" -> often - with the authors' scheme of proofs, it is unclear to me wether the limitation of having a loss bounded (by 1 for simplicity) is easy to relax. Since some effort has been put by the PAC-Bayesian community to derive generalization bounds for unbounded / heavy-tailed / non-iid data, perhaps a comment in section 2 (other related works) would be a nice addition to the manuscript. See for example the references Catoni, 2007 (already cited); Alquier and Guedj, 2018 (Simpler PAC-Bayesian bounds for hostile data, Machine Learning); and references therein. [Rebuttal acknowledged.]
Reviewer 2
The authors provide PAC-Bayes bounds using differential privacy for data dependent priors. They further discuss the approximation of the differential private priors based on the stochastic gradient Langevin dynamics, which has certain convergence properties in 2-Wasserstein distance. They further connect such an approach to study the generalization bound of neural nets using PAC-Bayes, which seems interesting. However, it is not quite clear to me why this procedure is helpful in measuring the generalization bound (e.g., for the neural nets). Empirically, we can measure the empirical generalization gap, and theoretically the differential privacy based bound can be looser than the PAC-Bayes bound. Further discussion regarding this aspect will be helpful. After rebuttal: I thank the authors for their efforts to address my queries. The answers are helpful and I think the idea of connecting the PAC-Bayes and differential privacy is interesting.
Reviewer 3
%% Summary %% This paper develops new PAC-Bayes bounds with data-dependent priors. Distribution-dependent PAC-Bayesian bounds (especially ones with distribution-dependent priors) are by now well-known. However, as far as I am aware, the authors' bound is the first non-trivial PAC-Bayesian bound where the prior is allowed to depend on the data. The key idea to accomplish this is the observation that a differentially private prior can be related to a prior that does not depend on the data, thereby allowing (as the authors do in their first main result) the development of a PAC-Bayesian bound that uses a data-dependent-yet-differentially-private prior based on the standard one that uses a data-independent prior. The proof of the new bound depends on the connection between max information and differential privacy (versions of the latter imply bounds on versions of the former). A second key result of the authors, more relevant from the computational perspective, is that if a differentially private prior is close in 2-Wasserstein distance to some other (not necessarily differentially private) prior, then a PAC-Bayesian bound can again be developed using this latter data-dependent-but-not-differentially-private prior, where we pick up some additional error according to the 2-Wasserstein distance. This result allows the authors to leverage a connection between stochastic gradient Langevin dynamics (which is computationally friendly) and Gibbs distributions (which are differentially private but computational nightmares). In addition to the above bounds, the authors also perform an empirical study. I have to admit that I focused more on the theoretical guarantees, as I believe they already are interesting enough to warrant publication of the paper. Also, the authors did a poor job in writing the Section 6, leaving vital details in the appendix, like figures which they constantly refer to. I feel that this was not in the spirit of abiding by the 8 page limit, as the paper was not self-contained as a result. This also was confusing, as the authors clearly had extra space throughout the paper to include these two figures. %% Reviewer's Expertise ** I am an expert in PAC-Bayesian inequalities and also well-versed in recent developments related to differential privacy. I am less familiar with stochastic gradient Langevin dynamics but know the basics. %% Detailed Comments %% Aside from Section 6, I found this paper to be remarkably well-written and clear. I especially appreciated the key insight in the paragraph immediately after equation (1.1), for why the Lever et al. bound often must be vacuous for large values of $\tau$. I believe that Theorem 4.2 is a very interesting (and not so complicated) result which really makes for a significant and original contribution. It should open up future directions in developing new PAC-Bayesian guarantees, and so this result alone I feel makes a compelling argument for accepting this paper. In addition, Theorem 5.3, which allows us to only require a prior that is close in 2-Wasserstein distance to a differentially private one, further broadens the PAC-Bayesian toolbox in a significant and original way. That said, Theorem 5.3 has a major weakness, which is that it forces us to give up on obtaining ``very high probability'' bounds, since (5.3) has a term which grows as $1/\delta'$ as the failure probability $\delta'$ decreases. This is a crucial weakness which the authors should mention after Theorem 5.3, including discussion of whether or not this issue is fundamental. I looked through the proof of the main results Theorems 4.2 and 5.3 and I believe the analysis is sound. I do think the authors made a typo in the definition of $g$ in Theorem 5.3; you should remove the negative sign (indeed the logarithm in (5.4) is not well-defined if the negative sign stays!). I would have liked a more useful/detailed version of what is currently Corollary 5.4. It currently glosses over too many details to really say much. Since you have extra space in the paper, I recommend including a more explicit version of this corollary. You should also include around this point a citation to the Raginsky / Rakhlin / Telgarsky (2017) paper that you refer to in the appendix. I also recommend giving, at least in the appendix, a precise citation of which result from their paper you are using. Minor comments: You never explain the notation for the squiggly arrow in the main paper. Please fix this. It is too important / widely used in the main text to be left to the appendix. On page 4, line 2, you say ``the above result''. I think you are referring to Definition 3.1, which is of course a definition, not a result. So, you should change the text accordingly. In Theorem 3.2, you have one instance of $n$ which should be replaced by $m$ (see the inline math following the word ``Then'') In the proof of Lemma D.1, you should mention that the upper bound of total variation by KL divergence is from Pinsker's inequality. ``Standard results'' is a bit vague to inform the uninformed reader. %% UPDATE AFTER AUTHOR'S RESPONSE %% I've read the author's rebuttal and their responses are satisfactory. I do hope you will highlight the weakness of the high probability bound involving the 1/\delta' dependence.