Paper ID: | 3731 |
---|---|

Title: | Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks |

This work studies PAC-Bayes bound optimization in the setting of deep neural networks with binary activations. One of the stated contributions of the paper---showing how to optimize despite the binary activations providing no naive derivative---is, in fact, a known technique in the literature on variational inference. This somewhat undermines the impact of the work, though importing these ideas into the PAC-Bayes community is nice. The other contribution is obtaining nonvacuous bounds and here it is impressive to see such tight bounds. I have a few issues to raise with the introduction, which I would like addressed in revisions: First, the authors write: "Although informative, these results upper bound the prediction error of a (stochastic) neural network with perturbed weights, which is not the one used to predict in practice". I find this statement somewhat odd because, as far as I can tell, the present paper also doesn't resolve the gap between "the networks used in practice" and "the predictors for which the bounds hold". A review points this out too. Yes, one obtains bounds for the weighted vote, not the Gibbs classifier, but people also don't use weighted votes in practice. (Though, they may start based on evidence that these help against adversaries.) Second, the authors refer to Neyshabur et al as work that presents bounds that depend on the architecture. Indeed they do, but their approach to obtain a bound on the deterministic classifier produces a _completely_ vacuous bound for standard networks. The technique is potentially important, but, at present, the bound does not explain anything. So based on the aesthetic principles of this work, it seems odd to me to bury this issue and to suggest that "spectral norms tell us something valuable about generalization". Do they? Unlikely.