Reviews: An Adaptive Empirical Bayesian Method for Sparse Deep Learning

This is a novel combination of existing techniques that appears well-formulated with intriguing experimental results. In particular, this work leverages the strengths stochastic gradient MCMC methods with stochastic approximation to form an adaptive empirical Bayesian approach to learning the parameters and hyperparameters of a Bayesian neural network (BNN). My best understanding is that by optimizing the hyperparameters (rather than sampling them), this new method improves upon existing approaches, speeding up inference without sacrificing quality (especially in the model compression domain). Other areas of BNN literature could be cited, but I think the authors were prudent not to distract the reader from the particular area of focus. This work demonstrates considerable theoretical analysis and is supported by intriguing experimental evidence. In particular, applying the method to a simpler problem (Bayesian linear regression with p << n) shows posteriors matching the data generating distributions, appropriately shrinks irrelevant variables, and attains better predictive performance than similar methods without the contributions of this paper. Similar predictive performance seems to hold for MNIST and FMNIST compared to SGHMC (considered by some to be a “gold standard” in learning BNN parameters), and in a compression setting on CIFAR10. My only complaint about the experimental section would be wanting to see more datasets (perhaps even some regression datasets other than the toy example [that was by no means trivial]), but I think the paper is sufficient as-is. The authors go to considerable lengths to demonstrate the theory of their contribution without distracting the reader from the main results, appropriately deferring lengthy proofs to supplemental material. I think that, coupled with the supplemental material, an expert would be able to reproduce the results in this paper, especially someone already familiar with SGLD/SGHMC methods. Sections 3.2-3.3 are a bit dense to read through, but considering the complexity of the approach and the desire for work to be reproducible, I think this is appropriate. I think the results on compression in MNIST and the uncertainty in the posterior for the toy example are important. They show the strength of this method (even in a neural network setting) and its ability to prune extraneous model parameters, which has positive implications for embedded devices and other lower-memory applications. I think future researchers could indeed apply this method as-is to certain models, although some hyperparameter guidance (e.g., \tau, the temperature) would be nice for practitioners. Especially because of the theoretical guarantees of the method, I think the results make it a more compelling alternative than SGLD/SGHMC for Bayesian inference, especially where sparsity is concerned. **post-author feedback** I appreciate the authors' response, especially regarding having some additional results on UCI regression datasets, showing an improvement over SGHMC (in particular). And I appreciate the authors' response to the other reviewers with regard to a clear problem statement and possible extensions of the work (larger networks, more structure to the sparsity).

Reviewer 2

Paper ID:	2979
Title:	An Adaptive Empirical Bayesian Method for Sparse Deep Learning

The paper proposes a novel adaptive empirical Bayesian method to train sparse Bayesian neural networks. The proposed method works by alternately sampling the network parameters from the posterior distribution using stochastic gradient Markov Chain Monte Carlo (MCMC) and smoothly optimizing the hyperparameters of the prior distribution using stochastic approximation (SA). Originality: the proposed sampling scheme enables learning BNNs of complex forms and seems novel. However, I am still unclear as to exactly what limitations of the previous related methods that this work aimed to address, and what was the key ingredient to enabling such advance. Quality: I believe the work is technically sound and the model assumptions are clearly stated. However, the authors do not discuss the weaknesses of the method. Are there any caveats to practitioners due to some violation of the assumptions given in Appendix. B or for any other reasons? Clarity: the writing is highly technical and rather dense, which I understand is necessary for some parts. However, I believe the manuscript would be readable to a broader audience if Sections 2 and 3 are augmented with more intuitive explanations of the motivations and their proposed methods. Many details of the derivations could be moved to the appendix and the resultant space could be used to highlight the key machinery which enabled efficient inference and to develop intuitions. Many terms and notations are not defined in text (as raised in "other comments" below). Significance: the empirical results support the practical utility of the method. I am not sure, however, if the experiments on synthetic datasets, support the theoretical insights presented in the paper. I believe that the method is quite complex and recommend that the authors release the codes to maximize the impact. Other comments: - line 47 - 48 "over-parametrization invariably overfits the data and results in worse performance": over-parameterization seems to be very helpful for supervised learning of deep neural networks in practice ... Also, I have seen a number of theoretical work showing the benefits of over-parametrisation e.g. [1]. - line 71: $\beta$ is never defined. It denotes the set of model parameters, right? - line 149-150 "the convergence to the asymptotically correct distribution allows ... obtain better point estimates in non-convex optimization.": this is only true if the assumptions in Appendix. B are satisfied, isn't it? How realistic are these assumptions in practice? - line 1: MCMC is never defined: Markov Chain Monte Carlo - line 77: typo "gxc lobal"=> "global" - eq.4: $\mathcal{N}$ and $\mathcal{L}$ are not defined. Normal and Laplace I suppose. You need to define them, please. - Table 2: using the letter

$a$ to denote the difference in used models is confusing. - too many acronyms are used. References: [1] Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. "A convergence theory for deep learning via over-parameterization." arXiv preprint arXiv:1811.03962 (2018). ---------------------------------------------------------------------- I am grateful that the authors have addressed most of the concerns about the paper, and have updated my score accordingly. I would like to recommend for acceptance provided that the authors reflect the given clarifications in the paper.

Reviewer 3

This paper combines spike-slab prior, SGMCMC and stochastic approximation to prune the structure of neural networks. The author proposes to use SA to optimize the meta parameters, such as spike-slab selection parameter gamma, and use SGMCMC to get the posterior distribution of weights. Right now the pruning seems to be done in a per scalar fashion. It would be morenteresting if the authors could study more structural version of pruning. E.g. use spike-slab to select which pathway to turn on and off, so we can have more structured sparsity pattern that can be speedup more easily. Most of the current experimental study are focused on small neural networks, what would it take to scale the experimental results to bigger datasets and models? We could also argue that SG-MCMC-SA works better for small neural network domain. Some discussions on this would be helpful.

Reviewer 1

Reviewer 2

Reviewer 3