{"title": "An Adaptive Empirical Bayesian Method for Sparse Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5563, "page_last": 5573, "abstract": "We propose a novel adaptive empirical Bayesian (AEB) method for sparse deep learning, where the sparsity is ensured via a class of self-adaptive spike-and-slab priors. The proposed method works by alternatively sampling from an adaptive hierarchical posterior distribution using stochastic gradient Markov Chain Monte Carlo (MCMC) and smoothly optimizing the hyperparameters using stochastic approximation (SA). The convergence of the proposed method to the asymptotically correct distribution is established under mild conditions. Empirical applications of the proposed method lead to the state-of-the-art performance on MNIST and Fashion MNIST with shallow convolutional neural networks (CNN) and the state-of-the-art compression performance on CIFAR10 with Residual Networks. The proposed method also improves resistance to adversarial attacks.", "full_text": "An Adaptive Empirical Bayesian Method for Sparse\n\nDeep Learning\n\nWei Deng\n\nXiao Zhang\n\nDepartment of Mathematics\n\nDepartment of Computer Science\n\nPurdue University\n\nWest Lafayette, IN 47907\ndeng106@purdue.edu\n\nPurdue University\n\nWest Lafayette, IN 47907\nzhang923@purdue.edu\n\nFaming Liang\n\nDepartment of Statistics\n\nPurdue University\n\nWest Lafayette, IN 47907\nfmliang@purdue.edu\n\nGuang Lin\n\nDepartments of Mathematics, Statistics\nand School of Mechanical Engineering\n\nPurdue University\n\nWest Lafayette, IN 47907\nguanglin@purdue.edu\n\nAbstract\n\nWe propose a novel adaptive empirical Bayesian (AEB) method for sparse deep\nlearning, where the sparsity is ensured via a class of self-adaptive spike-and-slab\npriors. The proposed method works by alternatively sampling from an adaptive\nhierarchical posterior distribution using stochastic gradient Markov Chain Monte\nCarlo (MCMC) and smoothly optimizing the hyperparameters using stochastic\napproximation (SA). We further prove the convergence of the proposed method to\nthe asymptotically correct distribution under mild conditions. Empirical applica-\ntions of the proposed method lead to the state-of-the-art performance on MNIST\nand Fashion MNIST with shallow convolutional neural networks (CNN) and the\nstate-of-the-art compression performance on CIFAR10 with Residual Networks.\nThe proposed method also improves resistance to adversarial attacks.\n\n1\n\nIntroduction\n\nMCMC, known for its asymptotic properties, has not been fully investigated in deep neural networks\n(DNNs) due to its unscalability in dealing with big data. Stochastic gradient Langevin dynamics\n(SGLD) [Welling and Teh, 2011], the \ufb01rst stochastic gradient MCMC (SG-MCMC) algorithm, tackled\nthis issue by adding noise to the stochastic gradient, smoothing the transition between optimization\nand sampling and making MCMC scalable. Chen et al. [2014] proposed using stochastic gradient\nHamiltonian Monte Carlo (SGHMC), the second-order SG-MCMC, which was shown to converge\nfaster. In addition to modeling uncertainty, SG-MCMC also has remarkable non-convex optimization\nabilities. Raginsky et al. [2017], Xu et al. [2018] proved that SGLD, the \ufb01rst-order SG-MCMC,\nis guaranteed to converge to an approximate global minimum of the empirical risk in \ufb01nite time.\nZhang et al. [2017] showed that SGLD hits the approximate local minimum of the population risk\nin polynomial time. Mangoubi and Vishnoi [2018] further demonstrated SGLD with simulated\nannealing has a higher chance to obtain the global minima on a wider class of non-convex functions.\nHowever, all the analyses fail when DNN has too many parameters, and the over-speci\ufb01ed model\ntends to have a large prediction variance, resulting in poor generalization and causing over-\ufb01tting.\nTherefore, a proper model selection is on demand at this situation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA standard method to deal with model selection is variable selection. Notably, the best variable\nselection based on the L0 penalty is conceptually ideal for sparsity detection but is computationally\nslow. Two alternatives emerged to approximate it. On the one hand, penalized likelihood approaches,\nsuch as Lasso [Tibshirani, 1994], induce sparsity due to the geometry that underlies the L1 penalty. To\nbetter handle highly correlated variables, Elastic Net was proposed [Zou and Hastie, 2005] and makes\na compromise between L1 penalty and L2 penalty. On the other hand, spike-and-slab approaches\nto Bayesian variable selection originates from probabilistic considerations. George and McCulloch\n[1993] proposed to build a continuous approximation of the spike-and-slab prior to sample from a\nhierarchical Bayesian model using Gibbs sampling. This continuous relaxation inspired the ef\ufb01cient\nEM variable selection (EMVS) algorithm in linear models [Ro\u02c7rkov\u00e1 and George, 2014, 2018].\nDespite the advances of model selection in linear systems, model selection in DNNs has received\nless attention. Ghosh et al. [2018] proposed to use variational inference (VI) based on regularized\nhorseshoe priors to obtain a compact model. Liang et al. [2018] presented the theory of posterior\nconsistency for Bayesian neural networks (BNNs) with Gaussian priors, and Ye and Sun [2018]\napplied a greedy elimination algorithm to conduct group model selection with the group Lasso penalty.\nAlthough these works only show the performance of shallow BNNs, the experimental methodologies\nimply the potential of model selection in DNNs. Louizos et al. [2017] studied scale mixtures of\nGaussian priors and half-Cauchy scale priors for the hidden units of VGG models [Simonyan and\nZisserman, 2014] and achieved good model compression performance on CIFAR10 [Krizhevsky,\n2009] using VI. However, due to the limitation of VI in non-convex optimization, the compression is\nstill not sparse enough and can be further optimized.\nOver-parameterized DNNs often demand for tremendous memory use and heavy computational\nresources, which is impractical for smart devices. More critically, over-parametrization frequently\nover\ufb01ts the data and results in worse performance [Lin et al., 2017]. To ensure the ef\ufb01ciency of the\nsparse sampling algorithm without over-shrinkage in DNN models, we propose an AEB method to\nadaptively sample from a hierarchical Bayesian DNN model with spike-and-slab Gaussian-Laplace\n(SSGL) priors and the priors are learned through optimization instead of sampling. The AEB method\ndiffers from the full Bayesian method in that the priors are inferred from the empirical data and the\nuncertainty of the priors is no longer considered to speed up the inference. In order to optimize the\nlatent variables without affecting the convergence to the asymptotically correct distribution, stochastic\napproximation (SA) [Benveniste et al., 1990], a standard method for adaptive sampling [Andrieu\net al., 2005, Liang, 2010], naturally \ufb01ts to train the adaptive hierarchical Bayesian model. As a\nresult, the asymptotic property allows us to combine simulated annealing and/or parallel tempering to\naccelerate the non-convex learning.\nIn this paper, we propose a sparse Bayesian deep learning algorithm, SG-MCMC-SA, to adaptively\nlearn the hierarchical Bayes mixture models in DNNs. This algorithm has four main contributions:\n\u2022 We propose a novel AEB method to ef\ufb01ciently train hierarchical Bayesian mixture DNN\nmodels, where the parameters are learned through sampling while the priors are learned\nthrough optimization.\n\u2022 We prove the convergence of this approach to the asymptotically correct distribution, and\nit can be further generalized to a class of adaptive sampling algorithms for estimating\nstate-space models in deep learning.\n\u2022 We apply this adaptive sampling algorithm in the DNN compression problems \ufb01rstly, with\n\u2022 It achieves the state of the art in terms of compression rates, which is 91.68% accuracy on\n\npotential extension to a variety of model compression problems.\n\nCIFAR10 using only 27K parameters (90% sparsity) with Resnet20 [He et al., 2016].\n\n2 Stochastic Gradient MCMC\n\nWe denote the set of model parameters by \u03b2, the learning rate at time k by \u0001(k), the entire data by\nD = {di}N\ni=1, where di = (xi, yi), the log of posterior by L(\u03b2). The mini-batch of data B is of size\nn with indices S = {s1, s2, ..., sn}, where si \u2208 {1, 2, ..., N}. Stochastic gradient \u2207\u03b2 \u02dcL(\u03b2) from a\nmini-batch of data B randomly sampled from D is used to approximate \u2207\u03b2L(\u03b2):\n\n\u2207\u03b2 log P(di|\u03b2).\n\n(1)\n\n(cid:88)\n\ni\u2208S\n\n\u2207\u03b2 \u02dcL(\u03b2) = \u2207\u03b2 log P(\u03b2) +\n\nN\nn\n\n2\n\n\f\uf8f1\uf8f2\uf8f3 d\u03b2 = rdt,\n\nSGLD (no momentum) is formulated as follows:\n\n\u03b2(k+1) = \u03b2(k) + \u0001(k)\u2207\u03b2 \u02dcL(\u03b2(k)) + N (0, 2\u0001(k)\u03c4\u22121),\n\n(2)\nwhere \u03c4 > 0 denotes the inverse temperature. It has been shown that SGLD asymptotically converges\nto a stationary distribution \u03c0(\u03b2|D) \u221d e\u03c4 L(\u03b2) [Teh et al., 2016, Zhang et al., 2017]. As \u03c4 increases\nand \u0001 decreases gradually, the solution tends towards the global optima with a higher probability.\nAnother variant of SG-MCMC, SGHMC [Chen et al., 2014, Ma et al., 2015], proposes to generate\nsamples as follows:\n\ndr = \u2207\u03b2 \u02dcL(\u03b2)dt \u2212 Crdt + N (0, 2B\u03c4\u22121dt) + N (0, 2(C \u2212 \u02c6B)\u03c4\u22121dt),\n\n(3)\n\nwhere r is the momentum item, \u02c6B is an estimate of the stochastic gradient variance, C is a user-\nspeci\ufb01ed friction term. Regarding the discretization of (3), we follow the numerical method proposed\nby Saatci and Wilson [2017] due to its convenience to import parameter settings from SGD.\n\n3 Empirical Bayesian via Stochastic Approximation\n\n3.1 A hierarchical formulation with deep SSGL priors\n\nInspired by the hierarchical Bayesian formulation for sparse inference [George and McCulloch, 1993],\nwe assume the weight \u03b2lj in sparse layer l with index j follows the SSGL prior\n\n\u03b2lj|\u03c32, \u03b3lj \u223c (1 \u2212 \u03b3lj)L(0, \u03c3v0) + \u03b3ljN (0, \u03c32v1).\n\n(4)\nwhere \u03b3lj \u2208 {0, 1}, \u03b2l \u2208 Rpl, \u03c32 \u2208 R, L(0, \u03c3v0) denotes a Laplace distribution with mean 0 and\nscale \u03c3v0, and N (0, \u03c32v1) denotes a Normal distribution with mean 0 and variance \u03c32v1. The sparse\nlayer can be the fully connected layers (FC) in a shallow CNN or Convolutional layers in ResNet. If\nwe have \u03b3lj = 0, the prior behaves like Lasso, which leads to a shrinkage effect; when \u03b3lj = 1, the\nL2 penalty dominates. The likelihood follows\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:26)\n\n(cid:80)\ni\u2208S (yi \u2212 \u03c8(xi; \u03b2))2\n\n(cid:27)\n\nexp\n\n\u2212\n\n2\u03c32\n(2\u03c0\u03c32)n/2\n(cid:80)K\nexp{\u03c8yi(xi; \u03b2)}\nt=1 exp{\u03c8t(xi; \u03b2)}\n\n(cid:81)\n\ni\u2208S\n\n\u03c0(B|\u03b2, \u03c32) =\n\n(regression),\n\n(classi\ufb01cation),\n\n(5)\n\nwhere \u03c8(xi; \u03b2) is a linear or non-linear mapping, and yi \u2208 {1, 2, . . . , K} is the response value of the\ni-th example. In addition, the variance \u03c32 follows an inverse gamma prior \u03c0(\u03c32) = IG(\u03bd/2, \u03bd\u03bb/2).\nThe i.i.d. Bernoulli prior is used for \u03b3, namely \u03c0(\u03b3l|\u03b4l) = \u03b4\n(1 \u2212 \u03b4l)pl\u2212|\u03b3l| where \u03b4l \u2208 R follows\n(1 \u2212 \u03b4l)b\u22121. The use of self-adaptive penalty enables the model to\nBeta distribution \u03c0(\u03b4l) \u221d \u03b4a\u22121\nlearn the level of sparsity automatically. Finally, our posterior follows\n\n|\u03b3l|\nl\n\nl\n\n\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B) \u221d \u03c0(B|\u03b2, \u03c32)\n\nN\n\nn \u03c0(\u03b2|\u03c32, \u03b3)\u03c0(\u03c32|\u03b3)\u03c0(\u03b3|\u03b4)\u03c0(\u03b4).\n\n(6)\n\n3.2 Empirical Bayesian with approximate priors\n\nTo speed up the inference, we propose the AEB method by sampling \u03b2 and optimizing \u03c32, \u03b4, \u03b3, where\nuncertainty of the hyperparameters are not considered. Because the binary variable \u03b3 is harder to\nDue to the limited memory, which restricts us from sampling directly from D, we choose to sample\n\noptimize directly, we consider optimizing the adaptive posterior E\u03b3|\u00b7,D(cid:2)\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|D)(cid:3) \u2217 instead.\n\u03b2 from E\u03b3|\u00b7,D(cid:2)EB(cid:2)\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3) \u2020. By Fubini\u2019s theorem and Jensen\u2019s inequality, we have\n\nlog E\u03b3|\u00b7,D(cid:2)EB(cid:2)\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3) = log EB(cid:2)E\u03b3|\u00b7,D(cid:2)\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3)\n\u2265EB(cid:2)log E\u03b3|\u00b7,D(cid:2)\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3) \u2265 EB(cid:2)E\u03b3|\u00b7,D(cid:2)log \u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3) .\n\n(7)\n\n\u2217E\u03b3|\u00b7,D[\u00b7] is short for E\n\n\u2020EB[\u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)] denotes(cid:82)\n\n\u03b3|\u03b2(k),\u03c3(k),\u03b4(k),D[\u00b7].\n\nD \u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)dB\n\n3\n\n\fInstead of tackling \u03c0(\u03b2, \u03c32, \u03b4, \u03b3|D) directly, we propose to iteratively update the lower bound Q\n\nQ(\u03b2, \u03c3, \u03b4|\u03b2(k), \u03c3(k), \u03b4(k)) = EB(cid:2)E\u03b3|D(cid:2)log \u03c0(\u03b2, \u03c32, \u03b4, \u03b3|B)(cid:3)(cid:3) .\n\n(8)\n\nGiven (\u03b2(k), \u03c3(k), \u03b4(k)) at the k-th iteration, we \ufb01rst sample \u03b2(k+1) from Q, then optimize Q with\nrespect to \u03c3, \u03b4 and E\u03b3l|\u00b7,D via SA, where E\u03b3l|\u00b7,D is used since \u03b3 is treated as unobserved variable. To\nmake the computation easier, we decompose our Q as follows:\n\nQ(\u03b2, \u03c3, \u03b4|\u03b2(k), \u03c3(k), \u03b4(k)) = Q1(\u03b2, \u03c3|\u03b2(k), \u03c3(k), \u03b4(k)) + Q2(\u03b4|\u03b2(k), \u03c3(k), \u03b4(k)) + C,\n\nDenote X and C as the sets of the indices of sparse and non-sparse layers, respectively. We have:\n\n(9)\n\nQ1(\u03b2|\u03b2(k), \u03c3(k), \u03b4(k)) =\n\nlog \u03c0(B|\u03b2)\n\n(cid:122)\n\nE\u03b3l|\u00b7,D\n\n(cid:88)\n\u2212(cid:88)\n(cid:123)(cid:122)\n(cid:124)\n(cid:122)\n\nj\u2208pl\n\nlog likelihood\n\nN\nn\n\n(cid:124)\n(cid:20)\n\n(cid:125)(cid:124)\n\n\u03balj0\n\n(cid:123)(cid:122)\n\n2\n\nlog(\u03c32)\n\n\u03b22\nlj\n2\u03c32\n0\n\n\u2212 p + \u03bd + 2\n\n(cid:125)\n(cid:125)(cid:124)\n(cid:20) 1\n\nl\u2208C\nnon-sparse layers C\n\u03balj1\n\n(cid:125)\n(cid:123)\n(cid:21)\n(cid:123)(cid:122)\n(cid:125)(cid:124)\n(cid:122)\nE\u03b3l|\u00b7,D\u03b3lj +(a \u2212 1) log(\u03b4l) + (pl + b \u2212 1) log(1 \u2212 \u03b4l),\n(11)\n\nE\u03b3l|\u00b7,D\n2\u03c32\n\n(cid:123)\n(cid:21)\n(cid:125)\n\n\u2212 \u03bd\u03bb\n2\u03c32\n\nv1\u03b3lj\n\n(10)\n\n(cid:123)\n\n\u03c1lj\n\n]\n\n1\n\n(cid:19)\n\n\u03b22\nlj\n\nv0(1 \u2212 \u03b3lj)\n\u03c3\ndeep SSGL priors in sparse layers X\n\n+\n\n(cid:18) \u03b4l\n\n\u2212(cid:88)\n\nl\u2208X\n\n(cid:88)\n\nj\u2208pl\n\n|\u03b2lj|\n\n[\n\n(cid:124)\n(cid:88)\n\n(cid:88)\n\nQ2(\u03b4l|\u03b2(k)\n\nl\n\n, \u03b4(k)\n\nl\n\n) =\n\nlog\n\n1 \u2212 \u03b4l\n\nl\u2208X\n\nj\u2208pl\n\nwhere \u03c1, \u03ba, \u03c3 and \u03b4 are to be estimated in the next section.\n\n3.3 Empirical Bayesian via stochastic approximation\n\nTo simplify the notation, we denote the vector (\u03c1, \u03ba, \u03c3, \u03b4) by \u03b8. Our interest is to obtain the optimal\n\u03b8\u2217 based on the asymptotically correct distribution \u03c0(\u03b2, \u03b8\u2217). This implies that we need to obtain\n\nan estimate \u03b8\u2217 that solves a \ufb01xed-point formulation(cid:82) g\u03b8\u2217 (\u03b2)\u03c0(\u03b2, \u03b8\u2217)d\u03b2 = \u03b8\u2217 [Shimkin, 2011],\n\nwhere g\u03b8(\u03b2) is inspired by EMVS to obtain the optimal \u03b8 based on the current \u03b2. De\ufb01ne the random\noutput g\u03b8(\u03b2) \u2212 \u03b8 as H(\u03b2, \u03b8) and the mean \ufb01eld function h(\u03b8) := E[H(\u03b2, \u03b8)]. The stochastic\napproximation algorithm can be used to solve the \ufb01xed-point iterations:\n\n(1) Sample \u03b2(k+1) from a transition kernel \u03a0\u03b8(k)(\u03b2), which yields the distribution \u03c0(\u03b2, \u03b8(k)),\n(2) Update \u03b8(k+1) = \u03b8(k) + \u03c9(k+1)H(\u03b8(k), \u03b2(k+1)) = \u03b8(k) + \u03c9(k+1)(h(\u03b8(k)) + \u2126(k)).\n\nwhere \u03c9(k+1) is the step size. The equilibrium point \u03b8\u2217 is obtained when the distribution of \u03b2\nconverges to the invariant distribution \u03c0(\u03b2, \u03b8\u2217). The stochastic approximation [Benveniste et al.,\n1990] differs from the Robbins\u2013Monro algorithm in that sampling \u03b2 from a transition kernel instead\nof a distribution introduces a Markov state-dependent noise \u2126(k) [Andrieu et al., 2005]. In addition,\nsince variational technique is only used to approximate the priors, and the exact likelihood doesn\u2019t\nchange, the algorithm falls into a class of adaptive SG-MCMC instead of variational inference.\nRegarding the updates of g\u03b8(\u03b2) with respect to \u03c1, we denote the optimal \u03c1 based on the current \u03b2\nand \u03b4 by \u02dc\u03c1. We have that \u02dc\u03c1(k+1)\n\n, the probability of \u03b2lj being dominated by the L2 penalty is\n\n= E\u03b3l|\u00b7,B\u03b3lj = P(\u03b3lj = 1|\u03b2(k)\n\n, \u03b4(k)\n\nwhere alj = \u03c0(\u03b2(k)\nchoice of Bernoulli prior enables us to use P(\u03b3lj = 1|\u03b4(k)\nSimilarly, as to g\u03b8(\u03b2) w.r.t. \u03ba, the optimal \u02dc\u03balj0 and \u02dc\u03balj1 based on the current \u03c1lj are given by:\n\nlj |\u03b3lj = 1)P(\u03b3lj = 1|\u03b4(k)\n(cid:21)\n\n) =\nlj |\u03b3lj = 0)P(\u03b3lj = 0|\u03b4(k)\n) and blj = \u03c0(\u03b2(k)\n) = \u03b4(k)\n.\n\nalj + blj\n\n(cid:20)\n\n(cid:21)\n\n,\n\nl\n\nl\n\nl\n\nl\n\nl\n\nl\n\nalj\n\n(cid:20) 1\n\n1 \u2212 \u03c1lj\n\n\u02dc\u03balj0 = E\u03b3l|\u00b7,B\n\n1\n\nv0(1 \u2212 \u03b3lj)\n\n=\n\n; \u02dc\u03balj1 = E\u03b3l|\u00b7,B\n\n=\n\n\u03c1lj\nv1\n\n.\n\n(13)\n\nv1\u03b3lj\n\n(12)\n\n). The\n\nlj\n\u02dc\u03c1(k+1)\nlj\n\nv0\n\n4\n\n\fTo optimize Q1 with respect to \u03c3, by denoting diag{\u03ba0li}pl\n\ni=1 as V 0l, diag{\u03ba1li}pl\n\ni=1 as V 1l we have:\n\nwhere Ra = N +(cid:80)\n\n||1,\n||2.\u2020\nRc = I +J +\u03bd\u03bb, Cc = J +\u03bd\u03bb, I = N\nn\nTo optimize Q2, a closed-form update can be derived from Eq.(11) and Eq.(12) given batch data B:\n\nl\u2208X ||V 0l\u03b2(k+1)\n1l \u03b2(k+1)\n\nl\u2208X ||V 1/2\n\ni\u2208S\n\nl\n\nl\n\n(14)\n\nb + 4RaRc\n\n\u02dc\u03c3(k+1) =\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\nRb +(cid:112)R2\nCb +(cid:112)C 2\nl\u2208X pl + \u03bd, Ca = (cid:80)\n(cid:80)\n\n(regression),\n\n2Ra\nb + 4CaCc\n2Ca\n\n(classi\ufb01cation),\n\nl\u2208X pl + \u03bd + 2, Rb = Cb = (cid:80)\n(cid:0)yi \u2212 \u03c8(xi; \u03b2(k+1))(cid:1)2, J =(cid:80)\n\n(cid:80)pl\nj=1 \u03c1lj + a \u2212 1\na + b + pl \u2212 2\n\n\u02dc\u03b4(k+1)\nl\n\n= arg max\n\n\u03b4l\u2208R\n\nQ2(\u03b4l|\u03b2(k)\n\nl\n\n, \u03b4(k)\n\nl\n\n) =\n\n3.4 Pruning strategy\n\n.\n\n(15)\n\nThere are quite a few methods for pruning neural networks including the oracle pruning and the\neasy-to-use magnitude-based pruning [Molchanov et al., 2017]. Although the magnitude-based unit\npruning shows more computational savings [Gomez et al., 2018], it doesn\u2019t demonstrate robustness\nunder coarser pruning [Han et al., 2016, Gomez et al., 2018]. Pruning based on the probability\n\u03c1 is also popular in the Bayesian community, but achieving the target sparsity in sophisticated\nnetworks requires extra \ufb01ne-tuning. We instead apply the magnitude-based weight-pruning to our\nResnet compression experiments and refer to it as SGLD-SA, which is detailed in Algorithm 1. The\ncorresponding variant of SGHMC with SA is referred to as SGHMC-SA.\n\n4 Convergence Analysis\n\nThe key to guaranteeing the convergence of the adaptive SGLD algorithm is to use Poisson\u2019s equation\nto analyze additive functionals. By decomposing the Markov state-dependent noise \u2126 into martingale\ndifference sequences and perturbations, where the latter can be controlled by the regularity of the\nsolution of Poisson\u2019s equation, we can guarantee the consistency of the latent variable estimators.\nTheorem 1 (L2 convergence rate). For any \u03b1 \u2208 (0, 1], under assumptions in Appendix B.1, the\nalgorithm satis\ufb01es: there exists a constant \u03bb and an optimum \u03b8\u2217 such that\n\nE(cid:104)(cid:107)\u03b8(k) \u2212 \u03b8\u2217(cid:107)2(cid:105) \u2264 \u03bbk\u2212\u03b1.\n\nSGLD with adaptive latent variables forms a sequence of inhomogenous Markov chains and the weak\nconvergence of \u03b2 to the target posterior is equivalent to proving the weak convergence of SGLD with\nbiased estimations of gradients. Inspired by Chen et al. [2015], we have:\nCorollary 1. Under assumptions in Appendix B.2, the random vector \u03b2(k) from the adaptive transi-\ntion kernel \u03a0\u03b8(k\u22121) converges weakly to the invariant distribution e\u03c4 L(\u03b2,\u03b8\u2217) as \u0001 \u2192 0 and k \u2192 \u221e.\nThe smooth optimization of the priors makes the algorithm robust to bad initialization and avoids\nentrapment in poor local optima. In addition, the convergence to the asymptotically correct distribution\nallows us to combine simulated annealing to obtain better point estimates in non-convex optimization.\n\n5 Experiments\n\n5.1 Simulation of Large-p-Small-n Regression\n\nWe conduct the linear regression experiments with a dataset containing n = 100 observations\nand p = 1000 predictors. Np(0, \u03a3) is chosen to simulate the predictor values X (training set)\ni,j=1 with \u03a3i,j = 0.6|i\u2212j|. Response values y are generated from X\u03b2 + \u03b7, where\nwhere \u03a3 = (\u03a3)p\nc ), \u03b22 \u223c N (2, \u03c32\n\u03b2 = (\u03b21, \u03b22, \u03b23, 0, 0, ..., 0)(cid:48) and \u03b7 \u223c Nn(0, 3In). We assume \u03b21 \u223c N (3, \u03c32\nc ),\n\u2020The quadratic equation has only one unique positive root. (cid:107) \u00b7 (cid:107) refers to L2 norm, (cid:107) \u00b7 (cid:107)1 represents L1 norm.\n\n5\n\n\fAlgorithm 1 SGLD-SA with SSGL priors\n\nInitialize: \u03b2(1), \u03c1(1), \u03ba(1), \u03c3(1) and \u03b4(1) from scratch, set target sparse rates D, (cid:102) and S\nfor k \u2190 1 : kmax do\n\nSampling\n\u03b2(k+1) \u2190 \u03b2(k) + \u0001(k)\u2207\u03b2Q(\u00b7|B(k)) + N (0, 2\u0001(k)\u03c4\u22121)\nStochastic Approximation for Latent Variables\nSA: \u03c1(k+1) \u2190 (1 \u2212 \u03c9(k+1))\u03c1(k) + \u03c9(k+1) \u02dc\u03c1(k+1) following Eq.(12)\nSA: \u03ba(k+1) \u2190 (1 \u2212 \u03c9(k+1))\u03ba(k) + \u03c9(k+1)\u02dc\u03ba(k+1) following Eq.(13)\nSA: \u03c3(k+1) \u2190 (1 \u2212 \u03c9(k+1))\u03c3(k) + \u03c9(k+1) \u02dc\u03c3(k+1) following Eq.(14)\nSA: \u03b4(k+1) \u2190 (1 \u2212 \u03c9(k+1))\u03b4(k) + \u03c9(k+1)\u02dc\u03b4(k+1) following Eq.(15)\nif Pruning then\n\nPrune the bottom-s% lowest magnitude weights\nIncrease the sparse rate s \u2190 S(1 \u2212 Dk/(cid:102)\n\n)\n\nend if\n\nend for\n\nTable 1: Predictive errors in linear regression based on a test set considering different v0 and \u03c3\n\nMAE / MSE\nSGLD-SA\nSGLD-EM\nSGLD\n\nv0=0.01, \u03c3=2\n1.89 / 5.56\n3.49 / 19.31\n15.85 / 416.39\n\nv0=0.1, \u03c3=2\n1.72 / 5.64\n2.23 / 8.22\n\n15.85 / 416.39\n\nv0=0.01, \u03c3=1\n1.48 / 3.51\n2.23 / 19.28\n11.86 / 229.38\n\nv0=0.1, \u03c3=1\n1.54 / 4.42\n2.07 / 6.94\n7.72 / 88.90\n\n\u03b23 \u223c N (1, \u03c32\nc ), \u03c3c = 0.2. We introduce some hyperparameters, but most of them are uninformative.\nWe \ufb01x \u03c4 = 1, \u03bb = 1, \u03bd = 1, v1 = 10, \u03b4 = 0.5, b = p and set a = 1. The learning rate follows\n\u0001(k) = 0.001 \u00d7 k\u2212 1\n3 , and the step size is given by \u03c9(k) = 10 \u00d7 (k + 1000)\u22120.7. We vary v0 and \u03c3 to\nshow the robustness of SGLD-SA to different initializations. In addition, to show the superiority of\nthe adaptive update, we compare SGLD-SA with the intuitive implementation of the EMVS to SGLD\nand refer to this algorithm as SGLD-EM, which is equivalent to setting \u03c9(k) := 1 in SGLD-SA.\nTo obtain the stochastic gradient, we randomly select 50 observations and calculate the numerical\ngradient. SGLD is sampled from the same hierarchical model without updating the latent variables.\nWe simulate 500, 000 samples from the posterior distribution, and also simulate a test set with 50\nobservations to evaluate the prediction. As shown in Fig.1 (d), all three algorithms are \ufb01tted very\nwell in the training set, however, SGLD fails completely in the test set (Fig.1 (e)), indicating the\nover-\ufb01tting problem of SGLD without proper regularization when the latent variables are not updated.\nFig.1 (f) shows that although SGLD-EM successfully identi\ufb01es the right variables, the estimations are\nlower biased. The reason is that SGLD-EM fails to regulate the right variables with L2 penalty, and\nL1 leads to a greater amount of shrinkage for \u03b21, \u03b22 and \u03b23 (Fig. 1 (a-c)), implying the importance\nof the adaptive update via SA in the stochastic optimization of the latent variables. In addition, from\nFig. 1(a), Fig. 1(b) and Fig.1(c), we see that SGLD-SA is the only algorithm among the three that\nquanti\ufb01es the uncertainties of \u03b21, \u03b22 and \u03b23 and always gives the best prediction as shown in Table.1.\nWe notice that SGLD-SA is fairly robust to various hyperparameters.\nFor the simulation of SGLD-SA in logistic regression and the evaluation of SGLD-SA on UCI\ndatasets, we leave the results in Appendix C and D.\n\n5.2 Classi\ufb01cation with Auto-tuning Hyperparameters\n\nThe following experiments are based on non-pruning SG-MCMC-SA, the goal is to show that auto-\ntuning sparse priors are useful to avoid over-\ufb01tting. The posterior average is applied to each Bayesian\nmodel. We implement all the algorithms in Pytorch [Paszke et al., 2017]. The \ufb01rst DNN is a standard\n2-Conv-2-FC CNN model of 670K parameters (see details in Appendix D.1).\nThe \ufb01rst set of experiments is to compare methods on the same model without using data augmentation\n(DA) and batch normalization (BN) [Ioffe and Szegedy, 2015]. We refer to the general CNN without\ndropout as Vanilla, with 50% dropout rate applied to the hidden units next to FC1 as Dropout.\n\n6\n\n\f(a) Posterior estimation of \u03b21.\n\n(b) Posterior estimation of \u03b22.\n\n(c) Posterior estimation of \u03b23.\n\n(d) Training performance.\n\n(e) Testing performance.\n\n(f) Posterior mean vs truth.\n\nFigure 1: Linear regression simulation when v0 = 0.1 and \u03c3 = 1.\n\nVanilla and Dropout models are trained with Adam [Kingma and Ba, 2014] and Pytorch default\nparameters (with learning rate 0.001). We use SGHMC as a benchmark method as it is also sampling-\nbased and has a close relationship with the popular momentum based optimization approaches\nin DNNs. SGHMC-SA differs from SGHMC in that SGHMC-SA keeps updating SSGL priors\nfor the \ufb01rst FC layer while they are \ufb01xed in SGHMC. We set the training batch size n = 1000,\na, b = p and \u03bd, \u03bb = 1000. The hyperparameters for SGHMC-SA are set to v0 = 1, v1 = 0.1\nand \u03c3 = 1 to regularize the over-\ufb01tted space. The learning rate is set to 5 \u00d7 10\u22127, and the step\nsize is \u03c9(k) = 1 \u00d7 (k + 1000)\u2212 3\n4 . We use a thinning factor 500 to avoid a cumbersome system.\nFixed temperature can also be powerful in escaping \u201cshallow\" local traps [Zhang et al., 2017], our\ntemperatures are set to \u03c4 = 1000 for MNIST and \u03c4 = 2500 for FMNIST.\nThe four CNN models are tested on MNIST and Fashion MNIST (FMNIST) [Xiao et al., 2017]\ndataset. Performance of these models is shown in Tab.2. Compared with SGHMC, our SGHMC-SA\noutperforms SGHMC on both datasets. We notice the posterior averages from SGHMC-SA and\nSGHMC obtain much better performance than Vanilla and Dropout. Without using either DA or\nBN, SGHMC-SA achieves 99.59% which outperforms some state-of-the-art models, such as Maxout\nNetwork (99.55%) [Goodfellow et al., 2013] and pSGLD (99.55%) [Li et al., 2016] . In F-MNIST,\nSGHMC-SA obtains 93.01% accuracy, outperforming all other competing models.\nTo further test the performance, we apply DA and BN to the following experiments (see details in\nAppendix D.2) and refer to the datasets as DA-MNIST and DA-FMNIST. All the experiments are\nconducted using a 2-Conv-BN-3-FC CNN of 490K parameters. Using this model, we obtain the\nstate-of-the-art 99.75% on DA-MNIST (200 epochs) and 94.38% on DA-FMNIST (1000 epochs) as\nshown in Tab. 2. The results are noticeable, because posterior average is only conducted on a single\nshallow CNN.\n\n5.3 Defenses against Adversarial Attacks\n\nContinuing with the setup in Sec. 5.2, the third set of experiments focuses on evaluating model\nrobustness. We apply the Fast Gradient Sign method [Goodfellow et al., 2014] to generate the\n\n7\n\n0.51.01.52.02.53.03.54.0SGLD\u2212SASGLD\u2212EMSGLDTrue value\u22122\u221210123SGLD\u2212SASGLD\u2212EMSGLDTrue value\u221210123SGLD\u2212SASGLD\u2212EMSGLDTrue value020406080100\u221215\u221210\u221250510lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllly^lllSGLD\u2212SASGLD\u2212EMSGLDTrue value01020304050\u221250510lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllly^lllSGLD\u2212SASGLD\u2212EMSGLDTrue valuelll0.00.51.01.52.02.53.00.00.51.01.52.02.53.0lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllb^llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllSGLD\u2212SASGLD\u2212EMSGLDTrue value\fTable 2: Classi\ufb01cation accuracy using shallow networks\n\nDATASET\nVANILLA\nDROPOUT\nSGHMC\nSGHMC-SA\n\nMNIST DA-MNIST\n99.31\n99.38\n99.47\n99.59\n\n99.54\n99.56\n99.63\n99.75\n\nFMNIST DA-FMNIST\n\n92.73\n92.81\n92.88\n93.01\n\n93.14\n93.35\n94.29\n94.38\n\nadversarial examples with one single gradient step as in Papernot et al. [2016]\u2019s study:\n\nxadv \u2190 x \u2212 \u03b6 \u00b7 sign{\u03b4x max\n\nlog P(y |x)},\n\ny\n\nwhere \u03b6 ranges from 0.1, 0.2, . . . , 0.5 to control the different levels of adversarial attacks.\nSimilar to the setup in Li and Gal [2017], we normalize the adversarial images by clipping to the\nrange [0, 1]. In Fig. 2(b) and Fig.2(d), we see no signi\ufb01cant difference among all the four models in\nthe early phase. As the degree of adversarial attacks arises, the images become vaguer as shown in\nFig.2(a) and Fig.2(c). The performance of Vanilla decreases rapidly, re\ufb02ecting its poor defense against\nadversarial attacks, while Dropout performs better than Vanilla. But Dropout is still signi\ufb01cantly\nworse than the sampling based methods. The advantage of SGHMC-SA over SGHMC becomes\nmore signi\ufb01cant when \u03b6 > 0.25. In the case of \u03b6 = 0.5 in MNIST where the images are hardly\nrecognizable, both Vanilla and Dropout models fail to identify the right images and their predictions\nare as worse as random guesses. However, SGHMC-SA model achieves roughly 11% higher than\nthese two models and 1% higher than SGHMC, which demonstrates the robustness of SGHMC-SA.\n\n(a) \u03b6 = ....\n\n(b) MNIST\n\n(c) \u03b6 = ....\n\n(d) FMNIST\n\nFigure 2: Adversarial test accuracies based on adversarial images of different levels\n\n5.4 Residual Network Compression\n\nOur compression experiments are conducted on the CIFAR-10 dataset [Krizhevsky, 2009] with DA.\nSGHMC and the non-adaptive SGHMC-EM are chosen as baselines. Simulated annealing is used to\nenhance the non-convex optimization and the methods with simulated annealing are referred to as\nA-SGHMC, A-SGHMC-EM and A-SGHMC-SA, respectively. We report the best point estimate.\nWe \ufb01rst use SGHMC to train a Resnet20 model and apply the magnitude-based criterion to prune\nweights to all convolutional layers (except the very \ufb01rst one). All the following methods are evaluated\nbased on the same setup except for different step sizes to learn the latent variables. The sparse training\ntakes 1000 epochs. The mini-batch size is 1000. The learning rate starts from 2e-9 \u2020 and is divided by\n10 at the 700th and 900th epoch. We set the inverse temperature \u03c4 to 1000 and multiply \u03c4 by 1.005\nevery epoch . We \ufb01x \u03bd = 1000 and \u03bb = 1000 for the inverse gamma prior. v0 and v1 are tuned based\non different sparsity to maximize the performance. The smooth increase of the sparse rate follows the\npruning rule in Algorithm 1, and D and (cid:102) are set to 0.99 and 50, respectively. The increase in the\nsparse rate s is faster in the beginning and slower in the later phase to avoid destroying the network\nstructure. Weight decay in the non-sparse layers C is set as 25.\nAs shown in Table 3, A-SGHMC-SA doesn\u2019t distinguish itself from A-SGHMC-EM and A-SGHMC\nwhen the sparse rate S is small, but outperforms the baselines given a large sparse rate. The pretrained\nmodel has accuracy 93.90%, however,the prediction performance can be improved to the state-of-the-\nart 94.27% with 50% sparsity. Most notably, we obtain 91.68% accuracy based on 27K parameters\n\n\u2020It is equivalent to setting the learning rate to 1e-4 when we don\u2019t multiply the likelihood with N\nn .\n\n8\n\n0.00.10.20.30.40.50%20%40%60%80%100%SGHMC-SASGHMCDropoutVanilla0.00.10.20.30.40.50%20%40%60%80%100%SGHMC-SASGHMCDropoutVanilla\fTable 3: Resnet20 Compression on CIFAR10. When S = 0.9, we \ufb01x v0 = 0.005, v1 =1e-5; When\nS = 0.7, we \ufb01x v0 = 0.1, v1 =5e-5; When S = 0.5, we \ufb01x v0 = 0.1, v1 =5e-4; When S = 0.3, we\n\ufb01x v0 = 0.5, v1 =1e-3.\n\nMETHODS \\ S\nA-SGHMC\n\n30%\n94.07\nA-SGHMC-EM 94.18\n94.13\nA-SGHMC-SA 94.23\n\nSGHMC-SA\n\n50%\n94.16\n94.19\n94.11\n94.27\n\n70%\n93.16\n93.41\n93.52\n93.74\n\n90%\n90.59\n91.26\n91.45\n91.68\n\n(90% sparsity) in Resnet20. By contrast, targeted dropout obtained 91.48% accuracy based on 47K\nparameters (90% sparsity) of Resnet32 [Gomez et al., 2018], BC-GHS achieves 91.0% accuracy\nbased on 8M parameters (94.5% sparsity) of VGG models [Louizos et al., 2017]. We also notice\nthat when simulated annealing is not used as in SGHMC-SA, the performance will decrease by\n0.2% to 0.3%. When we use batch size 2000 and inverse temperature schedule \u03c4 (k) = 20 \u00d7 1.01k,\nA-SGHMC-SA still achieves roughly the same level, but the prediction of SGHMC-SA can be 1%\nlower than A-SGHMC-SA.\n\n6 Conclusion\n\nWe propose a novel AEB method to adaptively sample from hierarchical Bayesian DNNs and optimize\nthe spike-and-slab priors, which yields a class of scalable adaptive sampling algorithms in DNNs.\nWe prove the convergence of this approach to the asymptotically correct distribution. By adaptively\nsearching and penalizing the over-\ufb01tted parameters, the proposed method achieves higher prediction\naccuracy over the traditional SG-MCMC methods in both simulated examples and real applications\nand shows more robustness towards adversarial attacks. Together with the magnitude-based weight\npruning strategy and simulated annealing, the AEB-based method, A-SGHMC-SA, obtains the\nstate-of-the-art performance in model compression.\nAcknowledgments\nWe would like to thank Prof. Vinayak Rao, Dr. Yunfan Li and the reviewers for their insightful\ncomments. We acknowledge the support from the National Science Foundation (DMS-1555072,\nDMS-1736364, DMS-1821233 and DMS-1818674) and the GPU grant program from NVIDIA.\n\nReferences\nChristophe Andrieu, \u00c9ric Moulines, and Pierre Priouret. Stability of stochastic approximation under\n\nveri\ufb01able conditions. SIAM J. Control Optim., 44(1):283\u2013312, 2005.\n\nAlbert Benveniste, Michael M\u00e9tivier, and Pierre Priouret. Adaptive Algorithms and Stochastic\n\nApproximations. Berlin: Springer, 1990.\n\nChangyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient\nMCMC Algorithms with High-order Integrators. In Proc. of the Conference on Advances in Neural\nInformation Processing Systems (NIPS), pages 2278\u20132286, 2015.\n\nTianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In\n\nProc. of the International Conference on Machine Learning (ICML), 2014.\n\nEdward I. George and Robert E. McCulloch. Variable Selection via Gibbs Sampling. Journal of the\n\nAmerican Statistical Association, 88(423):881\u2013889, 1993.\n\nSoumya Ghosh, Jiayu Yao, and Finale Doshi-Velez. Structured Variational Learning of Bayesian\nNeural Networks with Horseshoe Priors. In Proc. of the International Conference on Machine\nLearning (ICML), 2018.\n\nAidan N. Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. Targeted Dropout.\n\nIn NIPS 2018 workshop on Compact Deep Neural Networks with industrial applications, 2018.\n\n9\n\n\fIan J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\nnetworks. In Proc. of the International Conference on Machine Learning (ICML), pages III\u20131319\u2013\nIII\u20131327, 2013.\n\nIan J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial\n\nExamples. ArXiv e-prints, December 2014.\n\nSong Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network\nwith pruning, trained quantization and huffman coding. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\nSergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by\nReducing Internal Covariate Shift. In Proc. of the International Conference on Machine Learning\n(ICML), pages 448\u2013456, 2015.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proc. of the\n\nInternational Conference on Learning Representation (ICLR), 2014.\n\nAlex Krizhevsky. Learning Multiple Layers of Features from tiny images. In Tech Report, 2009.\n\nChunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned Stochastic\nGradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference on\nArti\ufb01cial Intelligence (AAAI), pages 1788\u20131794, 2016.\n\nYingzhen Li and Yarin Gal. Dropout inference in Bayesian neural networks with alpha-divergences.\n\nIn Proc. of the International Conference on Machine Learning (ICML), 2017.\n\nFaming Liang. Trajectory averaging for stochastic approximation MCMC algorithms. The Annals of\n\nStatistics, 38:2823\u20132856, 2010.\n\nFaming Liang, Bochao Jia, Jingnan Xue, Qizhai Li, and Ye Luo. Bayesian Neural Networks for\nSelection of Drug Sensitive Genes. Journal of the American Statistical Association, 113(5233):\n955\u2013972, 2018.\n\nJi Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime Neural Pruning. In Proc. of the Conference\n\non Advances in Neural Information Processing Systems (NIPS), 2017.\n\nChristos Louizos, Karen Ullrich, and Max Welling. Bayesian Compression for Deep learning. In\n\nProc. of the Conference on Advances in Neural Information Processing Systems (NIPS), 2017.\n\nYi-An Ma, Tianqi Chen, and Emily B. Fox. A complete recipe for stochastic gradient MCMC. In\n\nProc. of the Conference on Advances in Neural Information Processing Systems (NIPS), 2015.\n\nOren Mangoubi and Nisheeth K. Vishnoi. Convex Optimization with Unbounded Nonconvex Oracles\n\nusing Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018.\n\nPavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning Convolutional\nNeural Networks for Resource Ef\ufb01cient Inference. In Proc. of the International Conference on\nLearning Representation (ICLR), 2017.\n\nNicolas Papernot, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Fartash Faghri, Alexander\nMatyasko, Karen Hambardzumyan, Yi-Lin Juang, Alexey Kurakin, Ryan Sheatsley, Abhibhav\nGarg, and Yen-Chen Lin. cleverhans v2.0.0: an adversarial machine learning library. ArXiv\ne-prints, October 2016.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\nMaxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via Stochastic\nGradient Langevin Dynamics: a nonasymptotic analysis. In Proc. of Conference on Learning\nTheory (COLT), June 2017.\n\n10\n\n\fVeronika Ro\u02c7rkov\u00e1 and Edward I. George. EMVS: The EM Approach to Bayesian variable selection.\n\nJournal of the American Statistical Association, 109(506):828\u2013846, 2014.\n\nVeronika Ro\u02c7rkov\u00e1 and Edward I. George. The Spike-and-Slab Lasso. Journal of the American\n\nStatistical Association, 113:431\u2013444, 2018.\n\nYunus Saatci and Andrew G Wilson. Bayesian GAN. In Proc. of the Conference on Advances in\n\nNeural Information Processing Systems (NIPS), pages 3622\u20133631, 2017.\n\nNahum Shimkin. Introduction to Stochastic Approximation Algorithms, 2011. URL http://webee.\n\ntechnion.ac.il/shimkin/LCS11/ch5_SA.pdf.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\nrecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.\n\nYee Whye Teh, Alexandre Thi\u00e9ry, and Sebastian Vollmer. Consistency and Fluctuations for Stochastic\n\nGradient Langevin Dynamics. Journal of Machine Learning Research, 17:1\u201333, 2016.\n\nRobert Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1994.\n\nMax Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In\n\nProc. of the International Conference on Machine Learning (ICML), pages 681\u2013688, 2011.\n\nHan Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Bench-\n\nmarking Machine Learning Algorithms. ArXiv e-prints, August 2017.\n\nPan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin Dynamics\nBased Algorithms for Nonconvex Optimization. In Proc. of the Conference on Advances in Neural\nInformation Processing Systems (NIPS), December 2018.\n\nMao Ye and Yan Sun. Variable Selection via Penalized Neural Network: a Drop-Out-One Loss\nApproach. In Proc. of the International Conference on Machine Learning (ICML), volume 80,\npages 5620\u20135629, 10\u201315 Jul 2018.\n\nYuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic Gradient\nLangevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980\u20132022, 2017.\n\nHui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the\n\nRoyal Statistical Society, Series B, 67(2):301\u2013320, 2005.\n\n11\n\n\f", "award": [], "sourceid": 2979, "authors": [{"given_name": "Wei", "family_name": "Deng", "institution": "Purdue University"}, {"given_name": "Xiao", "family_name": "Zhang", "institution": "Purdue University"}, {"given_name": "Faming", "family_name": "Liang", "institution": "Purdue University"}, {"given_name": "Guang", "family_name": "Lin", "institution": "Purdue University"}]}