{"title": "Bayesian Compression for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3288, "page_last": 3298, "abstract": "Compression and computational efficiency in deep learning have become a problem of great significance. In this work, we argue that the most principled and effective way to attack this problem is by adopting a Bayesian point of view, where through sparsity inducing priors we prune large parts of the network. We introduce two novelties in this paper: 1) we use hierarchical priors to prune nodes instead of individual weights, and 2) we use the posterior uncertainties to determine the optimal fixed point precision to encode the weights. Both factors significantly contribute to achieving the state of the art in terms of compression rates, while still staying competitive with methods designed to optimize for speed or energy efficiency.", "full_text": "Bayesian Compression for Deep Learning\n\nChristos Louizos\n\nUniversity of Amsterdam\nTNO Intelligent Imaging\n\nc.louizos@uva.nl\n\nKaren Ullrich\n\nUniversity of Amsterdam\n\nk.ullrich@uva.nl\n\nMax Welling\n\nUniversity of Amsterdam\n\nCIFAR\u2217\n\nm.welling@uva.nl\n\nAbstract\n\nCompression and computational ef\ufb01ciency in deep learning have become a problem\nof great signi\ufb01cance. In this work, we argue that the most principled and effective\nway to attack this problem is by adopting a Bayesian point of view, where through\nsparsity inducing priors we prune large parts of the network. We introduce two\nnovelties in this paper: 1) we use hierarchical priors to prune nodes instead of\nindividual weights, and 2) we use the posterior uncertainties to determine the\noptimal \ufb01xed point precision to encode the weights. Both factors signi\ufb01cantly\ncontribute to achieving the state of the art in terms of compression rates, while\nstill staying competitive with methods designed to optimize for speed or energy\nef\ufb01ciency.\n\n1\n\nIntroduction\n\nWhile deep neural networks have become extremely successful in in a wide range of applications,\noften exceeding human performance, they remain dif\ufb01cult to apply in many real world scenarios. For\ninstance, making billions of predictions per day comes with substantial energy costs given the energy\nconsumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often\nabout a factor 100 away in terms of speed from what deep NNs can deliver, and sending NNs with\nmillions of parameters through band limited channels is still impractical. As a result, running them on\nhardware limited devices such as smart phones, robots or cars requires substantial improvements on\nall of these issues. For all those reasons, compression and ef\ufb01ciency have become a topic of interest\nin the deep learning community.\nWhile all of these issues are certainly related, compression and performance optimizing procedures\nmight not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which\naccount for only 4% of the parameters but 91% of the computation [65]. Compressing these layers\nwill not contribute much to the overall memory footprint.\nThere is a variety of approaches to address these problem settings. However, most methods have\nthe common strategy of reducing both the neural network structure and the effective \ufb01xed point\nprecision for each weight. A justi\ufb01cation for the former is the \ufb01nding that NNs suffer from signi\ufb01cant\nparameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary\nconnections are being removed [38, 24, 21], or student-teacher learning where a large network is\nused to train a signi\ufb01cantly smaller network [5, 26].\nFrom a Bayesian perspective network pruning and reducing bit precision for the weights is aligned\nwith achieving high accuracy, because Bayesian methods search for the optimal model structure\n(which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters\nthrough the bits back argument [27] (which leads to removing insigni\ufb01cant bits). This relation is\nmade explicit in the MDL principle [20] which is known to be related to Bayesian inference.\n\n\u2217Canadian Institute For Advanced Research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper we will use the variational Bayesian approximation for Bayesian inference which has\nalso been explicitly interpreted in terms of model compression [27]. By employing sparsity inducing\npriors for hidden units (and not individual weights) we can prune neurons including all their ingoing\nand outgoing weights. This avoids more complicated and inef\ufb01cient coding schemes needed for\npruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the\nvariational posterior uncertainty to assess which bits are signi\ufb01cant and remove the ones which\n\ufb02uctuate too much under approximate posterior sampling. From this we derive the optimal \ufb01xed\npoint precision per layer, which is still practical on chip.\n\n2 Variational Bayes and Minimum Description Length\n\nD that consists from N input-output pairs {(x1, y1), . . . , (xn, yn)}. Let p(D|w) =(cid:81)N\n\nA fundamental theorem in information theory is the minimum description length (MDL) principle [20].\nIt relates to compression directly in that it de\ufb01nes the best hypothesis to be the one that communicates\nthe sum of the model (complexity cost LC) and the data mis\ufb01t (error cost LE) with the minimum\nnumber of bits [57, 58]. It is well understood that variational inference can be reinterpreted from an\nMDL point of view [54, 69, 27, 29, 19]. More speci\ufb01cally, assume that we are presented with a dataset\ni=1 p(yi|xi, w)\nbe a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs\ny using parameters w governed by a prior distribution p(w). In this scenario, we wish to approximate\nthe intractable posterior distribution p(w|D) = p(D|w)p(w)/p(D) with a \ufb01xed form approximate\nposterior q\u03c6(w) by optimizing the variational parameters \u03c6 according to:\n\n(1)\n\n(cid:125)\nL(\u03c6) = Eq\u03c6(w)[log p(D|w)]\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nLE\n\n(cid:125)\n(cid:124)\n+ Eq\u03c6(w)[log p(w)] + H(q\u03c6(w))\n\n(cid:123)(cid:122)\n\n,\n\nLC\n\nn=1 and form of the parametric model.\n\nwhere H(\u00b7) denotes the entropy and L(\u03c6) is known as the evidence-lower-bound (ELBO) or negative\nvariational free energy. As indicated in eq. 1, L(\u03c6) naturally decomposes into a minimum cost for\ncommunicating the targets {yn}N\nn=1 under the assumption that the sender and receiver agreed on a\nprior p(w) and that the receiver knows the inputs {xn}N\nBy using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mecha-\nnism will start pruning hidden units that are not strictly necessary for prediction and thus achieving\ncompression. But there is also a second mechanism by which Bayes can help us compress. By\nexplicitly entertaining noisy weight encodings through q\u03c6(w) we can bene\ufb01t from the bits-back\nargument [27, 29] due to the entropy term; this is in contrast to in\ufb01nitely precise weights that lead to\nH(\u03b4(w)) = \u2212\u221e2. Nevertheless in practice, the data mis\ufb01t term LE is intractable for neural network\nmodels under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed.\nContinuous q\u03c6(w) allow for the reparametrization trick [34, 56]. Here, we replace sampling from\nq\u03c6(w) by a deterministic function of the variational parameters \u03c6 and random samples from some\nnoise variables \u0001:\n\nL(\u03c6) = Ep(\u0001)[log p(D|f (\u03c6, \u0001))] + Eq\u03c6(w)[log p(w)] + H(q\u03c6(w)),\n\n(2)\n\nwhere w = f (\u03c6, \u0001). By applying this trick, we obtain unbiased stochastic gradients of the ELBO\nwith respect to the variational parameters \u03c6, thus resulting in a standard optimization problem that is\n\ufb01t for stochastic gradient ascent. The ef\ufb01ciency of the gradient estimator resulting from eq. 2 can be\nfurther improved for neural networks by utilizing local reparametrizations [35] (which we will use in\nour experiments); they provide variance reduction in an ef\ufb01cient way by locally marginalizing the\nweights at each layer and instead sampling the distribution of the pre-activations.\n\n3 Related Work\n\nOne of the earliest ideas and most direct approaches to tackle ef\ufb01ciency is pruning. Originally\nintroduced by [38], pruning has recently been demonstrated to be applicable to modern architectures\n[25, 21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters\ncan be pruned in common architectures. There have been quite a few encouraging results obtained\nby (empirical) Bayesian approaches that employ weight pruning [19, 7, 50, 67, 49]. Nevertheless,\n\n2In practice this term is a large constant determined by the weight precision.\n\n2\n\n\fweight pruning is in general inef\ufb01cient for compression since the matrix format of the weights is not\ntaken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed.\nMoreover, note that in conventional CNNs most \ufb02ops are used by the convolution operation. Inspired\nby this observation, several authors proposed pruning schemes that take these considerations into\naccount [70, 71] or even go as far as ef\ufb01ciency aware architectures to begin with [31, 15, 30]. From\nthe Bayesian viewpoint, similar pruning schemes have been explored at [45, 51, 37, 33].\nGiven optimal architecture, NNs can further be compressed by quantization. More precisely, there\nare two common techniques. First, the set of accessible weights can be reduced drastically. As an\nextreme example, [13, 46, 55, 72] and [11] trained NN to use only binary or tertiary weights with\n\ufb02oating point gradients. This approach however is in need of signi\ufb01cantly more parameters than\ntheir ordinary counterparts. Work by [18] explores various techniques beyond binary quantization:\nk-means quantization, product quantization and residual quantization. Later studies extent this set to\noptimal \ufb01xed point [42] and hashing quantization [10]. [25] apply k-means clustering and consequent\ncenter training. From a practical point of view, however, all these are fairly unpractical during\ntest time. For the computation of each feature map in a net, the original weight matrix must be\nreconstructed from the indexes in the matrix and a codebook that contains all the original weights.\nThis is an expensive operation and this is why some studies propose a different approach than set\nquantization. Precision quantization simply reduces the bit size per weight. This has a great advantage\nover set quantization at inference time since feature maps can simply be computed with less precision\nweights. Several studies show that this has little to no effect on network accuracy when using 16bit\nweights [47, 22, 12, 68, 9]. Somewhat orthogonal to the above discussion but certainly relevant are\napproaches that customize the implementation of CNNs for hardware limited devices[30, 4, 60].\n\n4 Bayesian compression with scale mixtures of normals\n\nConsider the following prior over a parameter w where its scale z is governed by a distribution p(z):\n\nz \u223c p(z);\n\nw \u223c N (w; 0, z2),\n\n(3)\n\nwith z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales\nof w as random variables we can recover marginal prior distributions over the parameters that have\nheavier tails and more mass at zero; this subsequently biases the posterior distribution over w to\nbe sparse. This family of distributions is known as scale-mixtures of normals [6, 2] and it is quite\ngeneral, as a lot of well known sparsity inducing distributions are special cases.\nOne example of the aforementioned framework is the spike-and-slab distribution [48], the golden\nstandard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a\nBernoulli distribution, thus the marginal p(w) has a delta \u201cspike\u201d at zero and a continuous \u201cslab\u201d over\nthe real line. Unfortunately, this prior leads to a computationally expensive inference since we have\nto explore a space of 2M models, where M is the number of the model parameters. Dropout [28, 64],\none of the most popular regularization techniques for neural networks, can be interpreted as positing a\nspike and slab distribution over the weights where the variance of the \u201cslab\u201d is zero [17, 43]. Another\nexample is the Laplace distribution which arises by considering p(z2) = Exp(\u03bb). The mode of\nthe posterior distribution under a Laplace prior is known as the Lasso [66] estimator and has been\npreviously used for sparsifying neural networks at [70, 59]. While computationally simple, the\nLasso estimator is prone to \u201cshrinking\" large signals [8] and only provides point estimates about\nthe parameters. As a result it does not provide uncertainty estimates, it can potentially over\ufb01t and,\naccording to the bits-back argument, is inef\ufb01cient for compression.\nFor these reasons, in this paper we will tackle the problem of compression and ef\ufb01ciency in neural\nnetworks by adopting a Bayesian treatment and inferring an approximate posterior distribution over\nthe parameters under a scale mixture prior. We will consider two choices for the prior over the scales\np(z); the hyperparameter free log-uniform prior [16, 35] and the half-Cauchy prior, which results into\na horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the\nspike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C.\n\n3\n\n\f4.1 Reparametrizing variational dropout for group sparsity\nOne potential choice for p(z) is the improper log-uniform prior [35]: p(z) \u221d |z|\u22121. It turns out that\nwe can recover the log-uniform prior over the weights w if we marginalize over the scales z:\n\n(cid:90) 1\n|z|N (w|0, z2)dz =\n\np(w) \u221d\n\n1\n|w| .\n\n(4)\n\nThis alternative parametrization of the log uniform prior is known in the statistics literature as the\nnormal-Jeffreys prior and has been introduced by [16]. This formulation allows to \u201ccouple\" the\nscales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the\ncorresponding scale variable z in the joint prior3:\n\np(W, z) \u221d A(cid:89)\n\nA,B(cid:89)\n\nij\n\n1\n|zi|\n\ni\n\nN (wij|0, z2\ni ),\n\n(5)\n\nA(cid:89)\n\nA,B(cid:89)\n\nwhere W is the weight matrix of a fully connected neural network layer with A being the dimen-\nsionality of the input and B the dimensionality of the output. Now consider performing variational\ninference with a joint approximate posterior parametrized as follows:\n\nq\u03c6(W, z) =\n\nN (zi|\u00b5zi, \u00b52\n\nzi\n\n\u03b1i)\n\nN (wij|zi\u00b5ij, z2\n\ni \u03c32\n\nij),\n\n(6)\n\ni=1\n\ni,j\n\n(cid:18)\n\nKL(q\u03c6(W|z)||p(W|z)) =\n\nwhere \u03b1i is the dropout rate [64, 35, 49] of the given group. As explained at [35, 49], the multiplicative\nparametrization of the approximate posterior over z suffers from high variance gradients; therefore\nwe will follow [49] and re-parametrize it in terms of \u03c32\nzi. The\n= \u00b52\nzi\nzi\nlower bound under this prior and approximate posterior becomes:\nL(\u03c6) = Eq\u03c6(z)q\u03c6(W|z)[log p(D|W)] \u2212 Eq\u03c6(z)[KL(q\u03c6(W|z)||p(W|z))] \u2212 KL(q\u03c6(z)||p(z)). (7)\nUnder this particular variational posterior parametrization the negative KL-divergence from the\nconditional prior p(W|z) to the approximate posterior q\u03c6(W|z) is independent of z:\n\n\u03b1i, hence optimize w.r.t. \u03c32\n\n+\n\n\u0013\u0013z2\ni \u03c32\nij\n\u0013\u0013z2\n\ni\n\n+\n\n\u0013\u0013z2\ni \u00b52\nij\n\u0013\u0013z2\n\ni\n\nA,B(cid:88)\ninto p(W|z)p(z) = p( \u02dcW)p(z), where p( \u02dcW) =(cid:81)\nthe form of q\u03c6( \u02dcW, z) = q\u03c6( \u02dcW)q\u03c6(z), with q\u03c6( \u02dcW) =(cid:81)\n\nThis independence can be better understood if we consider a non-centered parametrization of the\nprior [53]. More speci\ufb01cally, consider reparametrizing the weights as \u02dcwij = wij\n; this will then result\nzi\ni,j N ( \u02dcwij|0, 1) and W = diag(z) \u02dcW. Now if\nwe perform variational inference under the p( \u02dcW)p(z) prior with an approximate posterior that has\nij), then we see that we\narrive at the same expressions for the negative KL-divergence from the prior to the approximate\nposterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the\nGaussian variational posterior q\u03c6(z) depends only on the \u201cimplied\u201d dropout rate, \u03b1i = \u03c32\nzi, and\nzi\ntakes the following form [49]:\n\ni,j N ( \u02dcwij|\u00b5ij, \u03c32\n\n.\n\n(8)\n\n\u0013\u0013z2\ni\n\u0013\u0013z2\ni \u03c32\nij\n\n(cid:19)\n\n\u2212 1\n\n\u2212 KL(q\u03c6(z)||p(z)) \u2248 A(cid:88)\n\n(cid:0)k1\u03c3(k2 + k3 log \u03b1i) \u2212 0.5m(\u2212 log \u03b1i) \u2212 k1\n\n(cid:1),\n\n1\n2\n\ni,j\n\n/\u00b52\n\n(9)\n\nlog\n\ni\n\nwhere \u03c3(\u00b7), m(\u00b7) are the sigmoid and softplus functions respectively4 and k1 = 0.63576, k2 =\n1.87320, k3 = 1.48695. We can now prune entire groups of parameters by simply specifying a thresh-\n) \u2265 t.\nold for the variational dropout rate of the corresponding group, e.g. log \u03b1i = (log \u03c32\nzi\nIt should be mentioned that this prior parametrization readily allows for a more \ufb02exible marginal pos-\n\nterior over the weights as we now have a compound distribution, q\u03c6(W) =(cid:82) q\u03c6(W|z)q\u03c6(z)dz; this\n\nis in contrast to the original parametrization and the Gaussian approximations employed by [35, 49].\n\n\u2212 log \u00b52\n\nzi\n\n3Stricly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is\nshared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it\nbiases the variational posterior to be sparse.\n\n4\u03c3(x) = (1 + exp(\u2212x))\u22121, m(x) = log(1 + exp(x))\n\n4\n\n\fFurthermore, this approach generalizes the low variance additive parametrization of variational\ndropout proposed for weight sparsity at [49] to group sparsity (which was left as an open question\nat [49]) in a principled way.\nAt test time, in order to have a single feedforward pass we replace the distribution over W at each\nlayer with a single weight matrix, the masked variational posterior mean:\n\nq(z)q( \u02dcW)[diag(z) \u02dcW] = diag(cid:0)m (cid:12) \u00b5z\n\n(cid:1)MW ,\n\n(10)\nwhere m is a binary mask determined according to the group variational dropout rate and MW are\nthe means of q\u03c6( \u02dcW). We further use the variational posterior marginal variances5 for this particular\nposterior approximation:\n\n\u02c6W = diag(m) E\n\nV(wij)N J = \u03c32\n\n(11)\nto asess the bit precision of each weight in the weight matrix. More speci\ufb01cally, we employed the\nmean variance across the weight matrix \u02c6W to compute the unit round off necessary to represent the\nweights. This method will give us the amount signi\ufb01cant bits, and by adding 3 exponent and 1 sign\nbits we arrive at the \ufb01nal bit precision for the entire weight matrix \u02c6W6. We provide more details at\nAppendix B.\n\nij + \u00b52\nij\n\nij\u00b52\nzi\n\nzi\n\n,\n\n(cid:1) + \u03c32\n\n(cid:0)\u03c32\n\n4.2 Group horseshoe with half-Cauchy scale priors\nAnother choice for p(z) is a proper half-Cauchy distribution: C+(0, s) = 2(s\u03c0(1 + (z/s)2))\u22121; it\ninduces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing\nprior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as\n(in a non-centered parametrization):\n\ns \u223c C+(0, \u03c40);\n\n\u02dczi \u223c C+(0, 1);\n\n\u02dcwij \u223c N (0, 1);\n\n(12)\nwhere \u03c40 is the free parameter that can be tuned for speci\ufb01c desiderata. The idea behind the horseshoe\nis that of the \u201cglobal-local\" shrinkage; the global scale variable s pulls all of the variables towards\nzero whereas the heavy tailed local variables zi can compensate and allow for some weights to escape.\nInstead of directly working with the half-Cauchy priors we will employ a decomposition of the\nhalf-Cauchy that relies upon (inverse) gamma distributions [52] as this will allow us to compute\nthe negative KL-divergence from the scale prior p(z) to an approximate log-normal scale posterior\nq\u03c6(z) in closed form (the derivation is given in Appendix D). More speci\ufb01cally, we have that the\nhalf-Cauchy prior can be expressed in a non-centered parametrization as:\n\nwij = \u02dcwij \u02dczis,\n\np( \u02dc\u03b2) = IG(0.5, 1);\n\np(\u02dc\u03b1) = G(0.5, k2);\n\n(13)\nwhere IG(\u00b7,\u00b7),G(\u00b7,\u00b7) correspond to the inverse Gamma and Gamma distributions in the scale\nparametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express\nthe whole hierarchy as:\n\nz2 = \u02dc\u03b1 \u02dc\u03b2,\n\nsb \u223c IG(0.5, 1); sa \u223c G(0.5, \u03c4 2\n0 );\n\n\u02dc\u03b2i \u223c IG(0.5, 1); \u02dc\u03b1i \u223c G(0.5, 1);\n\n\u02dcwij \u223c N (0, 1);\n\nwij = \u02dcwij\n\n(14)\nIt should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior\nwhen the shapes of the (inverse) Gamma hyperpriors on \u02dc\u03b1i, \u02dc\u03b2i go to zero [8]. In fact, several well\nknown shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma\nhyperpriors [3]. For the variational posterior we will employ the following mean \ufb01eld approximation:\n\nsasb \u02dc\u03b1i\n\n\u02dc\u03b2i.\n\n(cid:113)\n\nq\u03c6(sb, sa, \u02dc\u03b2) = LN (sb|\u00b5sb , \u03c32\n\n)LN (sa|\u00b5sa , \u03c32\n\nsb\n\n)\n\nsa\n\nLN ( \u02dc\u03b2i|\u00b5 \u02dc\u03b2i\n\n, \u03c32\n\u02dc\u03b2i\n\n)\n\nA(cid:89)\n\nq\u03c6( \u02dc\u03b1, \u02dcW) =\n\n5V(wij) = V(zi \u02dcwij) = V(zi)(cid:0) E[ \u02dcwij]2 + V( \u02dcwij)(cid:1) + V( \u02dcwij) E[zi]2.\n\n\u02dc\u03b1i\n\ni,j\n\ni\n\nLN (\u02dc\u03b1i|\u00b5 \u02dc\u03b1i, \u03c32\n\n)\n\nN ( \u02dcwij|\u00b5 \u02dcwij , \u03c32\n\n\u02dcwij),\n\n6Notice that the fact that we are using mean-\ufb01eld variational approximations (which we chose for simplicity)\ncan potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the\nexploration of more involved posteriors for future work.\n\nA(cid:89)\n\ni\n\nA,B(cid:89)\n\n5\n\n(15)\n\n(16)\n\n\fwhere LN (\u00b7,\u00b7) is a log-normal distribution. It should be mentioned that a similar form of non-\ncentered variational inference for the horseshoe has been also successfully employed for undirected\nmodels at [32]. Notice that we can also apply local reparametrizations [35] when we are sampling\nsasb by exploiting properties of the log-normal distribution7 and thus forming the\n\n(cid:113)\n\n\u02dc\u03b2i and\n\n\u221a\n\n\u02dc\u03b1i\n\nimplied:\n\n(cid:113)\n\n\u02dczi =\n\n\u02dc\u03b1i\n\n\u00b5\u02dczi =\n\n1\n2\n\n(\u00b5 \u02dc\u03b1i + \u00b5 \u02dc\u03b2i\n\n);\n\n\u02dc\u03b2i \u223c LN (\u00b5\u02dczi , \u03c32\n+ \u03c32\n\u03c32\n\u02dc\u03b2i\n\u02dczi\n\n(\u03c32\n\u02dc\u03b1i\n\n=\n\n\u02dczi\n\n1\n4\n\n);\n\n\u221a\n\ns =\n1\n2\n\nsasb \u223c LN (\u00b5s, \u03c32\ns )\n\u03c32\ns =\n\n(\u00b5sa + \u00b5sb );\n\n(17)\n\n1\n4\n\n(\u03c32\nsa\n\n+ \u03c32\nsb\n\n). (18)\n\n); \u00b5s =\n\nAs a threshold rule for group pruning we will use the negative log-mode8 of the local log-normal r.v.\nzi = s\u02dczi, i.e. prune when (\u03c32\ns.This ignores\nzi\ndependencies among the zi elements induced by the common scale s, but nonetheless we found\nthat it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the\ndistribution over W at each layer with the masked variational posterior mean during test time:\n\n\u2212 \u00b5zi) \u2265 t, with \u00b5zi = \u00b5\u02dczi + \u00b5s and \u03c32\n\n= \u03c32\n\u02dczi\n\n+ \u03c32\n\nzi\n\nq(z)q( \u02dcW)[diag(z) \u02dcW] = diag(cid:0)m (cid:12) exp(\u00b5z +\n\nz)(cid:1)MW ,\n\n1\n2\n\n\u03c32\n\n\u02c6W = diag(m) E\n\n(19)\n\nwhere m is a binary mask determined according to the aforementioned threshold, MW are the means\nof q( \u02dcW) and \u00b5z, \u03c32\nz are the means and variances of the local log-normals over zi. Furthermore,\nsimilarly to the group normal-Jeffreys approach, we will use the variational posterior marginal\nvariances:\n\nV(wij)HS = (exp(\u03c32\n\nzi\n\n) \u2212 1) exp(2\u00b5zi + \u03c32\n\nzi\n\nij + \u00b52\nij\n\nij exp(2\u00b5zi + \u03c32\nzi\n\n),\n\n(20)\n\n)(cid:0)\u03c32\n\n(cid:1) + \u03c32\n\nto compute the \ufb01nal bit precision for the entire weight matrix \u02c6W.\n\n5 Experiments\n\nWe validated the compression and speed-up capabilities of our models on the well-known architectures\nof LeNet-300-100 [39], LeNet-5-Caffe9 on MNIST [40] and, similarly with [49], VGG [61]10 on\nCIFAR 10 [36]. The groups of parameters were constructed by coupling the scale variables for each\n\ufb01lter for the convolutional layers and for each input neuron for the fully connected layers. We provide\nthe algorithms that describe the forward pass using local reparametrizations for fully connected\nand convolutional layers with each of the employed approximate posteriors at appendix F. For the\nhorseshoe prior we set the scale \u03c40 of the global half-Cauchy prior to a reasonably small value, e.g.\n\u03c40 = 1e \u2212 5. This further increases the prior mass at zero, which is essential for sparse estimation\nand compression. We also found that constraining the standard deviations as described at [44] and\n\u201cwarm-up\" [62] helps in avoiding bad local optima of the variational objective. Further details about\nthe experimental setup can be found at Appendix A. Determining the threshold for pruning can be\neasily done with manual inspection as usually there are two well separated clusters (signal and noise).\nWe provide a sample visualization at Appendix E.\n\n5.1 Architecture learning & bit precisions\n\nWe will \ufb01rst demonstrate the group sparsity capabilities of our methods by illustrating the learned\narchitectures at Table 1, along with the inferred bit precision per layer. As we can observe, our\nmethods infer signi\ufb01cantly smaller architectures for the LeNet-300-100 and LeNet-5-Caffe, compared\nto Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe\nthat for the VGG network almost all of big 512 feature map layers are drastically reduced to around\n10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian\nmethods considered require far fewer than the standard 32 bits per-layer to represent the weights,\nsometimes even allowing for 5 bit precisions.\n\n7The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal.\n8Empirically, it slightly better separates the scales compared to the negative log-mean \u2212(\u00b5zi + 0.5\u03c32\n9https://github.com/BVLC/caffe/tree/master/examples/mnist\n10The adapted CIFAR 10 version described at http://torch.ch/blog/2015/07/30/cifar.html.\n\nzi ).\n\n6\n\n\fTable 1: Learned architectures with Sparse VD [49], Generalized Dropout (GD) [63] and Group\nLasso (GL) [70]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group\nhorseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left\nafter pruning along with the average bit precisions for the weights at each layer.\n\nLeNet-5-Caffe\n\n20-50-800-500\n\nNetwork & size\nLeNet-300-100\n784-300-100\n\nMethod\nSparse VD\nBC-GNJ\nBC-GHS\nSparse VD\nGD\nGL\nBC-GNJ\nBC-GHS\nBC-GNJ\n(2\u00d7 64)-(2\u00d7 128)- BC-GHS\n-(3\u00d7256)-(8\u00d7 512)\n\nVGG\n\nPruned architecture\n\nBit-precision\n\n512-114-72\n278-98-13\n311-86-14\n\n14-19-242-131\n7-13-208-16\n3-12-192-500\n8-13-88-13\n5-10-76-16\n\n8-11-14\n8-9-14\n13-11-10\n13-10-8-12\n\n-\n-\n\n18-10-7-9\n10-10\u201314-13\n\n63-64-128-128-245-155-63-\n-26-24-20-14-12-11-11-15\n51-62-125-128-228-129-38-\n\n-13-9-6-5-6-6-6-20\n\n10-10-10-10-8-8-8-\n-5-5-5-5-5-6-7-11\n11-12-9-14-10-8-5-\n-5-6-6-6-8-11-17-10\n\n5.2 Compression Rates\n\nFor the actual compression task we compare our method to current work in three different scenarios:\n(i) compression achieved only by pruning, here, for non-group methods we use the CSC format\nto store parameters; (ii) compression based on the former but with reduced bit precision per layer\n(only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe\n\nTable 2: Compression results for our methods. \u201cDC\u201d corresponds to Deep Compression method\nintroduced at [25], \u201cDNS\u201d to the method of [21] and \u201cSWS\u201d to the Soft-Weight Sharing of [67].\nNumbers marked with * are best case guesses.\n\nCompression Rates (Error %)\n\nModel\n\nOriginal Error % Method\nLeNet-300-100\n\n1.6\n\nLeNet-5-Caffe\n\n0.9\n\nVGG\n8.4\n\n|w(cid:54)=0|\n|w| %\n8.0\n1.8\n4.3\n2.2\n10.8\n10.6\n8.0\n0.9\n0.5\n0.7\n0.9\n0.6\n6.7\n5.5\n\nPruning\n6 (1.6)\n28* (2.0)\n12* (1.9)\n21(1.8)\n9(1.8)\n9(1.8)\n6*(0.7)\n55*(0.9)\n100*(1.0)\n63(1.0)\n108(1.0)\n156(1.0)\n14(8.6)\n18(9.0)\n\nFast\n\nMaximum\nPrediction Compression\n40 (1.6)\n-\n64(1.9)\n113 (1.8)\n58(1.8)\n59(2.0)\n39(0.7)\n108(0.9)\n162(1.0)\n365(1.0)\n573(1.0)\n771(1.0)\n95(8.6)\n116(9.2)\n\n-\n-\n-\n84(1.8)\n36(1.8)\n23(1.9)\n-\n-\n-\n228(1.0)\n361(1.0)\n419(1.0)\n56(8.8)\n59(9.0)\n\nDC\nDNS\nSWS\nSparse VD\nBC-GNJ\nBC-GHS\nDC\nDNS\nSWS\nSparse VD\nBC-GNJ\nBC-GHS\nBC-GNJ\nBC-GHS\n\nthese to be relevant scenarios because (i) can be applied with already existing frameworks such as\nTensor\ufb02ow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to\nwork with low and mixed precision arithmetics [41, 23]. For (iii), we perform k-means clustering on\nthe weights with k=32 and consequently store a weight index that points to a codebook of available\n\n7\n\n\fweights. Note that the latter achieves highest compression rate but it is however fairly unpractical at\ntest time since the original matrix needs to be restored for each layer. As we can observe at Table 2,\nour methods are competitive with the state-of-the art for LeNet-300-100 while offering signi\ufb01cantly\nbetter compression rates on the LeNet-5-Caffe architecture, without any loss in accuracy. Do note\nthat group sparsity and weight sparsity can be combined so as to further prune some weights when a\nparticular group is not removed, thus we can potentially further boost compression performance at\ne.g. LeNet-300-100. For the VGG network we observe that training from a random initialization\nyielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the\napproximate posterior from a pretrained network, similarly with [49], thus we only report the latter\nresults11. After initialization we trained the VGG network regularly for 200 epochs using Adam with\nthe default hyperparameters. We observe a small drop in accuracy for the \ufb01nal models when using\nthe deterministic version of the network for prediction, but nevertheless averaging across multiple\nsamples restores the original accuracy. Note, that in general we can maintain the original accuracy on\nVGG without sampling by simply \ufb01netuning with a small learning rate, as done at [49]. This will\nstill induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision\nremains very high due to not appropriately increasing the marginal variances of the weights.\n\n5.3 Speed and energy consumption\n\nWe demonstrate that our method is competitive with [70], denoted as GL, a method that explicitly\nprunes convolutional kernels to reduce compute time. We measure the time and energy consumption\nof one forward pass of a mini-batch with batch size 8192 through LeNet-5-Caffe. We average over 104\nforward passes and all experiments were run with Tensor\ufb02ow 1.0.1, cuda 8.0 and respective cuDNN.\nWe apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned\narchitecture as lower bit precision would further increase the speed-up but is not implementable in\nany common framework. Further, all methods we compare to in the latter experiments would barely\nshow an improvement at all since they do not learn to prune groups but only parameters. In \ufb01gure 1\nwe present our results. As to be expected the largest effect on the speed up is caused by GPU usage.\nHowever, both our models and best competing models reach a speed up factor of around 8\u00d7. We\ncan further save about 3 \u00d7 energy costs by applying our architecture instead of the original one on a\nGPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256\nwe have a speed-up factor of 51\u00d7.\n\nFigure 1: Left: Avg. Time a batch of 8192 samples takes to pass through LeNet-5-Caffe. Numbers on\ntop of the bars represent speed-up factor relative to the CPU implementation of the original network.\nRight: Energy consumption of the GPU of the same process (when run on GPU).\n\n6 Conclusion\n\nWe introduced Bayesian compression, a way to tackle ef\ufb01ciency and compression in deep neural\nnetworks in a uni\ufb01ed and principled way. Our proposed methods allow for theoretically principled\ncompression of neural networks, improved energy ef\ufb01ciency with reduced computation while naturally\nlearning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian\nmethods for neural networks, when we are concerned with compression and speed up.\n\n11We also tried to \ufb01netune the same network with Sparse VD, but unfortunately it increased the error\n\nconsiderably (around 3% extra error), therefore we do not report those results.\n\n8\n\n\fAcknowledgments\nWe would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for\nvaluable discussions and feedback. This research was supported by TNO, NWO and Google.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, et al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. arXiv\npreprint arXiv:1603.04467, 2016.\n\n[2] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 99\u2013102, 1974.\n\n[3] A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. In Advances in neural\n\ninformation processing systems, pages 523\u2013531, 2011.\n\n[4] E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy ef\ufb01cient deep learning\n\nwith smart memory cubes. arXiv preprint arXiv:1701.06420, 2017.\n\n[5] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing\n\nsystems, pages 2654\u20132662, 2014.\n\n[6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means. The Annals of\n\nMathematical Statistics, 30(4):1145\u20131151, 1959.\n\n[7] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks.\nProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11\nJuly 2015, 2015.\n\n[8] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97\n\n(2):465\u2013480, 2010.\n\n[9] S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband\n\ndecomposition. arXiv preprint arXiv:1703.08595, 2017.\n\n[10] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural\n\nnetworks. arXiv preprint arXiv:1506.04449, 2015.\n\n[11] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations\n\nconstrained to +1 or \u22121. arXiv preprint arXiv:1602.02830, 2016.\n\n[12] M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplica-\n\ntions. arXiv preprint arXiv:1412.7024, 2014.\n\n[13] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary\nweights during propagations. In Advances in Neural Information Processing Systems, pages 3105\u20133113,\n2015.\n\n[14] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in\n\nNeural Information Processing Systems, pages 2148\u20132156, 2013.\n\n[15] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference\n\ncomplexity. arXiv preprint arXiv:1703.08651, 2017.\n\n[16] M. A. Figueiredo. Adaptive sparseness using jeffreys\u2019 prior. Advances in neural information processing\n\nsystems, 1:697\u2013704, 2002.\n\n[17] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep\n\nlearning. ICML, 2016.\n\n[18] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector\n\nquantization. ICLR, 2015.\n\n[19] A. Graves. Practical variational inference for neural networks.\n\nProcessing Systems, pages 2348\u20132356, 2011.\n\nIn Advances in Neural Information\n\n[20] P. D. Gr\u00fcnwald. The minimum description length principle. MIT press, 2007.\n[21] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for ef\ufb01cient dnns. In Advances In Neural\n\nInformation Processing Systems, pages 1379\u20131387, 2016.\n\n[22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical\n\nprecision. CoRR, abs/1502.02551, 392, 2015.\n\n[23] P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks. Master\u2019s thesis,\n\nUniversity of California, 2016.\n\n9\n\n\f[24] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient neural networks.\n\nIn Advances in Neural Information Processing Systems, pages 1135\u20131143, 2015.\n\n[25] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding. ICLR, 2016.\n\n[26] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[27] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length\nof the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages\n5\u201313. ACM, 1993.\n\n[28] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural\n\nnetworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.\n\n[29] A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to\n\nbayesian learning. IEEE Transactions on Neural Networks, 15(4):800\u2013810, 2004.\n\n[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv preprint\narXiv:1704.04861, 2017.\n\n[31] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level\n\naccuracy with 50x fewer parameters and< 0.5 mb model size. ICLR, 2017.\n\n[32] J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint\n\narXiv:1602.03807, 2016.\n\n[33] T. Karaletsos and G. R\u00e4tsch. Automatic relevance determination for deep generative models. arXiv preprint\n\narXiv:1505.07765, 2015.\n\n[34] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning\n\nRepresentations (ICLR), 2014.\n\n[35] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick.\n\nAdvances in Neural Information Processing Systems, 2015.\n\n[36] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.\n[37] N. D. Lawrence. Note relevance determination. In Neural Nets WIRN Vietri-01, pages 128\u2013133. Springer,\n\n2002.\n\n[38] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In NIPs,\n\nvolume 2, pages 598\u2013605, 1989.\n\n[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[40] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.\n[41] D. D. Lin and S. S. Talathi. Overcoming challenges in \ufb01xed point training of deep convolutional networks.\n\nWorkshop ICML, 2016.\n\n[42] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks.\n\narXiv preprint arXiv:1511.06393, 2015.\n\n[43] C. Louizos. Smart regularization of deep architectures. Master\u2019s thesis, University of Amsterdam, 2015.\n[44] C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.\n\nArXiv e-prints, Mar. 2017.\n\n[45] D. J. MacKay. Probable networks and plausible predictions\u2014a review of practical bayesian methods for\n\nsupervised neural networks. Network: Computation in Neural Systems, 6(3):469\u2013505, 1995.\n\n[46] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with\n\n\ufb01ne-grained quantization. arXiv preprint arXiv:1705.01462, 2017.\n\n[47] P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to\n\nweight binarization and other non-linear distortions. arXiv preprint arXiv:1606.01981, 2016.\n\n[48] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the\n\nAmerican Statistical Association, 83(404):1023\u20131032, 1988.\n\n[49] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsi\ufb01es deep neural networks. arXiv\n\npreprint arXiv:1701.05369, 2017.\n\n[50] E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural\n\nnetworks. arXiv preprint arXiv:1506.03208, 2015.\n\n[51] R. M. Neal. Bayesian learning for neural networks. PhD thesis, Citeseer, 1995.\n\n10\n\n\f[52] S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean \ufb01eld variational bayes for continuous sparse signal\n\nshrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1):1113\u20131151, 2014.\n\n[53] O. Papaspiliopoulos, G. O. Roberts, and M. Sk\u00f6ld. A general framework for the parametrization of\n\nhierarchical models. Statistical Science, pages 59\u201373, 2007.\n\n[54] C. Peterson. A mean \ufb01eld theory learning algorithm for neural networks. Complex systems, 1:995\u20131019,\n\n1987.\n\n[55] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary\nconvolutional neural networks. In European Conference on Computer Vision, pages 525\u2013542. Springer,\n2016.\n\n[56] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML\n2014, Beijing, China, 21-26 June 2014, pages 1278\u20131286, 2014.\n\n[57] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n[58] J. Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080\u20131100, 1986.\n[59] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural\n\nnetworks. arXiv preprint arXiv:1607.00485, 2016.\n\n[60] S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of recti\ufb01er units.\n\narXiv preprint arXiv:1704.07724, 2017.\n\n[61] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nICLR, 2015.\n\n[62] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational autoencoders.\n\narXiv preprint arXiv:1602.02282, 2016.\n\n[63] S. Srinivas and R. V. Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.\n[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u20131958,\n2014.\n\n[65] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Ef\ufb01cient processing of deep neural networks: A tutorial and\n\nsurvey. arXiv preprint arXiv:1703.09039, 2017.\n\n[66] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[67] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression. ICLR, 2017.\n[68] G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision\n\nand sparsity. arXiv preprint arXiv:1610.00324, 2016.\n\n[69] C. S. Wallace. Classi\ufb01cation by minimum-message-length inference. In International Conference on\n\nComputing and Information, pages 72\u201381. Springer, 1990.\n\n[70] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In\n\nAdvances In Neural Information Processing Systems, pages 2074\u20132082, 2016.\n\n[71] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-ef\ufb01cient convolutional neural networks using\n\nenergy-aware pruning. CVPR, 2017.\n\n[72] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. ICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1865, "authors": [{"given_name": "Christos", "family_name": "Louizos", "institution": "University of Amsterdam"}, {"given_name": "Karen", "family_name": "Ullrich", "institution": "University of Amsterdam"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam and University of California Irvine and CIFAR"}]}