{"title": "Practical Variational Inference for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2348, "page_last": 2356, "abstract": "Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus.", "full_text": "Practical Variational Inference for Neural Networks\n\nAlex Graves\n\nDepartment of Computer Science\n\nUniversity of Toronto, Canada\ngraves@cs.toronto.edu\n\nAbstract\n\nVariational methods have been previously explored as a tractable approximation\nto Bayesian inference for neural networks. However the approaches proposed so\nfar have only been applicable to a few simple network architectures. This paper\nintroduces an easy-to-implement stochastic variational method (or equivalently,\nminimum description length loss function) that can be applied to most neural net-\nworks. Along the way it revisits several common regularisers from a variational\nperspective. It also provides a simple pruning heuristic that can both drastically re-\nduce the number of network weights and lead to improved generalisation. Exper-\nimental results are provided for a hierarchical multidimensional recurrent neural\nnetwork applied to the TIMIT speech corpus.\n\n1\n\nIntroduction\n\nIn the eighteen years since variational inference was \ufb01rst proposed for neural networks [10] it has not\nseen widespread use. We believe this is largely due to the dif\ufb01culty of deriving analytical solutions\nto the required integrals over the variational posteriors. Such solutions are complicated for even\nthe simplest network architectures, such as radial basis networks [2] and single layer feedforward\nnetworks with linear outputs [10, 1, 14], and are generally unavailable for more complex systems.\nThe approach taken here is to forget about analytical solutions and search instead for variational\ndistributions whose expectation values (and derivatives thereof) can be ef\ufb01ciently approximated with\nnumerical integration. While it may seem perverse to replace one intractable integral (over the true\nposterior) with another (over the variational posterior), the point is that the variational posterior is far\neasier to draw probable samples from, and correspondingly more amenable to numerical methods.\nThe result is a stochastic method for variational inference with a diagonal Gaussian posterior that can\nbe applied to any differentiable log-loss parametric model\u2014which includes most neural networks1\nVariational inference can be reformulated as the optimisation of a Minimum Description length\n(MDL; [21]) loss function; indeed it was in this form that variational inference was \ufb01rst considered\nfor neural networks. One advantage of the MDL interpretation is that it leads to a clear separation\nbetween prediction accuracy and model complexity, which can help to both analyse and optimise the\nnetwork. Another bene\ufb01t is that recasting inference as optimisation makes it to easier to implement\nin existing, gradient-descent-based neural network software.\n\n2 Neural Networks\n\nFor the purposes of this paper a neural network is a parametric model that assigns a conditional\nprobability Pr(D|w) to some dataset D, given a set w = {wi}W\ni=1 of real-valued parameters, or\nweights. The elements (x, y) of D, each consisting of an input x and a target y, are assumed to be\n1An important exception are energy-based models such as restricted Boltzmann machines [24] whose log-\n\nloss is intractable.\n\n1\n\n\fdrawn independently from a joint distribution p(x, y)2. The network loss LN (w,D) is de\ufb01ned as\nthe negative log probability of the data given the weights.\n\nLN (w,D) = \u2212 ln Pr(D|w) = \u2212 (cid:88)\n\nln Pr(y|x, w)\n\n(1)\n\n(x,y)\u2208D\n\nThe logarithm could be taken to any base, but to avoid confusion we will use the natural loga-\nrithm ln throughout. We assume that the partial derivatives of LN (w,D) with respect to the net-\nwork weights can be ef\ufb01ciently calculated (using, for example, backpropagation or backpropagation\nthrough time [22]).\n\n3 Variational Inference\n\nPerforming Bayesian inference on a neural network requires the posterior distribution of the net-\nwork weights given the data. If the weights have a prior probability P (w|\u03b1) that depends on some\nparameters \u03b1, the posterior can be written Pr(w|D, \u03b1). Unfortunately, for most neural networks\nPr(w|D, \u03b1) cannot be calculated analytically, or even ef\ufb01ciently sampled from. Variational in-\nference addresses this problem by approximating Pr(w|D, \u03b1) with a more tractable distribution\nQ(w|\u03b2). The approximation is \ufb01tted by minimising the variational free energy F with respect to\nthe parameters \u03b2, where\n\n(cid:28)\n\n(cid:20) Pr(D|w)P (w|\u03b1)\n\n(cid:21)(cid:29)\n\nF = \u2212\n\n(2)\nand for some function g of a random variable x with distribution p(x), (cid:104)g(cid:105)x\u223cp denotes the expecta-\ntion of g over p. A fully Bayesian approach would infer the prior parameters \u03b1 from a hyperprior;\nhowever in this paper they are found by simply minimising F with respect to \u03b1 as well as \u03b2.\n\nQ(w|\u03b2)\n\nw\u223cQ(\u03b2)\n\nln\n\nF =(cid:10)LN (w,D)(cid:11)\n\n4 Minimum Description Length\nF can be reinterpreted as a minimum description length loss function [12] by rearranging Eq. (2)\nand substituting in from Eq. (1) to get\n\nw\u223cQ(\u03b2) + DKL(Q(\u03b2)||P (\u03b1)),\n\n(3)\nwhere DKL(Q(\u03b2)||P (\u03b1)) is the Kullback-Leibler divergence between Q(\u03b2) and P (\u03b1). Shannon\u2019s\nsource coding theorem [23] tells us that the \ufb01rst term on the right hand side of Eq. (3) is a lower\nbound on the expected amount of information (measured in nats, due to the use of natural loga-\nrithms) required to transmit the targets in D to a receiver who knows the inputs, using the outputs\nof a network whose weights are sampled from Q(\u03b2). Since this term decreases as the network\u2019s\nprediction accuracy increases, we identify it as the error loss LE(\u03b2,D):\n\nLE(\u03b2,D) =(cid:10)LN (w,D)(cid:11)\n\nw\u223cQ(\u03b2)\n\n(4)\n\nShannon\u2019s bound can almost be achieved in practice using arithmetic coding [26]. The second term\non the right hand side of Eq. (3) is the expected number of nats required by a receiver who knows\nP (\u03b1) to pick a sample from Q(\u03b2). Since this term measures the cost of \u2018describing\u2019 the network\nweights to the receiver, we identify it as the complexity loss LC(\u03b1, \u03b2):\n\nLC(\u03b1, \u03b2) = DKL(Q(\u03b2)||P (\u03b1))\n\n(5)\nLC(\u03b1, \u03b2) can be realised with bits-back coding [25, 10]. Although originally conceived as a thought\nexperiment, bits-back coding has been used for an actual compression scheme [5]. Putting the terms\ntogether F can be rephrased as an MDL loss function L(\u03b1, \u03b2,D) that measures the total number of\nnats required to transmit the training targets using the network, given \u03b1 and \u03b2:\n\n(6)\nThe network is then trained on D by minimising L(\u03b1, \u03b2,D) with respect to \u03b1 and \u03b2, just like\nan ordinary neural network loss function. One advantage of using a transmission cost as a loss\n\nL(\u03b1, \u03b2,D) = LE(\u03b2,D) + LC(\u03b1, \u03b2)\n\n2Unsupervised learning can be treated as a special case where x = \u2205\n\n2\n\n\ffunction is that we can immediately determine whether the network has compressed the targets past\na reasonable benchmark (such as that given by an off-the-shelf compressor). If it has, we can be\nfairly certain that the network is learning underlying patterns in the data and not simply memorising\nthe training set. We would therefore expect it to generalise well to new data. In practice we have\nfound that as long as signi\ufb01cant compression is taking place, decreasing L(\u03b1, \u03b2,D) on the training\nset does not increase LE(\u03b2,D) on the test set, and it is therefore unnecessary to sacri\ufb01ce any training\ndata for early stopping.\nTwo transmission costs were ignored in the above discussion. One is the cost of transmitting the\nmodel with w unspeci\ufb01ed (for example software that implements the network architecture, the train-\ning algorithm etc.). The other is the cost of transmitting the prior. If either of these are used to encode\na signi\ufb01cant amount of information about D, the MDL principle will break down and the generali-\nsation guarantees that come with compression will be lost. The easiest way to prevent this is to keep\nboth costs very small compared to D. In particular the prior should not contain too many parameters.\n\n5 Choice of Distributions\nWe now derive the form of LE(\u03b2,D) and LC(\u03b1, \u03b2) for various choices of Q(\u03b2) and P (\u03b1). We also\nderive the gradients of LE(\u03b2,D) and LC(\u03b1, \u03b2) with respect to \u03b2 and the optimal values of \u03b1 given\n\u03b2. All continuous distributions are implicitly assumed to be quantised at some very \ufb01ne resolution,\ni=1 qi(\u03b2i), meaning that\n\nand we will limit ourselves to diagonal posteriors of the form Q(\u03b2) =(cid:81)W\nLC(\u03b1, \u03b2) =(cid:80)W\n\ni=1 DKL(qi(\u03b2i)||P (\u03b1)).\n\n5.1 Delta Posterior\n\nPerhaps the simplest nontrivial distribution for Q(\u03b2) is a delta distribution that assigns probability\n1 to a particular set of weights w and 0 to all other weights. In this case \u03b2 = w, LE(\u03b2,D) =\nLN (w,D) and LC(\u03b1, \u03b2) = LC(\u03b1, w) = \u2212logP (w|\u03b1) + C. where C is a constant that depends\nonly on the discretisation of Q(\u03b2). Although C has no effect on the gradient used for training, it is\nusually large enough to ensure that the network cannot compress the data using the coding scheme\ndescribed in the previous section3. If the prior is uniform, and all realisable weight values are equally\nlikely then LC(\u03b1, \u03b2) is a constant and we recover ordinary maximum likelihood training.\n\nIf the prior is a Laplace distribution then \u03b1 = {\u00b5, b}, P (w|\u03b1) =(cid:81)W\n\n(cid:16)\u2212|wi\u2212\u00b5|\n\n(cid:17)\n\n1\n2b exp\n\ni=1\n\nb\n\nLC(\u03b1, w) = W ln 2b +\n\n1\nb\n\n|wi \u2212 \u00b5| + C =\u21d2 \u2202LC(\u03b1, w)\n\n\u2202wi\n\nsgn(wi \u2212 \u00b5)\n\nb\n\n=\n\nand\n\n(7)\n\nW(cid:88)\n\ni=1\n\n(cid:80)W\ni=1 |wi \u2212 \u02c6\u00b5|.\n\nIf \u00b5 = 0 and b is \ufb01xed, this is equivalent to ordinary L1 regularisation. However we can instead\ndetermine the optimal prior parameters \u02c6\u03b1 for w as follows: \u02c6\u00b5 = \u00b51/2(w) (the median weight value)\nand \u02c6b = 1\nW\n\nIf the prior is Gaussian then \u03b1 = {\u00b5, \u03c32}, P (w|\u03b1) =(cid:81)W\n\n(cid:16)\u2212 (wi\u2212\u00b5)2\n\n(cid:17)\n\n1\u221a\n2\u03c0\u03c32 exp\n\ni=1\n\n2\u03c32\n\nWith \u00b5 = 0 and \u03c32 \ufb01xed this is equivalent to L2 regularisation (also known as weight decay for\nneural networks). The optimal \u02c6\u03b1 given w are \u02c6\u00b5 = 1\nW\n\nW\n\n(wi \u2212 \u00b5)2 + C =\u21d2 \u2202LC(\u03b1, w)\n(cid:80)W\ni=1 (wi \u2212 \u02c6\u00b5)2\n\ni=1 wi and \u02c6\u03c32 = 1\n\n(cid:80)W\n\n\u2202wi\n\n=\n\n\u221a\nLC(\u03b1, w) = W ln(\n\n2\u03c0\u03c32) +\n\n1\n2\u03c32\n\nW(cid:88)\n\ni=1\n\nand\nwi \u2212 \u00b5\n\u03c32\n\n(8)\n\n5.2 Gaussian Posterior\n\nA more interesting distribution for Q(\u03b2) is a diagonal Gaussian. In this case each weight requires a\nseparate mean and variance, so \u03b2 = {\u00b5, \u03c32} with the mean vector \u00b5 and variance vector \u03c32 both\n3The \ufb02oating point resolution of the computer architecture used to train the network could in principle be\nused to upper-bound the discretisation constant, and hence the compression; but in practice the bound would\nbe prohibitively high.\n\n3\n\n\fthe same size as w. For a general network architecture we cannot compute either LE(\u03b2,D) or its\nderivatives exactly, so we resort to sampling. Applying Monte-Carlo integration to Eq. (4) gives\n\nLE(\u03b2,D) \u2248 1\nS\n\nLN (wk,D)\n\n(9)\n\nS(cid:88)\n\nk=1\n\nwith wk drawn independently from Q(\u03b2). A combination of the Gaussian characteristic function\nand integration by parts can be used to derive the following identities for the derivatives of multi-\nvariate Gaussian expectations [18]:\n\n\u2207\u00b5 (cid:104)V (a)(cid:105)a\u223cN = (cid:104)\u2207aV (a)(cid:105)a\u223cN ,\n\n(10)\nwhere N is a multivariate Gaussian with mean vector \u00b5 and covariance matrix \u03a3, and V is an\narbitrary function of a. Differentiating Eq. (4) and applying these identities yields\n\u2202LE(\u03b2,D)\n\n\u2202LN (wk,D)\n\n\u2207\u03a3 (cid:104)V (a)(cid:105)a\u223cN =\n\n(cid:104)\u2207a\u2207aV (a)(cid:105)a\u223cN\n\n1\n2\n\n(cid:28) \u2202LN (w,D)\n(cid:29)\n(cid:28) \u22022LN (w,D)\n\n\u2202wi\n\n(cid:29)\n\n\u2248 1\nS\n\nw\u223cQ(\u03b2)\n\n\u2202w2\ni\n\n\u2248 1\n2\n\nw\u223cQ(\u03b2)\n\nS(cid:88)\n(cid:42)(cid:20) \u2202LN (w,D)\n\n\u2202wi\n\nk=1\n\n(cid:21)2(cid:43)\n\n\u2202\u00b5i\n\n\u2202LE(\u03b2,D)\n\n\u2202\u03c32\ni\n\n=\n\n=\n\n1\n2\n\n\u2202wi\n\nw\u223cQ(\u03b2)\n\nk=1\n\n\u2202wi\n\n\u2248 1\n2S\n\n(11)\n\n(cid:20) \u2202LN (wk,D)\n\n(cid:21)2\n\nS(cid:88)\n\n(12)\nwhere the \ufb01rst approximation in Eq. (12) comes from substituting the negative diagonal of the em-\npirical Fisher information matrix for the diagonal of the Hessian. This approximation is exact if the\nconditional distribution Pr(D|w) matches the empirical distribution of D (i.e. if the network per-\nfectly models the data); we would therefore expect it to improve as LE(\u03b2,D) decreases. For simple\nnetworks whose second derivatives can be calculated ef\ufb01ciently the approximation is unnecessary\nand the diagonal Hessian can be sampled instead.\nA simpli\ufb01cation of the above distribution is to consider the variances of Q(\u03b2) \ufb01xed and optimise\nonly the means. Then the sampling used to calculate the derivatives in Eq. (11) is equivalent to\nadding zero-mean, \ufb01xed-variance Gaussian noise to the network weights during training. In par-\nticular, if the prior P (\u03b1) is uniform and a single weight sample is taken for each element of D,\nthen minimising L(\u03b1, \u03b2,D) is identical to minimising LN (w,D) with weight noise or synaptic\nnoise [13]. Note that the quantisation of the uniform prior adds a large constant to LC(\u03b1, \u03b2), mak-\ning it unfeasible to compress the data with our MDL coding scheme; in practice early stopping is\nrequired to prevent over\ufb01tting when training with weight noise.\nIf the prior is Gaussian then \u03b1 = {\u00b5, \u03c32} and\n\nLC(\u03b1, \u03b2) =\n\nln\n\u00b5i \u2212 \u00b5\n\u03c32\nThe optimal prior parameters \u02c6\u03b1 given \u03b2 are\n\n=\u21d2 \u2202LC(\u03b1, \u03b2)\n\n\u2202\u00b5i\n\ni=1\n\n=\n\n,\n\nW(cid:88)\n\ni=1\n\n\u02c6\u00b5 =\n\n1\nW\n\n\u00b5i,\n\n\u02c6\u03c32 =\n\n1\nW\n\nW(cid:88)\n\n\u03c3\n\u03c3i\n\n+\n\n1\n2\u03c32\n\n\u2202LC(\u03b1, \u03b2)\n\n(cid:104)\n\n(\u00b5i \u2212 \u00b5)2 + \u03c32\n\ni \u2212 \u03c32(cid:105)\n(cid:20) 1\n(cid:21)\n\u03c32 \u2212 1\ni + (\u00b5i \u2212 \u02c6\u00b5)2(cid:105)\n\n\u03c32\ni\n\n1\n2\n\n=\n\n\u03c32\n\n\u2202\u03c32\ni\n\nW(cid:88)\n\n(cid:104)\n\ni=1\n\n(13)\n\n(14)\n\n(15)\n\nIf a Gaussian prior is used with the \ufb01xed variance \u2018weight noise\u2019 posterior described above, it is still\npossible to choose the optimal prior parameters for each \u03b2. This requires only a slight modi\ufb01cation\nof standard weight-noise training, with the derivatives on the left of Eq. (14) added to the weight\ngradient and \u03b1 optimised after every weight update. But because the prior is no longer uniform the\nnetwork is able to compress the data, making it feasible to dispense with early stopping.\nThe terms in the sum on the right hand side of Eq. (13) are the complexity costs of individual\nnetwork weights. These costs give valuable insight into the internal structure of the network, since\n(with a limited budget of bits to spend) the network will assign more bits to more important weights.\nImportance can be used, for example, to prune away spurious weights [15] or determine which\ninputs are relevant [16].\n\n4\n\n\f6 Optimisation\nIf the derivatives of LE(\u03b2,D) are stochastic, we require an optimiser that can tolerate noisy gradient\nestimates. Steepest descent with momentum [19] and RPROP [20] both work well in practice.\nAlthough stochastic derivatives should in principle be estimated using the same weight samples for\nthe entire dataset, it is in practice much more ef\ufb01cient to pick different weight samples for each\n(x, y) \u2208 D. If both the prior and posterior are Gaussian this yields\n\n\u2202L(\u03b1, \u03b2,D)\n\n\u2202\u00b5i\n\n\u2202L(\u03b1, \u03b2,D)\n\n\u2202\u03c32\ni\n\n\u2248 \u00b5i \u2212 \u00b5\n\u03c32 +\n(cid:20) 1\n\u03c32 \u2212 1\n\n\u2248 1\n2\n\n\u03c32\ni\n\n(cid:88)\n(cid:21)\n\n+\n\n(x,y)\u2208D\n\n1\nS\n\nS(cid:88)\n(cid:88)\n\nk=1\n\n\u2202LN (wk, x, y)\n\n(cid:20) \u2202LN (wk, x, y)\n\n\u2202wi\n\nS(cid:88)\n\n(cid:21)2\n\n(16)\n\n(17)\n\n1\n2S\n\n(x,y)\u2208D\n\nk=1\n\n\u2202wi\n\nwhere LN (wk, x, y) = \u2212 ln Pr(y|x, w) and a separate set of S weight samples {wk}S\nk=1 is drawn\nfrom Q(\u03b2) for each (x, y). For large datasets it is usually suf\ufb01cient to set S = 1; however perfor-\nmance can in some cases be substantially improved by using more samples, at the cost of longer\ntraining times.\nIf the data is divided into B equally-sized batches such that D = {bj}B\nj=1, and an \u2018online\u2019 optimiser\nis used, with the parameters updated after each batch gradient calculation, the following online loss\nfunction (and corresponding derivatives) should be employed:\n\nL(\u03b1, \u03b2, bj) =\n\n1\nB\n\nLC(\u03b1, \u03b2) + LE(\u03b2, bj)\n\n(18)\n\nNote the 1/B factor for the complexity loss. This is because the weights (to which the complex-\nity cost applies) are only transmitted once for the entire dataset, whereas the error cost must be\ntransmitted separately for each batch.\nDuring training, the prior parameters \u03b1 should be set to their optimal values after every update to\n\u03b2. For more complex priors where the optimal \u03b1 cannot be found in closed form (such as mixture\ndistributions), \u03b1 and \u03b2 can instead be optimised simultaneously with gradient descent [17, 10].\nIdeally a trained network should be evaluated on some previously unseen input x(cid:48) using the expected\ndistribution (cid:104)Pr(.|x(cid:48), w)(cid:105)w\u223cQ(\u03b2). However the maximum a posteriori approximation Pr(.|x(cid:48), w\u2217),\nwhere w\u2217 is the mode of Q(\u03b2), appears to work well in practice (at least for diagonal Gaussian\nposteriors). This is equivalent to removing weight noise during testing.\n\n7 Pruning\n\nRemoving weights from a neural network (a process usually referred to as pruning) has been re-\npeatedly proposed as a means of reducing complexity and thereby improving generalisation [15, 7].\nThis would seem redundant for variational inference, which automatically limits the network com-\nplexity. However pruning can reduce the computational cost and memory demands of the network.\nFurthermore we have found that if the network is retrained after pruning, the \ufb01nal performance can\nbe improved. A possible explanation is that pruning reduces the noise in the gradient estimates\n(because the pruned weights are not sampled) without increasing network complexity.\nWeights w that are more probable under Q(\u03b2) tend to give lower LN (w,D) and pruning a weight\nis equivalent to \ufb01xing it to zero. These two facts suggest a pruning heuristic where a weight is\nremoved if its probability density at zero is suf\ufb01ciently high under Q(\u03b2). For a diagonal posterior\nwe can de\ufb01ne the relative probability of each wi at zero as the density of qi(\u03b2i) at zero divided by\nthe density of qi(\u03b2i) at its mode. We can then de\ufb01ne a pruning heuristic by removing all weights\nwhose relative probability at zero exceeds some threshold \u03b3, with 0 \u2264 \u03b3 \u2264 1. If qi(\u03b2i) is Gaussian\nthis yields\n\n(cid:18)\n\n(cid:19)\n\nexp\n\n\u2212 \u00b52\ni\n2\u03c32\ni\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u00b5i\n\n\u03c3i\n\n(cid:12)(cid:12)(cid:12)(cid:12) < \u03bb\n\n> \u03b3 =\u21d2\n\n5\n\n(19)\n\n\f\u201cIn wage negotiations the industry bargains as a unit with a single union.\u201d\n\nFigure 1: Two representations of a TIMIT utterance. Note the lower resolution and greater\ndecorrelation of the MFC coef\ufb01cients (top) compared to the spectrogram (bottom).\n\n\u221a\u22122 ln \u03b3, with \u03bb \u2265 0.\n\n\u03bb =\n\n2 ln\n\n(20)\n\nwhere we have used the reparameterisation \u03bb =\nIf \u03bb = 0 no weights\nare pruned. As \u03bb grows the amount of pruning increases, and the probability of the pruned weight\nvector under Q(\u03b2) (and therefore the likely network performance) decreases. A good rule of thumb\nfor how high \u03bb can safely be set is the point at which the pruned weights become less probable than\nan average weight sampled from qi(\u03b2i). For a Gaussian this is\n2 \u2248 0.83\n\n(cid:113)\n\n\u221a\n\nIf the network is retrained after pruning, the cost of transmitting which weights have been removed\nshould in principle be added to LC(\u03b1, \u03b2) (since this information could be used to over\ufb01t the training\ndata). However the extra cost does not depend on the network parameters, and can therefore be\nignored for the purposes of optimisation.\nWhen a Gaussian prior is used its mean tends to be near zero. This implies that \u2018cheaper\u2019 weights,\nwhere qi(\u03b2i) \u2248 P (\u03b1), have high relative probability at zero and are thus more likely to be pruned.\n\n8 Experiments\n\nWe tested all the combinations of posterior and prior described in Section 5 on a hierarchical mul-\ntidimensional recurrent neural network [9] trained to do phoneme recognition on the TIMIT speech\ncorpus [4]. We also assessed the pruning heuristic from Section 7 by applying it with various thresh-\nolds to a trained network and observing the impact on performance and network size.\nTIMIT is a popular phoneme recognition benchmark. The core training and test sets (which we used\nfor our experiments) contain respectively 3696 and 192 phonetically transcribed utterances. We\nde\ufb01ned a validation set by randomly selecting 184 sequences from the training set. The reduced set\nof 39 phonemes [6] was used during both training and testing. The audio data was presented to the\nnetwork in the form of spectrogram images. One such image is contrasted with the mel-frequency\ncepstrum representation used for most speech recognition systems in Fig. 1.\nHierarchical multidimensional recurrent neural networks containing Long Short-Term Memory [11]\nhidden layers and a CTC output layer [8] have proven effective for of\ufb02ine handwriting recogni-\ntion [9]. The same architecture is employed here, with a spectrogram in place of a handwriting\nimage, and phoneme labels in place of characters. Since the network scans through the spectrogram\nin all directions, both vertical and horizontal correlations can be captured.\nThe network topology was identical for all experiments. It was the same as that of the handwriting\nrecognition network in [9] except that the dimensions of the three subsampling windows used to\nprogressively decrease resolution were now 2\u00d7 4, 2\u00d7 4 and 1\u00d7 4, and the CTC layer now contained\n40 output units (one for each phoneme, plus an extra for \u2018blank\u2019). This gave a total of 15 layers,\n1306 units (not counting the inputs or bias), and 139,536 weights. All network parameters were\ntrained with online steepest descent (weight updates after every sequence) using a learning rate of\n10\u22124 and a momentum of 0.9. For the networks with stochastic derivatives (i.e those with Gaussian\nposteriors) a single weight sample was drawn for each sequence. Pre\ufb01x search CTC decoding [8]\nwas used to transcribe the test set, with probability threshold 0.995. When parameters in the pos-\nterior or prior were \ufb01xed, the best value was found empirically. All networks were initialised with\nrandom weights (or random weight means if the posterior was Gaussian), chosen from a Gaussian\n\n6\n\n\fAdaptive weight noise\n\nAdapt. prior weight noise\n\nWeight noise\n\nMaximum likelihood\n\nFigure 2: Error curves for four networks during training. The green, blue and red curves cor-\nrespond to the average per-sequence error loss LE(\u03b2,D) on the training, test and validation sets\nrespectively. Adaptive weight noise does not over\ufb01t, and normal weight noise over\ufb01ts much more\nslowly than maximum likelihood. Adaptive weight noise led to longer training times and noisier\nerror curves.\n\nTable 1: Results for different priors and posteriors. All distribution parameters were learned by\nthe network unless \ufb01xed values are speci\ufb01ed. \u2018Error\u2019 is the phoneme error rate on the core test set\n(total edit distance between the network transcriptions and the target transcriptions, multiplied by\n100). \u2018Epochs\u2019 is the number of passes through the training set after which the error was recorded.\n\u2018Ratio\u2019 is the compression ratio of the training set transcription targets relative to a uniform code\nover the 39 phoneme labels (\u2248 5.3 bits per phoneme); this could only be calculated for the networks\nwith Gaussian priors and posteriors.\n\nName\nAdaptive L1\nAdaptive L2\nAdaptive mean L2\nL2\nMaximum likelihood\nL1\nAdaptive mean L1\nWeight noise\nAdaptive prior weight noise Gauss \u03c3i = 0.075 Gauss\nAdaptive weight noise\nGauss\n\nPosterior\nDelta\nDelta\nDelta\nDelta\nDelta\nDelta\nDelta\nGauss \u03c3i = 0.075 Uniform\n\nGauss\n\nPrior\nLaplace\nGauss\nGauss \u03c32 = 0.1\nGauss \u00b5 = 0, \u03c32 = 0.1\nUniform\nLaplace \u00b5 = 0, b = 1/12\nLaplace b = 1/12\n\nError\n49.0\n35.1\n28.0\n27.4\n27.1\n26.0\n25.4\n25.4\n24.7\n23.8\n\nEpochs Ratio\n7\n421\n53\n59\n44\n545\n765\n220\n260\n384\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n0.542\n0.286\n\nwith mean 0, standard deviation 0.1. For the adaptive Gaussian posterior, the standard deviations\nof the weights were initialised to 0.075 then optimised during training; this ensured that the vari-\nances (which are the standard deviations squared) remained positive. The networks with Gaussian\nposteriors and priors did not require early stopping and were trained on all 3696 utterances in the\ntraining set; all other networks used the validation set for early stopping and hence were trained on\n3512 utterances. These were also the only networks for which the transmission cost of the network\nweights could be measured (since it did not depend on the quantisation of the posterior or prior).\nThe networks were evaluated on the test set using the parameters giving lowest LE(\u03b2,D) on the\ntraining set (or validation set if present). All experiments were stopped after 100 training epochs\nwith no improvement in either L(\u03b1, \u03b2,D), LE(\u03b2,D) or the number of transcription errors on the\ntraining or validation set. The reason for such conservative stopping criteria was that the error curves\nof some of the networks were extremely noisy (see Fig. 2).\nTable 1 shows the results for the different posteriors and priors. L2 regularisation was no better\nthan unregularised maximum likelihood, while L1 gave a slight improvement; this is consistent\nwith our previous experience of recurrent neural networks. The fully adaptive L1 and L2 networks\nperformed very badly, apparently because the priors became excessively narrow (\u03c32 \u2248 0.003 for\nL2 and b \u2248 0.002 for L1). L1 with \ufb01xed variance and adaptive mean was somewhat better than L1\nwith mean \ufb01xed at 0 (although the adaptive mean was very close to zero, settling around 0.0064).\nThe networks with Gaussian posteriors outperformed those with delta posteriors, with the best score\nobtained using a fully adaptive posterior.\nTable 2 shows the effect of pruning on the trained \u2018adaptive weight noise\u2019 network from Table 1.\nThe pruned networks were retrained using the same optimisation as before, with the error recorded\nbefore and after retraining. As well as being highly effective at removing weights, pruning led to\nimproved performance following retraining in some cases. Notice the slow increase in initial error\nup to \u03bb = 0.5 and sharp rise thereafter; this is consistent with the \u2018safe\u2019 threshold of \u03bb \u2248 0.83\n\n7\n\n\fTable 2: Effect of Network Pruning. \u2018\u03bb\u2019 is the threshold used for pruning. \u2018Weights\u2019 is the number\nof weights left after pruning and \u2018Percent\u2019 is the same \ufb01gure expressed as a percentage of the original\nweights. \u2018Initial Error\u2019 is the test error immediately after pruning and \u2018Retrain Error\u2019 is the test error\nfollowing \u2018Retrain Epochs\u2019 of subsequent retraining. \u2018Bits/weight\u2019 is the average bit cost (as de\ufb01ned\nin Eq. (13)) of the unpruned weights.\n\n\u03bb\n0\n0.01\n0.05\n0.1\n0.2\n0.5\n1\n2\n\nWeights\n139,536\n107,974\n63,079\n52,984\n43,182\n31,120\n22,806\n16,029\n\nPercent\n100%\n77.4%\n45.2%\n37.9%\n30.9%\n22.3%\n16.3%\n11.5%\n\nInitial error Retrain error Retrain Epochs Bits/weight\n23.8\n23.8\n23.9\n23.9\n23.9\n24.0\n24.5\n28.0\n\n0.53\n0.72\n1.15\n1.40\n1.82\n2.21\n3.19\n3.55\n\n23.8\n24.0\n23.5\n23.3\n23.7\n23.3\n24.1\n24.5\n\n0\n972\n35\n351\n740\n125\n403\n335\n\nFigure 3: Weight costs in an 2D LSTM recurrent connection. Each dot corresponds to a weight;\nthe lighter the colour the more bits the weight costs. The vertical axis shows the LSTM cell the\nweight comes from; the horizontal axis shows the LSTM unit the weight goes to. Note the low cost of\nthe \u2018V forget gates\u2019 (these mediate vertical correlations between frequency bands in the spectrogram,\nwhich are apparently less important to transcription than horizontal correlations between timesteps);\nthe high cost of the \u2018cells\u2019 (LSTM\u2019s main processing units); the bright horizontal and vertical bands\n(corresponding to units with \u2018important\u2019 outputs and inputs respectively); and the bright diagonal\nthrough the cells (corresponding to self connections).\n\nmentioned in Section 7. The lowest \ufb01nal phoneme error rate of 23.3 would until recently have been\nthe best recorded on TIMIT; however the application of deep belief networks has now improved the\nbenchmark to 20.5 [3].\n\nAcknowledgements\n\nI would like to thank Geoffrey Hinton, Christian Osendorfer, Justin Bayer and Thomas R\u00a8uckstie\u00df\nfor helpful discussions and suggestions. Alex Graves is a Junior Fellow of the Canadian Institute for\nAdvanced Research.\n\nFigure 4: The \u2018cell\u2019 weights from Fig. 3 pruned at different thresholds. Black dots are pruned\nweights, white dots are remaining weights. \u2018Cheaper\u2019 weights tend to be removed \ufb01rst as \u03bb grows.\n\n8\n\ninput gatesH forget gatesV forget gatescellsoutput gatescells\fReferences\n[1] D. Barber and C. M. Bishop. Ensemble learning in Bayesian neural networks., pages 215\u2013237. Springer-\n\nVerlag, Berlin, 1998.\n\n[2] D. Barber and B. Schottky. Radial basis functions: A bayesian treatment. In NIPS, 1997.\n[3] G. E. Dahl, M. Ranzato, A. rahman Mohamed, and G. Hinton. Phone recognition with the mean-\ncovariance restricted boltzmann machine. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel,\nand A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 469\u2013477. 2010.\n\n[4] DARPA-ISTO. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), speech disc\n\ncd1-1.1 edition, 1990.\n\n[5] B. J. Frey. Graphical models for machine learning and digital communication. MIT Press, Cambridge,\n\nMA, USA, 1998.\n\n[6] K. fu Lee and H. wuen Hon. Speaker-independent phone recognition using hidden markov models. IEEE\n\nTransactions on Acoustics, Speech, and Signal Processing, 1989.\n\n[7] C. L. Giles and C. W. Omlin. Pruning recurrent neural networks for improved generalization performance.\n\nIEEE Transactions on Neural Networks, 5:848\u2013851, 1994.\n\n[8] A. Graves, S. Fern\u00b4andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: La-\nbelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International\nConference on Machine Learning, ICML 2006, Pittsburgh, USA, 2006.\n\n[9] A. Graves and J. Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional recurrent neural\n\nnetworks. In NIPS, pages 545\u2013552, 2008.\n\n[10] G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length\n\nof the weights. In COLT, pages 5\u201313, 1993.\n\n[11] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[12] A. Honkela and H. Valpola. Variational learning and bits-back coding: An information-theoretic view to\n\nbayesian learning. IEEE Transactions on Neural Networks, 15:800\u2013810, 2004.\n\n[13] K.-C. Jim, C. Giles, and B. Horne. An analysis of noise in recurrent neural networks: convergence and\n\ngeneralization. Neural Networks, IEEE Transactions on, 7(6):1424 \u20131438, nov 1996.\n\n[14] N. D. Lawrence. Variational Inference in Probabilistic Models. PhD thesis, University of Cambridge,\n\n2000.\n\n[15] Y. Le Cun, J. Denker, and S. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural\n\nInformation Processing Systems, volume 2, pages 598\u2013605. Morgan Kaufmann, San Mateo, CA, 1990.\n\n[16] D. J. C. MacKay. Probable networks and plausible predictions - a review of practical bayesian methods\n\nfor supervised neural networks. Neural Computation, 1995.\n\n[17] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight sharing. Neural Computation,\n\n4:173\u2013193, 1992.\n\n[18] M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural Computation,\n\n21(3):786\u2013792, 2009.\n\n[19] D. Plaut, S. Nowlan, and G. E. Hinton. Experiments on learning by back propagation. Technical Report\nCMU-CS-86-126, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1986.\n[20] M. Riedmiller and T. Braun. A direst adaptive method for faster backpropagation learning: The rprop\n\nalgorithm. In International Symposium on Neural Networks, 1993.\n\n[21] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465 \u2013 471, 1978.\n[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors,\n\npages 696\u2013699. MIT Press, Cambridge, MA, USA, 1988.\n\n[23] C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.\n[24] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory, pages 194\u2013\n\n281. MIT Press, Cambridge, MA, USA, 1986.\n\n[25] C. S. Wallace. Classi\ufb01cation by minimum-message-length inference. In Proceedings of the international\nconference on Advances in computing and information, ICCI\u201990, pages 72\u201381, New York, NY, USA,\n1990. Springer-Verlag New York, Inc.\n\n[26] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Commun. ACM,\n\n30:520\u2013540, June 1987.\n\n9\n\n\f", "award": [], "sourceid": 1263, "authors": [{"given_name": "Alex", "family_name": "Graves", "institution": null}]}