{"title": "Inherent Weight Normalization in Stochastic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3291, "page_last": 3302, "abstract": "Multiplicative stochasticity such as Dropout improves the robustness and gener-\nalizability deep neural networks. Here, we further demonstrate that always-on\nmultiplicative stochasticity combined with simple threshold neurons provide a suf-\nficient substrate for deep learning machines. We call such models Neural Sampling Machines (NSM). We find that the probability of activation of the NSM exhibits a self-normalizing property that mirrors Weight Normalization, a previously studied mechanism that fulfills many of the features of Batch Normalization in an online fashion. The normalization of activities during training speeds up convergence by preventing internal covariate shift caused by changes in the distribution of inputs. The always-on stochasticity of the NSM confers the following advantages: the network is identical in the inference and learning phases, making the NSM a suitable substrate for continual learning, it can exploit stochasticity inherent to a physical substrate such as analog non-volatile memories for in memory computing, and it is suitable for Monte Carlo sampling, while requiring almost exclusively addition and comparison operations. We demonstrate NSMs on standard classification benchmarks (MNIST and CIFAR) and event-based classification benchmarks (N-MNIST and DVS Gestures). Our results show that NSMs perform comparably or better than conventional artificial neural networks with the same architecture.", "full_text": "Inherent Weight Normalization in Stochastic Neural\n\nNetworks\n\nGeorgios Detorakis\n\nSourav Dutta\n\nDepartment of Cognitive Sciences\n\nDepartment of Electrical Engineering\n\nUniversity of California Irvine\n\nIrvine, CA 92697\n\ngdetorak@uci.edu\n\nUniversity of Notre Dame\n\nNotre Dame, IN 46556 USA\n\nsdutta4@nd.edu\n\nAbhishek Khanna\n\nMatthew Jerry\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nUniversity of Notre Dame\n\nNotre Dame, IN 46556 USA\n\nakhanna@nd.edu\n\nUniversity of Notre Dame\n\nNotre Dame, IN 46556 USA\n\nmjerry@alumni.nd.edu\n\nSuman Datta\n\nDepartment of Electrical Engineering\n\nUniversity of Notre Dame\n\nNotre Dame, IN 46556 USA\n\nsdatta@nd.edu\n\nEmre Neftci\n\nDepartment of Cognitive Sciences\nDepartment of Computer Science\n\nUniversity of California Irvine\n\nIrvine, CA 92697\neneftci@uci.edu\n\nAbstract\n\nMultiplicative stochasticity such as Dropout improves the robustness and gener-\nalizability of deep neural networks. Here, we further demonstrate that always-on\nmultiplicative stochasticity combined with simple threshold neurons are suf\ufb01cient\noperations for deep neural networks. We call such models Neural Sampling Ma-\nchines (NSM). We \ufb01nd that the probability of activation of the NSM exhibits a\nself-normalizing property that mirrors Weight Normalization, a previously studied\nmechanism that ful\ufb01lls many of the features of Batch Normalization in an online\nfashion. The normalization of activities during training speeds up convergence\nby preventing internal covariate shift caused by changes in the input distribution.\nThe always-on stochasticity of the NSM confers the following advantages: the\nnetwork is identical in the inference and learning phases, making the NSM suitable\nfor online learning, it can exploit stochasticity inherent to a physical substrate such\nas analog non-volatile memories for in-memory computing, and it is suitable for\nMonte Carlo sampling, while requiring almost exclusively addition and compar-\nison operations. We demonstrate NSMs on standard classi\ufb01cation benchmarks\n(MNIST and CIFAR) and event-based classi\ufb01cation benchmarks (N-MNIST and\nDVS Gestures). Our results show that NSMs perform comparably or better than\nconventional arti\ufb01cial neural networks with the same architecture.\n\n1\n\nIntroduction\n\nStochasticity is a valuable resource for computations in biological and arti\ufb01cial neural networks [9,\n32, 2]. It affects neural networks in many different ways. Some of them are (i) escaping local minima\nduring learning and inference [1], (ii) stochastic regularization in neural networks [21, 52], (iii)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBayesian inference approximation with Monte Carlo sampling [9, 16], (iv) stochastic facilitation [31],\nand (v) energy ef\ufb01ciency in computation and communication [28, 19].\n\nIn arti\ufb01cial neural networks, multiplicative noise is applied as random variables that multiply network\nweights or neural activities (e.g. Dropout). In the brain, multiplicative noise is apparent in the\nprobabilistic nature of neural activations [19] and their synaptic quantal release [8, 51]. Analog\nnon-volatile memories for in-memory computing such as resistive RAMs, ferroelectric devices or\nphase-change materials [54, 23, 15] exhibit a wide variety of stochastic behaviors [42, 34, 54, 33],\nincluding set/reset variability [3] and random telegraphic noise [4]. In crossbar arrays of non-volatile\nmemory devices designed for vector-matrix multiplication (e.g. where weights are stored in the\nresistive or ferroelectric states), such stochasticity manifests itself as multiplicative noise.\n\nMotivated by the ubiquity of multiplicative noise in the physics of arti\ufb01cial and biological computing\nsubstrates, we explore here Neural Sampling Machines (NSMs): a class of neural networks with\nbinary threshold neurons that rely almost exclusively on multiplicative noise as a resource for\ninference and learning. We highlight a striking self-normalizing effect in the NSM that ful\ufb01lls a\nrole that is similar to Weight Normalization during learning [47]. This normalizing effect prevents\ninternal covariate shift as with Batch Normalization [22], stabilizes the weights distributions during\nlearning, and confers rejection to common mode \ufb02uctuations in the weights of each neuron.\n\nWe demonstrate the NSM on a wide variety of classi\ufb01cation tasks, including classical benchmarks\nand neuromorphic, event-based benchmarks. The simplicity of the NSM and its distinct advantages\nmake it an attractive model for hardware implementations using non-volatile memory devices. While\nstochasticity there is typically viewed as a disadvantage, the NSM has the potential to exploit it. In\nthis case, the forward pass in the NSM simply boils down to weight memory lookups, additions, and\ncomparisons.\n\n1.1 Related Work\n\nThe NSM is a stochastic neural network with discrete binary units and thus closely related to Binary\nNeural Networks (BNN). BNNs have the objective of reducing the computational and memory\nfootprint of deep neural networks at run-time [14, 44]. This is achieved by using binary weights and\nsimple activation functions that require only bit-wise operations.\n\nContrary to BNNs, the NSM is stochastic during both inference and learning. Stochastic neural\nnetworks are argued to be useful in learning multi-modal distributions and conditional computations [7,\n50] and for encoding uncertainty [16].\n\nDropout and Dropconnect techniques randomly mask a subset of the neurons and the connections\nduring train-time for regularization and preventing feature co-adaptation [21, 52]. These techniques\ncontinue to be used for training modern deep networks. Dropout during inference time can be viewed\nas approximate Bayesian inference in deep Gaussian processes [16], and this technique is referred to\nas Monte Carlo (MC) Dropout. NSMs are closely related to MC Dropout, with the exception that the\nactivation function is stochastic and the neurons are binary. Similarly to MC Dropout, the \u201calways-on\u201d\nstochasticity of NSMs can be in principle articulated as a MC integration over an equivalent Gaussian\nprocess posterior approximation, \ufb01tting the predictive mean and variance of the data. MC Dropout\ncan be used for active learning in deep neural networks, whereby a learner selects or in\ufb02uences the\ntraining dataset in a way that optimally minimizes a learning criterion [16, 12].\n\nTaken together, NSM can be viewed as a combination of stochastic neural networks, Dropout and\nBNNs. While stochastic activations in the binarization function are argued to be inef\ufb01cient due\nto the generation of random bits, stochasticity in the NSM, however, requires only one random\nbit per pass per neuron or per connection. A different approach for achieving zero mean and unit\nvariance is the self-normalizing neural networks proposed in [25]. There, an activation function in\nnon-binary, deterministic networks is constructed mathematically so that outputs are normalized. In\ncontrast, in the NSM unit, normalization in the sense of [47] emerges from the multiplicative noise\nas a by-product of the central limit theorem. This establishes a connection between exploiting the\nphysics of hardware systems and recent deep learning techniques, while achieving good accuracy on\nbenchmark classi\ufb01cation tasks. Such a connection is highly signi\ufb01cant for the devices community, as\nit implies a simple circuit (threshold operations and crossbars) that can exploit (rather than mitigate)\ndevice non-idealities such as read stochasticity.\n\n2\n\n\fIn recurrent neural networks, stochastic synapses were shown to behave as stochastic counterparts\nof Hop\ufb01eld networks [38], but where stochasticity is caused by multiplicative noise at the synapses\n(rather than logistic noise in Boltzmann machines). These were shown to surpass the performances\nof equivalent machine learning algorithms [20, 36] on certain benchmark tasks.\n\n1.2 Our Contribution\n\nIn this article, we demonstrate multi-layer and convolutional neural networks employing NSM layers\non GPU simulations, and compare with their equivalent deterministic neural networks. We articulate\nNSM\u2019s self-normalizing effect as a statistical equivalent of Weight Normalization. Our results indicate\nthat a neuron model equipped with a hard-\ufb01ring threshold (i.e., a Perceptron) and stochastic neurons\nand synapses:\n\n\u2022 Is a suf\ufb01cient resource for stochastic, binary deep neural networks.\n\u2022 Naturally performs weight normalization.\n\u2022 Can outperform standard arti\ufb01cial neural networks of comparable size.\n\nThe always-on stochasticity gives the NSM distinct advantages compared to traditional deep neural\nnetworks or binary neural networks: The shared forward passes for training and inference in NSM are\nconsistent with the requirement of online learning since an NSM implements weight normalization,\nwhich is not based on batches [47]. This enables simple implementations of neural networks with\nemerging devices. Additionally, we show that the NSM provides robustness to \ufb02uctuations and \ufb01xed\nprecision of the weights during learning.\n\n1.3 Applications\n\nDuring inference, the binary nature of the NSM equipped with blank-out noise makes it largely\nmultiplication-free. As with the Binarynet [13] or XNORnet [44], we speculate that they can be most\nadvantageous in terms of energy ef\ufb01ciency on dedicated hardware.\n\nThe NSM is of interest for hardware implementations in memristive crossbar arrays, as threshold units\nare straightforward to implement in CMOS and binary inputs mitigate read and write non-idealities in\nemerging non-volatile memory devices while reducing communication bandwidth [54]. Furthermore,\nmultiplicative stochasticity in the NSM is consistent with the stochastic properties of emerging\nnanodevices [42, 34]. Exploiting the physics of nanodevices for generating stochasticity can lead to\nsigni\ufb01cant improvements in embedded, dedicated deep learning machines.\n\n2 Methods\n\n2.1 Neural Sampling Machines (NSM)\n\nWe formulate the NSM as a stochastic neural network model that exploits the properties of multiplica-\ntive noise to perform inference and learning. For mathematical tractability, we focus on threshold\n(sign) units, where sgn : R \u2192 [\u22121, 1],\n\nwhere ui is the pre-activation of neuron i given by the following equation\n\nzi = sgn(ui) =(cid:26)1\n\u22121\n\nif ui \u2265 0\nif ui < 0\n\n,\n\nui =\n\nN\n\nXj=1\n\n\u03beijwijzj + bi + \u03b7i,\n\n(1)\n\n(2)\n\nwhere \u03beij and \u03b7i represent multiplicative and additive noise terms, respectively. Both \u03be and \u03b7 are\nindependent and identically distributed (iid) random variables. wij is the weight of the connection\nbetween neurons i and j, bi is a bias term, and N is the number of input connections (fan-in) to\nneuron i. Note that multiplicative noise can be introduced at the synapse (\u03beij ), or at the neuron (\u03bei).\nSince the neuron is a threshold unit, it follows that P (zi = 1|z) = P (ui \u2265 0|z). Thus, the probability\nthat unit i is active given the network state is equal to one minus the cumulative distribution function\n\n3\n\n\fFigure 1: Blank-out synapse with scaling factors. Weights are accumulated on ui as a sum of a\ndeterministic term scaled by \u03b1i (\ufb01lled discs) and a stochastic term with \ufb01xed blank-out probability p\n(empty discs).\n\nof ui. Assuming independent random variables ui, the central limit theorem indicates that the\nprobability of the neuron \ufb01ring is P (zi = 1|z) = 1\u2212 \u03a6(ui|z) (where \u03a6 is the cumulative distribution\nfunction of normal distribution) and more precisely\n\nP (zi = 1|z) =\n\n1\n\n2 1 + erf E(ui|z)\n\np2Var(ui|z)!! ,\n\nwhere E(ui) and Var(ui) are the expectation and variance of state ui.\n\n(3)\n\nIn the case where only independent additive noise is present, equation (2) is rewritten as ui =\nj=1 wijzj +bi+\u03b7i and the expectation and variance are given by E(ui|z) =PN\nPN\nj=1 wijzj +bi+E(\u03b7)\nand Var(ui|z) = Var(\u03b7), respectively. In this case, equation (3) is a sigmoidal neuron with an erf\nactivation function with constant bias E(\u03b7) and constant slope Var(\u03b7). Thus, besides the sigmoidal\nactivation function, the additive noise case does not endow the network with any extra properties.\n\nj=1 \u03beijwijzj + bi and its expec-\nj=1 w2\nij ,\nrespectively. In this derivation, we have used the fact that the square of a sign function is a constant\n\nIn the case of multiplicative noise, equation (2) becomes ui = PN\ntation and variance are given by E(ui|z) = E(\u03be)PN\nfunction (sgn2(x) = 1). In contrast to the additive noise case, Var(ui|z) is proportional to the square\n\nj=1 wijzj and Var(ui|z) = Var(\u03be)PN\n\nof the input weight parameters. The probability of neurons being active becomes:\n\nP (zi = 1|z) =\n\n1\n2\nwith vi = \u03b2i\n\nwi\n\n||wi||2\n\n(1 + erf (vi \u00b7 z)) ,\n\n,\n\n(4)\n\nwe have used the identityqPj w2\n\nwhere \u03b2i here is a variable that captures the parameters of the noise process \u03bei. In the denominator,\nj = ||wi||2, where || \u00b7 ||2 denotes the L2 norm\nof the weights of neuron i. This term has a normalizing effect on the activation function, similar to\nweight normalization, as discussed below. Note that the self-normalizing effect is not speci\ufb01c to the\ndistribution of the chosen random variables, and holds as long as the random variables are iid.\n\nij = qPj w2\n\nijz2\n\nOne consequence of multiplicative noise here is that any positive scaling factor applied to wi is\ncanceled out by the norm. To counter this problem and control \u03b2i without changing the distribution\ngoverning \u03be, the NSM introduces a factor ai in the preactivation\u2019s equation:\n\nui =\n\nN\n\nXj=1\n\n(\u03beij + ai)wijzj + bi.\n\n(5)\n\nThanks to the binary nature of zi, equation (5) is multiplication-free except for the term involving ai.\nSince ai is de\ufb01ned per neuron, the multiplication operation is only performed once per neuron and\ntime step. In this article, we focus on two relevant cases of noise: Gaussian noise with mean 1 and\nvariance \u03c32, \u03beij \u223c N (1, \u03c32) and Bernoulli (blank-out) noise \u03beij \u223c Bernoulli(p), with parameter p.\n\nFrom now on we focus only on the multiplicative noise case.\n\nGaussian Noise\n\nIn the case of multiplicative Gaussian Noise, \u03be in equation (5) is a Gaussian random\n\nvariable \u03be \u223c N (1, \u03c32). This means that the expectation and variance are E(ui|z) = (1+ai)Pj wijzj\nand Var(ui|z) = \u03c32Pj w2\nBernoulli (Blank-out) Noise Bernoulli (\u201cBlank-out\u201d) noise can be interpreted as a Dropout mask\non the neurons or a Dropconnect mask on the synaptic weights (see Fig 1), where \u03beij \u2208 [0, 1] in\n\nij , respectively. And hence, \u03b2i = 1+ai\u221a2\u03c32 .\n\n4\n\n\fequation (5) becomes a Bernoulli random variable with parameter p. Since the \u03beij are independent,\nfor a given z, a suf\ufb01ciently large fan-in, and 0 < p < 1, the sums in equation (5) are Gaussian-\nij ,\n\ndistributed with means and variances E(ui|z) = (p+ai)Pj wijzj and Var(ui|z) = p(1\u2212p)Pj w2\nrespectively. Therefore we obtain: \u03b2i = p+ai\u221a2p(1\u2212p)\n\n.\n\nWe observed empirically that whether the neuron is stochastic or the synapse is stochastic did not\nsigni\ufb01cantly affect the results.\n\n2.2 NSMs implements Weight Normalization\n\nThe key idea in weight normalization [47] is to normalize unit activity by reparameterizing the\nweight vectors. The reparameterization used there has the form: vi = \u03b2i\n. This is exactly the\nform obtained by introducing multiplicative noise in neurons (equation (4)), suggesting that NSMs\ninherently perform weight normalization in the sense of [47]. The authors argue that decoupling\nthe magnitude and the direction of the weight vectors speeds up convergence and confers many of\nthe features of batch normalization. To achieve weight normalization effectively, gradient descent is\nperformed with respect to the scalars \u03b2 (which are themselves parameterized with ai) in addition to\nthe weights w:\n\n||wi||\n\nwi\n\n\u2202\u03b2iL = Pj wij\u2202vijL\n||wi||\n\u2202vijL \u2212\n\u2202wijL =\n\n\u03b2i\n||wi||\n\nwi\u03b2i\n\n||wi||2 \u2202\u03b2iL\n\n(6)\n\n(7)\n\n2.3 NSM Training Procedure\n\nNeural sampling machines (and stochastic neural networks in general) are challenging to train because\nerrors cannot be directly back-propagated through stochastic nodes. This dif\ufb01culty is compounded by\nthe fact that the neuron state is a discrete random variable, and as such the standard reparametrization\ntrick is not directly applicable [17]. Under these circumstances, unbiased estimators resort to mini-\nmizing expected costs through the family of REINFORCE algorithms [53, 7, 35] (also called score\nfunction estimator and likelihood-ratio estimator). Such algorithms have general applicability but\ngradient estimators have impractically high variance and require multiple passes in the network to\nestimate them [43]. Straight-through estimators ignore the non-linearity altogether [7], but result in\nnetworks with low performance. Several work have introduced methods to overcome this issue, such\nas in discrete variational autoencoders [46], bias reduction techniques for the REINFORCE algo-\nrithm [18] and concrete distribution approach (smooth relaxations of discrete random variables) [30]\nor other reparameterization tricks [48].\n\nStriving for simplicity, here we propagate gradients through the neurons\u2019 activation probability\nfunction. This approach theoretically comes at a cost in accuracy because the rule is a biased estimate\nof the gradient of the loss. This is because the gradients are estimated using activation probability.\nHowever, it is more ef\ufb01cient than REINFORCE algorithms as it uses the information provided by\nthe gradient back-propagation algorithm. In practice, we \ufb01nd that, provided adequate initialization,\nthe gradients are well behaved and yield good performance while being able to leverage existing\nautomatic differentiation capabilities of software libraries (e.g. gradients in Pytorch [40]). In the\nimplementation of NSMs, probabilities are only computed for the gradients in the backward pass,\nwhile only binary states are propagated in the forward pass (see SI 4.2).\n\nTo assess the impact of this bias, we compare the above training method with Concrete Relaxation\nwhich is unbiased [30]. The NSM network is compatible with the binary case of Concrete relaxation.\nWe trained the NSM using BinConcrete units on MNIST data set (Test Error Rate: 0.78%), and\nobserved that the angles between the gradients of the proposed NSM and BinConcrete are close (see\nSI 4.11).\n\nUnless otherwise stated, and similarly to [47], we use a data-dependent initialization of the mag-\nnitude parameters \u03b2 and the bias parameters over one batch of 100 training samples such that the\n\n5\n\n\fTable 1: Classi\ufb01cation error on the permutation invariant MNIST task (test set). Error is estimated by\naveraging test errors over 100 samples (for NSMs) and over the 50 last epochs.\n\nData set\n\nNetwork\n\nPI MNIST NSM 784\u2013300\u2013300\u2013300\u201310\nPI MNIST StNN 784\u2013300\u2013300\u2013300\u201310\nPI MNIST NSM scaled 784\u2013300\u2013300\u2013300\u201310\n\nNSM\n\n1.36 %\n1.47 %\n1.38 %\n\npreactivations to each layer have zero mean and unit variance over that batch:\n\n\u03b2 \u2190\n\n1\n\u03c3\n\n,\n\nb \u2190 \u2212\n\n,\n\n(8)\n\n\u00b5||w||p2V ar(\u03be)\n\n\u03c3\n\nwhere \u00b5 and \u03c3 are feature-wise means and standard deviations estimated over the minibatch. For all\ni , where n indexes the\ndata sample and pi is the Softmax output. All simulations were performed using Pytorch [40]. All\nNSM layers were built as custom Pytorch layers (for more details about simulations see SI 4.8).1\n\nclassi\ufb01cation experiments, we used cross-entropy loss Ln = \u2212Pi tn\n\ni log pn\n\n3 Experiments\n\n3.1 Multi-layer NSM Outperforms Standard Stochastic Neural Networks in Speed and\n\nAccuracy\n\nIn order to characterize the classi\ufb01cation abilities of the NSM we trained a fully connected network on\nthe MNIST handwritten digit image database for digit classi\ufb01cation. The network consisted of three\nfully-connected layers of size 300, and a Softmax layer for 10-way classi\ufb01cation and all Bernoulli\nprocess parameters were set to p = .5. The NSM was trained using back-propagation and a softmax\nlayer with cross-entropy loss and minibatches of size 100. As a baseline for comparison, we used\nthe stochastic neural network (StNN) presented in [27] without biases, with a sigmoid activation\nprobability Psig(zi = 1|z) = sigmoid(wi \u00b7 z).\nThe results of this experiment are shown in Table 1. The 15th, 50th and 85th percentiles of the\ninput distributions to the last hidden layer during training is shown in Fig. 2. The evolution of the\ndistribution in the NSM case is more stable, suggesting that NSMs indeed prevent internal covariate\nshift.\n\nBoth the speed of convergence and accuracy within 200 iterations are higher in the NSM compared to\nthe StNN. The higher performance in the NSM is achieved using inference dynamics that are simpler\nthan the StNN (sign activation function compared to a sigmoid activation function) and using binary\nrandom variables.\n\n3.2 Robustness to Weight Fluctuations\n\nwi\n\n||wi||\n\nThe decoupling of the weight matrix as in vi = \u03b2i\nintroduces several additional advantages\nin learning machines. During learning, the distribution of the weights for a layer tend to remain\nmore stable in NSM compared to the StNN (SI Fig. 4). This feature can be exploited to mitigate\nsaturation at the boundaries of \ufb01xed range weight representations (e.g. in \ufb01xed-point representations or\nmemristors). Another subtle advantage from an implementation point of view is that the probabilities\nare invariant to positive scaling of the weights, i.e. \u03b1wi\nfor \u03b1 \u2265 0. Table 1 shows that\n||\u03b1wi||\nNSM with weights multiplied by a constant factor .1 (called NSM scaled in the table) during inference\ndid not signi\ufb01cantly affect the classi\ufb01cation accuracy. This suggests that the NSM can be robust to\ncommon mode \ufb02uctuations that may affect the rows of the weight matrix. Note that this property\ndoes not hold for ANNs with standard activation functions (relu, sigmoid, tanh), and the network\nperformance is lost by such scaling (for more details see SI 4.5).\n\n= wi\n||wi||\n\n1https://github.com/nmi-lab/neural_sampling_machines\n\n6\n\n\fFigure 2: NSM mitigates internal covariate shift. 15th, 50th and 85th percentiles of the input\ndistribution to the last hidden layer (similarly to Fig. 1 in [22]). The internal covariate shift is visible\nin the StNN as the input distributions change signi\ufb01cantly during the learning. The self normalizing\neffect in NSM performs weight normalization, which is known to mitigate this shift and speed up\nlearning. Each iteration corresponds to one mini-batch update (100 data samples per mini-batch,\n20000 data samples total).\n\n3.3 Supervised Classi\ufb01cation Experiments: MNIST Variants\n\nWe validate the effectiveness of NSMs in supervised classi\ufb01cation experiments on MNIST [26],\nEMNIST [11], N-MNIST [39], and DVS Gestures data sets (See Methods) using convolutional\narchitecture. For all data sets, the inputs were converted to \u22121/ + 1 binary in a deterministic fashion\nusing the function de\ufb01ned in equation (1). For the MNIST variants we trained all the networks for\n200 epochs presenting at each epoch the entire dataset. For testing the accuracy of the networks we\nused the entire test dataset sampling each minibatch 100 times.\n\nNSM models with Gaussian noise (gNSM) and Bernouilli noise (bNSM) converged to similar or\nbetter accuracy compared to the architecturally equivalent deterministic models. The results for\nMNIST, EMNIST and N-MNIST are given in Table 2, where we compare with the deterministic\ncounterpart convolutional neural network (see Table 6 in the SI). In addition we compared with a\nbinary (sign non-linearity) deterministic network (BD), a binary deterministic network with weight\nnormalization (wBD), a stochastic network (noisy recti\ufb01er [7]) (SN), and a deterministic binary\nnetwork (BN). We trained the \ufb01rst three networks using a Straight-Through Estimator [7] (STE)\nand the latter one using erf function in the backward pass only (i.e., gradients computed on the erf\nfunction). The architecture of all these four networks is the same as the NSM\u2019s. The training process\nwas the same as for the NSM networks and the results are given in Table 2. From these results we can\nconclude that NSM training procedure provides better performance than the STE and normalization\nof binary deterministic networks trained with a similar way as NSM (e.g., BN).\n\n3.4 Supervised Classi\ufb01cation Experiments: CIFAR10/100\n\nWe tested the NSM on the CIFAR10 and CIFAR100 dataset of natural images. We used the model\narchitecture described in [47] and added an extra input convolutional layer to convert RGB intensities\ninto binary values. The NSM non-linearities are sign functions given by equation (1). We used\nthe Adam [24] optimizer, with initial learning rate 0.0003 and we trained for 200 epochs using a\nbatch size of 100 over the entire CIFAR10/100 data sets (50K/10K images for training and testing\nrespectively). The test error was computed after each epoch and by running 100 times each batch\n(MC samples) with different seeds. Thus classi\ufb01cation was made on the average over the MC samples.\nAfter 100 epochs we started decaying the learning rate linearly and we changed the \ufb01rst moment\nfrom 0.9 to 0.5. The results are given in Table 5. For the NSM networks we tried two different types\nof initialization. First, we initialized the weights with the values of the already trained deterministic\nnetwork weights. Second and in order to verify that the initialization does not affect dramatically the\ntraining, we initialized the NSM without using any pre-trained weights. In both cases the performance\nof the NSM was similar as it is indicated in Table 5. We compared with the counterpart deterministic\nimplementation using the exact same parameters and same additional input convolutional layer.\n\n7\n\n0255075100125150175200Iterations-2.0-1.5-1.0-0.50.00.51.01.52.0ActivationNSMQ 15Q 50Q 850255075100125150175200Iterations-2.0-1.5-1.0-0.50.00.51.01.52.0StNNQ 15Q 50Q 85\fTable 2: (Top) Classi\ufb01cation error on MNIST datasets. Error is estimated by averaging test errors over\n100 samples (for NSMs), 5 runs, and over the 10 last epochs. Pre\ufb01x, d-deterministic, b-Bernouilli,\ng-Gaussian. (Bottom) Comparison of networks on MNIST classi\ufb01cation task. The NSM variations\nBernoulli (bNSM) and Gaussian (gNSM) are compared with an NSM trained with a Straight-Through\nEstimator instead of the proposed training algorithm, a deterministic binary (sign non-linearity)\nnetwork (BD), a BD with weight normalization enabled (wBD), a stochastic network (noisy recti\ufb01er)\n(SN) and a binary network (BN). For more details see section 3.3 in the main text.\n\nDataset\n\ndCNN\n\nbNSM gNSM\n\n0.880% 0.775 % 0.805%\nMNIST\n6.938% 6.185 % 6.256%\nEMNIST\nNMNIST 0.927% 0.689 % 0.701%\n\nModel\n\nbNSM gNSM bNSM (STE)\n\nBD wBD\n\nSN\n\nBN\n\nError\n\n0.775\n\n0.805\n\n2.13\n\n3.11\n\n2.72\n\n2.05\n\n1.10\n\nTable 3: Classi\ufb01cation error on CIFAR10/CIFAR100. Error is estimated by sampling 100 times each\nmini-batch (MC samples) and \ufb01nally averaging over all 100 samples (for NSMs), 5 runs and over the\n10 last epochs. Pre\ufb01x, d-deterministic, b-blank-out, g-Gaussian. The \u2217 indicates a network that has\nnot been initialized with pre-trained weights (see main text).\n\nDataset\n\nModel\n\nError\n\nCIFAR10/100\nCIFAR10/100\nCIFAR10/100\nCIFAR10/100\nCIFAR10/100\n\nbNSM 9.98% / 34.85%\ngNSM 10.35% / 34.84%\n10.47% / 34.37%\ndCNN\nbNSM\u2217\n9.94% / 35.19%\ngNSM\u2217\n9.81% / 34.93%\n\nTable 4: Classi\ufb01cation error on DVS Gestures data set. Error is estimated by averaging test errors\nover 100 samples and over the 10 last epochs. Pre\ufb01x, d-deterministic, b-blank-out, g-Gaussian.\n\nDataset\n\nModel\n\nError\n\nDVS Gestures\nDVS Gestures\nDVS Gestures\nDVS Gestures\n\nIBM EEDN 8.23%\n8.56%\nbNSM\n8.83%\ngNSM\ndCNN\n9.16%\n\n8\n\n\f3.5 Supervised Classi\ufb01cation Experiments: DVS Gestures\n\nBinary neural networks, such as the NSM are particularly suitable for discrete or binary data.\nNeuromorphic sensors such as Dynamic Vision Sensors (DVS) that output streams of events fall into\nthis category and can transduce visual or auditory spatiotemporal patterns into parallel, microsecond-\nprecise streams of events [29].\n\nAmir et al. recorded DVS Gesture data set using a Dynamical Vision Sensor (DVS), comprising 1342\ninstances of a set of 11 hand and arm gestures, collected from 29 subjects under 3 different lighting\nconditions. Unlike standard imagers, the DVS records streams of events that signal the temporal\nintensity changes at each of its 128 \u00d7 128 pixels. The unique features of each gesture are embedded\nin the stream of events. To process these streams, we closely follow the pre-processing in [5], where\nevent streams were downsized to 64 \u00d7 64 and binned in frames of 16ms. The input of the neural was\nformed by 6 frames (channels) and only ON (positive polarity) events were used. Similarly to [5], 23\nsubjects are used for the training set, and the remaining 6 subjects are reserved for testing. We note\nthat the network used in this work is much smaller than the one used in [5].\n\nWe adapted a model based on the all convolutional networks of [49]. Compared to the original model,\nour adaptation includes an additional group of three convolutions and one pooling layer to account\nfor the larger image size compared to the CIFAR10 data set used in [49] and a number of output\nclasses that matches those of the DVS Gestures data set (11 classes). See SI Tab. 7 for a detailed\nlisting of the layers. We trained the network for 200 epochs using a batch size 100. For the NSM\nnetwork we initialized the weights using the converged weights of the deterministic network. This\nmakes learning more robust and causes a faster convergence.\n\nWe \ufb01nd that the smaller models of [49] (in terms of layers and number of neurons) are faster to\ntrain and perform equally well when executed on GPU compared to the EEDN used in [5]. The\nmodels reported in Amir et al. were optimized for implementation in digital neuromorphic hardware,\nwhich strongly constrains weights, connectivity and neural activation functions in favor of energetic\nef\ufb01ciency.\n\n4 Conclusions\n\nStochasticity is a powerful mechanism for improving the computational features of neural networks,\nincluding regularization and Monte Carlo sampling. This work builds on the regularization effect\nof stochasticity in neural networks, and demonstrates that it naturally induces a normalizing effect\non the activation function. Normalization is a powerful feature used in most modern deep neural\nnetworks [22, 45, 47], and mitigates internal covariate shift. Interestingly, this normalization effect\nmay provide an alternative mechanism for divisive normalization in biological neural networks [10].\n\nOur results demonstrate that NSMs can (i) outperform standard stochastic networks on standard\nmachine learning benchmarks on convergence speed and accuracy, and (ii) perform close to de-\nterministic feed-forward networks when data is of discrete nature. This is achieved using strictly\nsimpler inference dynamics, that are well suited for emerging nanodevices, and argue strongly in\nfavor of exploiting stochasticity in the devices for deep learning. Several implementation advantages\naccrue from this approach: it is an online alternative to batch normalization and dropout, it mitigates\nsaturation at the boundaries of \ufb01xed range weight representations, and it confers robustness against\ncertain spurious \ufb02uctuations affecting the rows of the weight matrix.\n\nAlthough feed-forward passes in networks can be implemented free of multiplications, the weight\nupdate rule is more involved as it requires multiplications, calculating the row-wise L2-norms of\nthe weight matrices, and the derivatives of the erf function. However, these terms are shared for\nall connections fanning into a neuron, such that the overhead in computing them is reasonably\nsmall. Furthermore, based on existing work, we speculate that approximating the learning rule\neither by hand [37] or automatically [6] can lead to near-optimal learning performances while being\nimplemented with simple primitives.\n\n9\n\n\fReferences\n\n[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for\n\nboltzmann machines. Cognitive science, 9(1):147\u2013169, 1985.\n\n[2] Maruan Al-Shedivat, Rawan Naous, Gert Cauwenberghs, and Khaled Nabil Salama. Memristors\nempower spiking neurons with stochasticity. IEEE Journal on Emerging and Selected Topics in\nCircuits and Systems, 5(2):242\u2013253, 2015.\n\n[3] Stefano Ambrogio, Simone Balatti, Antonio Cubeta, Alessandro Calderoni, Nirmal Ramaswamy,\nand Daniele Ielmini. Statistical \ufb02uctuations in hfo x resistive-switching memory: part i-set/reset\nvariability. IEEE Transactions on electron devices, 61(8):2912\u20132919, 2014.\n\n[4] Stefano Ambrogio, Simone Balatti, Antonio Cubeta, Alessandro Calderoni, Nirmal Ramaswamy,\nand Daniele Ielmini. Statistical \ufb02uctuations in hfo x resistive-switching memory: Part\nii\u2014random telegraph noise. IEEE Transactions on Electron Devices, 61(8):2920\u20132927, 2014.\n\n[5] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo,\nTapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low\npower, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 7243\u20137252, 2017.\n\n[6] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nAdvances in Neural Information Processing Systems, pages 3981\u20133989, 2016.\n\n[7] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients\nthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[8] Tiago Branco and Kevin Staras. The probability of neurotransmitter release: variability and\n\nfeedback control at single synapses. Nature Reviews Neuroscience, 10(5):373, 2009.\n\n[9] L. Buesing, J. Bill, B. Nessler, and W. Maass. Neural dynamics as sampling: A model for\nstochastic computation in recurrent networks of spiking neurons. PLoS Computational Biology,\n7(11):e1002211, 2011.\n\n[10] Matteo Carandini and David J Heeger. Normalization as a canonical neural computation. Nature\n\nReviews Neuroscience, 13(1):nrn3136, 2011.\n\n[11] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr\u00e9 van Schaik. Emnist: an extension\n\nof mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.\n\n[12] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical\n\nmodels. In Advances in neural information processing systems, pages 705\u2013712, 1995.\n\n[13] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with\n\nweights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[14] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[15] S. B. Eryilmaz, E. Neftci, S. Joshi, S. Kim, M. BrightSky, H. L. Lung, C. Lam, G. Cauwenberghs,\nand H. S. P. Wong. Training a probabilistic graphical model with resistive switching electronic\nsynapses. IEEE Transactions on Electron Devices, 63(12):5004\u20135011, Dec 2016.\n\n[16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\n\nuncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.\n\n[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[18] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropa-\n\ngation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.\n\n10\n\n\f[19] Julia J Harris, Renaud Jolivet, and David Attwell. Synaptic energy use and supply. Neuron,\n\n75(5):762\u2013777, 2012.\n\n[20] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[21] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv\npreprint arXiv:1207.0580, 2012.\n\n[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[23] Matthew Jerry, Pai-Yu Chen, Jianchi Zhang, Pankaj Sharma, Kai Ni, Shimeng Yu, and Suman\nDatta. Ferroelectric fet analog synapse for acceleration of deep neural network training. In\n2017 IEEE International Electron Devices Meeting (IEDM), pages 6\u20132. IEEE, 2017.\n\n[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing\nneural networks. In Advances in neural information processing systems, pages 971\u2013980, 2017.\n\n[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] Dong-Hyun Lee, Saizheng Zhang, Antoine Biard, and Yoshua Bengio. Target propagation.\n\narXiv preprint arXiv:1412.7525, 2014.\n\n[28] William B Levy and Robert A Baxter. Energy-ef\ufb01cient neuronal computation via quantal\n\nsynaptic failures. The Journal of Neuroscience, 22(11):4746\u20134755, 2002.\n\n[29] S.-C. Liu and T. Delbruck. Neuromorphic sensory systems. Current Opinion in Neurobiology,\n\n20(3):288\u2013295, 2010.\n\n[30] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[31] Mark D McDonnell and Lawrence M Ward. The bene\ufb01ts of noise in neural systems: bridging\n\ntheory and experiment. Nature Reviews Neuroscience, 12(7), 2011.\n\n[32] Rub\u00e9n Moreno-Bote. Poisson-like spiking in circuits with probabilistic synapses. PLoS\n\ncomputational biology, 10(7):e1003522, 2014.\n\n[33] Halid Mulaosmanovic, Thomas Mikolajick, and Stefan Slesazeck. Random number generation\n\nbased on ferroelectric switching. IEEE Electron Device Letters, 39(1):135\u2013138, 2017.\n\n[34] Rawan Naous, Maruan AlShedivat, Emre Neftci, Gert Cauwenberghs, and Khaled Nabil Salama.\nMemristor-based neural networks: Synaptic versus neuronal stochasticity. AIP Advances,\n6(11):111304, 2016.\n\n[35] Radford M Neal. Learning stochastic feedforward networks. Department of Computer Science,\n\nUniversity of Toronto, page 64, 1990.\n\n[36] E. Neftci, S. Das, B. Pedroni, K. Kreutz-Delgado, and G. Cauwenberghs. Event-driven con-\ntrastive divergence for spiking neuromorphic systems. Frontiers in Neuroscience, 7(272), Jan.\n2014.\n\n[37] Emre Neftci, Charles Augustine, Somnath Paul, and Georgios Detorakis. Event-driven random\nback-propagation: Enabling neuromorphic deep learning machines. In 2017 IEEE International\nSymposium on Circuits and Systems, May 2017.\n\n[38] Emre O Neftci, Bruno Umbria Pedroni, Siddharth Joshi, Maruan Al-Shedivat, and Gert Cauwen-\nberghs. Stochastic synapses enable ef\ufb01cient brain-inspired learning machines. Frontiers in\nNeuroscience, 10(241), 2016.\n\n11\n\n\f[39] Garrick Orchard, Ajinkya Jayawant, Gregory K. Cohen, and Nitish Thakor. Converting static\nimage datasets to spiking neuromorphic datasets using saccades. Frontiers in Neuroscience, 9,\nnov 2015.\n\n[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[41] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[42] Damien Querlioz, Olivier Bichler, Adrien Francis Vincent, and Christian Gamrat. Bioinspired\nprogramming of memory devices for implementing an inference engine. Proceedings of the\nIEEE, 103(8):1398\u20131416, 2015.\n\n[43] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning\n\nbinary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.\n\n[44] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[45] Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H Sinz, and Richard S Zemel. Normalizing\nthe normalizers: Comparing and extending network normalization schemes. arXiv preprint\narXiv:1611.04520, 2016.\n\n[46] Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.\n\n[47] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to\n\naccelerate training of deep neural networks. arXiv preprint arXiv:1602.07868, 2016.\n\n[48] Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparame-\n\nterization trick. arXiv preprint arXiv:1710.07739, 2017.\n\n[49] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving\n\nfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[50] Yichuan Tang and Ruslan Salakhutdinov. A new learning algorithm for stochastic feedforward\n\nneural nets. In ICML\u20192013 Workshop on Challenges in Representation Learning, 2013.\n\n[51] B Walmsley, FR Edwards, and DJ Tracey. The probabilistic nature of synaptic transmission at a\n\nmammalian excitatory central synapse. Journal of Neuroscience, 7(4):1037\u20131046, 1987.\n\n[52] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural\nnetworks using dropconnect. In Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), pages 1058\u20131066, 2013.\n\n[53] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[54] S. Yu, Z. Li, P. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian. Binary neural network\nwith 16 mb rram macro chip for classi\ufb01cation and online training. In 2016 IEEE International\nElectron Devices Meeting (IEDM), pages 16.2.1\u201316.2.4, Dec 2016.\n\n12\n\n\f", "award": [], "sourceid": 1835, "authors": [{"given_name": "Georgios", "family_name": "Detorakis", "institution": "University of California, Irvine"}, {"given_name": "Sourav", "family_name": "Dutta", "institution": "Univ. Notre Dame"}, {"given_name": "Abhishek", "family_name": "Khanna", "institution": "Univ. Notre Dame"}, {"given_name": "Matthew", "family_name": "Jerry", "institution": "Univ. Notre Dame"}, {"given_name": "Suman", "family_name": "Datta", "institution": "Univ. Notre Dame"}, {"given_name": "Emre", "family_name": "Neftci", "institution": "UC Irvine"}]}