{"title": "BinaryConnect: Training Deep Neural Networks with binary weights during propagations", "book": "Advances in Neural Information Processing Systems", "page_first": 3123, "page_last": 3131, "abstract": "Deep Neural Networks (DNN) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and power-hungry components of the digital implementation of neural networks. We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.", "full_text": "BinaryConnect: Training Deep Neural Networks with\n\nbinary weights during propagations\n\nMatthieu Courbariaux\n\n\u00b4Ecole Polytechnique de Montr\u00b4eal\n\nmatthieu.courbariaux@polymtl.ca\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal, CIFAR Senior Fellow\n\nyoshua.bengio@gmail.com\n\nJean-Pierre David\n\n\u00b4Ecole Polytechnique de Montr\u00b4eal\n\njean-pierre.david@polymtl.ca\n\nAbstract\n\nDeep Neural Networks (DNN) have achieved state-of-the-art results in a wide\nrange of tasks, with the best results obtained with large training sets and large\nmodels. In the past, GPUs enabled these breakthroughs because of their greater\ncomputational speed. In the future, faster computation at both training and test\ntime is likely to be crucial for further progress and for consumer applications on\nlow-power devices. As a result, there is much interest in research and develop-\nment of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights\nwhich are constrained to only two possible values (e.g. -1 or 1), would bring great\nbene\ufb01ts to specialized DL hardware by replacing many multiply-accumulate op-\nerations by simple accumulations, as multipliers are the most space and power-\nhungry components of the digital implementation of neural networks. We in-\ntroduce BinaryConnect, a method which consists in training a DNN with binary\nweights during the forward and backward propagations, while retaining precision\nof the stored weights in which gradients are accumulated. Like other dropout\nschemes, we show that BinaryConnect acts as regularizer and we obtain near\nstate-of-the-art results with BinaryConnect on the permutation-invariant MNIST,\nCIFAR-10 and SVHN.\n\n1\n\nIntroduction\n\nDeep Neural Networks (DNN) have substantially pushed the state-of-the-art in a wide range of tasks,\nespecially in speech recognition [1, 2] and computer vision, notably object recognition from im-\nages [3, 4]. More recently, deep learning is making important strides in natural language processing,\nespecially statistical machine translation [5, 6, 7]. Interestingly, one of the key factors that enabled\nthis major progress has been the advent of Graphics Processing Units (GPUs), with speed-ups on the\norder of 10 to 30-fold, starting with [8], and similar improvements with distributed training [9, 10].\nIndeed, the ability to train larger models on more data has enabled the kind of breakthroughs ob-\nserved in the last few years. Today, researchers and developers designing new deep learning algo-\nrithms and applications often \ufb01nd themselves limited by computational capability. This along, with\nthe drive to put deep learning systems on low-power devices (unlike GPUs) is greatly increasing the\ninterest in research and development of specialized hardware for deep networks [11, 12, 13].\nMost of the computation performed during training and application of deep networks regards the\nmultiplication of a real-valued weight by a real-valued activation (in the recognition or forward\npropagation phase of the back-propagation algorithm) or gradient (in the backward propagation\nphase of the back-propagation algorithm). This paper proposes an approach called BinaryConnect\n\n1\n\n\fto eliminate the need for these multiplications by forcing the weights used in these forward and\nbackward propagations to be binary, i.e. constrained to only two values (not necessarily 0 and 1). We\nshow that state-of-the-art results can be achieved with BinaryConnect on the permutation-invariant\nMNIST, CIFAR-10 and SVHN.\nWhat makes this workable are two ingredients:\n\n1. Suf\ufb01cient precision is necessary to accumulate and average a large number of stochastic\ngradients, but noisy weights (and we can view discretization into a small number of values\nas a form of noise, especially if we make this discretization stochastic) are quite compatible\nwith Stochastic Gradient Descent (SGD), the main type of optimization algorithm for deep\nlearning. SGD explores the space of parameters by making small and noisy steps and\nthat noise is averaged out by the stochastic gradient contributions accumulated in each\nweight. Therefore, it is important to keep suf\ufb01cient resolution for these accumulators,\nwhich at \ufb01rst sight suggests that high precision is absolutely required. [14] and [15] show\nthat randomized or stochastic rounding can be used to provide unbiased discretization.\n[14] have shown that SGD requires weights with a precision of at least 6 to 8 bits and\n[16] successfully train DNNs with 12 bits dynamic \ufb01xed-point computation. Besides, the\nestimated precision of the brain synapses varies between 6 and 12 bits [17].\n\n2. Noisy weights actually provide a form of regularization which can help to generalize better,\nas previously shown with variational weight noise [18], Dropout [19, 20] and DropCon-\nnect [21], which add noise to the activations or to the weights. For instance, DropConnect\n[21], which is closest to BinaryConnect, is a very ef\ufb01cient regularizer that randomly substi-\ntutes half of the weights with zeros during propagations. What these previous works show\nis that only the expected value of the weight needs to have high precision, and that noise\ncan actually be bene\ufb01cial.\n\nThe main contributions of this article are the following.\n\nweights during the forward and backward propagations (Section 2).\n\n\u2022 We introduce BinaryConnect, a method which consists in training a DNN with binary\n\u2022 We show that BinaryConnect is a regularizer and we obtain near state-of-the-art results on\n\u2022 We make the code for BinaryConnect available 1.\n\nthe permutation-invariant MNIST, CIFAR-10 and SVHN (Section 3).\n\n2 BinaryConnect\n\nIn this section we give a more detailed view of BinaryConnect, considering which two values to\nchoose, how to discretize, how to train and how to perform inference.\n2.1 +1 or \u22121\nApplying a DNN mainly consists in convolutions and matrix multiplications. The key arithmetic\noperation of DL is thus the multiply-accumulate operation. Arti\ufb01cial neurons are basically multiply-\naccumulators computing weighted sums of their inputs.\nBinaryConnect constraints the weights to either +1 or \u22121 during propagations. As a result, many\nmultiply-accumulate operations are replaced by simple additions (and subtractions). This is a huge\ngain, as \ufb01xed-point adders are much less expensive both in terms of area and energy than \ufb01xed-point\nmultiply-accumulators [22].\n\n2.2 Deterministic vs stochastic binarization\n\nThe binarization operation transforms the real-valued weights into the two possible values. A very\nstraightforward binarization operation would be based on the sign function:\n\n(1)\n\n(cid:26) +1\n\n\u22121\n\nwb =\n\nif w \u2265 0,\notherwise.\n\n1https://github.com/MatthieuCourbariaux/BinaryConnect\n\n2\n\n\fWhere wb is the binarized weight and w the real-valued weight. Although this is a deterministic op-\neration, averaging this discretization over the many input weights of a hidden unit could compensate\nfor the loss of information. An alternative that allows a \ufb01ner and more correct averaging process to\ntake place is to binarize stochastically:\n\n(cid:26) +1 with probability p = \u03c3(w),\n\n\u22121 with probability 1 \u2212 p.\n\nwb =\n\nwhere \u03c3 is the \u201chard sigmoid\u201d function:\n\u03c3(x) = clip( x + 1\n2\n\n, 0, 1) = max(0, min(1,\n\nx + 1\n\n2\n\n))\n\n(2)\n\n(3)\n\nWe use such a hard sigmoid rather than the soft version because it is far less computationally expen-\nsive (both in software and specialized hardware implementations) and yielded excellent results in our\nexperiments. It is similar to the \u201chard tanh\u201d non-linearity introduced by [23]. It is also piece-wise\nlinear and corresponds to a bounded form of the recti\ufb01er [24].\n\n2.3 Propagations vs updates\n\nLet us consider the different steps of back-propagation with SGD udpates and whether it makes\nsense, or not, to discretize the weights, at each of these steps.\n\n1. Given the DNN input, compute the unit activations layer by layer, leading to the top layer\nwhich is the output of the DNN, given its input. This step is referred as the forward propa-\ngation.\n\n2. Given the DNN target, compute the training objective\u2019s gradient w.r.t. each layer\u2019s acti-\nvations, starting from the top layer and going down layer by layer until the \ufb01rst hidden\nlayer. This step is referred to as the backward propagation or backward phase of back-\npropagation.\n\n3. Compute the gradient w.r.t. each layer\u2019s parameters and then update the parameters using\ntheir computed gradients and their previous values. This step is referred to as the parameter\nupdate.\n\nAlgorithm 1 SGD training with BinaryConnect. C is the cost function for minibatch and the func-\ntions binarize(w) and clip(w) specify how to binarize and clip weights. L is the number of layers.\nRequire: a minibatch of (inputs, targets), previous parameters wt\u22121 (weights) and bt\u22121 (biases),\n\nand learning rate \u03b7.\n\nEnsure: updated parameters wt and bt.\n\n1. Forward propagation:\nwb \u2190 binarize(wt\u22121)\nFor k = 1 to L, compute ak knowing ak\u22121, wb and bt\u22121\n2. Backward propagation:\nInitialize output layer\u2019s activations gradient \u2202C\n\u2202aL\nFor k = L to 2, compute\n3. Parameter update:\nCompute \u2202C\nand \u2202C\ndbt\u22121\n\u2202wb\nwt \u2190 clip(wt\u22121 \u2212 \u03b7 \u2202C\nbt \u2190 bt\u22121 \u2212 \u03b7 \u2202C\n\nknowing \u2202C\n\u2202ak\n)\n\nknowing \u2202C\n\u2202ak\n\nand ak\u22121\n\nand wb\n\n\u2202C\n\n\u2202ak\u22121\n\n\u2202wb\n\n\u2202bt\u22121\n\nA key point to understand with BinaryConnect is that we only binarize the weights during the for-\nward and backward propagations (steps 1 and 2) but not during the parameter update (step 3), as\nillustrated in Algorithm 1. Keeping good precision weights during the updates is necessary for SGD\nto work at all. These parameter changes are tiny by virtue of being obtained by gradient descent, i.e.,\nSGD performs a large number of almost in\ufb01nitesimal changes in the direction that most improves\nthe training objective (plus noise). One way to picture all this is to hypothesize that what matters\n\n3\n\n\fw\u2217 = sign((cid:88)\n\nmost at the end of training is the sign of the weights, w\u2217, but that in order to \ufb01gure it out, we perform\na lot of small changes to a continuous-valued quantity w, and only at the end consider its sign:\n\ngt)\n\n(4)\n\nt\n\nwhere gt is a noisy estimator of \u2202C(f (xt,wt\u22121,bt\u22121),yt)\n, where C(f(xt, wt\u22121, bt\u22121), yt) is the value\n\u2202wt\u22121\nof the objective function on (input,target) example (xt, yt), when wt\u22121 are the previous weights and\nw\u2217 is its \ufb01nal discretized value of the weights.\nAnother way to conceive of this discretization is as a form of corruption, and hence as a regularizer,\nand our empirical results con\ufb01rm this hypothesis. In addition, we can make the discretization errors\non different weights approximately cancel each other while keeping a lot of precision by randomizing\nthe discretization appropriately. We propose a form of randomized discretization that preserves the\nexpected value of the discretized weight.\nHence, at training time, BinaryConnect randomly picks one of two values for each weight, for each\nminibatch, for both the forward and backward propagation phases of backprop. However, the SGD\nupdate is accumulated in a real-valued variable storing the parameter.\nAn interesting analogy to understand BinaryConnect is the DropConnect algorithm [21]. Just like\nBinaryConnect, DropConnect only injects noise to the weights during the propagations. Whereas\nDropConnect\u2019s noise is added Gaussian noise, BinaryConnect\u2019s noise is a binary sampling process.\nIn both cases the corrupted value has as expected value the clean original value.\n\n2.4 Clipping\n\nSince the binarization operation is not in\ufb02uenced by variations of the real-valued weights w when its\nmagnitude is beyond the binary values \u00b11, and since it is a common practice to bound weights (usu-\nally the weight vector) in order to regularize them, we have chosen to clip the real-valued weights\nwithin the [\u22121, 1] interval right after the weight updates, as per Algorithm 1. The real-valued weights\nwould otherwise grow very large without any impact on the binary weights.\n\n2.5 A few more tricks\n\nNo learning rate scaling Learning rate scaling\n\nOptimization\nSGD\nNesterov momentum 15.65%\nADAM\n12.81%\n\n11.45%\n11.30%\n10.47%\n\nTable 1: Test error rates of a (small) CNN trained on CIFAR-10 depending on optimization method\nand on whether the learning rate is scaled with the weights initialization coef\ufb01cients from [25].\n\nWe use Batch Normalization (BN) [26] in all of our experiments, not only because it accelerates\nthe training by reducing internal covariate shift, but also because it reduces the overall impact of\nthe weights scale. Moreover, we use the ADAM learning rule [27] in all of our CNN experiments.\nLast but not least, we scale the weights learning rates respectively with the weights initialization\ncoef\ufb01cients from [25] when optimizing with ADAM, and with the squares of those coef\ufb01cients\nwhen optimizing with SGD or Nesterov momentum [28]. Table 1 illustrates the effectiveness of\nthose tricks.\n\n2.6 Test-Time Inference\n\nUp to now we have introduced different ways of training a DNN with on-the-\ufb02y weight binarization.\nWhat are reasonable ways of using such a trained network, i.e., performing test-time inference on\nnew examples? We have considered three reasonable alternatives:\n\n1. Use the resulting binary weights wb (this makes most sense with the deterministic form of\n\nBinaryConnect).\n\n4\n\n\f2. Use the real-valued weights w, i.e., the binarization only helps to achieve faster training but\n\nnot faster test-time performance.\n\n3. In the stochastic case, many different networks can be sampled by sampling a wb for each\nweight according to Eq. 2. The ensemble output of these networks can then be obtained by\naveraging the outputs from individual networks.\n\nWe use the \ufb01rst method with the deterministic form of BinaryConnect. As for the stochastic form\nof BinaryConnect, we focused on the training advantage and used the second method in the experi-\nments, i.e., test-time inference using the real-valued weights. This follows the practice of Dropout\nmethods, where at test-time the \u201cnoise\u201d is removed.\n\nMethod\nNo regularizer\nBinaryConnect (det.)\nBinaryConnect (stoch.)\n50% Dropout\nMaxout Networks [29]\nDeep L2-SVM [30]\nNetwork in Network [31]\nDropConnect [21]\nDeeply-Supervised Nets [32]\n\nCIFAR-10\n\nMNIST\n1.30 \u00b1 0.04% 10.64%\n1.29 \u00b1 0.08% 9.90%\n1.18 \u00b1 0.04% 8.27%\n1.01 \u00b1 0.04%\n0.94%\n0.87%\n\n11.68%\n\n10.41%\n\n9.78%\n\nSVHN\n2.44%\n2.30%\n2.15%\n\n2.47%\n\n2.35%\n1.94%\n1.92%\n\nTable 2: Test error rates of DNNs trained on the MNIST (no convolution and no unsupervised\npretraining), CIFAR-10 (no data augmentation) and SVHN, depending on the method. We see\nthat in spite of using only a single bit per weight during propagation, performance is not worse\nthan ordinary (no regularizer) DNNs, it is actually better, especially with the stochastic version,\nsuggesting that BinaryConnect acts as a regularizer.\n\nFigure 1: Features of the \ufb01rst layer of an MLP trained on MNIST depending on the regular-\nizer. From left to right: no regularizer, deterministic BinaryConnect, stochastic BinaryConnect\nand Dropout.\n\n3 Benchmark results\n\nIn this section, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art\nresults with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.\n\n3.1 Permutation-invariant MNIST\n\nMNIST is a benchmark image classi\ufb01cation dataset [33]. It consists in a training set of 60000 and\na test set of 10000 28 \u00d7 28 gray-scale images representing digits ranging from 0 to 9. Permutation-\ninvariance means that the model must be unaware of the image (2-D) structure of the data (in other\nwords, CNNs are forbidden). Besides, we do not use any data-augmentation, preprocessing or un-\nsupervised pretraining. The MLP we train on MNIST consists in 3 hidden layers of 1024 Recti\ufb01er\nLinear Units (ReLU) [34, 24, 3] and a L2-SVM output layer (L2-SVM has been shown to perform\nbetter than Softmax on several classi\ufb01cation benchmarks [30, 32]). The square hinge loss is mini-\nmized with SGD without momentum. We use an exponentially decaying learning rate. We use Batch\n\n5\n\n\fFigure 2: Histogram of the weights of the \ufb01rst layer of an MLP trained on MNIST depending on\nthe regularizer. In both cases, it seems that the weights are trying to become deterministic to reduce\nthe training error. It also seems that some of the weights of deterministic BinaryConnect are stuck\naround 0, hesitating between \u22121 and 1.\n\nFigure 3: Training curves of a CNN on CIFAR-10 depending on the regularizer. The dotted lines\nrepresent the training costs (square hinge losses) and the continuous lines the corresponding valida-\ntion error rates. Both versions of BinaryConnect signi\ufb01cantly augment the training cost, slow down\nthe training and lower the validation error rate, which is what we would expect from a Dropout\nscheme.\n\nNormalization with a minibatch of size 200 to speed up the training. As typically done, we use the\nlast 10000 samples of the training set as a validation set for early stopping and model selection. We\nreport the test error rate associated with the best validation error rate after 1000 epochs (we do not\nretrain on the validation set). We repeat each experiment 6 times with different initializations. The\nresults are in Table 2. They suggest that the stochastic version of BinaryConnect can be considered\na regularizer, although a slightly less powerful one than Dropout, in this context.\n\n3.2 CIFAR-10\n\nCIFAR-10 is a benchmark image classi\ufb01cation dataset. It consists in a training set of 50000 and\na test set of 10000 32 \u00d7 32 color images representing airplanes, automobiles, birds, cats, deers,\ndogs, frogs, horses, ships and trucks. We preprocess the data using global contrast normalization\nand ZCA whitening. We do not use any data-augmentation (which can really be a game changer for\nthis dataset [35]). The architecture of our CNN is:\n(2\u00d7128C3)\u2212M P 2\u2212(2\u00d7256C3)\u2212M P 2\u2212(2\u00d7512C3)\u2212M P 2\u2212(2\u00d71024F C)\u221210SV M (5)\nWhere C3 is a 3 \u00d7 3 ReLU convolution layer, M P 2 is a 2 \u00d7 2 max-pooling layer, F C a fully\nconnected layer, and SVM a L2-SVM output layer. This architecture is greatly inspired from VGG\n[36]. The square hinge loss is minimized with ADAM. We use an exponentially decaying learning\n\n6\n\n\frate. We use Batch Normalization with a minibatch of size 50 to speed up the training. We use the\nlast 5000 samples of the training set as a validation set. We report the test error rate associated with\nthe best validation error rate after 500 training epochs (we do not retrain on the validation set). The\nresults are in Table 2 and Figure 3.\n\n3.3 SVHN\n\nSVHN is a benchmark image classi\ufb01cation dataset. It consists in a training set of 604K and a test set\nof 26K 32 \u00d7 32 color images representing digits ranging from 0 to 9. We follow the same procedure\nthat we used for CIFAR-10, with a few notable exceptions: we use half the number of hidden units\nand we train for 200 epochs instead of 500 (because SVHN is quite a big dataset). The results are in\nTable 2.\n\n4 Related works\n\nTraining DNNs with binary weights has been the subject of very recent works [37, 38, 39, 40]. Even\nthough we share the same objective, our approaches are quite different. [37, 38] do not train their\nDNN with Backpropagation (BP) but with a variant called Expectation Backpropagation (EBP).\nEBP is based on Expectation Propagation (EP) [41], which is a variational Bayes method used to do\ninference in probabilistic graphical models. Let us compare their method to ours:\n\n\u2022 It optimizes the weights posterior distribution (which is not binary). In this regard, our\n\nmethod is quite similar as we keep a real-valued version of the weights.\n\n\u2022 It binarizes both the neurons outputs and weights, which is more hardware friendly than\n\n\u2022 It yields a good classi\ufb01cation accuracy for fully connected networks (on MNIST) but not\n\njust binarizing the weights.\n\n(yet) for ConvNets.\n\n[39, 40] retrain neural networks with ternary weights during forward and backward propagations,\ni.e.:\n\n\u2022 They train a neural network with high-precision,\n\u2022 After training, they ternarize the weights to three possible values \u2212H, 0 and +H and adjust\n\n\u2022 And eventually, they retrain with ternary weights during propagations and high-precision\n\nH to minimize the output error,\n\nweights during updates.\n\nBy comparison, we train all the way with binary weights during propagations, i.e., our training pro-\ncedure could be implemented with ef\ufb01cient specialized hardware avoiding the forward and backward\npropagations multiplications, which amounts to about 2/3 of the multiplications (cf. Algorithm 1).\n\n5 Conclusion and future works\n\nWe have introduced a novel binarization scheme for weights during forward and backward propaga-\ntions called BinaryConnect. We have shown that it is possible to train DNNs with BinaryConnect on\nthe permutation invariant MNIST, CIFAR-10 and SVHN datasets and achieve nearly state-of-the-art\nresults. The impact of such a method on specialized hardware implementations of deep networks\ncould be major, by removing the need for about 2/3 of the multiplications, and thus potentially al-\nlowing to speed-up by a factor of 3 at training time. With the deterministic version of BinaryConnect\nthe impact at test time could be even more important, getting rid of the multiplications altogether\nand reducing by a factor of at least 16 (from 16 bits single-\ufb02oat precision to single bit precision)\nthe memory requirement of deep networks, which has an impact on the memory to computation\nbandwidth and on the size of the models that can be run. Future works should extend those results to\nother models and datasets, and explore getting rid of the multiplications altogether during training,\nby removing their need from the weight update computation.\n\n7\n\n\f6 Acknowledgments\n\nWe thank the reviewers for their many constructive comments. We also thank Roland Memisevic for\nhelpful discussions. We thank the developers of Theano [42, 43], a Python library which allowed\nus to easily develop a fast and optimized code for GPU. We also thank the developers of Pylearn2\n[44] and Lasagne, two Deep Learning libraries built on the top of Theano. We are also grateful for\nfunding from NSERC, the Canada Research Chairs, Compute Canada, and CIFAR.\n\nReferences\n[1] Geoffrey Hinton, Li Deng, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior,\nVincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acous-\ntic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6):82\u201397, Nov. 2012.\n\n[2] Tara Sainath, Abdel rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional\n\nneural networks for LVCSR. In ICASSP 2013, 2013.\n\n[3] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nnetworks. In NIPS\u20192012. 2012.\n\nImageNet classi\ufb01cation with deep convolutional neural\n\n[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. Technical report,\narXiv:1409.4842, 2014.\n\n[5] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul.\nFast and robust neural network joint models for statistical machine translation. In Proc. ACL\u20192014, 2014.\n[6] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In\n\nNIPS\u20192014, 2014.\n\n[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. In ICLR\u20192015, arXiv:1409.0473, 2015.\n\n[8] Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Large-scale deep unsupervised learning using graphics\n\nprocessors. In ICML\u20192009, 2009.\n\n[9] Yoshua Bengio, R\u00b4ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language\n\nmodel. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[10] J. Dean, G.S Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior,\n\nP. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS\u20192012, 2012.\n\n[11] Sang Kyun Kim, Lawrence C McAfee, Peter Leonard McMahon, and Kunle Olukotun. A highly scalable\nrestricted Boltzmann machine FPGA implementation. In Field Programmable Logic and Applications,\n2009. FPL 2009. International Conference on, pages 367\u2013372. IEEE, 2009.\n\n[12] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam.\nDiannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings\nof the 19th international conference on Architectural support for programming languages and operating\nsystems, pages 269\u2013284. ACM, 2014.\n\n[13] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei\nXu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In Microarchitecture (MICRO),\n2014 47th Annual IEEE/ACM International Symposium on, pages 609\u2013622. IEEE, 2014.\n\n[14] Lorenz K Muller and Giacomo Indiveri. Rounding methods for neural networks with low resolution\n\nsynaptic weights. arXiv preprint arXiv:1504.05767, 2015.\n\n[15] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with lim-\n\nited numerical precision. In ICML\u20192015, 2015.\n\n[16] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic for deep learn-\n\ning. In Arxiv:1412.7024, ICLR\u20192015 Workshop, 2015.\n\n[17] Thomas M Bartol, Cailey Bromer, Justin P Kinney, Michael A Chirillo, Jennifer N Bourne, Kristen M\n\nHarris, and Terrence J Sejnowski. Hippocampal spine head sizes are highly precise. bioRxiv, 2015.\n\n[18] Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R.S. Zemel, P.L.\nBartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems\n24, pages 2348\u20132356. Curran Associates, Inc., 2011.\n\n[19] Nitish Srivastava. Improving neural networks with dropout. Master\u2019s thesis, U. Toronto, 2013.\n[20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n8\n\n\f[21] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks\n\nusing dropconnect. In ICML\u20192013, 2013.\n\n[22] J.P. David, K. Kalach, and N. Tittley. Hardware complexity of modular multiplication and exponentiation.\n\nComputers, IEEE Transactions on, 56(10):1308\u20131319, Oct 2007.\n\n[23] R. Collobert. Large Scale Machine Learning. PhD thesis, Universit\u00b4e de Paris VI, LIP6, 2004.\n[24] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In AISTATS\u20192011, 2011.\n[25] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In AISTATS\u20192010, 2010.\n\n[26] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. 2015.\n\n[27] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[28] Yu Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\no(1/k2). Doklady AN SSSR (translated as Soviet. Math. Docl.), 269:543\u2013547, 1983.\n\n[29] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. Technical Report Arxiv report 1302.4389, Universit\u00b4e de Montr\u00b4eal, February 2013.\n\n[30] Yichuan Tang. Deep learning using linear support vector machines. Workshop on Challenges in Repre-\n\nsentation Learning, ICML, 2013.\n\n[31] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n[32] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised\n\nnets. arXiv preprint arXiv:1409.5185, 2014.\n\n[33] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[34] V. Nair and G.E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML\u20192010,\n\n2010.\n\n[35] Benjamin Graham. Spatially-sparse convolutional neural networks. arXiv preprint arXiv:1409.6070,\n\n2014.\n\n[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. In ICLR, 2015.\n\n[37] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of\n\nmultilayer neural networks with continuous or discrete weights. In NIPS\u20192014, 2014.\n\n[38] Zhiyong Cheng, Daniel Soudry, Zexi Mao, and Zhenzhong Lan. Training binary multilayer neural net-\nworks for image classi\ufb01cation using expectation backpropgation. arXiv preprint arXiv:1503.03562, 2015.\n[39] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural network design using\nweights+ 1, 0, and- 1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on, pages 1\u20136. IEEE,\n2014.\n\n[40] Jonghong Kim, Kyuyeon Hwang, and Wonyong Sung. X1000 real-time phoneme recognition vlsi using\nfeed-forward deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE\nInternational Conference on, pages 7510\u20137514. IEEE, 2014.\n\n[41] Thomas P Minka. Expectation propagation for approximate bayesian inference. In UAI\u20192001, 2001.\n[42] James Bergstra, Olivier Breuleux, Fr\u00b4ed\u00b4eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Des-\njardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expres-\nsion compiler. In Proceedings of the Python for Scienti\ufb01c Computing Conference (SciPy), June 2010.\nOral Presentation.\n\n[43] Fr\u00b4ed\u00b4eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron,\nNicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning\nand Unsupervised Feature Learning NIPS 2012 Workshop, 2012.\n\n[44] Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pas-\ncanu, James Bergstra, Fr\u00b4ed\u00b4eric Bastien, and Yoshua Bengio. Pylearn2: a machine learning research\nlibrary. arXiv preprint arXiv:1308.4214, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1747, "authors": [{"given_name": "Matthieu", "family_name": "Courbariaux", "institution": "\u00c9cole Polytechnique de Montr\u00e9al"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}, {"given_name": "Jean-Pierre", "family_name": "David", "institution": "Polytechnique Montr\u00e9al"}]}