{"title": "Binarized Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4107, "page_last": 4115, "abstract": "We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At train-time the binary weights and activations are used for computing the parameter gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to  substantially improve power-efficiency. To validate the effectiveness of BNNs, we conducted two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. We also report our preliminary results on the challenging ImageNet dataset. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster  than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.", "full_text": "Binarized Neural Networks\n\nItay Hubara1*\n\nitayh@technion.ac.il\n\nMatthieu Courbariaux2*\n\nmatthieu.courbariaux@gmail.com\n\nDaniel Soudry3\n\ndaniel.soudry@gmail.com\n\nRan El-Yaniv1\n\nrani@cs.technion.ac.il\n\nYoshua Bengio2,4\n\nyoshua.umontreal@gmail.com\n\n(1) Technion, Israel Institute of Technology.\n(3) Columbia University.\n(*) Indicates equal contribution.\n\n(2) Universit\u00e9 de Montr\u00e9al.\n(4) CIFAR Senior Fellow.\n\nAbstract\n\nWe introduce a method to train Binarized Neural Networks (BNNs) - neural\nnetworks with binary weights and activations at run-time. At train-time the binary\nweights and activations are used for computing the parameter gradients. During the\nforward pass, BNNs drastically reduce memory size and accesses, and replace most\narithmetic operations with bit-wise operations, which is expected to substantially\nimprove power-ef\ufb01ciency. To validate the effectiveness of BNNs, we conducted\ntwo sets of experiments on the Torch7 and Theano frameworks. On both, BNNs\nachieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN\ndatasets. We also report our preliminary results on the challenging ImageNet\ndataset. Last but not least, we wrote a binary matrix multiplication GPU kernel\nwith which it is possible to run our MNIST BNN 7 times faster than with an\nunoptimized GPU kernel, without suffering any loss in classi\ufb01cation accuracy. The\ncode for training and running our BNNs is available on-line.\n\nIntroduction\n\nDeep Neural Networks (DNNs) have substantially pushed Arti\ufb01cial Intelligence (AI) limits in a wide\nrange of tasks (LeCun et al., 2015). Today, DNNs are almost exclusively trained on one or many very\nfast and power-hungry Graphic Processing Units (GPUs) (Coates et al., 2013). As a result, it is often\na challenge to run DNNs on target low-power devices, and substantial research efforts are invested in\nspeeding up DNNs at run-time on both general-purpose (Gong et al., 2014; Han et al., 2015b) and\nspecialized computer hardware (Chen et al., 2014; Esser et al., 2015).\nThis paper makes the following contributions:\n\u2022 We introduce a method to train Binarized-Neural-Networks (BNNs), neural networks with binary\nweights and activations, at run-time, and when computing the parameter gradients at train-time\n(see Section 1).\n\n\u2022 We conduct two sets of experiments, each implemented on a different framework, namely Torch7\nand Theano, which show that it is possible to train BNNs on MNIST, CIFAR-10 and SVHN and\nachieve near state-of-the-art results (see Section 2). Moreover, we report preliminary results on the\nchallenging ImageNet dataset\n\n\u2022 We show that during the forward pass (both at run-time and train-time), BNNs drastically reduce\nmemory consumption (size and number of accesses), and replace most arithmetic operations with\nbit-wise operations, which potentially lead to a substantial increase in power-ef\ufb01ciency (see Section\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f3). Moreover, a binarized CNN can lead to binary convolution kernel repetitions; we argue that\ndedicated hardware could reduce the time complexity by 60% .\n\u2022 Last but not least, we programed a binary matrix multiplication GPU kernel with which it is\npossible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without\nsuffering any loss in classi\ufb01cation accuracy (see Section 4).\n\nThe code for training and running our BNNs is available on-line (both Theano1 and Torch frame-\nwork2).\n\n1 Binarized Neural Networks\n\nIn this section, we detail our binarization function, show how we use it to compute the parameter\ngradients,and how we backpropagate through it.\n\nDeterministic vs Stochastic Binarization When training a BNN, we constrain both the weights\nand the activations to either +1 or \u22121. Those two values are very advantageous from a hardware\nperspective, as we explain in Section 4. In order to transform the real-valued variables into those\ntwo values, we use two different binarization functions, as in (Courbariaux et al., 2015). Our \ufb01rst\nbinarization function is deterministic:\n\n(cid:26) +1\n\n\u22121\n\nif x \u2265 0,\notherwise,\n\nxb = Sign(x) =\n\n(1)\n\n(2)\n\n(3)\n\nwhere xb is the binarized variable (weight or activation) and x the real-valued variable. It is very\nstraightforward to implement and works quite well in practice. Our second binarization function is\nstochastic:\n\n(cid:26) +1 with probability p = \u03c3(x),\n\n\u22121 with probability 1 \u2212 p,\n\nxb =\n\nwhere \u03c3 is the \u201chard sigmoid\u201d function:\n\n\u03c3(x) = clip(\n\nx + 1\n\n2\n\n, 0, 1) = max(0, min(1,\n\nx + 1\n\n2\n\n)).\n\nThe stochastic binarization is more appealing than the sign function, but harder to implement as\nit requires the hardware to generate random bits when quantizing. As a result, we mostly use the\ndeterministic binarization function (i.e., the sign function), with the exception of activations at\ntrain-time in some of our experiments.\n\nGradient Computation and Accumulation Although our BNN training method uses binary\nweights and activation to compute the parameter gradients, the real-valued gradients of the weights\nare accumulated in real-valued variables, as per Algorithm 1. Real-valued weights are likely required\nfor Stochasic Gradient Descent (SGD) to work at all. SGD explores the space of parameters in small\nand noisy steps, and that noise is averaged out by the stochastic gradient contributions accumulated\nin each weight. Therefore, it is important to maintain suf\ufb01cient resolution for these accumulators,\nwhich at \ufb01rst glance suggests that high precision is absolutely required.\nMoreover, adding noise to weights and activations when computing the parameter gradients provide\na form of regularization that can help to generalize better, as previously shown with variational\nweight noise (Graves, 2011), Dropout (Srivastava et al., 2014) and DropConnect (Wan et al., 2013).\nOur method of training BNNs can be seen as a variant of Dropout, in which instead of randomly\nsetting half of the activations to zero when computing the parameter gradients, we binarize both the\nactivations and the weights.\n\nPropagating Gradients Through Discretization The derivative of the sign function is zero almost\neverywhere, making it apparently incompatible with back-propagation, since the exact gradient of\nthe cost with respect to the quantities before the discretization (pre-activations or weights) would\n\n1https://github.com/MatthieuCourbariaux/BinaryNet\n2https://github.com/itayhubara/BinaryNet\n\n2\n\n\fbe zero. Note that this remains true even if stochastic quantization is used. Bengio (2013) studied\nthe question of estimating or propagating gradients through stochastic discrete neurons. He found in\nhis experiments that the fastest training was obtained when using the \u201cstraight-through estimator,\u201d\npreviously introduced in Hinton\u2019s lectures (Hinton, 2012). We follow a similar approach but use the\nversion of the straight-through estimator that takes into account the saturation effect, and does use\ndeterministic rather than stochastic sampling of the bit. Consider the sign function quantization\n\nand assume that an estimator gq of the gradient \u2202C\nestimator when needed).\n\n\u2202q has been obtained (with the straight-through\n\nq = Sign(r),\n\nAlgorithm 1: Training a BNN. C is the cost function\nfor minibatch, \u03bb the learning rate decay factor and L\nthe number of layers. \u25e6 indicates element-wise mul-\ntiplication. The function Binarize() speci\ufb01es how to\n(stochastically or deterministically) binarize the activa-\ntions and weights, and Clip() speci\ufb01es how to clip the\nweights. BatchNorm() speci\ufb01es how to batch-normalize\nthe activations, using either batch normalization (Ioffe &\nSzegedy, 2015) or its shift-based variant we describe in\nAlgorithm 3. BackBatchNorm() speci\ufb01es how to back-\npropagate through the normalization. Update() speci\ufb01es\nhow to update the parameters when their gradients are\nknown, using either ADAM (Kingma & Ba, 2014) or\nthe shift-based AdaMax we describe in Algorithm 2.\nRequire: a minibatch of inputs and targets (a0, a\u2217),\nprevious weights W , previous BatchNorm parame-\nters \u03b8, weight initialization coef\ufb01cients from (Glorot\n& Bengio, 2010) \u03b3, and previous learning rate \u03b7.\n\nEnsure: updated weights W t+1, updated BatchNorm\n\nparameters \u03b8t+1 and updated learning rate \u03b7t+1.\n{1. Computing the gradients:}\n{1.1. Forward propagation:}\nfor k = 1 to L do\n\nk \u2190 Binarize(Wk), sk \u2190 ab\nW b\nak \u2190 BatchNorm(sk, \u03b8k)\nif k < L then ab\n\nk \u2190 Binarize(ak)\n\nk\u22121W b\nk\n\n{1.2. Backward propagation:}\n{Please note that the gradients are not binary.}\nCompute gaL = \u2202C\n\u2202aL\nfor k = L to 1 do\n\nknowing aL and a\u2217\n\nif k < L then gak \u2190 gab\n(gsk , g\u03b8k ) \u2190 BackBatchNorm(gak , sk, \u03b8k)\ngab\n\n\u2190 g(cid:62)\nab\nk\u22121\n{2. Accumulating the gradients:}\nfor k = 1 to L do\n\n\u2190 gsk W b\n\n\u25e6 1|ak|\u22641\n\nk, gW b\n\nk\u22121\n\nsk\n\nk\n\nk\n\nk \u2190 Update(\u03b8k, \u03b7t, g\u03b8k ), \u03b7t+1 \u2190 \u03bb\u03b7t\n\u03b8t+1\nk \u2190 Clip(Update(Wk, \u03b3k\u03b7t, gW b\nW t+1\nThen, our straight-through estimator of \u2202C\n\nk\n\n),\u22121, 1)\n\n\u2202r is simply\n\nAlgorithm 2: Shift based AdaMax learning\nrule (Kingma & Ba, 2014). g2\nt indicates the\nelement-wise square gt\u25e6gt and (cid:11) stands for\nboth left and right bit-shift. Good default\nsettings are \u03b1 = 2\u221210, 1 \u2212 \u03b21 = 2\u22123, 1 \u2212\n\u03b22 = 2\u221210. All operations on vectors are\nelement-wise. With \u03b2t\n2 we denote\n\u03b21 and \u03b22 to the power t.\nRequire: Previous parameters \u03b8t\u22121 and\n\n1 and \u03b2t\n\ntheir gradient gt, and learning rate \u03b1.\n\nEnsure: Updated parameters \u03b8t.\n\n{Biased 1st and 2nd moment estimates:}\nmt \u2190 \u03b21 \u00b7 mt\u22121 + (1 \u2212 \u03b21) \u00b7 gt\nvt \u2190 max(\u03b22 \u00b7 vt\u22121,|gt|)\n{Updated parameters:}\n\u03b8t \u2190 \u03b8t\u22121 \u2212 (\u03b1(cid:11) (1\u2212 \u03b21))\u00b7 \u02c6m(cid:11) v\u22121\n\n)\n\nt\n\nAlgorithm 3: Shift based Batch Normaliz-\ning Transform, applied to activation x over\na mini-batch. The approximate power-of-\n2 is3AP 2(x) = sign(x)2round(log2|x|), and\n(cid:11) stands for both left and right binary shift.\nRequire: Values of x over a mini-batch:\nB = {x1...m}; parameters to learn: \u03b3, \u03b2.\nEnsure: {yi = BN(xi,\u03b3, \u03b2)}\n\n{1. Mini-batch mean:}\n\u00b5B \u2190 1\ni=1 xi\n{2. Centered input: }\nC(xi) \u2190 (xi \u2212 \u00b5B)\n{3. Approximate variance:}\nB \u2190 1\n\u03c32\n{4. Normalize:}\n\n(cid:80)m\n(cid:80)m\ni=1(C(xi)(cid:11)AP 2(C(xi)))\nB + \u0001)\u22121)\n\n\u02c6xi \u2190 C(xi) (cid:11) AP 2(((cid:112)\u03c32\n\nm\n\nm\n\n{5. Scale and shift:}\nyi \u2190 AP 2(\u03b3) (cid:11) \u02c6xi\n\n(4)\nNote that this preserves the gradient\u2019s information and cancels the gradient when r is too large.\nNot cancelling the gradient when r is too large signi\ufb01cantly worsens the performance. The use of\nthis straight-through estimator is illustrated in Algorithm 1. The derivative 1|r|\u22641 can also be seen\nas propagating the gradient through hard tanh, which is the following piece-wise linear activation\nfunction:\n\ngr = gq1|r|\u22641.\n\nHtanh(x) = Clip(x,\u22121, 1).\n\n(5)\n\n3\n\n\fFor hidden units, we use the sign function non-\nlinearity to obtain binary activations, and for\nweights we combine two ingredients:\n\u2022 Constrain each real-valued weight between -1\nand 1, by projecting wr to -1 or 1 when the\nweight update brings wr outside of [\u22121, 1],\ni.e., clipping the weights during training, as\nper Algorithm 1. The real-valued weights\nwould otherwise grow very large without any\nimpact on the binary weights.\n\u2022 When using a weight wr, quantize it using\n\nwb = Sign(wr).\n\nThis is consistent with the gradient canceling\nwhen |wr| > 1, according to Eq. 4.\n\nAlgorithm 4: Running a BNN. L = layers.\n\nRequire: a vector of 8-bit inputs a0, the binary\nweights W b, and the BatchNorm parameters \u03b8.\n\nEnsure: the MLP output aL.\n\n0, Wb\n1 )\n\n{1. First layer:}\na1 \u2190 0\nfor n = 1 to 8 do\na1 \u2190 a1 +2n\u22121\u00b7XnorDotProduct(an\n1 \u2190 Sign(BatchNorm(a1, \u03b81))\nab\n{2. Remaining hidden layers:}\nfor k = 2 to L \u2212 1 do\nak \u2190 XnorDotProduct(ab\nk\u22121, W b\nk )\nk \u2190 Sign(BatchNorm(ak, \u03b8k))\nab\nL\u22121, W b\nL)\n\n{3. Output layer:}\naL \u2190 XnorDotProduct(ab\naL \u2190 BatchNorm(aL, \u03b8L)\n\nShift-based Batch Normalization Batch\nNormalization (BN) (Ioffe & Szegedy, 2015), accelerates the training and also seems to reduces\nthe overall impact of the weight scale. The normalization noise may also help to regularize the\nmodel. However, at train-time, BN requires many multiplications (calculating the standard deviation\nand dividing by it), namely, dividing by the running variance (the weighted mean of the training\nset activation variance). Although the number of scaling calculations is the same as the number of\nneurons, in the case of ConvNets this number is quite large. For example, in the CIFAR-10 dataset\n(using our architecture), the \ufb01rst convolution layer, consisting of only 128 \u00d7 3 \u00d7 3 \ufb01lter masks,\nconverts an image of size 3 \u00d7 32 \u00d7 32 to size 3 \u00d7 128 \u00d7 28 \u00d7 28, which is two orders of magnitude\nlarger than the number of weights. To achieve the results that BN would obtain, we use a shift-based\nbatch normalization (SBN) technique. detailed in Algorithm 3. SBN approximates BN almost\nwithout multiplications. In the experiment we conducted we did not observe accuracy loss when\nusing the shift based BN algorithm instead of the vanilla BN algorithm.\n\nShift based AdaMax The ADAM learning rule (Kingma & Ba, 2014) also seems to reduce the\nimpact of the weight scale. Since ADAM requires many multiplications, we suggest using instead the\nshift-based AdaMax we detail in Algorithm 2. In the experiment we conducted we did not observe\naccuracy loss when using the shift-based AdaMax algorithm instead of the vanilla ADAM algorithm.\n\n8(cid:88)\n\nFirst Layer\nIn a BNN, only the binarized values of the weights and activations are used in all\ncalculations. As the output of one layer is the input of the next, all the layers inputs are binary,\nwith the exception of the \ufb01rst layer. However, we do not believe this to be a major issue. First, in\ncomputer vision, the input representation typically has far fewer channels (e.g, red, green and blue)\nthan internal representations (e.g, 512). As a result, the \ufb01rst layer of a ConvNet is often the smallest\nconvolution layer, both in terms of parameters and computations (Szegedy et al., 2014). Second, it is\nrelatively easy to handle continuous-valued inputs as \ufb01xed point numbers, with m bits of precision.\nFor example, in the common case of 8-bit \ufb01xed point inputs:\n\ns = x \u00b7 wb\n\n;\n\ns =\n\n2n\u22121(xn \u00b7 wb),\n\n(6)\n\nwhere x is a vector of 1024 8-bit inputs, x8\nof 1024 1-bit weights, and s is the resulting weighted sum. This trick is used in Algorithm 4.\n\n1 is the most signi\ufb01cant bit of the \ufb01rst input, wb is a vector\n\nn=1\n\n2 Benchmark Results\n\nWe conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano.\nImplementation details are reported in Appendix A and code for both frameworks is available online.\nResults are reported in Table 1.\n\n3Hardware implementation of AP2 is as simple as extracting the index of the most signi\ufb01cant bit from the\n\nnumber\u2019s binary representation.\n\n4\n\n\fTable 1: Classi\ufb01cation test error rates of DNNs trained on MNIST (fully connected architecture),\nCIFAR-10 and SVHN (convnet). No unsupervised pre-training or data augmentation was used.\n\nData set\n\nMNIST\n\nSVHN CIFAR-10\n\nBinarized activations+weights, during training and test\n\nBNN (Torch7)\nBNN (Theano)\nCommittee Machines\u2019 Array (Baldassi et al., 2015)\n\n1.40%\n0.96%\n1.35%\n\n2.53%\n2.80%\n\n-\n\nBinaryConnect (Courbariaux et al., 2015)\n\nBinarized weights, during training and test\n\n1.29\u00b1 0.08% 2.30%\n\nEBP (Cheng et al., 2015)\nBitwise DNNs (Kim & Smaragdis, 2016)\n\nBinarized activations+weights, during test\n2.2\u00b1 0.1%\n\n1.33%\n\nTernary weights, binary activations, during test\n\n10.15%\n11.40%\n\n-\n\n9.90%\n\n-\n-\n\n-\n\n-\n-\n\n-\n\n(Hwang & Sung, 2014)\n\n1.45%\nNo binarization (standard results)\n\nNo regularization\nGated pooling (Lee et al., 2015)\n\n1.3\u00b1 0.2%\n\n-\n\n2.44%\n1.69%\n\n10.94%\n7.62%\n\nFigure 1: Training curves for different methods on\nCIFAR-10 dataset. The dotted lines represent the train-\ning costs (square hinge losses) and the continuous lines\nthe corresponding validation error rates. Although\nBNNs are slower to train, they are nearly as accurate as\n32-bit \ufb02oat DNNs.\n\nPreliminary Results on ImageNet To\ntest the strength of our method, we applied\nit to the challenging ImageNet classi\ufb01ca-\ntion task. Considerable research has been\nconcerned with compressing ImageNet ar-\nchitectures while preserving high accuracy\nperformance (e.g., Han et al. (2015a)). Pre-\nvious approaches that have been tried in-\nclude pruning near zero weights using ma-\ntrix factorization techniques, quantizing\nthe weights and applying Huffman codes\namong others. To the best of the our knowl-\nedge, so far there are no reports on success-\nfully quantizing the network\u2019s activations.\nMoreover, a recent work Han et al. (2015a)\nshowed that accuracy signi\ufb01cantly deterio-\nrates when trying to quantize convolutional\nlayers\u2019 weights below 4 bits (FC layers are\nmore robust to quantization and can operate\nquite well with only 2 bits). In the present\nwork we attempted to tackle the dif\ufb01cult task of binarizing both weights and activations. Employing\nthe well known AlexNet and GoogleNet architectures, we applied our techniques and achieved\n36.1% top-1 and 60.1% top-5 accuracies using AlexNet and 47.1% top-1 and 69.1% top-5 accuracies\nusing GoogleNet. While this performance leaves room for improvement (relative to full precision\nnets), they are by far better than all previous attempts to compress ImageNet architectures using less\nthan 4 bits precision for the weights. Moreover, this advantage is achieved while also binarizing\nneuron activations. Detailed descriptions of these results as well as full implementation details\nof our experiments are reported in the supplementary material (Appendix B). In our latest work\n(Hubara et al., 2016) we relaxed the binary constrains and allowed more than 1-bit per weight and\nactivations. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts.\nFor example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves\n51% top-1 accuracy and GoogleNet with 4-bits weighs and activation achived 66.6%. Moreover, we\nquantize the parameter gradients to 6-bits as well which enables gradients computation using only\nbit-wise operation. Full details can be found in (Hubara et al., 2016)\n\n5\n\n\fTable 2: Energy consumption of multiply-\naccumulations in pico-joules (Horowitz, 2014)\n\nTable 3: Energy consumption of memory accesses\nin pico-joules (Horowitz, 2014)\n\nMemory size\n8K\n32K\n1M\nDRAM\n\n64-bit memory access\n10pJ\n20pJ\n100pJ\n1.3-2.6nJ\n\nOperation\n8bit Integer\n32bit Integer\n16bit Floating Point\n32tbit Floating Point\n\nMUL ADD\n0.03pJ\n0.2pJ\n3.1pJ\n0.1pJ\n0.4pJ\n1.1pJ\n3.7pJ\n0.9pJ\n\n3 High Power Ef\ufb01ciency during the Forward Pass\n\nComputer hardware, be it general-purpose or specialized, is composed of memories, arithmetic\noperators and control logic. During the forward pass (both at run-time and train-time), BNNs\ndrastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise\noperations, which might lead to a great increase in power-ef\ufb01ciency. Moreover, a binarized CNN can\nlead to binary convolution kernel repetitions, and we argue that dedicated hardware could reduce the\ntime complexity by 60% .\n\nMemory Size and Accesses\nImproving computing performance has always been and remains a\nchallenge. Over the last decade, power has been the main constraint on performance (Horowitz, 2014).\nThis is why much research effort has been devoted to reducing the energy consumption of neural\nnetworks. Horowitz (2014) provides rough numbers for the energy consumed by the computation (the\ngiven numbers are for 45nm technology), as summarized in Tables 2 and 3. Importantly, we can see\nthat memory accesses typically consume more energy than arithmetic operations, and memory access\ncost augments with memory size. In comparison with 32-bit DNNs, BNNs require 32 times smaller\nmemory size and 32 times fewer memory accesses. This is expected to reduce energy consumption\ndrastically (i.e., more than 32 times).\n\nXNOR-Count Applying a DNN mainly consists of convolutions and matrix multiplications. The\nkey arithmetic operation of deep learning is thus the multiply-accumulate operation. Arti\ufb01cial neurons\nare basically multiply-accumulators computing weighted sums of their inputs. In BNNs, both the\nactivations and the weights are constrained to either \u22121 or +1. As a result, most of the 32-bit \ufb02oating\npoint multiply-accumulations are replaced by 1-bit XNOR-count operations. This could have a big\nimpact on dedicated deep learning hardware. For instance, a 32-bit \ufb02oating point multiplier costs\nabout 200 Xilinx FPGA slices (Govindu et al., 2004; Beauchamp et al., 2006), whereas a 1-bit XNOR\ngate only costs a single slice.\n\nExploiting Filter Repetitions When using a ConvNet architecture with binary weights, the number\nof unique \ufb01lters is bounded by the \ufb01lter size. For example, in our implementation we use \ufb01lters of\nsize 3 \u00d7 3, so the maximum number of unique 2D \ufb01lters is 29 = 512. Since we now have binary\n\ufb01lters, many 2D \ufb01lters of size k \u00d7 k repeat themselves. By using dedicated hardware/software, we\ncan apply only the unique 2D \ufb01lters on each feature map and sum the results to receive each 3D\n\ufb01lter\u2019s convolutional result. For example, in our ConvNet architecture trained on the CIFAR-10\nbenchmark, there are only 42% unique \ufb01lters per layer on average. Hence we can reduce the number\nof the XNOR-popcount operations by 3.\n\n4 Seven Times Faster on GPU at Run-Time\n\nIt is possible to speed up GPU implementations of BNNs, by using a method sometimes called\nSIMD (single instruction, multiple data) within a register (SWAR). The basic idea of SWAR is to\nconcatenate groups of 32 binary variables into 32-bit registers, and thus obtain a 32-times speed-up\non bitwise operations (e.g, XNOR). Using SWAR, it is possible to evaluate 32 connections with only\n3 instructions:\n\na1+ = popcount(xnor(a32b\n\n0\n\n, w32b\n\n1\n\n)),\n\n(7)\n\nwhere a1 is the resulting weighted sum, and a32b\nare the concatenated inputs and weights.\nThose 3 instructions (accumulation, popcount, xnor) take 1 + 4 + 1 = 6 clock cycles on recent\n\nand w32b\n\n0\n\n1\n\n6\n\n\fNvidia GPUs (and if they were to become a fused instruction, it would only take a single clock cycle).\nConsequently, we obtain a theoretical Nvidia GPU speed-up of factor of 32/6 \u2248 5.3. In practice, this\nspeed-up is quite easy to obtain as the memory bandwidth to computation ratio is also increased by 6\ntimes.\nIn order to validate those theoretical results, we\nprogramed two GPU kernels:\n\u2022 The \ufb01rst kernel (baseline) is an unoptimized\n\nFigure 2: The \ufb01rst three columns represent the\ntime it takes to perform a 8192 \u00d7 8192 \u00d7 8192 (bi-\nnary) matrix multiplication on a GTX750 Nvidia\nGPU, depending on which kernel is used. We\ncan see that our XNOR kernel is 23 times faster\nthan our baseline kernel and 3.4 times faster than\ncuBLAS. The next three columns represent the\ntime it takes to run the MLP from Section 2 on the\nfull MNIST test set. As MNIST\u2019s images are not\nbinary, the \ufb01rst layer\u2019s computations are always\nperformed by the baseline kernel. The last three\ncolumns show that the MLP accuracy does not\ndepend on which kernel is used.\n\nmatrix multiplication kernel.\n\n\u2022 The second kernel (XNOR) is nearly identical\nto the baseline kernel, except that it uses the\nSWAR method, as in Equation (7).\n\nThe two GPU kernels return identical outputs\nwhen their inputs are constrained to \u22121 or +1\n(but not otherwise). The XNOR kernel is about\n23 times faster than the baseline kernel and 3.4\ntimes faster than cuBLAS, as shown in Figure 2.\nLast but not least, the MLP from Section 2 runs\n7 times faster with the XNOR kernel than with\nthe baseline kernel, without suffering any loss\nin classi\ufb01cation accuracy (see Figure 2).\n\n5 Discussion and Related Work\n\nUntil recently,\nthe use of extremely low-\nprecision networks (binary in the extreme case)\nwas believed to be highly destructive to the net-\nwork performance (Courbariaux et al., 2014).\nSoudry et al. (2014) and Cheng et al. (2015)\nproved the contrary by showing that good per-\nformance could be achieved even if all neurons\nand weights are binarized to \u00b11 . This was done\nusing Expectation BackPropagation (EBP), a\nvariational Bayesian approach, which infers net-\nworks with binary weights and neurons by updating the posterior distributions over the weights.\nThese distributions are updated by differentiating their parameters (e.g., mean values) via the back\npropagation (BP) algorithm. Esser et al. (2015) implemented a fully binary network at run time using\na very similar approach to EBP, showing signi\ufb01cant improvement in energy ef\ufb01ciency. The drawback\nof EBP is that the binarized parameters are only used during inference.\nThe probabilistic idea behind EBP was extended in the BinaryConnect algorithm of Courbariaux et al.\n(2015). In BinaryConnect, the real-valued version of the weights is saved and used as a key reference\nfor the binarization process. The binarization noise is independent between different weights, either\nby construction (by using stochastic quantization) or by assumption (a common simpli\ufb01cation; see\nSpang (1962). The noise would have little effect on the next neuron\u2019s input because the input is\na summation over many weighted neurons. Thus, the real-valued version could be updated by the\nback propagated error by simply ignoring the binarization noise in the update. Using this method,\nCourbariaux et al. (2015) were the \ufb01rst to binarize weights in CNNs and achieved near state-of-the-art\nperformance on several datasets. They also argued that noisy weights provide a form of regularization,\nwhich could help to improve generalization, as previously shown in (Wan et al., 2013). This method\nbinarized weights while still maintaining full precision neurons.\nLin et al. (2015) carried over the work of Courbariaux et al. (2015) to the back-propagation process\nby quantizing the representations at each layer of the network, to convert some of the remaining\nmultiplications into bit-shifts by restricting the neurons values to be power-of-two integers. Lin et al.\n(2015)\u2019s work and ours seem to share similar characteristics . However, their approach continues to\nuse full precision weights during the test phase. Moreover, Lin et al. (2015) quantize the neurons\nonly during the back propagation process, and not during forward propagation.\n\n7\n\n\fOther research Baldassi et al. (2015) showed that full binary training and testing is possible in an\narray of committee machines with randomized input, where only one weight layer is being adjusted.\nGong et al. (2014) aimed to compress a fully trained high precision network by using a quantization\nor matrix factorization methods. These methods required training the network with full precision\nweights and neurons, thus requiring numerous MAC operations the proposed BNN algorithm avoids.\nHwang & Sung (2014) focused on a \ufb01xed-point neural network design and achieved performance\nalmost identical to that of the \ufb02oating-point architecture. Kim & Smaragdis (2016) retrained neural\nnetworks with binary weights and activations.\nSo far, to the best of our knowledge, no work has succeeded in binarizing weights and neurons, at the\ninference phase and the entire training phase of a deep network. This was achieved in the present\nwork. We relied on the idea that binarization can be done stochastically, or be approximated as\nrandom noise. This was previously done for the weights by Courbariaux et al. (2015), but our BNNs\nextend this to the activations. Note that the binary activations are especially important for ConvNets,\nwhere there are typically many more neurons than free weights. This allows highly ef\ufb01cient operation\nof the binarized DNN at run time, and at the forward-propagation phase during training. Moreover,\nour training method has almost no multiplications, and therefore might be implemented ef\ufb01ciently\nin dedicated hardware. However, we have to save the value of the full precision weights. This is a\nremaining computational bottleneck during training, since it is an energy-consuming operation.\n\nConclusion\n\nWe have introduced BNNs, which binarize deep neural networks and can lead to dramatic improve-\nments in both power consumption and computation speed. During the forward pass (both at run-time\nand train-time), BNNs drastically reduce memory size and accesses, and replace most arithmetic\noperations with bit-wise operations. Our estimates indicate that power ef\ufb01ciency can be improved by\nmore than one order of magnitude (see Section 3). In terms of speed, we programed a binary matrix\nmultiplication GPU kernel that enabled running MLP over the MNIST datset 7 times faster (than\nwith an unoptimized GPU kernel) without suffering any accuracy degradation (see Section 4).\nWe have shown that BNNs can handle MNIST, CIFAR-10 and SVHN while achieving nearly state-\nof-the-art accuracy performance. While our preliminary results for the challenging ImageNet are\nnot on par with the best results achievable with full precision networks, they signi\ufb01cantly improve\nall previous attempts to compress ImageNet-capable architectures (see Section 2 and supplementary\nmaterial - Appendix B). Moreover by relaxing the binary constrains and allowed more than 1-bit per\nweight and activations we have been able to achieve prediction accuracy comparable to their 32-bit\ncounterparts. Full details can be found in our latest work (Hubara et al., 2016) A major open question\nwould be to further improve our results on ImageNet. A substantial progress in this direction might\nlead to huge impact on DNN usability in low power instruments such as mobile phones.\n\nAcknowledgments\n\nWe would like to express our appreciation to Elad Hoffer, for his technical assistance and constructive\ncomments. We thank our fellow MILA lab members who took the time to read the article and give us\nsome feedback. We thank the developers of Torch, Collobert et al. (2011) a Lua based environment,\nand Theano (Bergstra et al., 2010; Bastien et al., 2012), a Python library which allowed us to easily\ndevelop a fast and optimized code for GPU. We also thank the developers of Pylearn2 (Goodfellow\net al., 2013) and Lasagne (Dieleman et al., 2015), two Deep Learning libraries built on the top of\nTheano. We thank Yuxin Wu for helping us compare our GPU kernels with cuBLAS. We are also\ngrateful for funding from NSERC, the Canada Research Chairs, Compute Canada, and CIFAR. We\nare also grateful for funding from CIFAR, NSERC, IBM, Samsung. This research was also supported\nby The Israel Science Foundation (grant No. 1890/14).\n\nReferences\nBaldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R. Subdominant Dense Clusters Allow for\nSimple Learning and High Computational Performance in Neural Networks with Discrete Synapses. Physical\nReview Letters, 115(12):1\u20135, 2015.\n\n8\n\n\fGong, Y., Liu, L., Yang, M., and Bourdev, L. Compressing deep convolutional networks using vector quantization.\n\nGoodfellow, I. J., Warde-Farley, D., Lamblin, P., et al. Pylearn2: a machine learning research library. arXiv\n\nAISTATS\u20192010, 2010.\n\narXiv preprint arXiv:1412.6115, 2014.\n\npreprint arXiv:1308.4214, 2013.\n\nBastien, F., Lamblin, P., Pascanu, R., et al. Theano: new features and speed improvements. Deep Learning and\n\nUnsupervised Feature Learning NIPS 2012 Workshop, 2012.\n\nBeauchamp, M. J., Hauck, S., Underwood, K. D., and Hemmert, K. S. Embedded \ufb02oating-point units in FPGAs.\nIn Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays,\npp. 12\u201320. ACM, 2006.\n\nBengio, Y. Estimating or propagating gradients through stochastic neurons. Technical Report arXiv:1305.2982,\n\nUniversite de Montreal, 2013.\n\nBergstra, J., Breuleux, O., Bastien, F., et al. Theano: a CPU and GPU math expression compiler. In Proceedings\n\nof the Python for Scienti\ufb01c Computing Conference (SciPy), June 2010. Oral Presentation.\n\nChen, T., Du, Z., Sun, N., et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-\nlearning. In Proceedings of the 19th international conference on Architectural support for programming\nlanguages and operating systems, pp. 269\u2013284. ACM, 2014.\n\nCheng, Z., Soudry, D., Mao, Z., and Lan, Z. Training binary multilayer neural networks for image classi\ufb01cation\n\nusing expectation backpropgation. arXiv preprint arXiv:1503.03562, 2015.\n\nCoates, A., Huval, B., Wang, T., et al. Deep learning with COTS HPC systems. In Proceedings of the 30th\n\ninternational conference on machine learning, pp. 1337\u20131345, 2013.\n\nCollobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A matlab-like environment for machine learning. In\n\nBigLearn, NIPS Workshop, 2011.\n\nCourbariaux, M., Bengio, Y., and David, J.-P. Training deep neural networks with low precision multiplications.\n\nArXiv e-prints, abs/1412.7024, December 2014.\n\nCourbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights\n\nduring propagations. ArXiv e-prints, abs/1511.00363, November 2015.\n\nDieleman, S., Schl\u00fcter, J., Raffel, C., et al. Lasagne: First release., August 2015.\nEsser, S. K., Appuswamy, R., Merolla, P., Arthur, J. V., and Modha, D. S. Backpropagation for energy-ef\ufb01cient\n\nneuromorphic computing. In Advances in Neural Information Processing Systems, pp. 1117\u20131125, 2015.\n\nGlorot, X. and Bengio, Y. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nGovindu, G., Zhuo, L., Choi, S., and Prasanna, V. Analysis of high-performance \ufb02oating-point arithmetic on\nFPGAs. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pp. 149.\nIEEE, 2004.\n\nGraves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing\n\nSystems, pp. 2348\u20132356, 2011.\n\nHan, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained\n\nquantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.\n\nHan, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for ef\ufb01cient neural network. In\n\nAdvances in Neural Information Processing Systems, pp. 1135\u20131143, 2015b.\n\nHinton, G. Neural networks for machine learning. Coursera, video lectures, 2012.\nHorowitz, M. Computing\u2019s Energy Problem (and what we can do about it). IEEE Interational Solid State\n\nCircuits Conference, pp. 10\u201314, 2014.\n\nHubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training\n\nneural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.\n\nHwang, K. and Sung, W. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In\n\nSignal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1\u20136. IEEE, 2014.\n\nIoffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. 2015.\n\nKim, M. and Smaragdis, P. Bitwise Neural Networks. ArXiv e-prints, January 2016.\nKingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\nLeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436\u2013444, 2015.\nLee, C.-Y., Gallagher, P. W., and Tu, Z. Generalizing pooling functions in convolutional neural networks: Mixed,\n\ngated, and tree. arXiv preprint arXiv:1509.08985, 2015.\n\nLin, Z., Courbariaux, M., Memisevic, R., and Bengio, Y. Neural networks with few multiplications. ArXiv\n\ne-prints, abs/1510.03009, October 2015.\n\nSoudry, D., Hubara, I., and Meir, R. Expectation backpropagation: Parameter-free training of multilayer neural\n\nnetworks with continuous or discrete weights. In NIPS\u20192014, 2014.\n\nSrivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15:1929\u20131958, 2014.\n\nSzegedy, C., Liu, W., Jia, Y., et al. Going deeper with convolutions. Technical report, arXiv:1409.4842, 2014.\nWan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect.\n\nIn ICML\u20192013, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2044, "authors": [{"given_name": "Itay", "family_name": "Hubara", "institution": "Technion"}, {"given_name": "Matthieu", "family_name": "Courbariaux", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Columbia University"}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": "Technion"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Universit\u00e9 de Montr\u00e9al"}]}