{"title": "Towards Accurate Binary Convolutional Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 345, "page_last": 353, "abstract": "We introduce a novel scheme to train binary convolutional neural networks (CNNs) -- CNNs with weights and activations constrained to \\{-1,+1\\} at run-time. It has been known that using binary weights and activations drastically reduce memory size and accesses, and can replace arithmetic operations with more efficient bitwise operations, leading to much faster test-time inference and lower power consumption. However, previous works on binarizing CNNs usually result in severe prediction accuracy degradation. In this paper, we address this issue with two major innovations: (1) approximating full-precision weights with the linear combination of multiple binary weight bases; (2) employing multiple binary activations to alleviate information loss. The implementation of the resulting binary CNN, denoted as ABC-Net, is shown to achieve much closer performance to its full-precision counterpart, and even reach the comparable prediction accuracy on ImageNet and forest trail datasets, given adequate binary weight bases and activations.", "full_text": "Towards Accurate Binary Convolutional Neural\n\nNetwork\n\nXiaofan Lin\n\nCong Zhao\n\nDJI Innovations Inc, Shenzhen, China\n\n{xiaofan.lin, cong.zhao, wei.pan}@dji.com\n\nWei Pan*\n\nAbstract\n\nWe introduce a novel scheme to train binary convolutional neural networks (CNNs)\n\u2013 CNNs with weights and activations constrained to {-1,+1} at run-time. It has been\nknown that using binary weights and activations drastically reduce memory size\nand accesses, and can replace arithmetic operations with more ef\ufb01cient bitwise op-\nerations, leading to much faster test-time inference and lower power consumption.\nHowever, previous works on binarizing CNNs usually result in severe prediction\naccuracy degradation. In this paper, we address this issue with two major inno-\nvations: (1) approximating full-precision weights with the linear combination of\nmultiple binary weight bases; (2) employing multiple binary activations to allevi-\nate information loss. The implementation of the resulting binary CNN, denoted\nas ABC-Net, is shown to achieve much closer performance to its full-precision\ncounterpart, and even reach the comparable prediction accuracy on ImageNet and\nforest trail datasets, given adequate binary weight bases and activations.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) have achieved state-of-the-art results on real-world applica-\ntions such as image classi\ufb01cation [He et al., 2016] and object detection [Ren et al., 2015], with the\nbest results obtained with large models and suf\ufb01cient computation resources. Concurrent to these\nprogresses, the deployment of CNNs on mobile devices for consumer applications is gaining more\nand more attention, due to the widespread commercial value and the exciting prospect.\nOn mobile applications, it is typically assumed that training is performed on the server and test\nor inference is executed on the mobile devices [Courbariaux et al., 2016, Esser et al., 2016]. In\nthe training phase, GPUs enabled substantial breakthroughs because of their greater computational\nspeed. In the test phase, however, GPUs are usually too expensive to deploy. Thus improving the\ntest-time performance and reducing hardware costs are likely to be crucial for further progress,\nas mobile applications usually require real-time, low power consumption and fully embeddable.\nAs a result, there is much interest in research and development of dedicated hardware for deep\nneural networks (DNNs). Binary neural networks (BNNs) [Courbariaux et al., 2016, Rastegari et al.,\n2016], i.e., neural networks with weights and perhaps activations constrained to only two possible\nvalues (e.g., -1 or +1), would bring great bene\ufb01ts to specialized DNN hardware for three major\nreasons: (1) the binary weights/activations reduce memory usage and model size 32 times compared\nto single-precision version; (2) if weights are binary, then most multiply-accumulate operations can\nbe replaced by simple accumulations, which is bene\ufb01cial because multipliers are the most space\nand power-hungry components of the digital implementation of neural networks; (3) furthermore, if\nboth activations and weights are binary, the multiply-accumulations can be replaced by the bitwise\noperations: xnor and bitcount Courbariaux et al. [2016]. This could have a big impact on dedicated\ndeep learning hardware. For instance, a 32-bit \ufb02oating point multiplier costs about 200 Xilinx FPGA\nslices [Govindu et al., 2004], whereas a 1-bit xnor gate only costs a single slice. Semiconductor\n\n\u21e4 indicates corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmanufacturers like IBM [Esser et al., 2016] and Intel [Venkatesh et al., 2016] have been involved in\nthe research and development of related chips.\nHowever, binarization usually cause severe prediction accuracy degradation, especially on complex\ntasks such as classi\ufb01cation on ImageNet dataset. To take a closer look, Rastegari et al. [2016] shows\nthat binarizing weights causes the accuracy of Resnet-18 drops from 69.3% to 60.8% on ImageNet\ndataset. If further binarize activations, the accuracy drops to 51.2%. Similar phenomenon can also be\nfound in literatures such as [Hubara et al., 2016]. Clearly there is a considerable gap between the\naccuracy of a full-precision model and a binary model.\nThis paper proposes a novel scheme for binarizing CNNs, which aims to alleviate, or even eliminate\nthe accuracy degradation, while still signi\ufb01cantly reducing inference time, resource requirement and\npower consumption. The paper makes the following major contributions.\n\n\u2022 We approximate full-precision weights with the linear combination of multiple binary weight\nbases. The weights values of CNNs are constrained to {1, +1}, which means convolutions\ncan be implemented by only addition and subtraction (without multiplication), or bitwise\noperation when activations are binary as well. We demonstrate that 3\u21e05 binary weight bases\nare adequate to well approximate the full-precision weights.\n\u2022 We introduce multiple binary activations. Previous works have shown that the quantization\nof activations, especially binarization, is more dif\ufb01cult than that of weights [Cai et al., 2017,\nCourbariaux et al., 2016]. By employing \ufb01ve binary activations, we have been able to reduce\nthe Top-1 and Top-5 accuracy degradation caused by binarization to around 5% on ImageNet\ncompared to the full precision counterpart.\n\nIt is worth noting that the multiple binary weight bases/activations scheme is preferable to the \ufb01xed-\npoint quantization in previous works. In those \ufb01xed-point quantized networks one still needs to\nemploy arithmetic operations, such as multiplication and addition, on \ufb01xed-point values. Even though\nfaster than \ufb02oating point, they still require relatively complex logic and can consume a lot of power.\nDetailed discussions can be found in Section 5.2.\nIdeally, combining more binary weight bases and activations always leads to better accuracy and will\neventually get very close to that of full-precision networks. We verify this on ImageNet using Resnet\nnetwork topology. This is the \ufb01rst time a binary neural network achieves prediction accuracy\ncomparable to its full-precision counterpart on ImageNet.\n\n2 Related work\n\nQuantized Neural Networks: High precision parameters are not very necessary to reach high\nperformance in deep neural networks. Recent research efforts (e.g., [Hubara et al., 2016]) have\nconsiderably reduced a large amounts of memory requirement and computation complexity by using\nlow bitwidth weights and activations. Zhou et al. [2016] further generalized these schemes and\nproposed to train CNNs with low bitwidth gradients. By performing the quantization after network\ntraining or using the \u201cstraight-through estimator (STE)\" [Bengio et al., 2013], these works avoided the\nissues of non-differentiable optimization. While some of these methods have produced good results\non datasets such as CIFAR-10 and SVHN, none has produced low precision networks competitive\nwith full-precision models on large-scale classi\ufb01cation tasks, such as ImageNet. In fact, [Zhou et al.,\n2016] and [Hubara et al., 2016] experiment with different combinations of bitwidth for weights and\nactivations, and show that the performance of their highly quantized networks deteriorates rapidly\nwhen the weights and activations are quantized to less than 4-bit numbers. Cai et al. [2017] enhance\nthe performance of a low bitwidth model by addressing the gradient mismatch problem, nevertheless\nthere is still much room for improvement.\nBinarized Neural Networks: The binary representation for deep models is not a new topic. At the\nemergence of arti\ufb01cial neural networks, inspired biologically, the unit step function has been used as\nthe activation function [Toms, 1990]. It is known that binary activation can use spiking response for\nevent-based computation and communication (consuming energy only when necessary) and therefore\nis energy-ef\ufb01cient [Esser et al., 2016]. Recently, Courbariaux et al. [2016] introduce Binarized-\nNeural-Networks (BNNs), neural networks with binary weights and activations at run-time. Different\nfrom their work, Rastegari et al. [2016] introduce simple, ef\ufb01cient, and accurate approximations\nto CNNs by binarizing the weights and even the intermediate representations in CNNs. All these\nworks drastically reduce memory consumption, and replace most arithmetic operations with bitwise\noperations, which potentially lead to a substantial increase in power ef\ufb01ciency.\n\n2\n\n\fIn all above mentioned works, binarization signi\ufb01cantly reduces accuracy. Our experimental results\non ImageNet show that we are close to \ufb01lling the gap between the accuracy of a binary model and its\nfull-precision counterpart. We relied on the idea of \ufb01nding the best approximation of full-precision\nconvolution using multiple binary operations, and employing multiple binary activations to allow\nmore information passing through.\n\n3 Binarization methods\nIn this section, we detail our binarization method, which is termed ABC-Net (Accurate-Binary-\nConvolutional) for convenience. Bear in mind that during training, the real-valued weights are\nreserved and updated at every epoch, while in test-time only binary weights are used in convolution.\n\n3.1 Weight approximation\nConsider a L-layer CNN architecture. Without loss of generality, we assume the weights of each\nconvolutional layer are tensors of dimension (w, h, cin, cout), which represents \ufb01lter width, \ufb01lter\nheight, input-channel and output-channel respectively. We propose two variations of binarization\nmethod for weights at each layer: 1) approximate weights as a whole and 2) approximate weights\nchannel-wise.\n\n3.1.1 Approximate weights as a whole\nAt each layer, in order to constrain a CNN to have binary weights, we estimate the real-value weight\n\ufb01lter W 2 Rw\u21e5h\u21e5cin\u21e5cout using the linear combination of M binary \ufb01lters B1, B2,\u00b7\u00b7\u00b7 , BM 2\n{1, +1}w\u21e5h\u21e5cin\u21e5cout such that W \u21e1 \u21b51B1+\u21b52B2+\u00b7\u00b7\u00b7+\u21b5M BM. To \ufb01nd an optimal estimation,\na straightforward way is to solve the following optimization problem:\n\nmin\n\u21b5,B\n\ns.t. Bij 2 {1, +1},\n\nJ(\u21b5, B) = ||w B\u21b5||2,\n\n(1)\nwhere B = [vec(B1), vec(B2),\u00b7\u00b7\u00b7 , vec(BM )], w = vec(W ) and \u21b5 = [\u21b51,\u21b5 2,\u00b7\u00b7\u00b7 ,\u21b5 M ]T. Here\nthe notation vec(\u00b7) refers to vectorization.\nAlthough a local minimum solution to (1) can be obtained by numerical methods, one could not\nbackpropagate through it to update the real-value weight \ufb01lter W . To address this issue, assuming the\nmean and standard deviation of W are mean(W ) and std(W ) respectively, we \ufb01x Bi\u2019s as follows:\n(2)\nwhere \u00afW = W mean(W ), and ui is a shift parameter. For example, one can choose ui\u2019s to\nM1 , i = 1, 2,\u00b7\u00b7\u00b7 , M, to shift evenly over the range [std(W ), std(W )],\nbe ui = 1 + (i 1)\nor leave it to be trained by the network. This is based on the observation that the full-precision\nweights tend to have a symmetric, non-sparse distribution, which is close to Gaussian. To gain more\nintuition and illustrate the approximation effectiveness, an example is visualized in Section S2 of the\nsupplementary material.\nWith Bi\u2019s chosen, (1) becomes a linear regression problem\n\nBi = Fui(W ) := sign( \u00afW + uistd(W )), i = 1, 2,\u00b7\u00b7\u00b7 , M,\n\n2\n\nmin\n\u21b5\n\nJ(\u21b5) = ||w B\u21b5||2,\n\nin which Bi\u2019s serve as the bases in the design/dictionary matrix. We can then back-propagate\nthrough Bi\u2019s using the \u201cstraight-through estimator\u201d (STE) [Bengio et al., 2013]. Assume c as the\ncost function, A and O as the input and output tensor of a convolution respectively, the forward and\nbackward approach of an approximated convolution during training can be computed as follows:\n\nForward: B1, B2,\u00b7\u00b7\u00b7 , BM = Fu1(W ), Fu2(W ),\u00b7\u00b7\u00b7 , FuM (W ),\n\nSolve (3) for \u21b5,\n\n(3)\n\n(4)\n(5)\n\n(6)\n\n\u21b5mConv(Bm, A).\n\nO =\n\nBackward: @c\n@W\n\nMXm=1\n@O MXm=1\n\n@c\n\n=\n\n\u21b5m\n\n@O\n@Bm\n\n@Bm\n\n@W ! STE\n\n=\n\n@c\n\n@O MXm=1\n\n@O\n\n@Bm! =\n\n\u21b5m\n\nMXm=1\n\n\u21b5m\n\n@c\n\n@Bm\n\n.\n\n(7)\n\n3\n\n\fIn test-time, only (6) is required. The block structure of this approximated convolution layer is shown\non the left side in Figure 1. With suitable hardwares and appropriate implementations, the convolution\ncan be ef\ufb01ciently computed. For example, since the weight values are binary, we can implement the\nconvolution with additions and subtractions (thus without multiplications). Furthermore, if the input\nA is binary as well, we can implement the convolution with bitwise operations: xnor and bitcount\n[Rastegari et al., 2016]. Note that the convolution with each binary \ufb01lter can be computed in parallel.\n\nFigure 1: An example of the block structure of the convolution in ABC-Net. M = N = 3. On the\nleft is the structure of the approximated convolution (ApproxConv). ApproxConv is expected to ap-\nproximate the conventional full-precision convolution with linear combination of binary convolutions\n(BinConv), i.e., convolution with binary and weights. On the right is the overall block structure of the\nconvolution in ABC-Net. The input is binarized using different functions Hv1, Hv2, Hv3, passed into\nthe corresponding ApproxConv\u2019s and then summed up after multiplying their corresponding n\u2019s.\nWith the input binarized, the BinConv\u2019s can be implemented with highly ef\ufb01cient bitwise operations.\nThere are 9 BinConv\u2019s in this example and they can work in parallel.\n\n3.1.2 Approximate weights channel-wise\nAlternatively, we can estimate the real-value weight \ufb01lter Wi 2 Rw\u21e5h\u21e5cin of each output chan-\nnel i 2{ 1, 2,\u00b7\u00b7\u00b7 , cout} using the linear combination of M binary \ufb01lters Bi1, Bi2,\u00b7\u00b7\u00b7 , BiM 2\n{1, +1}w\u21e5h\u21e5cin such that Wi \u21e1 \u21b5i1Bi1 + \u21b5i2Bi2 + \u00b7\u00b7\u00b7 + \u21b5iM BiM. Again, to \ufb01nd an optimal\nestimation, we solve a linear regression problem analogy to (3) for each output channel. After\nconvolution, the results are concatenated together along the output-channel dimension. If M = 1,\nthis approach reduces to the Binary-Weights-Networks (BWN) proposed in [Rastegari et al., 2016].\nCompared to weights approximation as a whole, the channel-wise approach approximates weights\nmore elaborately, however no extra cost is needed during inference. Since this approach requires\nmore computational resources during training, we leave it as a future work and focus on the former\napproximation approach in this paper.\n\n3.2 Multiple binary activations and bitwise convolution\nAs mentioned above, a convolution can be implemented without multiplications when weights are\nbinarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they\nare the inputs of convolutions.\nSimilar to the activation binarization procedure in [Zhou et al., 2016], we binarize activations after\npassing it through a bounded activation function h, which ensures h(x) 2 [0, 1]. We choose the\nbounded recti\ufb01er as h. Formally, it can be de\ufb01ned as:\n(8)\nwhere v is a shift parameter. If v = 0, then hv is the clip activation function in [Zhou et al., 2016].\nWe constrain the binary activations to either 1 or -1. In order to transform the real-valued activation\nR into binary activation, we use the following binarization function:\n\nhv(x) = clip(x + v, 0, 1),\n\nHv(R) := 2Ihv(R)0.5 1,\n\n4\n\n(9)\n\n\fwhere I is the indicator function. The conventional forward and backward approach of the activation\ncan be given as follows:\n\nForward: A = Hv(R).\nBackward: @c\n@R\n\n=\n\n@c\n@A I0\uf8ffRv\uf8ff1.\n\n(using STE)\n\n(10)\n\nHere denotes the Hadamard product. As can be expected, binaizing activations as above is kind of\ncrude and leads to non-trivial losses in accuracy, as shown in Rastegari et al. [2016], Hubara et al.\n[2016]. While it is also possible to approximate activations with linear regression, as that of weights,\nanother critical challenge arises \u2013 unlike weights, the activations always vary in test-time inference.\nLuckily, this dif\ufb01culty can be avoided by exploiting the statistical structure of the activations of deep\nnetworks.\nOur scheme can be described as follows. First of all, to keep the distribution of activations relatively\nstable, we resort to batch normalization [Ioffe and Szegedy, 2015]. This is a widely used normalization\ntechnique, which forces the responses of each network layer to have zero mean and unit variance. We\napply this normalization before activation. Secondly, we estimate the real-value activation R using\nthe linear combination of N binary activations A1, A2,\u00b7\u00b7\u00b7 , AN such that R \u21e1 1A1 + 2A2 +\n\u00b7\u00b7\u00b7 + N AN, where\n(11)\nDifferent from that of weights, the parameters n\u2019s and vn\u2019s (n = 1,\u00b7\u00b7\u00b7 , N) here are both trainable,\njust like the scale and shift parameters in batch normalization. Without the explicit linear regression\napproach, n\u2019s and vn\u2019s are tuned by the network itself during training and \ufb01xed in test-time. They\nare expected to learn and utilize the statistical features of full-precision activations.\nThe resulting network architecture outputs multiple binary activations A1, A2,\u00b7\u00b7\u00b7 , AN and their\ncorresponding coef\ufb01cients 1, 2,\u00b7\u00b7\u00b7 , N, which allows more information passing through compared\nto the former one. Combining with the weight approximation, the whole convolution scheme is given\nby:\n\nA1, A2,\u00b7\u00b7\u00b7 , AN = Hv1(R), Hv2(R),\u00b7\u00b7\u00b7 , HvN (R).\n\nConv(W , R) \u21e1 Conv MXm=1\n\n\u21b5mBm,\n\nnAn! =\n\nNXn=1\n\nMXm=1\n\nNXn=1\n\n\u21b5mnConv (Bm, An) ,\n\n(12)\n\nwhich suggests that it can be implemented by computing M \u21e5 N bitwise convolutions in parallel.\nAn example of the whole convolution scheme is shown in Figure 1.\n\n3.3 Training algorithm\nA typical block in CNN contains several different layers, which are usually in the following order:\n(1) Convolution, (2) Batch Normalization, (3) Activation and (4) Pooling. The batch normalization\nlayer [Ioffe and Szegedy, 2015] normalizes the input batch by its mean and variance. The activation\nis an element-wise non-linear function (e.g., Sigmoid, ReLU). The pooling layer applies any type of\npooling (e.g., max,min or average) on the input batch. In our experiment, we observe that applying\nmax-pooling on binary input returns a tensor that most of its elements are equal to +1, resulting in a\nnoticeable drop in accuracy. Similar phenomenon has been reported in Rastegari et al. [2016] as well.\nTherefore, we put the max-pooling layer before the batch normalization and activation.\nSince our binarization scheme approximates full-precision weights, using the full-precision pre-train\nmodel serves as a perfect initialization. However, \ufb01ne-tuning is always required for the weights\nto adapt to the new network structure. The training procedure, i.e., ABC-Net, is summarized in\nSection S1 of the supplementary material.\nIt is worth noting that as M increases, the shift parameters get closer and the bases of the linear\ncombination are more correlated, which sometimes lead to rank de\ufb01ciency when solving (3). This\ncan be tackled with the `2 regularization.\n4 Experiment results\n\nIn this section, the proposed ABC-Net was evaluated on the ILSVRC12 ImageNet classi\ufb01cation\ndataset [Deng et al., 2009], and visual perception of forest trails datasets for mobile robots [Giusti\net al., 2016] in Section S6 of supplementary material.\n\n5\n\n\f4.1 Experiment results on ImageNet dataset\nThe ImageNet dataset contains about 1.2 million high-resolution natural images for training that\nspans 1000 categories of objects. The validation set contains 50k images. We use Resnet ([He et al.,\n2016]) as network topology. The images are resized to 224x224 before fed into the network. We\nreport our classi\ufb01cation performance using Top-1 and Top-5 accuracies.\n\n4.1.1 Effect of weight approximation\nWe \ufb01rst evaluate the weight approximation technique by examining the accuracy improvement for a\nbinary model. To eliminate variables, we leave the activations being full-precision in this experiment.\nTable 1 shows the prediction accuracy of ABC-Net on ImageNet with different choices of M. For\ncomparison, we add the results of Binary-Weights-Network (denoted \u2018BWN\u2019) reported in Rastegari\net al. [2016] and the full-precision network (denoted \u2018FP\u2019). The BWN binarizes weights and leaves\nthe activations being full-precision as we do. All results in this experiment use Resnet-18 as network\ntopology. It can be observed that as M increases, the accuracy of ABC-Net converges to its full-\nprecision counterpart. The Top-1 gap between them reduces to only 0.9 percentage point when\nM = 5, which suggests that this approach nearly eliminates the accuracy degradation caused by\nbinarizing weights.\n\nTable 1: Top-1 (left) and Top-5 (right) accuracy of ABC-Net on ImageNet, using full-precision\nactivation and different choices of the number of binary weight bases M.\nBWN M = 1 M = 2 M = 3 M = 5\nFP\n60.8% 62.8% 63.7% 66.2% 68.3% 69.3%\n83.0% 84.4% 85.2% 86.7% 87.9% 89.2%\n\nTop-1\nTop-5\n\nFor interested readers, Figure S4 in section S5 of the supplementary material shows that the relation-\nship between accuracy and M appears to be linear. Also, in Section S2 of the supplementary material,\na visualization of the approximated weights is provided.\n\n4.1.2 Con\ufb01guration space exploration\nWe explore the con\ufb01guration space of combinations of number of weight bases and activations. Table\n2 presents the results of ABC-Net with different con\ufb01gurations. The parameter settings for these\nexperiments are provided in Section S4 of the supplementary material.\n\nTable 2: Prediction accuracy (Top-1/Top-5) for ImageNet with different choices of M and N in a\nABC-Net (approximate weights as a whole). \u201cres18\u201d, \u201cres34\u201d and \u201cres50\u201d are short for Resnet-18,\nResnet-34 and Resnet-50 network topology respectively. M and N refer to the number of weight\nbases and activations respectively.\n\nNetwork M-weight base N-activation base\n\nres18\nres18\nres18\nres18\nres18\nres18\nres18\nres18\nres34\nres34\nres34\nres34\nres50\nres50\n\n1\n3\n3\n3\n5\n5\n5\n\n1\n3\n5\n\n5\n\n1\n1\n3\n5\n1\n3\n5\n\n1\n3\n5\n\n5\n\nFull Precision\n\nFull Precision\n\nFull Precision\n\nTop-1\nTop-5\n42.7% 67.6%\n49.1% 73.8%\n61.0% 83.2%\n63.1% 84.8%\n54.1% 78.1%\n62.5% 84.2%\n65.0% 85.9%\n69.3% 89.2%\n52.4% 76.5%\n66.7% 87.4%\n68.4% 88.2%\n73.3% 91.3%\n70.1% 89.7%\n76.1% 92.8%\n\nTop-1 gap Top-5 gap\n\n26.6%\n20.2%\n8.3%\n6.2%\n15.2%\n6.8%\n4.3%\n\n-\n\n21.6%\n15.4%\n6.0%\n4.4%\n11.1%\n5.0%\n3.3%\n\n-\n\n20.9%\n6.6%\n4.9%\n\n6.0%\n\n-\n\n-\n\n14.8%\n3.9%\n3.1%\n\n3.1%\n\n-\n\n-\n\nAs balancing between multiple factors like training time and inference time, model size and accuracy\nis more a problem of practical trade-off, there will be no de\ufb01nite conclusion as which combination of\n\n6\n\n\f(M, N) one should choose. In general, Table 2 shows that (1) the prediction accuracy of ABC-Net\nimproves greatly as the number of binary activations increases, which is analogous to the weight\napproximation approach; (2) larger M or N gives better accuracy; (3) when M = N = 5, the Top-1\ngap between the accuracy of a full-precision model and a binary one reduces to around 5%. To gain a\nvisual understanding and show the possibility of extensions to other tasks such object detection, we\nprint the a sample of feature maps in Section S3 of supplementary material.\n\n4.1.3 Comparison with the state-of-the-art\n\nTable 3: Classi\ufb01cation test accuracy of CNNs trained on ImageNet with Resnet-18 network topology.\n\u2018W\u2019 and \u2018A\u2019 refer to the weight and activation bitwidth respectively.\n\nModel\n\nFull-Precision Resnet-18 [full-precision weights and activation]\n\nBWN [full-precision activation] Rastegari et al. [2016]\n\nDoReFa-Net [1-bit weight and 4-bit activation] Zhou et al. [2016]\nXNOR-Net [binary weight and activation] Rastegari et al. [2016]\nBNN [binary weight and activation] Courbariaux et al. [2016]\n\nABC-Net [5 binary weight bases, 5 binary activations]\n\nABC-Net [5 binary weight bases, full-precision activations]\n\nW A\n32\n32\n1\n32\n4\n1\n1\n1\n1\n1\n1\n1\n32\n1\n\nTop-1\nTop-5\n69.3% 89.2%\n60.8% 83.0%\n59.2% 81.5%\n51.2% 73.2%\n42.2% 67.1%\n65.0% 85.9%\n68.3% 87.9%\n\nTable 3 presents a comparison between ABC-Net and several other state-of-the-art models, i.e.,\nfull-precision Resnet-18, BWN and XNOR-Net in [Rastegari et al., 2016], DoReFa-Net in [Zhou\net al., 2016] and BNN in [Courbariaux et al., 2016] respectively. All comparative models use Resnet-\n18 as network topology. The full-precision Resnet-18 achieves 69.3% Top-1 accuracy. Although\nRastegari et al. [2016]\u2019s BWN model and DeReFa-Net perform well, it should be noted that they use\nfull-precision and 4-bit activation respectively. Models (XNOR-Net and BNN) that used both binary\nweights and activations achieve much less satisfactory accuracy, and is signi\ufb01cantly outperformed\nby ABC-Net with multiple binary weight bases and activations. It can be seen that ABC-Net has\nachieved state-of-the-art performance as a binary model.\nOne might argue that 5-bit width quantization scheme could reach similar accuracy as that of ABC-\nNet with 5 weight bases and 5 binary activations. However, the former one is less ef\ufb01cient and\nrequires distinctly more hardware resource. More detailed discussions can be found in Section 5.2.\n5 Discussion\n\n5.1 Why adding a shift parameter works?\nIntuitively, the multiple binarized weight bases/activations scheme works because it allows more\ninformation passing through. Consider the case that a real value, say 1.5, is passed to a binarized\nfunction f (x) = sign(x), where sign maps a positive x to 1 and otherwise -1. In that case, the\noutputs of f (1.5) is 1, which suggests that the input value is positive. Now imagine that we have two\nbinarization function f1(x) = sign(x) and f2(x) = sign(x 2). In that case f1 outputs 1 and f2\noutputs -1, which suggests that the input value is not only positive, but also must be smaller than 2.\nClearly we see that each function contributes differently to represent the input and more information\nis gained from f2 compared to the former case.\nFrom another point of view, both coef\ufb01cients (\u2019s) and shift parameters are expected to learn and\nutilize the statistical features of full-precision tensors, just like the scale and shift parameters in batch\nnormalization. If we have more binarized weight bases/activations, the network has the capacity to\napproximate the full-precision one more precisely. Therefore, it can be deduced that when M or N\nis large enough, the network learns to tune itself so that the combination of M weight bases or N\nbinarized activations can act like the full-precision one.\n\n5.2 Advantage over the \ufb01xed-point quantization scheme\nIt should be noted that there are key differences between the multiple binarization scheme (M\nbinarized weight bases or N binarized activations) proposed in this paper and the \ufb01xed-point quanti-\nzation scheme in the previous works such as [Zhou et al., 2016, Hubara et al., 2016], though at \ufb01rst\nCourbariaux et al. [2016] did not report their result on ImageNet. We implemented and presented the result.\n\n7\n\n\fthought K-bit width quantization seems to share the same memory requirement with K binarizations.\nSpeci\ufb01cally, our K binarized weight bases/activations is preferable to the \ufb01xed K-bit width scheme\nfor the following reasons:\n(1) The K binarization scheme preserves binarization for bitwise operations. One or several bitwise\noperations is known to be more ef\ufb01cient than a \ufb01xed-point multiplication, which is a major reason\nthat BNN/XNOR-Net was proposed.\n(2) A K-bit width multiplier consumes more resources than K 1-bit multipliers in a digital chip: it\nrequires more than K bits to store and compute, otherwise it could easily over\ufb02ow/under\ufb02ow. For\nexample, if a real number is quantized to a 2-bit number, a possible choice is in range {0,1,2,4}.\nIn this 2-bit multiplication, when both numbers are 4, it outputs 4 \u21e5 4 = 16, which is not within\nthe range. In [Zhou et al., 2016], the range of activations is constrained within [0,1], which seems\nto avoid this situation. However, fractional numbers do not solve this problem, severe precision\ndeterioration will appear during the multiplication if there are no extra resources. The fact that the\ncomplexity of a multiplier is proportional to THE SQUARE of bit-widths can be found in literatures\n(e.g., sec 3.1.1. in [Grabbe et al., 2003]). In contrast, our K binarization scheme does not have this\nissue \u2013 it always outputs within the range {-1,1}. The saved hardware resources can be further used\nfor parallel computing.\n(3) A binary activation can use spiking response for event-based computation and communication\n(consuming energy only when necessary) and therefore is energy-ef\ufb01cient [Esser et al., 2016]. This\ncan be employed in our scheme, but not in the \ufb01xed K-bit width scheme. Also, we have mentioned the\nfact that K-bit width multiplier consumes more resources than K 1-bit multipliers. It is noteworthy\nthat these resources include power.\nTo sum up, K-bit multipliers are the most space and power-hungry components of the digital\nimplementation of DNNs. Our scheme could bring great bene\ufb01ts to specialized DNN hardware.\n\n5.3 Further computation reduction in run-time\nOn specialized hardware, the following operations in our scheme can be integrated with other\noperations in run-time and further reduce the computation requirement.\n(1) Shift operations. The existence of shift parameters seem to require extra additions/subtractions\n(see (2) and (8)). However, the binarization operation with a shift parameter can be imple-\nmented as a comparator where the shift parameter is the number for comparison, e.g., Hv(R) =\n\u21e2 1, R 0.5 v;\n1, R < 0.5 v. (0.5 v is a constant), so no extra additions/subtractions are involved.\n(2) Batch normalization.\nIn run-time, a batch normalization is simply an af\ufb01ne function, say,\nBN(R) = aR + b, whose scale and shift parameters a, b are \ufb01xed and can be integrated with\nvn\u2019s. More speci\ufb01cally, a batch normalization can be integrated into a binarization operation as\n1, R < (0.5 v b)/a. Therefore,\n\n1, aR + b < 0.5 v. =\u21e2 1, R (0.5 v b)/a;\n\nfollow: Hv(BN(R)) =\u21e2 1,\n\naR + b 0.5 v;\nthere will be no extra cost for the batch normalization.\n\n6 Conclusion and future work\n\nWe have introduced a novel binarization scheme for weights and activations during forward and\nbackward propagations called ABC-Net. We have shown that it is possible to train a binary CNN with\nABC-Net on ImageNet and achieve accuracy close to its full-precision counterpart. The binarization\nscheme proposed in this work is parallelizable and hardware friendly, and the impact of such a method\non specialized hardware implementations of CNNs could be major, by replacing most multiplications\nin convolution with bitwise operations. The potential to speed-up the test-time inference might be\nvery useful for real-time embedding systems. Future work includes the extension of those results\nto other tasks such as object detection and other models such as RNN. Also, it would be interesting\nto investigate using FPGA/ASIC or other customized deep learning processor [Liu et al., 2016] to\nimplement ABC-Net at run-time.\n\n7 Acknowledgement\nWe acknowledge Mr Jingyang Xu for helpful discussions.\n\n8\n\n\fReferences\nY. Bengio, N. L\u00e9onard, and A. Courville. Estimating or propagating gradients through stochastic neurons for\n\nconditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\nZ. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by half-wave gaussian quantization.\n\narXiv preprint arXiv:1702.00953, 2017.\n\nM. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep\nneural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.\nIn Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248\u2013255. IEEE,\n2009.\n\nS. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L.\nMcKinstry, T. Melano, D. R. Barch, et al. Convolutional networks for fast, energy-ef\ufb01cient neuromorphic\ncomputing. Proceedings of the National Academy of Sciences, page 201604850, 2016.\n\nA. Giusti, J. Guzzi, D. Ciresan, F.-L. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber,\nG. Di Caro, D. Scaramuzza, and L. Gambardella. A machine learning approach to visual perception of forest\ntrails for mobile robots. IEEE Robotics and Automation Letters, 2016.\n\nG. Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of high-performance \ufb02oating-point arithmetic on fpgas.\nIn Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, page 149. IEEE,\n2004.\n\nC. Grabbe, M. Bednara, J. Teich, J. von zur Gathen, and J. Shokrollahi. Fpga designs of parallel high performance\n\ngf (2233) multipliers. In ISCAS (2), pages 268\u2013271. Citeseer, 2003.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\nI. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural\n\nnetworks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. arXiv preprint arXiv:1502.03167, 2015.\n\nD. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n\nS. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen. Cambricon: An instruction set architecture\nfor neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture, pages\n393\u2013405. IEEE Press, 2016.\n\nN. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145\u2013151,\n\n1999.\n\nM. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary convolu-\n\ntional neural networks. In European Conference on Computer Vision, pages 525\u2013542. Springer, 2016.\n\nS. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal\n\nnetworks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\nD. Toms. Training binary node feedforward neural networks by back propagation of error. Electronics letters,\n\n26(21):1745\u20131746, 1990.\n\nG. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision and\n\nsparsity. arXiv preprint arXiv:1610.00324, 2016.\n\nS. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural\n\nnetworks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.\n\n9\n\n\f", "award": [], "sourceid": 261, "authors": [{"given_name": "Xiaofan", "family_name": "Lin", "institution": "DJI"}, {"given_name": "Cong", "family_name": "Zhao", "institution": "DJI"}, {"given_name": "Wei", "family_name": "Pan", "institution": "DJI"}]}