{"title": "Training Deep Neural Networks with 8-bit Floating Point Numbers", "book": "Advances in Neural Information Processing Systems", "page_first": 7675, "page_last": 7684, "abstract": "The state-of-the-art hardware platforms for training deep neural networks are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation. Here we demonstrate, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets. In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding. The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4 times improved throughput over today's systems.", "full_text": "Training Deep Neural Networks with 8-bit Floating\n\nPoint Numbers\n\nNaigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen and Kailash Gopalakrishnan\n\nIBM T. J. Watson Research Center\nYorktown Heights, NY 10598, USA\n\n{nwang, choij, danbrand, cchen, kailash}@us.ibm.com\n\nAbstract\n\nThe state-of-the-art hardware platforms for training Deep Neural Networks (DNNs)\nare moving from traditional single precision (32-bit) computations towards 16\nbits of precision \u2013 in large part due to the high energy ef\ufb01ciency and smaller\nbit storage associated with using reduced-precision representations. However,\nunlike inference, training with numbers represented with less than 16 bits has been\nchallenging due to the need to maintain \ufb01delity of the gradient computations during\nback-propagation. Here we demonstrate, for the \ufb01rst time, the successful training\nof DNNs using 8-bit \ufb02oating point numbers while fully maintaining the accuracy\non a spectrum of Deep Learning models and datasets. In addition to reducing the\ndata and computation precision to 8 bits, we also successfully reduce the arithmetic\nprecision for additions (used in partial product accumulation and weight updates)\nfrom 32 bits to 16 bits through the introduction of a number of key ideas including\nchunk-based accumulation and \ufb02oating point stochastic rounding. The use of these\nnovel techniques lays the foundation for a new generation of hardware training\nplatforms with the potential for 2 \u2212 4\u00d7 improved throughput over today\u2019s systems.\n\n1\n\nIntroduction\n\nOver the past decade, Deep Learning has emerged as the dominant Machine Learning algorithm\nshowing remarkable success in a wide spectrum of applications, including image processing [9],\nmachine translation [20], speech recognition [21] and many others.\nIn each of these domains, Deep Neural Networks (DNNs) achieve superior accuracy through the use\nof very large and deep models \u2013 necessitating up to 100s of ExaOps of computation during training\nand Gigabytes of storage. Approximate computing techniques have been widely studied to minimize\nthe computational complexity of these algorithms as well as to improve the throughput and energy\nef\ufb01ciency of hardware platforms executing Deep Learning kernels [2]. These techniques trade off the\ninherent resilience of Machine Learning algorithms for improved computational ef\ufb01ciency. Towards\nthis end, exploiting reduced numerical precision for data representation and computation has been\nextremely promising \u2013 since hardware energy ef\ufb01ciency improves quadratically with bit-precision.\nWhile reduced-precision methods have been studied extensively, recent work has mostly focused on\nexploiting them for DNN inference. It has shown that the bit-width for inference computations can\nbe successfully scaled down to just a few bits (i.e., 2-4 bits) while (mostly) preserving accuracy [3].\nHowever, reduced precision DNN training has been signi\ufb01cantly more challenging due to the need to\nmaintain \ufb01delity of the gradients during the back-propagation step. Recent studies have empirically\nshown that at least 16 bits of precision is necessary to train DNNs without impacting model accuracy\n[6, 16, 4]. As a result, state-of-the-art training platforms have started to offer 16-bit \ufb02oating point\ntraining hardware [8, 5] with \u2265 4\u00d7 performance over equivalent 32-bit systems.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe goal of this paper is to push the envelope further and enable DNN training using 8-bit \ufb02oating\npoint numbers. To exploit the full bene\ufb01ts of 8-bit platforms, 8-bit \ufb02oating point numbers are\nused for numerical representation of data as well as computations encountered in the forward and\nbackward passes of DNN training. There are three primary challenges to using super scaled precision\nwhile fully preserving model accuracy (as exempli\ufb01ed in Fig. 1 for ResNet18 training on ImageNet\ndataset). Firstly, when all the operands (i.e., weights, activations, errors and gradients) for general\nmatrix multiplication (GEMM) and convolution computations are reduced to 8 bits, most DNNs\nsuffer noticeable accuracy degradation (e.g., Fig. 1(a)). Secondly, reducing the bit-precision of\naccumulations in GEMM from 32 bits (e.g., [16, 4]) to 16 bits signi\ufb01cantly impacts the convergence\nof DNN training (Fig. 1(b)). This reduction in accumulation bit-precision is critically important for\nreducing the area and power of 8-bit hardware. Finally, reducing the bit-precision of weight updates\nto 16-bit \ufb02oating point impacts accuracy (Fig. 1(c)) - while 32-bit weight updates require an extra\ncopy of the high precision weights and gradients to be kept in memory, which is expensive.\n\nFigure 1: The challenges of selectively reducing training precision with (a) 8-bit representations, (b)\n16-bit accumulations, and (c) 16-bit weight updates vs. F P 32 baseline for ResNet18 (ImageNet).\n\nIn this paper, we introduce new techniques to fully overcome all of above challenges:\n\n\u2022 Devised a new F P 8 \ufb02oating point format that, in combination with DNN training insights,\nallows GEMM computations for Deep Learning to work without loss in model accuracy.\n\u2022 Developed a new technique called chunk-based computations that when applied hier-\narchically allows all matrix and convolution operations to be computed using only 8-bit\nmultiplications and 16-bit additions (instead of 16 and 32 bits respectively).\n\u2022 Applied \ufb02oating point stochastic rounding in the weight update process allowing these\n\u2022 Demonstrated the wide applicability of the combined effects of these techniques across a\n\nupdates to happen with 16 bits of precision (instead of 32 bits).\n\nsuite of Deep Learning models and datasets \u2013 while fully preserving model accuracy.\n\nThe use of these novel techniques open up new opportunities for hardware platforms with 2 \u2212 4\u00d7\nimproved energy ef\ufb01ciency and throughput over state-of-the-art training systems.\n\n2 8-bit \ufb02oating point training\n\n2.1 Related Work\n\nThere has been a tremendous body of research conducted towards DNN precision scaling over the\npast few years. However, a signi\ufb01cant fraction of this quantization research has focused around\nreduction of bit-width for the forward path for inference applications. Recently, precision for weights\nand activations were scaled down to 1-2 bits ([11, 3]) with a small loss of accuracy, while keeping\nthe gradients and errors in the backward path as well as the weight updates in full-precision. In\ncomparison to inference, much of the recent work on low precision training often uses much higher\nprecision \u2013 speci\ufb01cally on the errors and gradients in the backward path. DoReFa-Net [22] reduces the\ngradient precision down to 6 bits while using 1-bit weights and 2-bit activations for training. WAGE\n[19] quantizes weights, activations, errors and gradients to 2, 8, 8 and 8 bits respectively. However, all\nof these techniques incur signi\ufb01cant accuracy degradation (> 5%) relative to full-precision models.\nTo maintain model accuracy for reduced-precision training, much of recent work keeps the data\nand computation precision in at least 16 bits. MPT [16] uses a IEEE half-precision \ufb02oating point\n\n2\n\n\fformat (16 bits) accumulating results into 32-bit arrays and additionally proposes a loss-scaling\nmethod to preserve gradients with very small magnitudes. Flexpoint [13] and DFP [4] demonstrated a\nformat with a 16-bit mantissa and a shared exponent to train large neural networks with full-precision\naccuracy. The shared exponents can be adjusted dynamically to minimize over\ufb02ow. However, even\nwith 16-bit data representations, these techniques require the partial products to be accumulated\nin 32-bits and subsequently rounded down to 16 bits for the following computation. In addition, in\nall cases, a 32-bit copy of the weights is maintained to preserve the \ufb01delity of the weight update\nprocess.\nIn contrast, using the new ideas presented in this paper, we show that it is possible to train these\nnetworks using just 8-bit \ufb02oating point representations for all of the arrays used in matrix and\ntensor computations \u2013 weights, activations, errors and gradients. In addition, we show that the\npartial products of these two 8-bit operands can be accumulated into 16-bit sums which can then be\nrounded down to 8 bits again. Furthermore, the master copy of the weights preserved after the weight\nupdate process can be scaled down from 32 to 16 bits. These advances dramatically improve the\ncomputational ef\ufb01ciency, energy ef\ufb01ciency and memory bandwidth needs of future deep learning\nhardware platforms without impacting model convergence and accuracy.\n\nFigure 2: A diagram showing the precision settings for (a) three GEMM functions during forward\nand backward passes, and (b) three AXPY operations during a standard SGD weight update process.\n\n2.2 New Reduced Precision Floating Point Formats: F P 8 and F P 16\n\nFig. 2 shows the precision settings for the three GEMM functions during forward and backward\npasses, i.e., Forward, Backward and Gradient GEMM, as well as the three vector addition (AXPY)\noperations during a standard stochastic gradient descent (SGD) weight update process, i.e., L2-\nregularization (L2-Reg), momentum gradient accumulation (Momentum-Acc), and weight update\n(Weight-Upd). Note that the convolution computation is implemented by \ufb01rst lowering the input data,\nfollowed by GEMM operations. In other words, GEMM refers to computations corresponding to\nboth convolution (Conv) and fully-connected (FC) layers. Our 8-bit \ufb02oating point number (F P 8)\nhas a (sign, exponent, mantissa) format of (1, 5, 2) bits - where the format is chosen carefully to\nrepresent weights, activations, errors and gradients used in the three GEMMs. The 16-bit \ufb02oating\npoint number (F P 16) has a (1, 6, 9) format and is used for both GEMM accumulations as well\nas AXPY additions \u2013 where the higher (6-bit) exponent provides a larger dynamic range needed\nduring weight updates. Both F P 8 and F P 16 formats are selected after in-depth studies of the data\ndistribution in networks, focusing on balancing the representation accuracy and dynamic range. Due\nto limited available space, we only show results from the best formats that work reliably across a\nvariety of deep networks/datasets. We refer to IEEE single precision as F P 32, i.e (1, 8, 23). In\naddition, we explore two \ufb02oating point rounding modes post F P 16 additions \u2013 nearest and stochastic\nrounding.\n\n2.3 Floating Point Accumulation in Reduced Precision\n\nA GEMM function involves a dot-product that may accumulate a large number of element-wise\nproducts in \ufb02oating point. Since \ufb02oating point addition involves right-shift of the smaller of the two\noperands (by the difference in exponents), it is possible that this smaller number may be truncated\nentirely after addition due to limited mantissa bits.\n\n3\n\n\fThis issue of truncation in large-to-small number addition (also called \u201cswamping\u201d [10]) is known in\nthe area of high performance computing [17], which focuses on numerical accuracy of high precision\n32/64-bit \ufb02oating point computations. However, in the context of deep neural networks, we \ufb01nd that\nthe swamping is particularly serious when the accumulation bit-precision is reduced aggressively.\nWhen we use our F P 16 format for accumulations, this truncation happens when the magnitude\ndiffers larger than the swamping threshold 2mantissa+1. Furthermore, swamping is exacerbated\nunder the following conditions: 1) the accumulation is done over the values with non-zero mean\n(and thus the magnitude of the sum can gradually increase beyond the swamping threshold) and/or\n2) some of the elements in the vector have a large magnitude (due to long tails in the distribution).\nThese two cases cause signi\ufb01cant accumulation errors \u2013 and is the reason why current hardware\nplatforms are unable to reduce accumulation precision below 32 bits.\nIn this work, we demonstrate that swamping severely limits reduction in training precision and\npropose two novel schemes that completely overcome this limit and enable low-precision F P 8 DNN\ntraining: chunk-based accumulation and \ufb02oating point stochastic rounding.\n\nChunk-based Accumulation The novel insight behind our proposed idea of chunk-based accu-\nmulations is to divide a long dot-product into smaller chunks (de\ufb01ned by the chunk length CL).\nThe individual element-wise products are then added hierarchically \u2013 intra-chunk accumulations\nare \ufb01rst performed to produce partial sums followed by inter-chunk accumulations of these partial\nsums to produce a \ufb01nal dot-product value. Since the length of the additions for both intra-chunk\nand inter-chunk computations is reduced by CL, the probability of adding a large number to a small\nnumber decreases dramatically. Furthermore, chunk-based accumulation requires little additional\ncomputational overhead (unlike sorting-based summation techniques) and incurs relatively insigni\ufb01-\ncant memory overheads (unlike pairwise-summation) while reducing theoretical error bounds from\nO(N ) to O(N/CL + CL) where N is the length of the dot product \u2013 similar to the analysis in [1].\nMotivated by this chunk-based accumulation concept, we propose a reduced-precision dot-product\nalgorithm for Deep Learning as described in Fig. 3(a). The input to the dot-product are two vectors\nin F Pmult precision, which are multiplied in F Pmult but have products accumulated in a higher\nprecision F Pacc in order to capture information of the intermediate sum better, e.g., F Pmult = F P 8\nand F Pacc = F P 16. Since F P 16 is still signi\ufb01cantly lower than the typical bit-precision used\nin GPUs today for GEMM accumulation (i.e., F P 32), we employ chunk-based accumulation to\novercome swamping errors. Intra-chunk accumulation is carried out in the innermost loop of the\nalgorithm shown in Fig. 3(a), then the sum of the chunks is further accumulated into the \ufb01nal sum. It\nshould be noted that only a single additional variable is required to maintain the intra-chunk sum \u2013\nthereby minimizing cost and overheads. The net impact of this remarkably simple idea is to minimize\nswamping and to open up opportunities for using F P 8 for representations (and multiplications) and\nF P 16 for accumulations, while matching F P 32 baselines for additions as shown in Fig. 3(b).\n\nStochastic Rounding Stochastic rounding is another extremely effective way of addressing the\nissue of swamping. Note that information loss occurs when the bit-width is reduced by rounding. As\ndiscussed before, \ufb02oating point addition rounds off the intermediate sum of two aligned mantissas.\nNearest rounding is a common rounding mode, but it discards information conveyed in the least\nsigni\ufb01cant bits (LSBs) that are rounded off. This information loss can be signi\ufb01cant when the\naccumulation bit-precision is reduced into half, i.e., F P 16, which has only 9 bits of mantissa.\nStochastic rounding is a method to capture this information loss from the discarded bits. Assume\na \ufb02oating point value with the larger mantissa bits for the intermediate sum, x = s \u00b7 2e \u00b7 (1 + m)\nwhere s, e, and m are sign, exponent, and mantissa for x, respectively. Also assume that m for this\nintermediate sum is represented in \ufb01xed-precision with k(cid:48) bits, which needs to be rounded off into\nsmaller bits, k \u2264 k(cid:48). Then, the stochastic rounding works as follows:\n\n(cid:40)\n\nRound(x) =\n\ns \u00b7 2e \u00b7 (1 + (cid:98)m(cid:99) + \u0001) with probability m\u2212(cid:98)m(cid:99)\ns \u00b7 2e \u00b7 (1 + (cid:98)m(cid:99))\n\nwith probability 1 \u2212 m\u2212(cid:98)m(cid:99)\n\n\u0001\n\n,\n\n\u0001\n\n,\n\n(1)\n\nwhere (cid:98)m(cid:99) is the truncation of k(cid:48) \u2212 k LSBs of m, and \u0001 = 2\u2212k.\nNote that this \ufb02oating point stochastic rounding technique is mathematically different from the \ufb01xed\npoint stochastic rounding approach that is widely used in literature [6, 11]; since the magnitude of\nthe rounding error of the \ufb02oating point stochastic rounding is proportional to the exponent value 2e.\n\n4\n\n\fTable 1: Training con\ufb01guration and test error (model size) across a spectrum of networks and datasets.\nModel\nDataset\nMinibatch Size\nEpoch\nF P 32 Baseline\nOur F P 8 Training\n\nCIFAR10-ResNet BN50-DNN\nCIFAR10\n128\n160\n7.23% (2.81MB)\n7.79% (1.41MB)\n\nCIFAR10-CNN\nCIFAR10\n128\n140\n17.80% (0.45MB)\n18.15% (0.23MB)\n\nBN50\n256\n20\n59.33% (64.5MB)\n60.08% (34.5MB)\n\nAlexNet\nImageNet\n256\n45\n41.96% (432MB)\n42.45% (216MB)\n\nResNet18\nImageNet\n256\n85\n32.57% (66.9MB)\n33.05% (32.3MB)\n\nResNet50\nImageNet\n256\n80\n27.86% (147MB)\n28.28% (73.5MB)\n\nIn spite of this difference, we show both numerically (in the next section) and empirically (in Sec. 3\nand 4.3) that this technique works robustly for DNNs.\nTo the best of our knowledge, this work is the \ufb01rst to demonstrate the effectiveness of chunk-based\naccumulation and \ufb02oating point stochastic rounding towards 8-bit DNN training of large models.\n\nComparison of Accumulation Techniques We perform numerical analysis to investigate the\neffectiveness of the proposed chunk-based accumulation and \ufb02oating point stochastic rounding\nschemes. Fig. 3(b) compares the behavior of F P 16 accumulation for different rounding modes and\nchunk sizes. A vector with varying length drawn from the uniform distribution (mean=1, stdev=1) is\naccumulated. As a baseline, accumulation in F P 32 is shown where the accumulated values increase\nlinearly with vector length, as the addend has a non-zero mean. A typical F P 16 accumulation with\nthe nearest rounding (i.e., ChunkSize=1) signi\ufb01cally suffers swamping errors (the accumulation\nstops when length \u2265 4096, since the magnitudes differ by \u2265 211). Chunk-based accumulation\ndramatically helps compensate this error, as the effective length of accumulation is reduced by chunk\nsize to avoid swamping (ChunkSize=32 is already very robust, as shown in Fig. 3(b)). The \ufb01gure\nalso shows the effectiveness of the stochastic rounding; although there exists slight deviation at large\naccumulation length due to the rounding error, stochastic rounding consistently follows the FP32\nresult.\nGiven these results on simple dot-products, we employ chunk-based accumulation for For-\nward/Backward/Gradient GEMMs, using the reduced-precision dot-product algorithm described in\nFig. 3(a). For weight update AXPY computations, it is more natural to use stochastic rounding,\nsince the weight gradient is accumulated into the weight over mini-batches across epochs, unlike dot-\nproduct of long vectors in GEMM. The following sections empirically demonstrate the effectiveness\nof these two techniques over a wide spectrum of DNN training models and datasets.\n\nFigure 3: (a) Reduced-precision dot-product based on accumulation in chunks. (b) Comparison of\naccumulation for different chunk sizes and rounding modes. A typical F P 16 accumulation (i.e.,\nChunkSize = 1) with nearest rounding (NR) suffers signi\ufb01cant error, whereas ChunkSize >= 32 help\ncompensate this error. Stochastic rounding schemes also follows the F P 32 baseline.\n\n3 Experimental Results\n\nReduced-precision emulated experiments were performed using NVIDIA GPUs. The software plat-\nform is an in-house distributed deep learning framework [7]. The three GEMM computations share\nthe same bit-precision and chunk-size: F P 8 for input operands and multiplication and F P 16 for ac-\ncumulation with a chunk-size of 64. The three AXPY computations use the same bit-precision, F P 16,\nusing \ufb02oating point stochastic rounding. To preserve the dynamic range of the back-propagated error\nwith small magnitude, we adopt the loss-scaling method described in [16]. For all the models tested,\n\n5\n\n(a)02000400060008000100001200014000160001800016409681761225616336Accumulation\tValuesAccumulation\tLengthFP32FP16\t-SR\tChunkSize=1ChunkSize=2ChunkSize=4ChunkSize=8ChunkSize=16ChunkSize=32ChunkSize=64ChunkSize=128ChunkSize=25602000400060008000100001200014000160001800016409681761225616336Accumulation\tValuesAccumulation\tLengthFP32FP16\t-SR\tChunkSize=1ChunkSize=2ChunkSize=4ChunkSize=8ChunkSize=16ChunkSize=32ChunkSize=64ChunkSize=128ChunkSize=25602000400060008000100001200014000160001800016409681761225616336Accumulation\tValuesAccumulation\tLengthFP32FP16\t-SR\tChunkSize=1ChunkSize=2ChunkSize=4ChunkSize=8ChunkSize=16ChunkSize=32ChunkSize=64ChunkSize=128ChunkSize=256<FP16\t\u2013NR\t>(b)ChunkSize=16ChunkSize=8SRFP32,\tFP16-ChunkSize=32-256Input: {\"#}#%&:(,{)#}#%&:((*+,-./), Parameter:chunk size 01Output: 234(*+566)234=0.0;\t<=\"=0;\t>346?=@/01for n=1:>346?\t{2346?=0.0for i=1:01{<=\"++B4C=\"DEFG)DEF(in *+,-./)2346?+=B4C(in *+566)}234+=2346?(in *+566)}\fa single scaling factor of 1000 was used without loss of accuracy. The GEMM computation for the\nlast layer of the model (typically a small FC layer followed by Softmax) is kept at F P 16 for better\nnumerical stability. Finally, for the ImageNet dataset, the input image is represented using F P 16 for\nthe ResNet18 and ResNet50 models. The technical reasons behind these choices are discussed in\nSec. 4 in more detail.\nTo demonstrate the robustness as well as the wide coverage of the proposed F P 8 training scheme,\nwe tested it comprehensively on a spectrum of well-known Convolutional Neural Networks (CNNs)\nand Deep Neural Networks (DNNs) for both image and speech classi\ufb01cation tasks across multiple\ndatasets; CIFAR10-CNN ([14]), CIFAR10-ResNet, ImageNet-ResNet18, ImageNet-ResNet50 ([9]),\nImageNet-AlexNet ([15]), BN50-DNN ([18]) (details on the network architectures can be found\nin the supplementary material). Note that, for large ImageNet networks, we skipped some pre-\nprocessing steps, such as color and scale augmentations, in order to accelerate the emulation process\nfor reduced-precision DNN training, since it needs large computing resources.\nAll networks are trained using the SGD optimizer via the proposed F P 8 training scheme without\nchanges to network architectures, data pre-processing, or hyper-parameters, then the results are\ncompared with the F P 32 baseline. The experimental results are summarized in Table 1, while\nthe detailed convergence curves are shown in Fig. 4. As can be seen, with the proposed F P 8\ntraining technique, every single network tested achieved almost identical test errors compared to\nthe full-precision baseline while memory foot-print for not only weight but also the master copy\nis reduced by 2\u00d7 due to F P 8 weight and F P 16 master copy. As a proof of wide-applicability,\nwe additionally trained the CIFAR10-CNN network with the ADAM optimizer [12] and achieved\nbaseline accuracies while using F P 8 GEMMs and F P 16 weight updates. Overall, our experimental\nresults indicate that training with F P 8 representations, F P 16 accumulations and F P 16 weight\nupdates show remarkable robustness across a wide spectrum of application domains, network types\nand optimizer choices.\nTable 2 shows a comparison of the reduced-precision training work for top-1 accuracy (%) of AlexNet\non ImageNet. The proposed F P 8 training scheme achieved equivalent accuracies to the previous\nstate-of-the-art, while using only half of the bit-precision for both representations and accumulations.\n\nFigure 4: Reliable model convergence results across a spectrum of models and datasets using a chunk\nsize of 64. F P 8 is used for representations and F P 16 is used for accumulation and updates.\n\n4 Discussion & Insight\n\n4.1 Bit-Precisions for First and Last Layer\n\nThe \ufb01rst and last layers in DNNs are often excluded from quantization due to their sensitivity [22, 3].\nHowever, there is very limited understanding on how the bit-precision needs to be set for the \ufb01rst/last\n\n6\n\n\fTable 2: Comparison of reduced-precision training for top-1 accuracy (%) for AlexNet (ImageNet)\n\nReduced Precision Training Scheme\n\nDoReFa-Net [22]\n\nWAGE [19]\n\nDFP [4]\nMPT [16]\n\nProposed FP8 training\n\nBit-Precision\ndW dx\n6\n32\n8\n8\n16\n16\n16\n16\n8\n8\n\nW x\n2\n1\n2\n8\n16\n16\n16\n16\n8\n8\n\nacc\n32\n32\n32\n32\n16\n\nF P 32\n55.9\nN/A\n57.4\n56.8\n58.0\n\nReduced\nPrecision\n46.1\n51.6\n56.9\n56.9\n57.5\n\nTable 3: Comparison of the precision setting on the last layer of AlexNet\nLast Layer GEMMs\n\nForward Backward Gradient\nFP16\nFP8\nFP8\n\nFP16\nFP8\nFP8\n\nFP16\nFP8\nFP8\n\nInput to\nSoftmax\nFP16\nFP8\nFP16\n\nTest Error (%) Accuracy Degradation (%)\n0.34\n10.16\n0.41\n\n42.30\n52.12\n42.37\n\n(a) ResNet50\n\n(b) ResNet18\n\nFigure 5: (a) The importance of chunk-based accumulations for ResNet50. (b) Sensitivity of Forward,\nBackward and Gradient GEMMs to accumulation errors for ResNet18 without chunking - indicating\nthat Gradient GEMM accumulation errors harm DNN convergence.\n\nlayers in order to reduce its impact on model accuracy. This section aims to precisely provide that\ninsight and specify how the bit-precision of the \ufb01rst and last layers affects F P 8 training performance.\nFor the \ufb01rst layer, we observe that the representation precision for input images is very critical for\nsuccessfully training F P 8 models. Image data is typically represented by 256 color intensity levels\n(e.g., uint8 in CIFAR10). Since F P 8 does not have enough mantissa bits to represent integer values\nfrom 0 to 255, we chose to use F P 16 to adequately represent input images. This is particularly\ncritical for achieving high accuracy on ImageNet using ResNet18 and ResNet50; without which we\nobserve \u223c 2% accuracy degradation for these networks. All other data types includings weights,\noutput activations, and weight gradients can still be represented in F P 8 with no loss in accuracy.\nAdditionally, we note that the last layer is very sensitive to quantization. First, we conjecture\nthat this sensitivity is directly related to the \ufb01delity of the Softmax function (since these errors\nget exponentially ampli\ufb01ed). To verify this, we conducted experiments on AlexNet, with varying\nprecisions for the last layer. As summarized in Table 3, the last layer with all three GEMMs in F P 16\nachieves baseline accuracy (degradation < 0.5%), but the F P 8 case exhibits noticeable degradation.\nWe also observe that it is indeed possible to use F P 8 for all three GEMMs in the last layer and\nachieve high accuracy \u2013 as long as the output of the last layer Forward GEMM is preserved in F P 16.\nHowever, to ensure robust training across a diverse set of neural networks, we decided to use F P 16\nfor all three GEMMs in the last layer. Given the limited computational complexity in the last layer of\na DNN (< 1% in FLOPS), we anticipate very little loss in performance from running this layer in\nF P 16 while maintaining F P 8 for the rest of layers in DNNs.\n\n7\n\n\fTable 4: Impact of the rounding mode used in F P 16 weight updates. Top-1 accuracy for AlexNet and\nResNet18 on the ImageNet dataset is reported for nearest as well as stochastic rounding approaches.\n\nF P 32 Baseline Nearest Rounding\n54.10%\n65.74%\n\n58.04%\n67.43%\n\nAlexNet\nResNet18\n\nStochastic Rounding\n57.94%\n67.34%\n\nFigure 6: Effect of chunk sizes on Gradient\nGEMM computation errors (normalized L2-\ndistance between F P 8 and F P 32 GEMMs) for\nCIFAR10-ResNet.\n\n4.2 Accumulation Error\n\nFigure 7: Chip layout of a novel data\ufb02ow-based\ncore (14 nm) with F P 16 chunk-based accumula-\ntion. F P 8 engines are > 2 \u223c 4\u00d7 more ef\ufb01cient\nover F P 16 implementations - and require lesser\nmemory bandwidth and storage\n\nNext, we investigate the impact of chunk-based accumulations. Prior works (e.g., [16, 4]) claim that\n32 bits of precision is required for the accumulation in any GEMM to prevent loss of information.\nMotivated by the signi\ufb01cant area/energy expense of F P 32 adders, we counter this by claiming that\nchunk-based accumulations can effectively address this loss in long dot-products while maintaining\naccumulation bit-precision in F P 16. As shown in Fig. 5(a), F P 8 training for ResNet50 fails to\nconverge without chunking, but chunk-based computations bring model convergence back to baseline.\nInvestigating further, we identify Gradient GEMM to be the most sensitive to accumulation precision\nwhen chunking is not used. As shown in Fig. 5(b), F P 8 training on ResNet18 converges to baseline\naccuracy levels when F P 32 is used for Gradient GEMM. For other cases, interestingly, the training\nloss converges but the test error diverges to 99%, exhibiting signi\ufb01cant over-\ufb01tting. This implies\nthat the failure in addressing information loss in low-precision Gradient GEMM results in poor\ngeneralization of the network during training. Gradient GEMM accumulates weight gradients across\nminibatch samples, where information from small gradients may be lost due to swamping (Sec. 2.3),\nresulting in the SGD optimization being stuck at sharp local minimas. Chunk-based accumulation\naddresses the issue of swamping to recover information loss and therefore help generalization.\nTo understand the impact of chunk size on accumulation accuracy for Gradient GEMM in DNN\ntraining, we extracted data from Activation and Error matrices from the two different Conv layers in\nthe CIFAR10-ResNet model to compute Gradient GEMM with varying chunk sizes. Fig. 6 shows the\nnormalized L2-distance of the results relative to the full-precision counterpart for varying chunk size.\nThe computation results are closest to the F P 32 baseline with the chunk size between 64 and 256.\nBefore and after this range, the L2-distance is higher due to the dominant inter-chuck and intra-chunk\naccumulation error, respectively. Based on this insight, and for the ease of hardware implementation,\nwe use a chunk size of 64 for our experiments across all models.\n\n4.3 Nearest Rounding vs. Stochastic Rounding\n\nFinally, we investigate the impact of rounding mode on F P 16 weight updates. Since weight gradients\nare typically several orders of magnitude smaller than weights, prior work (e.g., [16]) adopts F P 32\nfor weight updates. In this work, we maintain F P 16 for the entire weight update process in SGD\n(i.e., L2-Reg, Momentum-Acc, and Weight-Upd), as a part of our F P 8 training scheme; stochastic\nrounding is applied to avoid accuracy loss. Table 4 shows the impact of rounding modes (nearest vs.\n\n8\n\n\fstochastic) on the top-1 accuracy of the AlexNet and ResNet18 models. For this experiment, GEMM\nis done in F P 32 to avoid its additional impact on accuracy. As can be seen from the table, the nearest\nrounding suffers noticeable accuracy degradation (2 \u223c 4%) while stochastic rounding maintains the\nbaseline accuracies, demonstrating its effectiveness as a key enabler for low precision training.\n\n4.4 Hardware Bene\ufb01ts\n\nA subset of the new ideas discussed in this paper were implemented in hardware using a novel data\ufb02ow\nbased core design in 14nm silicon technology \u2013 incorporating both chunk-based computations as\nwell as scaled precisions for training (Fig. 7). Through these hardware implementations, we draw\nthe following conclusions: 1) The energy overheads of chunk-based computations are < 5% for\nchunk sizes > 64. 2) F P 8 based multipliers accumulating results into F P 16 are 2-4 times more\nef\ufb01cient in hardware compared to pure F P 16 computations because of smaller multipliers (i.e.,\nsmaller mantissa) as well as smaller accumulator bit-widths. F P 8 hardware engines are roughly\nsimilar in area and power to 8-bit integer computation engines (that require larger multipliers and\n32-bit accumulators). These promising results lay the foundation for new hardware platforms that\nprovide signi\ufb01cantly improved DNN training performance without accuracy loss.\n\n5 Conclusions\nWe have demonstrated DNN training with 8-bit \ufb02oating point numbers (F P 8) that achieves 2 \u2212 4\u00d7\nspeedup without compromise in accuracy. The key insight is that reduced-precision additions (used\nin partial product accumulations and weight updates) can result in swamping errors causing accuracy\ndegradation during training. To minimize this error, we propose two new techniques, chunk-based\naccumulation and \ufb02oating point stochastic rounding, that enable a reduction of bit-precision for\nadditions down to 16 bits \u2013 as well as implement them in hardware. Across a wide spectrum of\npopular DNN benchmarks and datasets, this mixed precision F P 8 training technique achieves the\nsame accuracy levels as the F P 32 baseline. Future work aims to further optimize data formats and\ncomputations in order to increase margins as well as study additional benchmarks and datasets.\n\nAcknowledgments\n\nThe authors would like to thank I-Hsin Chung, Ming-Hung Chen, Ankur Agrawal, Silvia Melitta\nMueller, Vijayalakshmi Srinivasan, Dongsoo Lee and Jinseok Kim for helpful discussions and\nsupports. This research was supported by IBM Research, IBM SoftLayer, and IBM Congnitive\nComputing Cluster (CCC).\n\nReferences\n[1] Anthony M Castaldo, R Clint Whaley, and Anthony T Chronopoulos. Reducing \ufb02oating point error in\ndot product using the superblock family of algorithms. SIAM journal on scienti\ufb01c computing, 31(2):\n1156\u20131174, 2008.\n\n[2] Chia-Yu Chen, Jungwook Choi, Kailash Gopalakrishnan, Viji Srinivasan, and Swagath Venkataramani.\nExploiting approximate computing for deep learning acceleration. In Design, Automation Test in Europe\nConference Exhibition (DATE), pages 821\u2013826, 2018.\n\n[3] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan,\nand Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv\npreprint arXiv:1805.06085, 2018.\n\n[4] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal\nBanerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, et al. Mixed preci-\nsion training of convolutional neural networks using integer operations. arXiv preprint arXiv:1802.00930,\n2018.\n\n[5] Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijayalakshmi Srinivasan,\nJungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. A scalable multi-teraops deep\nlearning processor core for ai training and inference. In VLSI Circuits, 2018 Symposium on. IEEE, 2018.\n\n[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited\n\nnumerical precision. In International Conference on Machine Learning, pages 1737\u20131746, 2015.\n\n9\n\n\f[7] Suyog Gupta, Wei Zhang, and Fei Wang. Model accuracy and runtime tradeoff in distributed deep learning:\nA systematic study. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 171\u2013180.\nIEEE, 2016.\n\n[8] Mark Harris. Mixed-precision programming with cuda 8, 2016. URL https://devblogs.nvidia.com/\n\nmixed-precision-programming-cuda-8/.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[10] Nicholas J Higham. The accuracy of \ufb02oating point summation. SIAM Journal on Scienti\ufb01c Computing, 14\n\n(4):783\u2013799, 1993.\n\n[11] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural\n\nnetworks. In Advances in neural information processing systems, pages 4107\u20134115, 2016.\n\n[12] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nInternational\n\n[13] Urs K\u00f6ster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol,\nScott Gray, Stewart Hall, Luke Hornof, et al. Flexpoint: An adaptive numerical format for ef\ufb01cient training\nof deep neural networks. In Advances in Neural Information Processing Systems, pages 1742\u20131752, 2017.\n\n[14] Alex Krizhevsky and G Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript,\n\n40, 2010.\n\n[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\nNeural Networks. In Advances in Neural Information Processing Systems 25 (NIPS), pages 1097\u20131105,\n2012.\n\n[16] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris\nGinsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv\npreprint arXiv:1710.03740, 2017.\n\n[17] Thomas G Robertazzi and Stuart C Schwartz. Best \u201cordering\u201d for \ufb02oating-point addition. ACM Transactions\n\non Mathematical Software (TOMS), 14(1):101\u2013110, 1988.\n\n[18] Ewout van den Berg, Bhuvana Ramabhadran, and Michael Picheny. Training variance and performance\nevaluation of neural networks in speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE\nInternational Conference on, pages 2287\u20132291. IEEE, 2017.\n\n[19] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural\n\nnetworks. arXiv preprint arXiv:1802.04680, 2018.\n\n[20] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[21] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and\nGeoffrey Zweig. The microsoft 2016 conversational speech recognition system. In Acoustics, Speech and\nSignal Processing (ICASSP), 2017 IEEE International Conference on, pages 5255\u20135259. IEEE, 2017.\n\n[22] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net: Training Low\n\nBitwidth Convolutional Neural Networks with Low Bitwidth Gradients. CoRR, abs/1606.06160, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3789, "authors": [{"given_name": "Naigang", "family_name": "Wang", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Jungwook", "family_name": "Choi", "institution": "IBM Research"}, {"given_name": "Daniel", "family_name": "Brand", "institution": "IBM Research"}, {"given_name": "Chia-Yu", "family_name": "Chen", "institution": "IBM research"}, {"given_name": "Kailash", "family_name": "Gopalakrishnan", "institution": "IBM Research"}]}