{"title": "Online Normalization for Training Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8433, "page_last": 8443, "abstract": "Online Normalization is a new technique for normalizing the hidden activations of a neural network. Like Batch Normalization, it normalizes the sample dimension. While Online Normalization does not use batches, it is as accurate as Batch Normalization. We resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activations. Online Normalization works with automatic differentiation by adding statistical normalization as a primitive. This technique can be used in cases not covered by some other normalizers, such as recurrent networks, fully connected networks, and networks with activation memory requirements prohibitive for batching. We show its applications to image classification, image segmentation, and language modeling. We present formal proofs and experimental results on ImageNet, CIFAR, and PTB datasets.", "full_text": "Online Normalization for Training Neural Networks\n\nVitaliy Chiley\u2217\n\nIlya Sharapov\u2217\n\nAtli Kosson\n\nUrs Koster\n\nRyan Reece\n\nSof\u00eda Samaniego de la Fuente\n\nVishal Subbiah\n\nMichael James\u2217 \u2020\n\nCerebras Systems\n\n175 S. San Antonio Road\nLos Altos, California 94022\n\nAbstract\n\nOnline Normalization is a new technique for normalizing the hidden activations\nof a neural network. Like Batch Normalization, it normalizes the sample dimen-\nsion. While Online Normalization does not use batches, it is as accurate as Batch\nNormalization. We resolve a theoretical limitation of Batch Normalization by intro-\nducing an unbiased technique for computing the gradient of normalized activations.\nOnline Normalization works with automatic differentiation by adding statistical\nnormalization as a primitive. This technique can be used in cases not covered by\nsome other normalizers, such as recurrent networks, fully connected networks,\nand networks with activation memory requirements prohibitive for batching. We\nshow its applications to image classi\ufb01cation, image segmentation, and language\nmodeling. We present formal proofs and experimental results on ImageNet, CIFAR,\nand PTB datasets.\n\n1\n\nIntroduction\n\nTraditionally, neural networks are functions that map inputs deterministically to outputs. Normaliza-\ntion makes this non-deterministic because each sample is affected not only by the network weights\nbut also by the statistical distribution of samples. Therefore, normalization re-de\ufb01nes neural networks\nto be statistical operators. Normalized networks treat each neuron\u2019s output as a random variable that\nultimately depends on the network\u2019s parameters and input distribution. No matter how it is stimulated,\na normalized neuron produces an output distribution with zero mean and unit variance.\nWhile normalization has enjoyed widespread success, current normalization methods have theoretical\nand practical limitations. These limitations stem from an inability to compute the gradient of the\nideal normalization operator.\nBatch methods are commonly used to approximate ideal normalization. These methods use the\ndistribution of the current minibatch as a proxy for the distribution of the entire dataset. They produce\nbiased estimates of the gradient that violate a fundamental tenet of stochastic gradient descent (SGD):\nIt is not possible to recover the true gradient from any number of small batch evaluations. This bias\nbecomes more pronounced as batch size is reduced.\nIncreasing the minibatch size provides more accurate approximations of normalization and its gradient\nat the cost of increased memory consumption. This is especially problematic for image processing\nand volumetric networks. Here neural activations outnumber network parameters, and even modest\nbatch sizes reduce the trainable network size by an order of magnitude.\n\n\u2217Equal contribution\n\u2020Corresponding author: michael@cerebras.net\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOnline Normalization is a new algorithm that resolves these limitations while matching or exceeding\nthe performance of current methods. It computes unbiased activations and unbiased gradients without\nany use of batching. Online Normalization differentiates through the normalization operator in a\nway that has theoretical justi\ufb01cation. We show the technique working at scale with the ImageNet [1]\nResNet-50 [2] classi\ufb01cation benchmark, as well as with smaller networks for image classi\ufb01cation,\nimage segmentation, and recurrent language modeling.\nInstead of using batches, Online Normalization uses running estimates of activation statistics in the\nforward pass with a corrective guard to prevent exponential behavior. The backward pass implements\na control process to ensure that back-propagated gradients stay within a bounded distance of true\ngradients. A geometrical analysis of normalization reveals necessary and suf\ufb01cient conditions that\ncharacterize the gradient of the normalization operator. We further analyze the effect of approximation\nerrors in the forward and backward passes on network dynamics. Based on our \ufb01ndings we present the\nOnline Normalization technique and experiments that compare it with other normalization methods.\nFormal proofs and all details necessary to reproduce results are in the appendix. Additionally we\nprovide reference code in PyTorch, TensorFlow, and C [3].\n\n2 Related work\n\nIoffe and Szegedy introduced normalization of hidden activations [4], de\ufb01ning it as a transformation\nthat uses full dataset statistics to eliminate internal covariate shift. They observed that the inability\nto differentiate through a running estimator of forward statistics produces a gradient that leads to\ndivergence [5]. They resolved this with the Batch Normalization method [4]. During training, each\nminibatch is used as a statistical proxy for the entire dataset. This allows use of gradient descent\nwithout a running estimator process. However, training still maintains running estimates for use\nduring validation and inference.\nThe success of Batch Normalization has inspired a number of related methods that address its\nlimitations. They can be classi\ufb01ed as functional or heuristic methods.\nFunctional methods replace the normalization operator with a normalization function. The func-\ntion is chosen to share certain properties of the normalization operator. Layer Normalization [6]\nnormalizes across features instead of across samples. Group Normalization [7] generalizes this by\npartitioning features into groups. Weight Normalization [8] and Normalization Propagation [9] apply\nnormalization to network weights instead of network activations.\nThe advantage of functional normalizers is that they \ufb01t within the SGD framework, and work in\nrecurrent networks and large networks. However, when compared directly to batch normalization\nthey generally perform worse [7].\nHeuristic methods use measurements from previous network iterations to augment the current\nforward and backward passes. These methods do not differentiate through the normalization operator.\nInstead, they combine terms from previous batch-based approximations. An advantage of heuristic\nnormalizers is that they use more data to generate better estimates of forward statistics; however, they\nlack correctness and stability guarantees.\nBatch Renormalization [5] is one example of a heuristic method. While it uses an online process to\nestimate dataset statistics, these estimates are based on batches and are only allowed to be within a\n\ufb01xed interval of the current batch\u2019s statistics. Batch Renormalization does not differentiate through\nits statistical estimation process, and like Instance Normalization [10], it cannot be used with fully\nconnected layers at a batch size of one.\nStreaming Normalization [11] is also a heuristic method. It performs one weight update for every\nseveral minibatches. Instead of differentiating through the normalization operator, it averages point\ngradients at long and short time scales. It applies a different mixture in a saw-tooth pattern to each\nminibatch depending on its timing relative to the latest weight update.\nIn recurrent networks, circular dependencies between sample statistics and activations pose a chal-\nlenge to normalization [12, 13, 14]. Recurrent Batch Normalization [12] offers the approach of\nmaintaining distinct statistics for each time step. At inference this results in a different linear op-\neration being applied at each time step, breaking the formalism of recurrent networks. Functional\nnormalizers avoid circular dependencies and have been shown to perform better [6].\n\n2\n\n\f(cid:126)x\n\nP(cid:126)1\u22a5((cid:126)x)\n\nT(cid:126)y\n\nSN\u22121\n\n(cid:126)y (cid:48)\n\nP(cid:126)1\u22a5((cid:126)y (cid:48))\n\n(cid:126)y\n\n(cid:126)1\u22a5\n\n0\n\nSN\u22122\n\n(a)\n\n(cid:126)y\u22a5\n\n(cid:126)1\u22a5\u2229 (cid:126)y\u22a5\n\nP(cid:126)1\u22a5\u2229(cid:126)y\u22a5((cid:126)y (cid:48))\n\n0\n\n(cid:126)x (cid:48)\n\nFigure 1: Geometry of normalization.\n\n(b)\n\n(cid:126)1\u22a5\n\n3 Principles of normalization\n\nNormalization is an af\ufb01ne transformation fX that maps a scalar random variable x to an output y with\nzero mean and unit variance. It maps every sample in a way that depends on the distribution X,\n\n(1)\n\nfX [x] \u2261\nresulting in normalized output y satisfying\n\nx \u2212 \u00b5[x]\n\u03c3 [x]\n\nx \u223c X ,\nand \u00b5(cid:2)y2(cid:3) = 1 .\n\n\u00b5[y] = 0\n\n(2)\nWhen we apply normalization to network activations, the input distribution X is itself functionally\ndependent on the state of the network, in particular on the weights of all prior layers. This poses a\nchallenge for accurate computation of normalization because at no point in time can we observe the\nentire distribution corresponding to the current values of the weights.\nBackpropagation uses the chain rule to compute the derivative of the loss function L with respect to\nhidden activations. We express this using the convention (\u00b7)(cid:48) = \u2202L/\u2202(\u00b7) as\n\n[y(cid:48)] .\n\n(3)\n\nx(cid:48) =\n\n\u2202fX[x]\n\n\u2202x\n\nIt is not obvious how to handle the derivative in the preceding equation, which is itself a statistical\noperator. The usual approaches do not work: Automatic differentiation cannot be applied to expec-\ntations. Exact computation over the entire dataset is prohibitive. Ignoring the derivative causes a\nfeedback loop between gradient descent and the estimator process, leading to instability [4].\nBatch Normalization avoids these challenges by freezing the network while it measures the statistics\nof a batch. Increasing batch size improves accuracy of the gradients but also increases memory\nrequirements and potentially impedes learning. We started our study with the question: Is freezing\nthe network the only way to resolve interference between an estimator process and gradient descent?\nIt is not. In the following sections we will show how to achieve the asymptotic accuracy of large\nbatch normalization while inspecting only one sample at a time.\n\n3.1 Properties of normalized activations and gradients\nDifferential geometry provides key insights on normalization. Let (cid:126)x \u2208 RN be a \ufb01nite-dimensional\nvector whose components approximate the normalizer\u2019s input distribution. In the geometric setting,\nnormalization is a function de\ufb01ned on RN . Its output (cid:126)y satis\ufb01es both conditions of (2). The zero\nmean condition is satis\ufb01ed on the subspace (cid:126)1\u22a5 orthogonal to the ones vector, whereas the unit\nvariance condition is satis\ufb01ed on the sphere SN\u22121 with radius \u221aN (Figure 1a). Therefore (cid:126)y lies on\nthe manifold SN\u22122 = (cid:126)1\u22a5 \u2229 SN\u22121.\nClearly, mapping RN to a sphere is nonlinear. The forward pass (1) does this in two steps: It subtracts\nthe same value from all components of (cid:126)x, which is orthogonal projection P(cid:126)1\u22a5; then it rescales the\n\n3\n\n\fFigure 2: Two element normalization (N=2).\n\nFigure 3: Gradient bias (BN).\n\nresult to SN\u22121. In contrast, the backward pass (3) is linear because the chain rule produces a product\nof Jacobians. The Jacobian J = [\u2202yj/\u2202xi] must suppress gradient components that would move (cid:126)y off\nthe manifold\u2019s tangent space. SN\u22122 is a sphere embedded in a subspace, so its tangent space T(cid:126)y at (cid:126)y\nis orthogonal to both the sphere\u2019s radius (cid:126)y and the subspace\u2019s complement (cid:126)1.\n\n(cid:126)x (cid:48) = J(cid:126)y (cid:48) =\u21d2 P(cid:126)1((cid:126)x (cid:48)) = P(cid:126)y ((cid:126)x (cid:48)) = 0 .\n\n(4)\nBecause (1) is the composition of two steps, J is a product of two factors (Figure 1b). The unbiasing\nstep P(cid:126)1\u22a5 is linear and therefore is also its own Jacobian. The scaling step is isotropic in (cid:126)y\u22a5 and\ntherefore its Jacobian acts equally to all components in (cid:126)y\u22a5 scaling them by \u03c3. The remaining (cid:126)y\ncomponent must be suppressed (4), resulting in:\n\nJ =\n\n1\n\u03c3\n\nP(cid:126)1\u22a5 P(cid:126)y\u22a5 =\u21d2 (cid:126)x (cid:48) =\n\n1\n\n\u03c3(cid:0)I \u2212 P(cid:126)1(cid:1) (I \u2212 P(cid:126)y) (cid:126)y (cid:48) .\n\n(5)\n\nThis is the exact expression for backpropagation through the normalization operator. It is also possible\nto reach the same conclusion algebraically [5] (Appendix B).\nThe input (cid:126)x is a continuous function of the neural network\u2019s weights and dataset distribution. During\ntraining, the incremental weight updates cause (cid:126)x to drift. Meanwhile, normalization is only presented\nwith a single scalar component of (cid:126)x while the other components remain unknown. Online Normaliza-\ntion handles this with an online control process that examines a single sample per step while ensuring\n(5) is always approximately satis\ufb01ed throughout training.\n\n3.2 Bias in gradient estimates\n\nAlthough normalization applies an af\ufb01ne transformation, it has a nonlinear dependence on the input\ndistribution X. Therefore, sampling the gradient of a normalized network with mini-batches results\nin biased estimates. This effect becomes more pronounced for smaller mini-batch sizes. Consider\nthe extreme case of normalizing a fully connected layer with batch size two (Figure 2). Each pair\nof samples is transformed to either (\u22121, +1) or (+1,\u22121), resulting in a piecewise constant surface.\nSince the output is discrete, the corresponding gradient is zero almost everywhere. Of course, the\ntrue gradient is nonzero almost everywhere and therefore cannot be recovered from any number of\nbatch-two evaluations.\nThe same effect can be seen in more realistic cases. Figure 3 shows gradient bias as a function of\nbatch size measured for a convolutional network with the CIFAR-10 dataset [15]. Ground truth for\nthis plot used all 50,000 images in the dataset with weights randomly initialized and \ufb01xed. Even in\nthis simple scenario, moderate batch sizes exhibit bias exceeding an angle of 10 degrees.\n\n3.3 Exploding and vanishing activations\n\nAll normalizers are presented with the task of calculating speci\ufb01c values of the af\ufb01ne coef\ufb01cients\n\u00b5[x] and \u03c3[x] for the forward pass (1). Exact computation of these coef\ufb01cients is impossible without\nprocessing the entire dataset. Therefore, SGD-based optimizers must admit errors in normalization\nstatistics. These errors are problematic for networks that have unbounded activation functions, such\nas ReLU. It is possible for the errors to amplify through the depth of the network causing exponential\ngrowth of activation magnitudes.\nFigure 4 shows exponential behavior for a 100-layer fully connected network with a synthetic dataset.\nIn each layer we compute exact af\ufb01ne coef\ufb01cients using the entire dataset. We randomly perturb\n\n4\n\nSN1AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBi2W3CnosevEkFe0HtGvJptk2NJusSVYoS/+EFw+KePXvePPfmLZ70NYHA4/3ZpiZF8ScaeO6305uaXlldS2/XtjY3NreKe7uNbRMFKF1IrlUrQBrypmgdcMMp61YURwFnDaD4dXEbz5RpZkU92YUUz/CfcFCRrCxUuvuIb058caoWyy5ZXcKtEi8jJQgQ61b/Or0JEkiKgzhWOu258bGT7EyjHA6LnQSTWNMhrhP25YKHFHtp9N7x+jIKj0USmVLGDRVf0+kONJ6FAW2M8JmoOe9ifif105MeOGnTMSJoYLMFoUJR0aiyfOoxxQlho8swUQxeysiA6wwMTaigg3Bm395kTQqZe+0XLk9K1UvszjycACHcAwenEMVrqEGdSDA4Rle4c15dF6cd+dj1ppzspl9+APn8wf3G49D\u27131+1\u25c6AAACEHicbVDLSsNAFJ3UV42vqks3g0UUxJJUQZdFNy4r2Ac0oUwmt+3QySTMTMQS+glu/BU3LhRx69Kdf+P0gWjrgQuHc+7l3nuChDOlHefLyi0sLi2v5FfttfWNza3C9k5dxamkUKMxj2UzIAo4E1DTTHNoJhJIFHBoBP2rkd+4A6lYLG71IAE/Il3BOowSbaR24dALoMtElkRES3Y/xPaJ63n2sYttD0T4o7cLRafkjIHniTslRTRFtV349MKYphEITTlRquU6ifYzIjWjHIa2lypICO2TLrQMFSQC5Wfjh4b4wCgh7sTSlNB4rP6eyEik1CAKTKe5r6dmvZH4n9dKdefCz5hIUg2CThZ1Uo51jEfp4JBJoJoPDCFUMnMrpj0iCdUmQ9uE4M6+PE/q5ZJ7WirfnBUrl9M48mgP7aMj5KJzVEHXqIpqiKIH9IRe0Kv1aD1bb9b7pDVnTWd20R9YH98GPpvz\u2713+11\u25c6AAACEHicbVDLSsNAFJ3UV42vqks3g0UUxJJUQZdFNy4r2Ac0oUwmt+3QySTMTMQS+glu/BU3LhRx69Kdf+P0gWjrgQuHc+7l3nuChDOlHefLyi0sLi2v5FfttfWNza3C9k5dxamkUKMxj2UzIAo4E1DTTHNoJhJIFHBoBP2rkd+4A6lYLG71IAE/Il3BOowSbaR24dALoMtElkRES3Y/xPax63n2iYttD0T4o7cLRafkjIHniTslRTRFtV349MKYphEITTlRquU6ifYzIjWjHIa2lypICO2TLrQMFSQC5Wfjh4b4wCgh7sTSlNB4rP6eyEik1CAKTKe5r6dmvZH4n9dKdefCz5hIUg2CThZ1Uo51jEfp4JBJoJoPDCFUMnMrpj0iCdUmQ9uE4M6+PE/q5ZJ7WirfnBUrl9M48mgP7aMj5KJzVEHXqIpqiKIH9IRe0Kv1aD1bb9b7pDVnTWd20R9YH98GNJvzP~1?AAAB/HicbVBNS8NAEN34WetXtEcvwSJ4KkkV9Fj04rGC/YAmhs120i7dbMLuphBC/CtePCji1R/izX/jts1BWx8MPN6bYWZekDAqlW1/G2vrG5tb25Wd6u7e/sGheXTclXEqCHRIzGLRD7AERjl0FFUM+okAHAUMesHkdub3piAkjfmDyhLwIjziNKQEKy35Zq3t5+4USO4Uj7mbgEiKwjfrdsOew1olTknqqETbN7/cYUzSCLgiDEs5cOxEeTkWihIGRdVNJSSYTPAIBppyHIH08vnxhXWmlaEVxkIXV9Zc/T2R40jKLAp0Z4TVWC57M/E/b5Cq8NrLKU9SBZwsFoUps1RszZKwhlQAUSzTBBNB9a0WGWOBidJ5VXUIzvLLq6TbbDgXjeb9Zb11U8ZRQSfoFJ0jB12hFrpDbdRBBGXoGb2iN+PJeDHejY9F65pRztTQHxifP2N/lUA=x1AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2kx0AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AELeo2jSN2=\u21e2\u27131+1\u25c6,\u2713+11\u25c6AAACVXicbVHRShtBFJ3damrXqtE++jIYBCEm7EahvgjSvvRJLDUmkEnD7ORuMmR2dpm5K4YlP+mL+Cd9KXQSg9jEAwOHc8+9c+dMnCtpMQyfPf/Dxmbl49anYPvzzu5edf/gzmaFEdAWmcpMN+YWlNTQRokKurkBnsYKOvHk+7zeuQdjZaZvcZpDP+UjLRMpODppUFW/fpfXjdaMXlKmIEFWBiyGkdRlnnI08mFGg0bEWFCPaMBAD1/1U7rurM+djVVnwIwcjZHNBtVa2AwXoOskWpIaWeJmUH1kw0wUKWgUilvbi8Ic+yU3KIUCN7iwkHMx4SPoOap5CrZfLlKZ0WOnDGmSGXc00oX6tqPkqbXTNHZOt+jYrtbm4nu1XoHJRb+UOi8QtHi5KCkUxYzOI6ZDaUCgmjrChZFuVyrG3HCB7iMCF0K0+uR1ctdqRmfN1s/z2tW3ZRxb5JAckRMSka/kivwgN6RNBHkkfzzP870n76+/4VderL637PlC/oO/9w8+FrDWy=\u27131+1\u25c6AAACEXicbVBNS8NAEN34WeNX1aOXYBEKYkmqoBeh6MVjBfsBTSib7aRdutmE3Y0YQv+CF/+KFw+KePXmzX/jpi2orQ8GHu/NMDPPjxmVyra/jIXFpeWV1cKaub6xubVd3NltyigRBBokYpFo+1gCoxwaiioG7VgADn0GLX94lfutOxCSRvxWpTF4Ie5zGlCClZa6xXJ64frQpzyLQ6wEvR+Zx47rmkeO6QLv/ajdYsmu2GNY88SZkhKaot4tfrq9iCQhcEUYlrLj2LHyMiwUJQxGpptIiDEZ4j50NOU4BOll449G1qFWelYQCV1cWWP190SGQynT0Ned+sCBnPVy8T+vk6jg3MsojxMFnEwWBQmzVGTl8Vg9KoAolmqCiaD6VosMsMBE6RDzEJzZl+dJs1pxTirVm9NS7XIaRwHtowNURg46QzV0jeqogQh6QE/oBb0aj8az8Wa8T1oXjOnMHvoD4+MbA7WcfQ==y=\u2713+11\u25c6AAACEXicbVBNS8NAEN34WeNX1aOXYBEKYkmqoBeh6MVjBfsBTSib7aRdutmE3Y0YQv+CF/+KFw+KePXmzX/jpi2orQ8GHu/NMDPPjxmVyra/jIXFpeWV1cKaub6xubVd3NltyigRBBokYpFo+1gCoxwaiioG7VgADn0GLX94lfutOxCSRvxWpTF4Ie5zGlCClZa6xXJ64frQpzyLQ6wEvR+ZR47rmseO6QLv/ajdYsmu2GNY88SZkhKaot4tfrq9iCQhcEUYlrLj2LHyMiwUJQxGpptIiDEZ4j50NOU4BOll449G1qFWelYQCV1cWWP190SGQynT0Ned+sCBnPVy8T+vk6jg3MsojxMFnEwWBQmzVGTl8Vg9KoAolmqCiaD6VosMsMBE6RDzEJzZl+dJs1pxTirVm9NS7XIaRwHtowNURg46QzV0jeqogQh6QE/oBb0aj8az8Wa8T1oXjOnMHvoD4+MbA6ucfQ==248163264128Batch size2\u00b04\u00b08\u00b016\u00b032\u00b0Bias\f\u03b7 E|w(cid:48)|\n\n|w|\n\n\u03b7\u03bb|w|\n\nFigure 4: Activation growth.\n\nFigure 5: Weight equilibrium.\n\nthe coef\ufb01cients before applying inference to assess the sensitivity to errors. Exponential behavior is\neasy to observe even with mild noise. This effect is particularly pronounced when variances \u03c32 are\nsystematically underestimated, in which case each layer ampli\ufb01es the signal in expectation.\nBatch Normalization does not exhibit exponential behavior. Although its estimates contain error,\nexact normalization of a batch of inputs imposes (2) as strict constraints on normalized output. For\neach layer, the largest possible output component is bounded by the square root of the batch size.\nExponential behavior is precluded because this bound does not depend on the depth of the network.\nThis property is also enjoyed by Layer Normalization and Group Normalization.\nAny successful online procedure will also need a mechanism to avoid exponential growth of activa-\ntions. With a bounded activation function, such as tanh, this is achieved automatically. Layer scaling\n(Figure 4) that enforces the second equality of (2) across all features in a layer is another possible\nmechanism that prevents both growth and decay of activations.\n\n3.4\n\nInvariance to gradient scale\n\nWhen a normalizer follows a linear layer, the normalized output is invariant to the scale of the weights\n|w| [5, 6]. Scaling the weights by any constant is immediately absorbed by the normalizer. Therefore,\n\u2202y/\u2202|w| is zero and gradient descent makes steps orthogonal to the weight vector (Figure 5). With\na \ufb01xed learning rate \u03b7, a sequence of steps of size O(\u03b7) leads to unbounded growth of |w|. Each\nsuccessive step will have decreasing relative effect on the weight change reducing the effective\nlearning rate.\nOthers have observed that the L2 weight decay [16] commonly used in normalized networks coun-\nteracts the growth of |w|. In particular, [17] analyzes this phenomenon, although under a faulty\nassumption that gradients are not backpropagated through the mean and variance calculations. In-\nstead, we observe that weight growth and decay are balanced when weights reach an equilibrium\nscale (Figure 5). We denote the gradient with respect to weights w(cid:48) and the increment in weights\n\u2206w \u2261 \u03b7w(cid:48). When \u03b7 and decay factor \u03bb are small, solving for equilibrium yields (Appendix C):\n\n(6)\n\n|w| =(cid:114) \u03b7\n\n2\u03bb\n\nE|w(cid:48)| .\n\nThe equilibrium weight magnitude depends on \u03b7. When the weights are away from their equilibrium\nmagnitude, such as at initialization and after each learning rate drop, the weights tend to either grow\nor diminish network-wide. This tendency can create a biased error in statistical estimates that can\nlead to exponential behavior (Section 3.3).\nScale invariance with respect to the weights means that the learning trajectory depends only on the\nratio \u2206w/|w| and the problem can be arbitrarily reparametrized as long as this ratio is kept constant.\nThis shows that L2 weight decay does not have a regularizing effect; it only corrects for the radial\ngrowth artifact introduced by the \ufb01nite step size of SGD.\nWhen weights are in the equilibrium described by (6),\n\nThis equation shows that learning dynamics are invariant to the scale of the distribution of gradients\n\nE|w(cid:48)|. We also observe that the effective learning rate is \u221a2\u03b7\u03bb. This correspondence was indepen-\n\n.\n\n(7)\n\n\u2206w\n|w|\n\n=(cid:112)2\u03b7\u03bb\n\nw(cid:48)\nE|w(cid:48)|\n\n5\n\n-10%-5%0%5%10%Perturbation magnitude -10%-5%0%5%10%Growth rate per layer= without Layer Scaling~(0,) without Layer ScalingWith Layer Scaling\fFigure 6: Online Normalization.\n\ndently observed by Page [18]. Practitioners tend to use linear scaling of the learning rate with batch\nsize [19] while keeping the L2 regularization constant \u03bb \ufb01xed. Equation (7) shows that this amounts\nto the square root scaling suggested earlier by Krizhevsky [20].\n\n4 Online Normalization\n\nTo de\ufb01ne Online Normalization (Figure 6), we replace arithmetic averages over the full dataset in\n(2) with exponentially decaying averages of online samples. Similarly, projections in (4) and (5) are\ncomputed over online data using exponentially decaying inner products. The decay factors \u03b1f and \u03b1b\nfor forward and backward passes respectively are hyperparameters for the technique.\nWe allow incoming samples xt, such as images, to have multiple scalar components and denote\nfeature-wide mean and variance by \u00b5(xt) and \u03c32 (xt). The algorithm also applies to outputs of fully\nconnected layers with only one scalar output per feature. In fact, this case simpli\ufb01es to \u00b5(xt) = xt\nand \u03c3 (xt) = 0. We use scalars \u00b5t and \u03c3t to denote running estimates of mean and variance across\nall samples. The subscript t denotes time steps corresponding to processing new incoming samples.\nOnline Normalization uses an ongoing process during the forward pass to estimate activation means\nand variances.\nIt implements the standard online computation of mean and variance [21, 22]\ngeneralized to processing multi-value samples and exponential averaging of sample statistics. The\nresulting estimates directly lead to an af\ufb01ne normalization transform.\n\nyt =\n\nxt \u2212 \u00b5t\u22121\n\n\u03c3t\u22121\n\n\u00b5t = \u03b1f \u00b5t\u22121 + (1 \u2212 \u03b1f )\u00b5(xt)\n\u03c32\nt = \u03b1f \u03c32\n\nt\u22121 + (1 \u2212 \u03b1f )\u03c32 (xt) + \u03b1f (1 \u2212 \u03b1f ) (\u00b5(xt) \u2212 \u00b5t\u22121)2\n\n(8a)\n\n(8b)\n(8c)\n\nThis process removes two degrees of freedom for each feature that may be restored adding another\naf\ufb01ne transform with adaptive bias and gain. Corresponding equations are standard in normalization\nliterature [4] and are not reproduced here. The forward pass concludes with a layer-scaling stage that\nuses data from all features to prevent exponential growth (Section 3.3):\n\nwhere {\u00b7} includes all features.\nThe backward pass proceeds in reverse order, starting with the exact gradient of layer scaling:\n\nzt =\n\nyt\n\u03b6t\n\nwith \u03b6t =(cid:113)\u00b5({y2\n\nt }) ,\n\ny(cid:48)t =\n\nz(cid:48)t \u2212 zt\u00b5({ztz(cid:48)t})\n\n\u03b6t\n\n.\n\n6\n\n(9)\n\n(10)\n\nA\ufb03nenorm (8a)Meantracker (8b)Variancetracker (8c)A\ufb03netrain y-projection (11a)y-error accu- mulator (11b)Scaling +-projection (12a)-error accu-mulator (12b)Forward estimator processBackward control processScalingfeaturesfeatures~1AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2A9oQ9lsJ+3SzSbsbgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZW8epYthisYhVN6AaBZfYMtwI7CYKaRQI7AST+4XfmaLSPJZPZpagH9GR5CFn1Fip058iy7z5oFxxq+4SZJN4OalAjuag/NUfxiyNUBomqNY9z02Mn1FlOBM4L/VTjQllEzrCnqWSRqj9bHnunFxZZUjCWNmShizV3xMZjbSeRYHtjKgZ63VvIf7n9VIT3vkZl0lqULLVojAVxMRk8TsZcoXMiJkllClubyVsTBVlxiZUsiF46y9vknat6t1Ua4/1SoPkcRThAi7hGjy4hQY8QBNawGACz/AKb07ivDjvzseqteDkM+fwB87nDz0Uj2U=~1AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2A9oQ9lsJ+3SzSbsbgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZW8epYthisYhVN6AaBZfYMtwI7CYKaRQI7AST+4XfmaLSPJZPZpagH9GR5CFn1Fip058iy7z5oFxxq+4SZJN4OalAjuag/NUfxiyNUBomqNY9z02Mn1FlOBM4L/VTjQllEzrCnqWSRqj9bHnunFxZZUjCWNmShizV3xMZjbSeRYHtjKgZ63VvIf7n9VIT3vkZl0lqULLVojAVxMRk8TsZcoXMiJkllClubyVsTBVlxiZUsiF46y9vknat6t1Ua4/1SoPkcRThAi7hGjy4hQY8QBNawGACz/AKb07ivDjvzseqteDkM+fwB87nDz0Uj2U=Layer scaling (9)z-projection + scaling (10)xAAAB6HicbVDLTgJBEOzFF+IL9ehlItF4IrtookcSLx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1mu1K9K1fMsjjycwClcgAfXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AN8bjOQ=AAAB7HicbVBNS8NAEN34WetX1aOXxaJ4KkkV9Fjw4rGCaQttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpDj5PZKI7ITMghQIfBUropBpYHEpoh+O7md9+Am1Eoh5xkkIQs6ESkeAMreT3QkDWr1TdmjsHXSVeQaqkQLNf+eoNEp7FoJBLZkzXc1MMcqZRcAnTci8zkDI+ZkPoWqpYDCbI58dO6blVBjRKtC2FdK7+nshZbMwkDm1nzHBklr2Z+J/XzTC6DXKh0gxB8cWiKJMUEzr7nA6EBo5yYgnjWthbKR8xzTjafMo2BG/55VXSqte8q1r94brauCjiKJFTckYuiUduSIPckybxCSeCPJNX8uYo58V5dz4WrWtOMXNC/sD5/AG8uI6MAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPiKexGQY8BLx4jmAckS+idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnAkM5k7RpmeW0k2gKIuK0HY1vZ377iWrDlHywk4SGAoaSxYyAdVKrNwQhoF+u+FV/DrxKgpxUUI5Gv/zVGyiSCiot4WBMN/ATG2agLSOcTku91NAEyBiGtOuoBEFNmM2vneIzpwxwrLQrafFc/T2RgTBmIiLXKcCOzLI3E//zuqmNb8KMySS1VJLFojjl2Co8ex0PmKbE8okjQDRzt2IyAg3EuoBKLoRg+eVV0qpVg8tq7f6qUj/P4yiiE3SKLlCArlEd3aEGaiKCHtEzekVvnvJevHfvY9Fa8PKZY/QH3ucPf82O/Q=={x}AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbFU0mqoMeCF48VTFtoQtlsN+3SzW7Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5UcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRW8tMEeoTyaXqRlhTzgT1DTOcdlNFcRJx2onGtzO/80iVZlI8mElKwwQPBYsZwcZKfpA/BdN+tebW3TnQKvEKUoMCrX71KxhIkiVUGMKx1j3PTU2YY2UY4XRaCTJNU0zGeEh7lgqcUB3m82On6MwqAxRLZUsYNFd/T+Q40XqSRLYzwWakl72Z+J/Xy0x8E+ZMpJmhgiwWxRlHRqLZ52jAFCWGTyzBRDF7KyIjrDAxNp+KDcFbfnmVtBt177LeuL+qNc+LOMpwAqdwAR5cQxPuoAU+EGDwDK/w5gjnxXl3PhatJaeYOYY/cD5/AAXgjrw={x0}AAAB7XicbVDLSgNBEOyNrxhfUY9eBoOPU9iNgh4DXjxGMA/ILmF2MpuMmZ1ZZmbFsOQfvHhQxKv/482/cZLsQRMLGoqqbrq7woQzbVz32ymsrK6tbxQ3S1vbO7t75f2DlpapIrRJJJeqE2JNORO0aZjhtJMoiuOQ03Y4upn67UeqNJPi3owTGsR4IFjECDZWavnZ05k/6ZUrbtWdAS0TLycVyNHolb/8viRpTIUhHGvd9dzEBBlWhhFOJyU/1TTBZIQHtGupwDHVQTa7doJOrNJHkVS2hEEz9fdEhmOtx3FoO2NshnrRm4r/ed3URNdBxkSSGirIfFGUcmQkmr6O+kxRYvjYEkwUs7ciMsQKE2MDKtkQvMWXl0mrVvUuqrW7y0r9NI+jCEdwDOfgwRXU4RYa0AQCD/AMr/DmSOfFeXc+5q0FJ585hD9wPn8AZ4WO7Q=={y0}AAAB7XicbVDLSsNAFL3xWeur6tLNYPGxKkkVdFlw47KCfUATymQ6acdOZsLMRAih/+DGhSJu/R93/o3TNgttPXDhcM693HtPmHCmjet+Oyura+sbm6Wt8vbO7t5+5eCwrWWqCG0RyaXqhlhTzgRtGWY47SaK4jjktBOOb6d+54kqzaR4MFlCgxgPBYsYwcZKbT/Pzv1Jv1J1a+4MaJl4BalCgWa/8uUPJEljKgzhWOue5yYmyLEyjHA6KfuppgkmYzykPUsFjqkO8tm1E3RqlQGKpLIlDJqpvydyHGudxaHtjLEZ6UVvKv7n9VIT3QQ5E0lqqCDzRVHKkZFo+joaMEWJ4ZklmChmb0VkhBUmxgZUtiF4iy8vk3a95l3W6vdX1cZZEUcJjuEELsCDa2jAHTShBQQe4Rle4c2Rzovz7nzMW1ecYuYI/sD5/AFpDI7u{z0}AAAB7XicbVDLSgNBEOyNrxhfUY9eBoOPU9iNgh4DXjxGMA/ILmF2MpuMmZ1ZZmaFuOQfvHhQxKv/482/cZLsQRMLGoqqbrq7woQzbVz32ymsrK6tbxQ3S1vbO7t75f2DlpapIrRJJJeqE2JNORO0aZjhtJMoiuOQ03Y4upn67UeqNJPi3owTGsR4IFjECDZWavnZ05k/6ZUrbtWdAS0TLycVyNHolb/8viRpTIUhHGvd9dzEBBlWhhFOJyU/1TTBZIQHtGupwDHVQTa7doJOrNJHkVS2hEEz9fdEhmOtx3FoO2NshnrRm4r/ed3URNdBxkSSGirIfFGUcmQkmr6O+kxRYvjYEkwUs7ciMsQKE2MDKtkQvMWXl0mrVvUuqrW7y0r9NI+jCEdwDOfgwRXU4RYa0AQCD/AMr/DmSOfFeXc+5q0FJ585hD9wPn8AapOO7w=={z}AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbFU0mqoMeCF48VTFtoQtlsN+3SzW7Y3Qg19Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5UcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRW8tMEeoTyaXqRlhTzgT1DTOcdlNFcRJx2onGtzO/80iVZlI8mElKwwQPBYsZwcZKfpA/BdN+tebW3TnQKvEKUoMCrX71KxhIkiVUGMKx1j3PTU2YY2UY4XRaCTJNU0zGeEh7lgqcUB3m82On6MwqAxRLZUsYNFd/T+Q40XqSRLYzwWakl72Z+J/Xy0x8E+ZMpJmhgiwWxRlHRqLZ52jAFCWGTyzBRDF7KyIjrDAxNp+KDcFbfnmVtBt177LeuL+qNc+LOMpwAqdwAR5cQxPuoAU+EGDwDK/w5gjnxXl3PhatJaeYOYY/cD5/AAjsjr4={y}AAAB7HicbVBNS8NAEJ34WetX1aOXxaJ4KkkV9Fjw4rGCaQtNKJvtpl262Q27GyGE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvSjnTxnW/nbX1jc2t7cpOdXdv/+CwdnTc0TJThPpEcql6EdaUM0F9wwynvVRRnEScdqPJ3czvPlGlmRSPJk9pmOCRYDEj2FjJD4o8mA5qdbfhzoFWiVeSOpRoD2pfwVCSLKHCEI617ntuasICK8MIp9NqkGmaYjLBI9q3VOCE6rCYHztF51YZolgqW8Kgufp7osCJ1nkS2c4Em7Fe9mbif14/M/FtWDCRZoYKslgUZxwZiWafoyFTlBieW4KJYvZWRMZYYWJsPlUbgrf88irpNBveVaP5cF1vXZRxVOAUzuASPLiBFtxDG3wgwOAZXuHNEc6L8+58LFrXnHLmBP7A+fwBB2aOvQ==y0AAAB6XicbVDLSgNBEOz1GeMr6tHLYPBxCrtR0GPAi8co5gHJEmYns8mQ2dllpldYQv7AiwdFvPpH3vwbJ8keNLGgoajqprsrSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGt1O/9cS1EbF6xCzhfkQHSoSCUbTSQ3beK5XdijsDWSZeTsqQo94rfXX7MUsjrpBJakzHcxP0x1SjYJJPit3U8ISyER3wjqWKRtz449mlE3JqlT4JY21LIZmpvyfGNDImiwLbGVEcmkVvKv7ndVIMb/yxUEmKXLH5ojCVBGMyfZv0heYMZWYJZVrYWwkbUk0Z2nCKNgRv8eVl0qxWvMtK9f6qXDvL4yjAMZzABXhwDTW4gzo0gEEIz/AKb87IeXHenY9564qTzxzBHzifP0EnjRY=x0AAAB6XicbVDLTgJBEOzFF+IL9ehlIvFxIrtookcSLx7RyCMBQmaHWZgwO7uZ6TWSDX/gxYPGePWPvPk3DrAHBSvppFLVne4uP5bCoOt+O7mV1bX1jfxmYWt7Z3evuH/QMFGiGa+zSEa65VPDpVC8jgIlb8Wa09CXvOmPbqZ+85FrIyL1gOOYd0M6UCIQjKKV7p/OesWSW3ZnIMvEy0gJMtR6xa9OP2JJyBUySY1pe26M3ZRqFEzySaGTGB5TNqID3rZU0ZCbbjq7dEJOrNInQaRtKSQz9fdESkNjxqFvO0OKQ7PoTcX/vHaCwXU3FSpOkCs2XxQkkmBEpm+TvtCcoRxbQpkW9lbChlRThjacgg3BW3x5mTQqZe+iXLm7LFVPszjycATHcA4eXEEVbqEGdWAQwDO8wpszcl6cd+dj3ppzsplD+APn8wc/oo0V\u00b5AAAB6nicbVBNSwMxEJ3Ur1q/qh69BIviqexWQY8FLx4r2g9ol5JNs21okl2SrFCW/gQvHhTx6i/y5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61TJxqypo0FrHuhMQwwRVrWm4F6ySaERkK1g7HtzO//cS04bF6tJOEBZIMFY84JdZJDz2Z9ssVr+rNgVeJn5MK5Gj0y1+9QUxTyZSlghjT9b3EBhnRllPBpqVealhC6JgMWddRRSQzQTY/dYrPnDLAUaxdKYvn6u+JjEhjJjJ0nZLYkVn2ZuJ/Xje10U2QcZWklim6WBSlAtsYz/7GA64ZtWLiCKGau1sxHRFNqHXplFwI/vLLq6RVq/qX1dr9VaV+nsdRhBM4hQvw4RrqcAcNaAKFITzDK7whgV7QO/pYtBZQPnMMf4A+fwBWQI2+AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPiKexGQY8BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vgmzJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ipp1arBZbV2f1Wpn+dxFOEETuECAriGOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPlUGPCw==\u02dcx0AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIsfp5JUQY8FLx4r2A9sQ9lsJu3SzSbsbsQS+i+8eFDEq//Gm//GbZuDtj4YeLw3w8w8P+FMacf5tgorq2vrG8XN0tb2zu5eef+gpeJUUmzSmMey4xOFnAlsaqY5dhKJJPI5tv3RzdRvP6JULBb3epygF5GBYCGjRBvpoacZDzB7mpz1yxWn6sxgLxM3JxXI0eiXv3pBTNMIhaacKNV1nUR7GZGaUY6TUi9VmBA6IgPsGipIhMrLZhdP7BOjBHYYS1NC2zP190RGIqXGkW86I6KHatGbiv953VSH117GRJJqFHS+KEy5rWN7+r4dMIlU87EhhEpmbrXpkEhCtQmpZEJwF19eJq1a1b2o1u4uK/XTPI4iHMExnIMLV1CHW2hAEygIeIZXeLOU9WK9Wx/z1oKVzxzCH1ifP6AtkMs=yAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbFU0mqoMeCF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQlv/IzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqu15lWlfp7HUYQTOIUL8OAa6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f4J+M5Q==Optional\u21e3\"(1)AAAB+nicbVBNT8JAEJ3iF+JX0aOXjUSDF9KiiR5JvHjERD4SqGS7bGHDdtvsbjGk8lO8eNAYr/4Sb/4bF+hBwZdM8vLeTGbm+TFnSjvOt5VbW9/Y3MpvF3Z29/YP7OJhU0WJJLRBIh7Jto8V5UzQhmaa03YsKQ59Tlv+6Gbmt8ZUKhaJez2JqRfigWABI1gbqWcXu2MsaawYj8RDWnbPpz275FScOdAqcTNSggz1nv3V7UckCanQhGOlOq4Tay/FUjPC6bTQTRSNMRnhAe0YKnBIlZfOT5+iU6P0URBJU0Kjufp7IsWhUpPQN50h1kO17M3E/7xOooNrL2UiTjQVZLEoSDjSEZrlgPpMUqL5xBBMJDO3IjLEEhNt0iqYENzll1dJs1pxLyrVu8tS7SyLIw/HcAJlcOEKanALdWgAgUd4hld4s56sF+vd+li05qxs5gj+wPr8AfStk7Q=\"(y)AAAB+nicbVBNT8JAEJ3iF+JX0aOXjUSDF9KiiR5JvHjERD4SqGS7bGHDdtvsbjGk8lO8eNAYr/4Sb/4bF+hBwZdM8vLeTGbm+TFnSjvOt5VbW9/Y3MpvF3Z29/YP7OJhU0WJJLRBIh7Jto8V5UzQhmaa03YsKQ59Tlv+6Gbmt8ZUKhaJez2JqRfigWABI1gbqWcXu2MsaawYj8RDWp6cT3t2yak4c6BV4makBBnqPfur249IElKhCcdKdVwn1l6KpWaE02mhmygaYzLCA9oxVOCQKi+dnz5Fp0bpoyCSpoRGc/X3RIpDpSahbzpDrIdq2ZuJ/3mdRAfXXspEnGgqyGJRkHCkIzTLAfWZpETziSGYSGZuRWSIJSbapFUwIbjLL6+SZrXiXlSqd5el2lkWRx6O4QTK4MIV1OAW6tAAAo/wDK/wZj1ZL9a79bFozVnZzBH8gfX5A2Jsk/w=Required for unbounded activation functions, e.g. ReLU.\fNetwork\n\nTable 1: Memory for training (GB).\nBatch\n32\n2\n5\n29\n195\n31\n137\n\nResNet-50, ImageNet\nResNet-50, PyTorcha\nU-Net, 1503 voxels\nU-Net, 2503 voxels\nU-Net, 10242 pixels\nU-Net, 20482 pixels\na PyTorch stores multiple copies of\nactivations for improved performance.\n\nOnline\nNorm\n1\n2\n1\n6\n2\n5\n\n128\n4\n15\n115\n785\n123\n546\n\nCIFAR-100\nResNet-20\n\nTable 2: Best validation: loss (accuracy%).\nImageNet\nNormalizer CIFAR-10\nResNet-20\nResNet-50\n0.26 (92.3) 1.12 (68.6) 0.94 (76.3)\n0.26 (92.2) 1.14 (68.6) 0.97 (76.4)\n(75.9)b\n0.32 (90.3) 1.35 (63.3)\n(71.6)b\n0.31 (90.4) 1.32 (63.1)\n(74.7)b\n0.39 (87.4) 1.47 (59.2)\n(67 )b\n(71.9)b\n\nOnline\nBatcha\nGroup\nInstance\nLayer\nWeight\nPropagation\na Batch size 128 for CIFAR and 32 for ImageNet.\nb Data from [7, 23, 24].\n\n-\n-\n\n-\n-\n\nThe backward pass continues through per-feature normalization (8) using a control mechanism to\nback out projections de\ufb01ned by (5). We do it in two steps, controlling for orthogonality to (cid:126)y \ufb01rst\n\n\u02dcx(cid:48)t = y(cid:48)t \u2212 (1 \u2212 \u03b1b)\u03b5(y)\nt = \u03b5(y)\n\u03b5(y)\n\nt\u22121 + \u00b5(\u02dcx(cid:48)tyt)\n\nt\u22121yt\n\n\u02dcx(cid:48)t\n\u03c3t\u22121 \u2212 (1 \u2212 \u03b1b)\u03b5(1)\nt\u22121 + \u00b5(x(cid:48)t) .\n\nt\u22121\n\n(11a)\n(11b)\n\n(12a)\n\nand then for the mean-zero condition\n\nx(cid:48)t =\n\n(12b)\nGradient scale invariance (Section 3.4) shows that scaling with the running estimate of input variance\n\u03c3t in (12a) is optional and can be replaced by rescaling the output x(cid:48)t with a running average to force\nit to the unit norm in expectation.\n\n\u03b5(1)\nt = \u03b5(1)\n\nFormal Properties Online Normalization provides arbitrarily good approximations of ideal nor-\nmalization and its gradient. The quality of approximation is controlled by the hyperparameters \u03b1f,\n\u03b1b, and the learning rate \u03b7. Parameters \u03b1f and \u03b1b determine the extent of temporal averaging and \u03b7\ncontrols the rate of change of the input distribution. Online Normalization also satis\ufb01es the gradient\u2019s\northogonality requirements. In the course of training, the accumulated errors \u03b5(y)\nand \u03b5(1)\nthat track\ndeviation from orthogonality (5) remain bounded. Formal derivations are in Appendix D.\n\nt\n\nt\n\nMemory Requirements Networks that use Batch Normalization tend to train poorly with small\nbatches. Larger batches are required for accurate estimates of parameter gradients, but activation\nmemory usage increases linearly with batch size. This limits the size of models that can be trained on\na given system. Online Normalization achieves same accuracy without requiring batches (Section 5).\nTable 1 shows that using batches for classi\ufb01cation of 2D images leads to a considerable increase\nin the memory footprint; for 3D volumes, batching becomes prohibitive even with modestly sized\nimages.\n\n5 Experiments\n\nWe demonstrate Online Normalization in a variety of settings. In our experience it has ported easily to\nnew networks and tasks. Details for replicating experiments as well as statistical characterization of\nexperiment reproducibility are in Appendix A. Scripts to reproduce our results are in the companion\nrepository [3].\nCIFAR image classi\ufb01cation (Figures 7-8, Table 2). Our experiments start with the best-published\nhyperparameter settings for ResNet-20 [2] for use with Batch Normalization on a single GPU. We ac-\ncept these hyperparameters as \ufb01xed values for use with Online Normalization. Online Normalization\nintroduces two hyperparameters, decay rates \u03b1f and \u03b1b. We used a logarithmic grid sweep to deter-\nmine good settings. Then we ran \ufb01ve independent trials for each normalizer. Online Normalization\nhad the best validation performance of all compared methods.\n\n7\n\n\fFigure 7: CIFAR-10 / ResNet-20.\n\nFigure 8: CIFAR-100 / ResNet-20.\n\nFigure 9: ImageNet / ResNet-50.\n\nFigure 10: Image Segmentation with U-Net.\n\nImageNet image classi\ufb01cation (Figure 9, Table 2). For the ResNet-50 [2] experiment, we are reporting\nthe single experimental run that we conducted. This trial used decay factors chosen based on the\nCIFAR experiments. Even better results should be possible with a sweep. Our training procedure is\nbased on a protocol tuned for Batch Normalization [25]. Even without tuning, Online Normalization\nachieves the best validation loss of all methods. At validation time it is nearly as accurate as Batch\nNormalization and both methods are better than other compared methods.\nU-Net image segmentation (Figure 10). The U-Net [26] architecture has applications in segmenting\n2D and 3D images. It has been applied to volumetric segmentation in 3D scans [27]. Volumetric con-\nvolutions require large memories for activations (Table 1), making Batch Normalization impractical.\nOur small-scale experiment performs image segmentation on a synthetic shape dataset [28]. Online\nNormalization achieves the best Jaccard similarity coef\ufb01cient among compared methods.\n\nFigure 11: FMNIST with MLP.\n\nFigure 12: RNN (dashed) and LSTM (solid).\n\nFully-connected network (Figure 11). Online Normalization also works when normalizer inputs are\nsingle scalars. We used a three-layer fully connected network, 500+300 HU [29], for the Fashion\nMNIST [30] classi\ufb01cation task. Fashion MNIST is a harder task than MNIST digit recognition,\nand therefore provides more discrimination power in our comparison. The initial learning trajectory\nshows Online Normalization outperforms the other normalizers.\nRecurrent language modeling (Figure 12). Online Normalization works without modi\ufb01cation in\nrecurrent networks. It maintains statistics using information from all previous samples and time\nsteps. This information is representative of the distribution of all recurrent activations, allowing\nOnline Normalization to work in the presence of circular dependencies (Section 2). We train word\nbased language models of PTB [31] using single layer RNN and LSTM. The LSTM network uses\nnormalization on the four gate activation functions, but not the memory cell. This allows the memory\ncell to encode a persistent state for unbounded time without normalization forcing it to zero mean. In\nboth the RNN and LSTM, Online Normalization performs better than the other methods. Remarkably,\nthe RNN using Online Normalization performs nearly as well as the unnormalized LSTM.\n\n8\n\n050100150200250Epoch0.30.40.50.60.70.80.9Validation lossONBN (128)GNLNIN050100150200250Epoch1.001.251.501.752.002.252.502.75Validation lossONBN (128)GNLNIN020406080100Epoch1.01.21.41.61.82.0Validation lossONBN (32)05101520253035Epoch0.940.950.960.970.98Jaccard similarityONBN (25)None0246810Epoch8586878889Validation accuracy (%)ONBN (32)LNNone510152025Epoch125150175200225250PerplexityONLNNone\f6 Conclusion\n\nOnline Normalization is a robust normalizer that performs competitively with the best normalizers\nfor large-scale networks and works for cases where other normalizers do not apply. The technique\nis formally derived and straightforward to implement. The gradient of normalization is remarkably\nsimple: it is only a linear projection and scaling.\nThere have been concerns in the \ufb01eld that normalization violates the paradigm of SGD [5, 8, 9]. A\nmain tenet of SGD is that noisy measurements can be averaged to the true value of the gradient. Batch\nnormalization has a fundamental gradient bias dependent on the batch size that cannot be eliminated\nby additional averaging or reduction in the learning rate. Because Batch Normalization requires\nbatches, it leaves the value of the gradient for any individual input unde\ufb01ned. This within-batch\ncomputation has been seen as biologically implausible [11].\nIn contrast, we have shown that the normalization operator and its gradient can be implemented\nlocally within individual neurons. The computation does not require keeping track of speci\ufb01c prior\nactivations. Additionally, normalization allows neurons to locally maintain input weights at any scale\nof choice\u2013without coordinating with other neurons. Finally any gradient signal generated by the\nneuron is also scale-free and independent of gradient scale employed by other neurons. In aggregate\nideal normalization (1) provides stability and localized computation for all three phases of gradient\ndescent: forward propagation, backward propagation, and weight update. Other methods do not\nhave this property. For instance, Layer Normalization requires layer-wide communication and Batch\nNormalization is implemented by computing within-batch dependencies.\nWe expect normalization to remain important as the community continues to explore larger and\ndeeper networks. Memory will become even more precious in this scenario. Online Normalization\nenables batch-free training resulting in over an order of magnitude reduction of activation memory.\n\nAcknowledgments\n\nWe thank Rob Schreiber, Gary Lauterbach, Natalia Vassilieva, Andy Hock, Scott James and Xin\nWang for their help and comments that greatly improved the manuscript. We thank Devansh Arpit\nfor insightful discussions. We also thank Natalia Vassilieva for modeling memory requirements for\nU-Net and Michael Kural for work on this project during his internship.\n\nReferences\n[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet large scale visual recognition challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770\u2013778, 2016.\n\n[3] Vitaliy Chiley, Michael James, and Ilya Sharapov. Online Normalization reference implementa-\n\ntion. https://github.com/cerebras/online-normalization, 2019.\n\n[4] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. CoRR, abs/1502.03167, 2015.\n\n[5] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-\n\nnormalized models. CoRR, abs/1702.03275, 2017.\n\n[6] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.\n\nabs/1607.06450, 2016.\n\nLayer normalization. CoRR,\n\n[7] Yuxin Wu and Kaiming He. Group normalization. CoRR, abs/1803.08494, 2018.\n\n[8] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to\n\naccelerate training of deep neural networks. CoRR, abs/1602.07868, 2016.\n\n9\n\n\f[9] Devansh Arpit, Yingbo Zhou, Bhargava Urala Kota, and Venu Govindaraju. Normalization\npropagation: A parametric technique for removing internal covariate shift in deep networks. In\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, pages 1168\u20131176, 2016.\n\n[10] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing\n\ningredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[11] Qianli Liao, Kenji Kawaguchi, and Tomaso A. Poggio. Streaming normalization: Towards\nsimpler and more biologically-plausible normalizations for online and recurrent learning. CoRR,\nabs/1610.06160, 2016.\n\n[12] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, and Aaron C. Courville. Recurrent batch\n\nnormalization. CoRR, abs/1603.09025, 2016.\n\n[13] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch normalized recurrent neural\nnetworks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pages 2657\u20132661, March 2016.\n\n[14] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro,\nJingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel,\nLinxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley,\nLibby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman,\nSanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang,\nBo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech\nrecognition in english and mandarin. CoRR, abs/1512.02595, 2015.\n\n[15] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced\n\nResearch). http://www.cs.toronto.edu/ kriz/cifar.html.\n\n[16] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In\nProceedings of the 4th International Conference on Neural Information Processing Systems,\nNIPS\u201991, pages 950\u2013957, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.\n\n[17] Twan van Laarhoven. L2 regularization versus batch and weight normalization. CoRR,\n\nabs/1706.05350, 2017.\n\n[18] David Page. How to train your ResNet. https://www.myrtle.ai/2018/09/24/how_to_\n\ntrain_your_resnet/, 2018.\n\n[19] Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo\nKyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD:\ntraining imagenet in 1 hour. CoRR, abs/1706.02677, 2017.\n\n[20] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR,\n\nabs/1404.5997, 2014.\n\n[21] Tony Finch. Incremental calculation of weighted mean and variance. http://people.ds.\n\ncam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf, 2009.\n\n[22] Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for computing the sample\n\nvariance: Analysis and recommendations. The American Statistician, 37:242\u2013247, 1983.\n\n[23] Igor Gitman and Boris Ginsburg. Comparison of batch normalization and weight normalization\n\nalgorithms for the large-scale image classi\ufb01cation. CoRR, abs/1709.08145, 2017.\n\n[24] Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual\nnetworks with concatenated recti\ufb01ed linear units. In Proceedings of the Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, AAAI\u201917, pages 1509\u20131516. AAAI Press, 2017.\n\n[25] ResNet\n\nin TensorFlow.\nofficial/resnet, 2018.\n\nhttps://github.com/tensorflow/models/tree/r1.9.0/\n\n[26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\n\nbiomedical image segmentation. CoRR, abs/1505.04597, 2015.\n\n10\n\n\f[27] \u00d6zg\u00fcn \u00c7i\u00e7ek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d\nu-net: Learning dense volumetric segmentation from sparse annotation. CoRR, abs/1606.06650,\n2016.\n\n[28] Naoto Usuyama. Simple PyTorch implementations of U-Net/FullyConvNet for image segmen-\n\ntation. https://github.com/usuyama/pytorch-unet, 2018.\n\n[29] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.\n\n[30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for\n\nbenchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.\n\n[31] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark\nFerguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate\nargument structure. In Proceedings of the Workshop on Human Language Technology, HLT \u201994,\npages 114\u2013119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics.\n\n[32] The MNIST Database. http://yann.lecun.com/exdb/mnist/.\n\n[33] O\ufb01r Press and Lior Wolf. Using the output embedding to improve language models. CoRR,\n\nabs/1608.05859, 2016.\n\n[34] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In Proceedings of the 30th International Con-\nference on International Conference on Machine Learning - Volume 28, ICML\u201913, pages\nIII\u20131139\u2013III\u20131147. JMLR.org, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4561, "authors": [{"given_name": "Vitaliy", "family_name": "Chiley", "institution": "Cerebras Systems"}, {"given_name": "Ilya", "family_name": "Sharapov", "institution": "Cerebras Systems"}, {"given_name": "Atli", "family_name": "Kosson", "institution": "Cerebras Systems"}, {"given_name": "Urs", "family_name": "Koster", "institution": "Cerebras Systems"}, {"given_name": "Ryan", "family_name": "Reece", "institution": "Cerebras Systems"}, {"given_name": "Sofia", "family_name": "Samaniego de la Fuente", "institution": "Cerebras Systems"}, {"given_name": "Vishal", "family_name": "Subbiah", "institution": "Cerebras Systems"}, {"given_name": "Michael", "family_name": "James", "institution": "Cerebras"}]}