{"title": "Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1945, "page_last": 1953, "abstract": "Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.", "full_text": "Batch Renormalization: Towards Reducing\n\nMinibatch Dependence in Batch-Normalized Models\n\nSergey Ioffe\n\nGoogle\n\nsioffe@google.com\n\nAbstract\n\nBatch Normalization is quite effective at accelerating and improving the training\nof deep models. However, its effectiveness diminishes when the training mini-\nbatches are small, or do not consist of independent samples. We hypothesize that\nthis is due to the dependence of model layer inputs on all the examples in the\nminibatch, and different activations being produced between training and infer-\nence. We propose Batch Renormalization, a simple and effective extension to\nensure that the training and inference models generate the same outputs that de-\npend on individual examples rather than the entire minibatch. Models trained with\nBatch Renormalization perform substantially better than batchnorm when training\nwith small or non-i.i.d. minibatches. At the same time, Batch Renormalization re-\ntains the bene\ufb01ts of batchnorm such as insensitivity to initialization and training\nef\ufb01ciency.\n\n1\n\nIntroduction\n\nBatch Normalization (\u201cbatchnorm\u201d [6]) has recently become a part of the standard toolkit for train-\ning deep networks. By normalizing activations, batch normalization helps stabilize the distributions\nof internal activations as the model trains. Batch normalization also makes it possible to use signi\ufb01-\ncantly higher learning rates, and reduces the sensitivity to initialization. These effects help accelerate\nthe training, sometimes dramatically so. Batchnorm has been successfully used to enable state-of-\nthe-art architectures such as residual networks [5].\nBatchnorm works on minibatches in stochastic gradient training, and uses the mean and variance\nof the minibatch to normalize the activations. Speci\ufb01cally, consider a particular node in the deep\nnetwork, producing a scalar value for each input example. Given a minibatch B of m examples,\nconsider the values of this node, x1 . . . xm. Then batchnorm takes the form:\n\n(cid:98)xi \u2190 xi \u2212 \u00b5B\n\n\u03c3B\n\nwhere \u00b5B is the sample mean of x1 . . . xm, and \u03c32B is the sample variance (in practice, a small \u0001\nis added to it for numerical stability). It is clear that the normalized activations corresponding to\nan input example will depend on the other examples in the minibatch. This is undesirable during\ninference, and therefore the mean and variance computed over all training data can be used instead.\nIn practice, the model usually maintains moving averages of minibatch means and variances, and\nduring inference uses those in place of the minibatch statistics.\nWhile it appears to make sense to replace the minibatch statistics with whole-data ones during\ninference, this changes the activations in the network. In particular, this means that the upper layers\n(whose inputs are normalized using the minibatch) are trained on representations different from\nthose computed in inference (when the inputs are normalized using the population statistics). When\nthe minibatch size is large and its elements are i.i.d. samples from the training distribution, this\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdifference is small, and can in fact aid generalization. However, minibatch-wise normalization may\nhave signi\ufb01cant drawbacks:\nFor small minibatches, the estimates of the mean and variance become less accurate. These inaccu-\nracies are compounded with depth, and reduce the quality of resulting models. Moreover, as each\nexample is used to compute the variance used in its own normalization, the normalization operation\nis less well approximated by an af\ufb01ne transform, which is what is used in inference.\nNon-i.i.d. minibatches can have a detrimental effect on models with batchnorm. For example, in a\nmetric learning scenario (e.g. [4]), it is common to bias the minibatch sampling to include sets of\nexamples that are known to be related. For instance, for a minibatch of size 32, we may randomly\nselect 16 labels, then choose 2 examples for each of those labels. Without batchnorm, the loss\ncomputed for the minibatch decouples over the examples, and the intra-batch dependence introduced\nby our sampling mechanism may, at worst, increase the variance of the minibatch gradient. With\nbatchnorm, however, the examples interact at every layer, which may cause the model to over\ufb01t to\nthe speci\ufb01c distribution of minibatches and suffer when used on individual examples.\nThe dependence of the batch-normalized activations on the entire minibatch makes batchnorm pow-\nerful, but it is also the source of its drawbacks. Several approaches have been proposed to alleviate\nthis. However, unlike batchnorm which can be easily applied to an existing model, these methods\nmay require careful analysis of nonlinearities [1] and may change the class of functions representable\nby the model [2]. Weight normalization [11] presents an alternative, but does not offer guarantees\nabout the activations and gradients when the model contains arbitrary nonlinearities, or contains lay-\ners without such normalization. Furthermore, weight normalization has been shown to bene\ufb01t from\nmean-only batch normalization, which, like batchnorm, results in different outputs during training\nand inference. Another alternative [10] is to use a separate and \ufb01xed minibatch to compute the nor-\nmalization parameters, but this makes the training more expensive, and does not guarantee that the\nactivations outside the \ufb01xed minibatch are normalized.\nIn this paper we propose Batch Renormalization, a new extension to batchnorm. Our method ensures\nthat the activations computed in the forward pass of the training step depend only on a single example\nand are identical to the activations computed in inference. This signi\ufb01cantly improves the training\non non-i.i.d. or small minibatches, compared to batchnorm, without incurring extra cost.\n\n2 Prior Work: Batch Normalization\n\nWe are interested in stochastic gradient optimization of deep networks. The task is to minimize the\nloss, which decomposes over training examples:\n\n\u0398 = arg min\n\n\u0398\n\n1\nN\n\n(cid:96)i(\u0398)\n\nN(cid:88)\n\ni=1\n\nwhere (cid:96)i is the loss incurred on the ith training example, and \u0398 is the vector of model weights. At\neach training step, a minibatch of m examples is used to compute the gradient\n\n1\nm\n\n\u2202(cid:96)i(\u0398)\n\n\u2202\u0398\n\nwhich the optimizer uses to adjust \u0398.\nConsider a particular node x in a deep network. We observe that x depends on all the model param-\neters that are used for its computation, and when those change, the distribution of x also changes.\nSince x itself affects the loss through all the layers above it, this change in distribution complicates\nthe training of the layers above. This has been referred to as internal covariate shift. Batch Nor-\nmalization [6] addresses it by considering the values of x in a minibatch B = {x1...m}. It then\n\n2\n\n\fnormalizes them as follows:\n\nxi\n\ni=1\n\n\u00b5B \u2190 1\nm\n\nm(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) 1\nm(cid:88)\n(cid:98)xi \u2190 xi \u2212 \u00b5B\nyi \u2190 \u03b3(cid:98)xi + \u03b2 \u2261 BN(xi)\n\n\u03c3B \u2190\n\n\u03c3B\n\nm\n\ni=1\n\n(xi \u2212 \u00b5B)2 + \u0001\n\nHere \u03b3 and \u03b2 are trainable parameters (learned using the same procedure, such as stochastic gradient\ndescent, as all the other model weights), and \u0001 is a small constant. Crucially, the computation of the\nsample mean \u00b5B and sample standard deviation \u03c3B are part of the model architecture, are themselves\nfunctions of the model parameters, and as such participate in backpropagation. The backpropagation\nformulas for batchnorm are easy to derive by chain rule and are given in [6].\nWhen applying batchnorm to a layer of activations x, the normalization takes place independently\nfor each dimension (or, in the convolutional case, for each channel or feature map). When x is itself\na result of applying a linear transform W to the previous layer, batchnorm makes the model invariant\nto the scale of W (ignoring the small \u0001). This invariance makes it possible to not be picky about\nweight initialization, and to use larger learning rates.\nBesides the reduction of internal covariate shift, an intuition for another effect of batchnorm can be\nobtained by considering the gradients with respect to different layers. Consider the normalized layer\n\n(cid:98)x, whose elements all have zero mean and unit variance. For a thought experiment, let us assume\nthat the dimensions of(cid:98)x are independent. Further, let us approximate the loss (cid:96)((cid:98)x) as its \ufb01rst-order\nTaylor expansion: (cid:96) \u2248 (cid:96)0 + gT(cid:98)x, where g = \u2202(cid:96)\n\u2202(cid:98)x. It then follows that Var[(cid:96)] \u2248 (cid:107)g(cid:107)2 in which the\n\u2202(cid:98)x(cid:107) is approximately the same for different normalized layers. Therefore the\n\nleft-hand side does not depend on the layer we picked. This means that the norm of the gradient w.r.t.\na normalized layer (cid:107) \u2202(cid:96)\ngradients, as they \ufb02ow through the network, do not explode nor vanish, thus facilitating the training.\nWhile the assumptions of independence and linearity do not hold in practice, the gradient \ufb02ow is in\nfact signi\ufb01cantly improved in batch-normalized models.\nDuring inference, the standard practice is to normalize the activations using the moving averages \u00b5,\n\u03c32 instead of minibatch mean \u00b5B and variance \u03c32B:\n\nyinference =\n\nx \u2212 \u00b5\n\n\u03c3\n\n\u00b7 \u03b3 + \u03b2\n\nwhich depends only on a single input example rather than requiring a whole minibatch.\nIt is natural to ask whether we could simply use the moving averages \u00b5, \u03c3 to perform the normal-\nization during training, since this would remove the dependence of the normalized activations on\nthe other example in the minibatch. This, however, has been observed to lead to the model blowing\nup. As argued in [6], such use of moving averages would cause the gradient optimization and the\nnormalization to counteract each other. For example, the gradient step may increase a bias or scale\nthe convolutional weights, in spite of the fact that the normalization would cancel the effect of these\nchanges on the loss. This would result in unbounded growth of model parameters without actually\nimproving the loss. It is thus crucial to use the minibatch moments, and to backpropagate through\nthem.\n\n3 Batch Renormalization\n\nWith batchnorm, the activities in the network differ between training and inference, since the nor-\nmalization is done differently between the two models. Here, we aim to rectify this, while retaining\nthe bene\ufb01ts of batchnorm.\nLet us observe that if we have a minibatch and normalize a particular node x using either the mini-\nbatch statistics or their moving averages, then the results of these two normalizations are related by\nan af\ufb01ne transform. Speci\ufb01cally, let \u00b5 be an estimate of the mean of x, and \u03c3 be an estimate of its\n\n3\n\n\fInput: Values of x over a training mini-batch B = {x1...m}; parameters \u03b3, \u03b2; current moving mean\n\u00b5 and standard deviation \u03c3; moving average update rate \u03b1; maximum allowed correction rmax,\ndmax.\n\nOutput: {yi = BatchRenorm(xi)}; updated \u00b5, \u03c3.\n\n(xi \u2212 \u00b5B)2\n\n(cid:16)\n(cid:18)\n\nclip[1/rmax,rmax]\n\nclip[\u2212dmax,dmax]\n\nm(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)\u0001 +\n\ni=1\n\n\u00b5B \u2190 1\nm\n\n\u03c3B \u2190\n\nxi\n\n1\nm\n\nm(cid:88)\n\ni=1\n\nr \u2190 stop gradient\nd \u2190 stop gradient\n\u00b7 r + d\n\n(cid:98)xi \u2190 xi \u2212 \u00b5B\nyi \u2190 \u03b3(cid:98)xi + \u03b2\n\n\u03c3B\n\n(cid:16) \u03c3B\n(cid:17)(cid:17)\n(cid:18) \u00b5B \u2212 \u00b5\n\n\u03c3\n\n\u03c3\n\n(cid:19)(cid:19)\n\n\u00b5 := \u00b5 + \u03b1(\u00b5B \u2212 \u00b5)\n\u03c3 := \u03c3 + \u03b1(\u03c3B \u2212 \u03c3)\n\n// Update moving averages\n\nInference:\n\ny \u2190 \u03b3 \u00b7 x \u2212 \u00b5\n\n\u03c3\n\n+ \u03b2\n\nAlgorithm 1: Training (top) and inference (bottom) with Batch Renormalization, applied to activa-\ntion x over a mini-batch. During backpropagation, standard chain rule is used. The values marked\nwith stop gradient are treated as constant for a given training step, and the gradient is not\npropagated through them.\n\nstandard deviation, computed perhaps as a moving average over the last several minibatches. Then,\nwe have:\n\nxi \u2212 \u00b5\n\nxi \u2212 \u00b5B\n\n=\n\n\u03c3\n\n\u03c3B\n\n\u00b7 r + d, where r =\n\n\u03c3B\n\u03c3\n\n, d =\n\n\u00b5B \u2212 \u00b5\n\n\u03c3\n\nIf \u03c3 = E[\u03c3B] and \u00b5 = E[\u00b5B], then E[r] = 1 and E[d] = 0 (the expectations are w.r.t. a minibatch\nB). Batch Normalization, in fact, simply sets r = 1, d = 0.\nWe propose to retain r and d, but treat them as constants for the purposes of gradient computation.\nIn other words, we augment a network, which contains batch normalization layers, with a per-\ndimension af\ufb01ne transformation applied to the normalized activations. We treat the parameters r and\nd of this af\ufb01ne transform as \ufb01xed, even though they were computed from the minibatch itself. It is\nimportant to note that this transform is identity in expectation, as long as \u03c3 = E[\u03c3B] and \u00b5 = E[\u00b5B].\nWe refer to batch normalization augmented with this af\ufb01ne transform as Batch Renormalization: the\n\ufb01xed (for the given minibatch) r and d correct for the fact that the minibatch statistics differ from\nthe population ones. This allows the above layers to observe the \u201ccorrect\u201d activations \u2013 namely,\nthe ones that would be generated by the inference model. We emphasize that, unlike the trainable\nparameters \u03b3, \u03b2 of batchnorm, the corrections r and d are not trained by gradient descent, and vary\nacross minibatches since they depend on the statistics of the current minibatch.\nIn practice, it is bene\ufb01cial to train the model for a certain number of iterations with batchnorm alone,\nwithout the correction, then ramp up the amount of allowed correction. We do this by imposing\nbounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually\nrelaxed.\n\n4\n\n\f\u2202(cid:96)\n\n\u2202(cid:98)xi\n\n\u2202(cid:96)\n\u2202\u03c3B\n\n\u2202(cid:96)\n\u2202\u00b5B\n\u2202(cid:96)\n\u2202xi\n\u2202(cid:96)\n\u2202\u03b3\n\n\u2202(cid:96)\n\u2202\u03b2\n\ni=1\n\n\u2202(cid:96)\n\u2202yi\n\nm(cid:88)\nm(cid:88)\n\u2202(cid:98)xi\nm(cid:88)\nm(cid:88)\n\ni=1\n\u2202(cid:96)\n\ni=1\n\n=\n\n=\n\n=\n\n=\n\n=\n\n=\n\ni=1\n\n\u00b7 (xi \u2212 \u00b5B) \u00b7 \u2212r\n\u03c32B\n\u00b7 \u2212r\n\u03c3B\n\n\u00b7 xi \u2212 \u00b5B\nm\u03c3B\n\n\u2202(cid:96)\n\u2202\u03c3B\n\n+\n\n\u2202(cid:96)\n\u2202\u00b5B\n\n\u00b7 1\nm\n\n\u00b7 \u03b3\n\n\u2202(cid:96)\n\n\u2202(cid:98)xi\n\u2202(cid:98)xi\n\n\u2202(cid:96)\n\n\u00b7 r\n\u03c3B\n\n+\n\n\u00b7(cid:98)xi\n\n\u2202(cid:96)\n\u2202yi\n\n\u2202(cid:96)\n\u2202yi\n\nAlgorithm 1 presents Batch Renormalization. Unlike batchnorm, where the moving averages are\ncomputed during training but used only for inference, Batch Renorm does use \u00b5 and \u03c3 during train-\ning to perform the correction. We use a fairly high rate of update \u03b1 for these averages, to ensure that\nthey bene\ufb01t from averaging multiple batches but do not become stale relative to the model param-\neters. We explicitly update the exponentially-decayed moving averages \u00b5 and \u03c3, and optimize the\nrest of the model using gradient optimization, with the gradients calculated via backpropagation:\n\n\u2202xi\n\ni=1\n\nThese gradient equations reveal another interpretation of Batch Renormalization. Because the loss\n(cid:96) is unaffected when all xi are shifted or scaled by the same amount, the functions (cid:96)({xi + t}) and\n= 0 and\n\n(cid:96)({xi \u00b7 (1 + t)}) are constant in t, and computing their derivatives at t = 0 gives(cid:80)m\n(cid:80)m\nthen(cid:8) \u2202(cid:96)\nbackprop formulas that to compute the gradient(cid:8) \u2202(cid:96)\n\n= 0. Therefore, if we consider the m-dimensional vector(cid:8) \u2202(cid:96)\n(cid:9) (with one element per\n(cid:9) lies in the null-space of p0 and p1. In fact, it is easy to see from the Batch Renorm\n(cid:9), we need to \ufb01rst scale the latter\n\nexample in the minibatch), and further consider two vectors p0 = (1, . . . , 1) and p1 = (x1, . . . , xm),\n\ni=1 xi\n\n\u2202(cid:96)\n\u2202xi\n\n\u2202(cid:96)\n\u2202xi\n\n(cid:9) from(cid:8) \u2202(cid:96)\n\u2202(cid:98)xi\n\nby r/\u03c3B, then project it onto the null-space of p0 and p1. For r = \u03c3B\n\u03c3 , this is equivalent to the\nbackprop for the transformation x\u2212\u00b5\n\u03c3 , but combined with the null-space projection. In other words,\nBatch Renormalization allows us to normalize using moving averages \u00b5, \u03c3 in training, and makes it\nwork using the extra projection step in backprop.\nBatch Renormalization shares many of the bene\ufb01cial properties of batchnorm, such as insensitivity\nto initialization and ability to train ef\ufb01ciently with large learning rates. Unlike batchnorm, our\nmethod ensures that that all layers are trained on internal representations that will be actually used\nduring inference.\n\n\u2202xi\n\n\u2202xi\n\n4 Results\n\nTo evaluate Batch Renormalization, we applied it to the problem of image classi\ufb01cation. Our base-\nline model is Inception v3 [13], trained on 1000 classes from ImageNet training set [9], and evaluated\non the ImageNet validation data. In the baseline model, batchnorm was used after convolution and\nbefore the ReLU [8]. To apply Batch Renorm, we simply swapped it into the model in place of\nbatchnorm. Both methods normalize each feature map over examples as well as over spatial loca-\ntions. We \ufb01x the scale \u03b3 = 1, since it could be propagated through the ReLU and absorbed into the\nnext layer.\nThe training used 50 synchronized workers [3]. Each worker processed a minibatch of 32 examples\nper training step. The gradients computed for all 50 minibatches were aggregated and then used\nby the RMSProp optimizer [14]. As is common practice, the inference model used exponentially-\ndecayed moving averages of all model parameters, including the \u00b5 and \u03c3 computed by both batch-\nnorm and Batch Renorm.\nFor Batch Renorm, we used rmax = 1, dmax = 0 (i.e. simply batchnorm) for the \ufb01rst 5000 training\nsteps, after which these were gradually relaxed to reach rmax = 3 at 40k steps, and dmax = 5 at 25k\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Validation top-1 accuracy of Inception-v3 model with batchnorm and its Batch Renorm\nversion, trained on 50 synchronized workers, each processing minibatches of size 32. The Batch\nRenorm model achieves a marginally higher validation accuracy.\n(b) Validation accuracy for\nmodels trained with either batchnorm or Batch Renorm, where normalization is performed for sets\nof 4 examples (but with the gradients aggregated over all 50 \u00d7 32 examples processed by the 50\nworkers). Batch Renorm allows the model to train faster and achieve a higher accuracy, although\nnormalizing sets of 32 examples performs better.\n\nsteps. These \ufb01nal values resulted in clipping a small fraction of rs, and none of ds. However, at the\nbeginning of training, when the learning rate was larger, it proved important to increase rmax slowly:\notherwise, occasional large gradients were observed to suddenly and severely increase the loss. To\naccount for the fact that the means and variances change as the model trains, we used relatively fast\nupdates to the moving statistics \u00b5 and \u03c3, with \u03b1 = 0.01. Because of this and keeping rmax = 1 for a\nrelatively large number of steps, we did not need to apply initialization bias correction [7].\nAll the hyperparameters other than those related to normalization were \ufb01xed between the models\nand across experiments.\n\n4.1 Baseline\n\nAs a baseline, we trained the batchnorm model using the minibatch size of 32. More speci\ufb01cally,\nbatchnorm was applied to each of the 50 minibatches; each example was normalized using 32 ex-\namples, but the resulting gradients were aggregated over 50 minibatches. This model achieved the\ntop-1 validation accuracy of 78.3% after 130k training steps.\nTo verify that Batch Renorm does not diminish performance on such minibatches, we also trained\nthe model with Batch Renorm, see Figure 1(a). The test accuracy of this model closely tracked the\nbaseline, achieving a slightly higher test accuracy (78.5%) after the same number of steps.\n\n4.2 Small minibatches\n\nTo investigate the effectiveness of Batch Renorm when training on small minibatches, we reduced\nthe number of examples used for normalization to 4. Each minibatch of size 32 was thus broken into\n\u201cmicrobatches\u201d each having 4 examples; each microbatch was normalized independently, but the\nloss for each minibatch was computed as before. In other words, the gradient was still aggregated\nover 1600 examples per step, but the normalization involved groups of 4 examples rather than 32 as\nin the baseline. Figure 1(b) shows the results.\nThe validation accuracy of the batchnorm model is signi\ufb01cantly lower than the baseline that nor-\nmalized over minibatches of size 32, and training is slow, achieving 74.2% at 210k steps. We obtain\na substantial improvement much faster (76.5% at 130k steps) by replacing batchnorm with Batch\nRenorm, However, the resulting test accuracy is still below what we get when applying either batch-\nnorm or Batch Renorm to size 32 minibatches. Although Batch Renorm improves the training with\nsmall minibatches, it does not eliminate the bene\ufb01t of having larger ones.\n\n6\n\n\fFigure 2: Validation accuracy when training on non-i.i.d. minibatches, obtained by sampling 2\nimages for each of 16 (out of total 1000) random labels. This distribution bias results not only in a\nlow test accuracy, but also low accuracy on the training set, with an eventual drop. This indicates\nover\ufb01tting to the particular minibatch distribution, which is con\ufb01rmed by the improvement when the\ntest minibatches also contain 2 images per label, and batchnorm uses minibatch statistics \u00b5B, \u03c3B\nduring inference. It improves further if batchnorm is applied separately to 2 halves of a training\nminibatch, making each of them more i.i.d. Finally, by using Batch Renorm, we are able to just train\nand evaluate normally, and achieve the same validation accuracy as we get for i.i.d. minibatches in\nFig. 1(a).\n\n4.3 Non-i.i.d. minibatches\n\nWhen examples in a minibatch are not sampled independently, batchnorm can perform rather poorly.\nHowever, sampling with dependencies may be necessary for tasks such as for metric learning [4, 12].\nWe may want to ensure that images with the same label have more similar representations than\notherwise, and to learn this we require that a reasonable number of same-label image pairs can be\nfound within the same minibatch.\nIn this experiment (Figure 2), we selected each minibatch of size 32 by randomly sampling 16 labels\n(out of the total 1000) with replacement, then randomly selecting 2 images for each of those labels.\nWhen training with batchnorm, the test accuracy is much lower than for i.i.d. minibatches, achieving\nonly 67%. Surprisingly, even the training accuracy is much lower (72.8%) than the test accuracy in\nthe i.i.d. case, and in fact exhibits a drop that is consistent with over\ufb01tting. We suspect that this is\nin fact what happens: the model learns to predict labels for images that come in a set, where each\nimage has a counterpart with the same label. This does not directly translate to classifying images\nindividually, thus producing a drop in the accuracy computed on the training data. To verify this,\nwe also evaluated the model in the \u201ctraining mode\u201d, i.e. using minibatch statistics \u00b5B, \u03c3B instead\nof moving averages \u00b5, \u03c3, where each test minibatch had size 50 and was obtained using the same\nprocedure as the training minibatches \u2013 25 labels, with 2 images per label. As expected, this does\nmuch better, achieving 76.5%, though still below the baseline accuracy. Of course, this evaluation\nscenario is usually infeasible, as we want the image representation to be a deterministic function of\nthat image alone.\nWe can improve the accuracy for this problem by splitting each minibatch into two halves of size\n16 each, so that for every pair of images belonging to the same class, one image is assigned to the\n\ufb01rst half-minibatch, and the other to the second. Each half is then more i.i.d., and this achieves a\nmuch better test accuracy (77.4% at 140k steps), but still below the baseline. This method is only\n\n7\n\n\fapplicable when the number of examples per label is small (since this determines the number of\nmicrobatches that a minibatch needs to be split into).\nWith Batch Renorm, we simply trained the model with minibatch size of 32. The model achieved\nthe same test accuracy (78.5% at 120k steps) as the equivalent model on i.i.d. minibatches, vs.\n67% obtained with batchnorm. By replacing batchnorm with Batch Renorm, we ensured that the\ninference model can effectively classify individual images. This has completely eliminated the effect\nof over\ufb01tting the model to image sets with a biased label distribution.\n\n5 Conclusions\n\nWe have demonstrated that Batch Normalization, while effective, is not well suited to small or\nnon-i.i.d. training minibatches. We hypothesized that these drawbacks are due to the fact that the\nactivations in the model, which are in turn used by other layers as inputs, are computed differently\nduring training than during inference. We address this with Batch Renormalization, which replaces\nbatchnorm and ensures that the outputs computed by the model are dependent only on the individual\nexamples and not the entire minibatch, during both training and inference.\nBatch Renormalization extends batchnorm with a per-dimension correction to ensure that the activa-\ntions match between the training and inference networks. This correction is identity in expectation;\nits parameters are computed from the minibatch but are treated as constant by the optimizer. Unlike\nbatchnorm, where the means and variances used during inference do not need to be computed until\nthe training has completed, Batch Renormalization bene\ufb01ts from having these statistics directly par-\nticipate in the training. Batch Renormalization is as easy to implement as batchnorm itself, runs at\nthe same speed during both training and inference, and signi\ufb01cantly improves training on small or\nnon-i.i.d. minibatches. Our method does have extra hyperparameters: the update rate \u03b1 for the mov-\ning averages, and the schedules for correction limits dmax, rmax. We have observed, however, that\nstable training can be achieved even without this clipping, by using a saturating nonlinearity such as\nmin(ReLU(\u00b7), 6), and simply turning on renormalization after an initial warm-up using batchnorm\nalone. A more extensive investigation of the effect of these parameters is a part of future work.\nBatch Renormalization offers a promise of improving the performance of any model that would\nnormally use batchnorm. This includes Residual Networks [5]. Another application is Generative\nAdversarial Networks [10], where the non-determinism introduced by batchnorm has been found to\nbe an issue, and Batch Renorm may provide a solution.\nFinally, Batch Renormalization may bene\ufb01t applications where applying batch normalization has\nbeen dif\ufb01cult \u2013 such as recurrent networks. There, batchnorm would require each timestep to be\nnormalized independently, but Batch Renormalization may make it possible to use the same running\naverages to normalize all timesteps, and then update those averages using all timesteps. This remains\none of the areas that warrants further exploration.\n\nReferences\n[1] Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization prop-\nagation: A parametric technique for removing internal covariate shift in deep networks. arXiv\npreprint arXiv:1603.01431, 2016.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[3] Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed syn-\n\nchronous sgd. arXiv preprint arXiv:1604.00981, 2016.\n\n[4] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighbourhood\n\ncomponents analysis. In Advances in Neural Information Processing Systems 17, 2004.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-\nnition, pages 770\u2013778, 2016.\n\n8\n\n\f[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proceedings of the 32nd International Conference on\nMachine Learning (ICML-15), pages 448\u2013456, 2015.\n\n[7] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[8] Vinod Nair and Geoffrey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann ma-\n\nchines. In ICML, pages 807\u2013814. Omnipress, 2010.\n\n[9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-\nFei. ImageNet Large Scale Visual Recognition Challenge, 2014.\n\n[10] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2226\u20132234, 2016.\n\n[11] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization\nto accelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013901, 2016.\n\n[12] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for\n\nface recognition and clustering. CoRR, abs/1503.03832, 2015.\n\n[13] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 2818\u20132826, 2016.\n\n[14] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop. COURSERA: Neural Networks for Ma-\n\nchine Learning, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1198, "authors": [{"given_name": "Sergey", "family_name": "Ioffe", "institution": "Google"}]}