{"title": "Train longer, generalize better: closing the generalization gap in large batch training of neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1731, "page_last": 1741, "abstract": "Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the \"generalization gap\" phenomenon. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a \"random walk on a random landscape\" statistical model which is known to exhibit similar \"ultra-slow\" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the \"generalization gap\" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named \"Ghost Batch Normalization\" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.", "full_text": "Train longer, generalize better: closing the\n\ngeneralization gap in large batch training of neural\n\nnetworks\n\nElad Hoffer\u2217,\n\nItay Hubara\u2217,\n\nTechnion - Israel Institute of Technology, Haifa, Israel\n\n{elad.hoffer, itayhubara, daniel.soudry}@gmail.com\n\nDaniel Soudry\n\nAbstract\n\nBackground: Deep learning models are typically trained using stochastic gradient\ndescent or one of its variants. These methods update the weights using their\ngradient, estimated from a small fraction of the training data. It has been observed\nthat when using large batch sizes there is a persistent degradation in generalization\nperformance - known as the \"generalization gap\" phenomenon. Identifying the\norigin of this gap and closing it had remained an open problem.\nContributions: We examine the initial high learning rate training phase. We\n\ufb01nd that the weight distance from its initialization grows logarithmically with the\nnumber of weight updates. We therefore propose a \"random walk on a random\nlandscape\" statistical model which is known to exhibit similar \"ultra-slow\" diffusion\nbehavior. Following this hypothesis we conducted experiments to show empirically\nthat the \"generalization gap\" stems from the relatively small number of updates\nrather than the batch size, and can be completely eliminated by adapting the\ntraining regime used. We further investigate different techniques to train models\nin the large-batch regime and present a novel algorithm named \"Ghost Batch\nNormalization\" which enables signi\ufb01cant decrease in the generalization gap without\nincreasing the number of updates. To validate our \ufb01ndings we conduct several\nadditional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally,\nwe reassess common practices and beliefs concerning training of deep models and\nsuggest they may not be optimal to achieve good generalization.\n\n1\n\nIntroduction\n\nFor quite a few years, deep neural networks (DNNs) have persistently enabled signi\ufb01cant improve-\nments in many application domains, such as object recognition from images (He et al., 2016); speech\nrecognition (Amodei et al., 2015); natural language processing (Luong et al., 2015) and computer\ngames control using reinforcement learning (Silver et al., 2016; Mnih et al., 2015).\nThe optimization method of choice for training highly complex and non-convex DNNs, is typically\nstochastic gradient decent (SGD) or some variant of it. Since SGD, at best, \ufb01nds a local minimum of\nthe non-convex objective function, substantial research efforts are invested to explain DNNs ground\nbreaking results. It has been argued that saddle-points can be avoided (Ge et al., 2015) and that\n\"bad\" local minima in the training error vanish exponentially (Dauphin et al., 2014; Choromanska\net al., 2015; Soudry & Hoffer, 2017). However, it is still unclear why these complex models tend to\ngeneralize well to unseen data despite being heavily over-parameterized (Zhang et al., 2017).\nA speci\ufb01c aspect of generalization has recently attracted much interest. Keskar et al. (2017) focused\non a long observed phenomenon (LeCun et al., 1998a) \u2013 that when a large batch size is used while\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\u2217Equal contribution\n\n\f(a) Training error\n\n(b) Validation error\n\nFigure 1: Impact of batch size on classi\ufb01cation error\n\ntraining DNNs, the trained models appear to generalize less well. This remained true even when the\nmodels were trained \"without any budget or limits, until the loss function ceased to improve\" (Keskar\net al., 2017). This decrease in performance has been named the \"generalization gap\".\nUnderstanding the origin of the generalization gap, and moreover, \ufb01nding ways to decrease it,\nmay have a signi\ufb01cant practical importance. Training with large batch size immediately increases\nparallelization, thus has the potential to decrease learning time. Many efforts have been made to\nparallelize SGD for Deep Learning (Dean et al., 2012; Das et al., 2016; Zhang et al., 2015), yet the\nspeed-ups and scale-out are still limited by the batch size.\nIn this study we suggest a \ufb01rst attempt to tackle this issue.\nFirst,\n\n\u2022 We propose that the initial learning phase can be described using a high-dimensional\n\"random walk on a random potential\" process, with an \"ultra-slow\" logarithmic increase in\nthe distance of the weights from their initialization, as we observe empirically.\n\nInspired by this hypothesis, we \ufb01nd that\n\n\u2022 By simply adjusting the learning rate and batch normalization the generalization gap can be\n\nsigni\ufb01cantly decreased (for example, from 5% to 1% \u2212 2%).\n\n\u2022 In contrast to common practices (Montavon et al., 2012) and theoretical recommendations\n(Hardt et al., 2016), generalization keeps improving for a long time at the initial high learning\nrate, even without any observable changes in training or validation errors. However, this\nimprovement seems to be related to the distance of the weights from their initialization.\n\n\u2022 There is no inherent \"generalization gap\": large-batch training can generalize as well as\n\nsmall batch training by adapting the number of iterations.\n\n2 Training with a large batch\n\nTraining method. A common practice of training deep neural networks is to follow an optimization\n\"regime\" in which the objective is minimized using gradient steps with a \ufb01xed learning rate and a\nmomentum term (Sutskever et al., 2013). The learning rate is annealed over time, usually with an\nexponential decrease every few epochs of training data. An alternative to this regime is to use an\nadaptive per-parameter learning method such as Adam (Kingma & Ba, 2014), Rmsprop (Dauphin\net al.) or Adagrad (Duchi et al., 2011). These methods are known to bene\ufb01t the convergence rate of\nSGD based optimization. Yet, many current studies still use simple variants of SGD (Ruder, 2016)\nfor all or part of the optimization process (Wu et al., 2016), due to the tendency of these methods to\nconverge to a lower test error and better generalization.\nThus, we focused on momentum SGD, with a \ufb01xed learning rate that decreases exponentially every\nfew epochs, similarly to the regime employed by He et al. (2016). The convergence of SGD is\nalso known to be affected by the batch size (Li et al., 2014), but in this work we will focus on\ngeneralization. Most of our results were conducted on the Resnet44 topology, introduced by He et al.\n(2016). We strengthen our \ufb01ndings with additional empirical results in section 6.\n\n2\n\n\fEmpirical observations of previous work. Previous work by Keskar et al. (2017) studied the\nperformance and properties of models which were trained with relatively large batches and reported\nthe following observations:\n\nlimits, until the loss function ceased to improve.\n\n\u2022 Training models with large batch size increase the generalization error (see Figure 1).\n\u2022 This \"generalization gap\" seemed to remain even when the models were trained without\n\u2022 Low generalization was correlated with \"sharp\" minima2 (strong positive curvature), while\n\u2022 Small-batch regimes were brie\ufb02y noted to produce weights that are farther away from the\n\ngood generalization was correlated with \"\ufb02at\" minima (weak positive curvature).\n\ninitial point, in comparison with the weights produced in a large-batch regime.\n\nTheir hypothesis was that a large estimation noise (originated by the use of mini-batch rather than full\nbatch) in small mini-batches encourages the weights to exit out of the basins of attraction of sharp\nminima, and towards \ufb02atter minima which have better generalization.In the next section we provide\nan analysis that suggest a somewhat different explanation.\n\n3 Theoretical analysis\n\nNotation.\nIn this paper we examine Stochastic Gradient Descent (SGD) based training of a Deep\nNeural Network (DNN). The DNN is trained on a \ufb01nite training set of N samples. We de\ufb01ne w as\nthe vector of the neural network parameters, and Ln (w) as loss function on sample n. We \ufb01nd w by\nminimizing the training loss.\n\nusing SGD. Minimizing L (w) requires an estimate of the gradient of the negative loss.\n\nN(cid:88)\n\nn=1\n\nN(cid:88)\n\nn=1\n\nL (w) (cid:44) 1\nN\n\nLn (w) ,\n\nN(cid:88)\n\nn=1\n\ng (cid:44) 1\nN\n\ngn (cid:44) \u2212 1\nN\n\n\u2207Ln (w)\n\nwhere g is the true gradient, and gn is the per-sample gradient. During training we increment the\nparameter vector w using only the mean gradient \u02c6g computed on some mini-batch B \u2013 a set of M\nrandomly selected sample indices.\n\n(cid:88)\n\nn\u2208B\n\n\u02c6g (cid:44) 1\nM\n\ngn .\n\nIn order to gain a better insight into the optimization process and the empirical results, we \ufb01rst\nexamine simple SGD training, in which the weights at update step t are incremented according to the\nmini-batch gradient \u2206wt = \u03b7\u02c6gt. With respect to the randomness of SGD,\n\nE\u02c6gt = g = \u2212\u2207L (wt) ,\n\nand the increments are uncorrelated between different mini-batches3. For physical intuition, one\ncan think of the weight vector wt as a particle performing a random walk on the loss (\u201cpotential\u201d)\nlandscape L (wt). Thus, for example, adding momentum term to the increment is similar to adding\ninertia to the particle.\n\nMotivation.\nIn complex systems (such as DNNs) where we do not know the exact shape of the\nloss, statistical physics models commonly assume a simpler description of the potential as a random\nprocess. For example, Dauphin et al. (2014) explained the observation that local minima tend to have\n\n2It was later pointed out (Dinh et al., 2017) that certain \"degenerate\" directions, in which the parameters can\nbe changed without affecting the loss, must be excluded from this explanation. For example, for any c > 0 and\nany neuron, we can multiply all input weights by c and divide the output weights by c: this does not affect the\nloss, but can generate arbitrarily strong positive curvature.\n\n3Either exactly (with replacement) or approximately (without replacement): see appendix section A.\n\n3\n\n\flow error using an analogy between L (w), the DNN loss surface, and the high-dimensional Gaussian\nrandom \ufb01eld analyzed in Bray & Dean (2007), which has zero mean and auto-covariance\n\nE (L (w1) L (w2)) = f\n\n(cid:16)(cid:107)w1 \u2212 w2(cid:107)2(cid:17)\n\nfor some function f, where the expectation now is over the randomness of the loss. This analogy\nresulted with the hypothesis that in DNNs, local minima with high loss are indeed exponentially\nvanishing, as in Bray & Dean (2007). Only recently, similar results are starting to be proved for\nrealistic neural network models (Soudry & Hoffer, 2017). Thus, a similar statistical model of the loss\nmight also give useful insights for our empirical observations.\n\n(1)\n\n(2)\n\nModel: Random walk on a random potential. Fortunately, the high dimensional case of a particle\ndoing a \u201crandom walk on a random potential\u201d was extensively investigated already decades ago\n(Bouchaud & Georges, 1990). The main result of that investigation was that the asymptotic behavior\nof the auto-covariance of a random potential4,\n\nE (L (w1) L (w2)) \u223c (cid:107)w1 \u2212 w2(cid:107)\u03b1 , \u03b1 > 0\n\nin a certain range, determines the asymptotic behavior of the random walker in that range:\n\nE(cid:107)wt \u2212 w0(cid:107)2 \u223c (log t)\n\n4\n\u03b1 .\n\n(3)\nThis is called an \u201cultra-slow diffusion\u201d in which, typically (cid:107)wt \u2212 w0(cid:107) \u223c (log t)2/\u03b1 , in contrast\nto standard diffusion (on a \ufb02at potential), in which we have (cid:107)wt \u2212 w0(cid:107) \u223c \u221a\nt . The informal\nreason for this behavior (for any \u03b1 > 0), is that for a particle to move a distance d, it has to pass\npotential barriers of height \u223c d\u03b1/2, from eq. (2). Then, to climb (or go around) each barrier takes\nexponentially long time in the height of the barrier: t \u223c exp(d\u03b1/2). Inverting this relation, we get eq.\nd \u223c (log(t))2/\u03b1. In the high-dimensional case, this type of behavior was \ufb01rst shown numerically and\nexplained heuristically by Marinari et al. (1983), then rigorously proven for the case of a discrete\nlattice by Durrett (1986), and explained in the continuous case by Bouchaud & Comtet (1987).\n\n3.1 Comparison with empirical results and implications\n\nTo examine this prediction of ultra slow diffusion and \ufb01nd the value of \u03b1, in Figure 2a, we examine\n(cid:107)wt \u2212 w0(cid:107) during the initial training phase over the experiment shown in Figure 1. We found that\nthe weight distance from initialization point increases logarithmically with the number of training\niterations (weight updates), which matches our model with \u03b1 = 2:\n\n(cid:107)wt \u2212 w0(cid:107) \u223c log t .\n\n(4)\nInterestingly, the value of \u03b1 = 2 matches the statistics of the loss estimated in appendix section B.\nMoreover, in Figure 2a, we \ufb01nd that a very similar logarithmic graph is observed for all batch sizes.\nYet, there are two main differences. First, each graph seems to have a somewhat different slope\n(i.e., it is multiplied by different positive constant), which peaks at M = 128 and then decreases\nwith the mini-batch size. This indicates a somewhat different diffusion rate for different batch sizes.\nSecond, since we trained all models for a constant number of epochs, smaller batch sizes entail more\ntraining iterations in total. Thus, there is a signi\ufb01cant difference in the number of iterations and the\ncorresponding weight distance reached at the end of the initial learning phase.\nThis leads to the following informal argument (which assumes \ufb02at minima are indeed important for\ngeneralization). During the initial training phase, to reach a minima of \"width\" d the weight vector\nwt has to travel at least a distance d, and this takes a long time \u2013 about exp(d) iterations. Thus,\nto reach wide (\"\ufb02at\") minima we need to have the highest possible diffusion rates (which do not\nresult in numerical instability) and a large number of training iterations. In the next sections we will\nimplement these conclusions in practice.\n\n4 Matching weight increment statistics for different mini-batch sizes\n\nFirst, to correct the different diffusion rates observed for different batch sizes, we will aim to match\nthe statistics of the weights increments to that of a small batch size.\n\n4Note that this form is consistent with eq. (1), if f (x) = x\u03b1/2.\n\n4\n\n\f(a) Before learning rate adjustment and GBN\n\n(b) After learning rate adjustment and GBN\n\nFigure 2: Euclidean distance of weight vector from initialization\n\nLearning rate. Recall that in this paper we investigate SGD, possibly with momentum, where the\nweight updates are proportional to the estimated gradient.\n\u2206w \u221d \u03b7\u02c6g ,\n\n(5)\n\nwhere \u03b7 is the learning rate, and we ignore for now the effect of batch normalization.\nIn appendix section A, we show that the covariance matrix of the parameters update step \u2206w is,\n\ncov (\u2206w, \u2206w)\u2248 \u03b72\nM\n\n1\nN\n\ngng(cid:62)\n\nn\n\n(cid:33)\n\n(cid:32)\n\nN(cid:88)\n\nn=1\n\n(6)\n\n(7)\n\nin the case of uniform sampling of the mini-batch indices (with or without replacement), when\nM (cid:28) N. Therefore, a simple way to make sure that the covariance matrix stays the same for all\nmini-batch sizes is to choose\n\n\u221a\n\n\u03b7 \u221d\n\nM ,\n\nN(cid:88)\n\nn\u2208B\n\n1\nM\n\ni.e., we should increase the learning rate by the square root of the mini-batch size.\nWe note that Krizhevsky (2014) suggested a similar learning rate scaling in order to keep the variance\nin the gradient expectation constant, but chose to use a linear scaling heuristics as it reached better\nempirical result in his setting. Later on, Li (2017) suggested the same.\nNaturally, such an increase in the learning rate also increases the mean steps E [\u2206w]. However,\nwe found that this effect is negligible since E [\u2206w] is typically orders of magnitude lower than the\nstandard deviation.\nFurthermore, we can match both the \ufb01rst and second order statistics by adding multiplicative noise to\nthe gradient estimate as follows:\n\n\u02c6g =\n\ngnzn ,\n\nwhere zn \u223c N(cid:0)1, \u03c32(cid:1) are independent random Gaussian variables for which \u03c32 \u221d M. This can\n\nbe veri\ufb01ed by using similar calculation as in appendix section A. This method keeps the covariance\nconstant when we change the batch size, yet does not change the mean steps E [\u2206w].\nIn both cases, for the \ufb01rst few iterations, we had to clip or normalize the gradients to prevent\ndivergence. Since both methods yielded similar performance 5 (due the negligible effect of the \ufb01rst\norder statistics), we preferred to use the simpler learning rate method.\nIt is important to note that other types of noise (e.g., dropout (Srivastava et al., 2014), dropconnect\n(Wan et al., 2013), label noise (Szegedy et al., 2016)) change the structure of the covariance matrix\nand not just its scale, thus the second order statistics of the small batch increment cannot be accurately\nmatched. Accordingly, we did not \ufb01nd that these types of noise helped to reduce the generalization\ngap for large batch sizes.\nLastly, note that in our discussion above (and the derivations provided in appendix section A) we\nassumed each per-sample gradient gn does not depend on the selected mini-batch. However, this\nignores the in\ufb02uence of batch normalization. We take this into consideration in the next subsection.\n\n5a simple comparison can be seen in appendix (\ufb01gure 3)\n\n5\n\n\fGhost Batch Normalization. Batch Normalization (BN) (Ioffe & Szegedy, 2015), is known to\naccelerate the training, increase the robustness of neural network to different initialization schemes\nand improve generalization. Nonetheless, since BN uses the batch statistics it is bounded to depend\non the choosen batch size. We study this dependency and observe that by acquiring the statistics\non small virtual (\"ghost\") batches instead of the real large batch we can reduce the generalization\nerror. In our experiments we found out that it is important to use the full batch statistic as suggested\nby (Ioffe & Szegedy, 2015) for the inference phase. Full details are given in Algorithm 1. This\nmodi\ufb01cation by itself reduce the generalization error substantially.\n\nAlgorithm 1: Ghost Batch Normalization (GBN), applied to activation x over a large batch BL with\nvirtual mini-batch BS. Where BS < BL.\nRequire: Values of x over a large-batch: BL = {x1...m} size of virtual batch |BS|; Parameters to\n\nbe learned: \u03b3, \u03b2, momentum \u03b7\nTraining Phase:\nScatter BL to {X 1, X 2, ...X|BL|/|BS|} = {x1...|BS|, x|BS|+1...2|BS|...x|BL|\u2212|BS|...m}\nB \u2190 1|BS|\n\u00b5l\n\u03c3l\n\n(cid:80)|BS|\nB \u2190(cid:113) 1|BS|\n(cid:80)|BS|\n\u00b5run = (1 \u2212 \u03b7)|BS|\u00b5run +(cid:80)|BL|/|BS|\n\u03c3run = (1 \u2212 \u03b7)|BS|\u03c3run +(cid:80)|BL|/|BS|\n\nfor l = 1, 2, 3 . . .\n(1 \u2212 \u03b7)i \u00b7 \u03b7 \u00b7 \u00b5l\n(1 \u2212 \u03b7)i \u00b7 \u03b7 \u00b7 \u03c3l\n\nfor l = 1, 2, 3 . . .\ni \u2212 \u00b5B)2 + \u0001\n\n{calculate ghost mini-batches means}\n\ni=1 (X l\n\ni=1 X l\ni\n\ni=1\n\ni=1\n\nB\n\nB\n\n{calculate ghost mini-batches std}\n\nB\n\nreturn \u03b3 X l\u2212\u00b5l\nTest Phase:\nreturn \u03b3 X\u2212\u00b5l\n\n\u03c3l\nB\n\n\u03c3run\n\nrun\n\n+ \u03b2\n\n+ \u03b2 {scale and shift}\n\nWe note that in a multi-device distributed setting, some of the bene\ufb01ts of \"Ghost BN\" may already\noccur, since batch-normalization is often preformed on each device separately to avoid additional\ncommunication cost. Thus, each device computes the batch norm statistics using only its samples\n(i.e., part of the whole mini-batch). It is a known fact, yet unpublished, to the best of the authors\nknowledge, that this form of batch norm update helps generalization and yields better results than\ncomputing the batch-norm statistics over the entire batch. Note that GBN enables \ufb02exibility in the\nsmall (virtual) batch size which is not provided by the commercial frameworks (e.g., TensorFlow,\nPyTorch) in which the batch statistics is calculated on the entire, per-device, batch. Moreover, in those\ncommercial frameworks, the running statistics are usually computed differently from \"Ghost BN\",\nby weighting each update part equally. In our experiments we found it to worsen the generalization\nperformance.\nImplementing both the learning rate and GBN adjustments seem to improve generalization perfor-\nmance, as we shall see in section 6. Additionally, as can be seen in Figure 6, the slopes of the\nlogarithmic weight distance graphs seem to better matched, indicating similar diffusion rates. We\nalso observe some constant shift, which we believe is related to the gradient clipping. Since this shift\nonly increased the weight distances, we assume it does not harm the performance.\n\n5 Adapting number of weight updates eliminates generalization gap\n\nAccording to our conclusions in section 3, the initial high-learning rate training phase enables the\nmodel to reach farther locations in the parameter space, which may be necessary to \ufb01nd wider local\nminima and better generalization. Examining \ufb01gure 2b, the next obvious step to match the graphs for\ndifferent batch sizes is to increase the number of training iterations in the initial high learning rate\nregime. And indeed we noticed that the distance between the current weight and the initialization\npoint can be a good measure to decide upon when to decrease the learning rate.\nNote that this is different from common practices. Usually, practitioners decrease the learning\nrate after validation error appears to reach a plateau. This practice is due to the long-held belief\nthat the optimization process should not be allowed to decrease the training error when validation\nerror \"\ufb02atlines\", for fear of over\ufb01tting (Girosi et al., 1995). However, we observed that substantial\nimprovement to the \ufb01nal accuracy can be obtained by continuing the optimization using the same\n\n6\n\n\f(a) Validation error\n\n(b) Validation error - zoomed\n\nFigure 3: Comparing generalization of large-batch regimes, adapted to match performance of small-\nbatch training.\n\nlearning rate even if the training error decreases while the validation plateaus. Subsequent learning\nrate drops resulted with a sharp validation error decrease, and better generalization for the \ufb01nal model.\nThese observations led us to believe that \"generalization gap\" phenomenon stems from the relatively\nsmall number of updates rather than the batch size. Speci\ufb01cally, using the insights from Figure 2 and\nour model, we adapted the training regime to better suit the usage of large mini-batch. We \"stretched\"\nthe time-frame of the optimization process, where each time period of e epochs in the original regime,\nwill be transformed to |BL|\n|BS| e epochs according to the mini-batch size used. This modi\ufb01cation ensures\nthat the number of optimization steps taken is identical to those performed in the small batch regime.\nAs can be seen in Figure 3, combining this modi\ufb01cation with learning rate adjustment completely\neliminates the generalization gap observed earlier 6.\n\n6 Experiments\n\nExperimental setting. We experimented with a set of popular image classi\ufb01cation tasks:\n\n\u2022 MNIST (LeCun et al., 1998b) - Consists of a training set of 60K and a test set of 10K\n28 \u00d7 28 gray-scale images representing digits ranging from 0 to 9.\n\u2022 CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) - Each consists of a training set of size 50K\nand a test set of size 10K. Instance are 32 \u00d7 32 color images representing 10 or 100 classes.\n\u2022 ImageNet classi\ufb01cation task Deng et al. (2009) - Consists of a training set of size 1.2M\n\nsamples and test set of size 50K. Each instance is labeled with one of 1000 categories.\n\nTo validate our \ufb01ndings, we used a representative choice of neural network models. We used the\nfully-connected model, F1, as well as shallow convolutional models C1 and C3 suggested by Keskar\net al. (2017). As a demonstration of more current architectures, we used the models: VGG (Simonyan,\n2014) and Resnet44 (He et al., 2016) for CIFAR10 dataset, Wide-Resnet16-4 (Zagoruyko, 2016) for\nCIFAR100 dataset and Alexnet (Krizhevsky, 2014) for ImageNet dataset.\nIn each of the experiments, we used the training regime suggested by the original work, together with\na momentum SGD optimizer. We use a batch of 4096 samples as \"large batch\" (LB) and a small\nbatch (SB) of either 128 (F1,C1,VGG,Resnet44,C3,Alexnet) or 256 (WResnet). We compare the\noriginal training baseline for small and large batch, as well as the following methods7:\n\n(cid:113)|BL|\n\n\u2022 Learning rate tuning (LB+LR): Using a large batch, while adapting the learning rate to be\nlarger so that \u03b7L =\n|BS| \u03b7S where \u03b7S is the original learning rate used for small batch, \u03b7L\nis the adapted learning rate and |BL|,|BS| are the large and small batch sizes, respectively.\n\u2022 Ghost batch norm (LB+LR+GBN): Additionally using the \"Ghost batch normalization\"\n\u2022 Regime adaptation: Using the tuned learning rate as well as ghost batch-norm, but with\nan adapted training regime. The training regime is modi\ufb01ed to have the same number of\n\nmethod in our training procedure. The \"ghost batch size\" used is 128.\n\n6Additional graphs, including comparison to non-adapted regime, are available in appendix (\ufb01gure 2).\n7Code is available at https://github.com/eladhoffer/bigBatch.\n\n7\n\n\fiterations for each batch size used - effectively multiplying the number of epochs by the\nrelative size of the large batch.\n\nResults. Following our experiments, we can establish an empirical basis to our claims. Observing\nthe \ufb01nal validation accuracy displayed in Table 1, we can see that in accordance with previous works\nthe move from a small-batch (SB) to a large-batch (LB) indeed incurs a substantial generalization gap.\nHowever, modifying the learning-rate used for large-batch (+LR) causes much of this gap to diminish,\nfollowing with an additional improvement by using the Ghost-BN method (+GBN). Finally, we can\nsee that the generalization gap completely disappears when the training regime is adapted (+RA),\nyielding validation accuracy that is good-as or better than the one obtained using a small batch.\nWe additionally display results obtained on the more challenging ImageNet dataset in Table 2 which\nshows similar impact for our methods.\n\nTable 1: Validation accuracy results, SB/LB represent small and large batch respectively. GBN stands\nfor Ghost-BN, and RA stands for regime adaptation\nSB\n98.27% 97.05% 97.55% 97.60% 98.53%\n87.80% 83.95% 86.15% 86.4%\n88.20%\n92.83% 86.10% 89.30% 90.50% 93.07%\n91.50% 93.03%\n92.30% 84.1%\n61.25% 51.50% 57.38% 57.5%\n63.20%\n73.70% 68.15% 69.05% 71.20% 73.57%\n\nDataset\nNetwork\nMNIST\nF1 (Keskar et al., 2017)\nCifar10\nC1 (Keskar et al., 2017)\nCifar10\nResnet44 (He et al., 2016)\nCifar10\nVGG (Simonyan, 2014)\nC3 (Keskar et al., 2017)\nCifar100\nWResnet16-4 (Zagoruyko, 2016) Cifar100\n\n+GBN\n\n88.6%\n\n+RA\n\n+LR\n\nLB\n\nTable 2: ImageNet top-1 results using Alexnet topology (Krizhevsky, 2014), notation as in Table 1.\n\nNetwork LB size Dataset\nAlexnet\nAlexnet\n\nImageNet\nImageNet\n\n4096\n8192\n\nLB8\n\nSB\n57.10% 41.23% 53.25% 54.92% 59.5%\n57.10% 41.23% 53.25% 53.93% 59.5%\n\n+GBN\n\n+RA\n\n+LR8\n\n7 Discussion\n\nThere are two important issues regarding the use of large batch sizes. First, why do we get worse\ngeneralization with a larger batch, and how do we avoid this behaviour? Second, can we decrease the\ntraining wall clock time by using a larger batch (exploiting parallelization), while retaining the same\ngeneralization performance as in small batch?\nThis work tackles the \ufb01rst issue by investigating the random walk behaviour of SGD and the\nrelationship of its diffusion rate to the size of a batch. Based on this and empirical observations, we\npropose simple set of remedies to close down the generalization gap between the small and large\nbatch training strategies: (1) Use SGD with momentum, gradient clipping, and a decreasing learning\nrate schedule; (2) adapt the learning rate with batch size (we used a square root scaling); (3) compute\nbatch-norm statistics over several partitions (\"ghost batch-norm\"); and (4) use a suf\ufb01cient number of\nhigh learning rate training iterations.\nThus, the main point arising from our results is that, in contrast to previous conception, there is no\ninherent generalization problem with training using large mini batches. That is, model training using\nlarge mini-batches can generalize as well as models trained using small mini-batches. Though this\nanswers the \ufb01rst issues, the second issue remained open: can we speed up training by using large\nbatch sizes?\nNot long after our paper \ufb01rst appeared, this issue was also answered. Using a Resnet model on\nImagenet Goyal et al. (2017) showed that, indeed, signi\ufb01cant speedups in training could be achieved\nusing a large batch size. This further highlights the ideas brought in this work and their importance\nto future scale-up, especially since Goyal et al. (2017) used similar training practices to those we\n\n8 Due to memory limitation those experiments were conducted with batch of 2048.\n\n8\n\n\fdescribed above. The main difference between our works is the use of a linear scaling of the learning\nrate9, similarly to Krizhevsky (2014), and as suggested by Bottou (2010). However, we found that\nlinear scaling works less well on CIFAR10, and later work found that linear scaling rules work less\nwell for other architectures on ImageNet (You et al., 2017).\nWe also note that current \"rules of thumb\" regarding optimization regime and explicitly learning rate\nannealing schedule may be misguided. We showed that good generalization can result from extensive\namount of gradient updates in which there is no apparent validation error change and training error\ncontinues to drop, in contrast to common practice. After our work appeared, Soudry et al. (2017)\nsuggested an explanation to this, and to the logarithmic increase in the weight distance observed in\nFigure 2. We show this behavior happens even in simple logistic regression problems with separable\ndata. In this case, we exactly solve the asymptotic dynamics and prove that w(t) = log(t) \u02c6w + O(1)\nwhere \u02c6w is to the L2 maximum margin separator. Therefore, the margin (affecting generalization)\nimproves slowly (as O(1/ log(t)), even while the training error is very low. Future work, based on\nthis, may be focused on \ufb01nding when and how the learning rate should be decreased while training.\n\nConclusion.\nIn this work we make a \ufb01rst attempt to tackle the \"generalization gap\" phenomenon.\nWe argue that the initial learning phase can be described using a high-dimensional \"random walk\non a random potential\" process, with a an \"ultra-slow\" logarithmic increase in the distance of the\nweights from their initialization, as we observe empirically. Following this observation we suggest\nseveral techniques which enable training with large batch without suffering from performance\ndegradation. This implies that the problem is not related to the batch size but rather to the amount of\nupdates. Moreover we introduce a simple yet ef\ufb01cient algorithm \"Ghost-BN\" which improves the\ngeneralization performance signi\ufb01cantly while keeping the training time intact.\n\nAcknowledgments\n\nWe wish to thank Nir Ailon, Dar Gilboa, K\ufb01r Levy and Igor Berman for their feedback on the initial\nmanuscript. The research was partially supported by the Taub Foundation, and the Intelligence\nAdvanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center\n(DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and\ndistribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.\nDisclaimer: The views and conclusions contained herein are those of the authors and should not\nbe interpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed or\nimplied, of IARPA, DoI/IBC, or the U.S. Government.\n\nReferences\nAmodei, D., Anubhai, R., Battenberg, E., et al. Deep speech 2: End-to-end speech recognition in english and\n\nmandarin. arXiv preprint arXiv:1512.02595, 2015.\n\nBottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT\u20192010,\n\npp. 177\u2013186. Springer, 2010.\n\nBouchaud, J. P. and Georges, A. Anomalous diffusion in disordered media: statistical mechanisms, models and\n\nphysical applications. Physics reports, 195:127\u2013293, 1990.\n\nBouchaud, J. P. and Comtet, A. Anomalous diffusion in random media of any dimensionality. J. Physique, 48:\n\nBray, A. J. and Dean, D. S. Statistics of critical points of Gaussian \ufb01elds on large-dimensional spaces. Physical\n\nChoromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The Loss Surfaces of Multilayer\n\n1445\u20131450, 1987.\n\nReview Letters, 98(15):1\u20135, 2007.\n\nNetworks. AISTATS15, 38, 2015.\n\nDas, D., Avancha, S., Mudigere, D., et al. Distributed deep learning using synchronous stochastic gradient\n\ndescent. arXiv preprint arXiv:1602.06709, 2016.\n\nDauphin, Y., de Vries, H., Chung, J., and Bengio, Y. Rmsprop and equilibrated adaptive learning rates for\n\nnon-convex optimization. corr abs/1502.04390 (2015).\n\nDauphin, Y., Pascanu, R., and Gulcehre, C.\n\nIdentifying and attacking the saddle point problem in high-\n\ndimensional non-convex optimization. In NIPS, pp. 1\u20139, 2014.\n\nDean, J., Corrado, G., Monga, R., et al. Large scale distributed deep networks. In NIPS, pp. 1223\u20131231, 2012.\nDeng, J., Dong, W., Socher, R., et al. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.\n\n9e.g., Goyal et al. (2017) also used an initial warm-phase for the learning rate, however, this has a similar\neffect to the gradient clipping we used, since this clipping was mostly active during the initial steps of training.\n\n9\n\n\fDinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. arXiv preprint\n\narXiv:1703.04933, 2017.\n\nDuchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization.\n\nJournal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nDurrett, R. Multidimensional random walks in random environments with subclassical limiting behavior.\n\nCommunications in Mathematical Physics, 104(1):87\u2013102, 1986.\n\nGe, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points-online stochastic gradient for tensor\n\ndecomposition. In COLT, pp. 797\u2013842, 2015.\n\nGirosi, F., Jones, M., and Poggio, T. Regularization theory and neural networks architectures. Neural computation,\n\nGoyal, P., Doll\u00e1r, P., Girshick, R., et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv\n\nHardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778, 2016.\n\nIoffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. arXiv preprint arXiv:1502.03167, 2015.\n\nKeskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep\n\nlearning: Generalization gap and sharp minima. In ICLR, 2017.\n\nKingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\nKrizhevsky, A. Learning multiple layers of features from tiny images. 2009.\nKrizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997,\n\n7(2):219\u2013269, 1995.\n\npreprint arXiv:1706.02677, 2017.\n\nICML, pp. 1\u201324, 2016.\n\n2014.\n\nLeCun, Y., Bottou, L., and Orr, G. Ef\ufb01cient backprop in neural networks: Tricks of the trade (orr, g. and m\u00fcller,\n\nk., eds.). Lecture Notes in Computer Science, 1524, 1998a.\n\nLeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998b.\n\nLi, M. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, Intel, 2017.\nLi, M., Zhang, T., Chen, Y., and Smola, A. J. Ef\ufb01cient mini-batch training for stochastic optimization. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,\npp. 661\u2013670. ACM, 2014.\n\nLuong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation.\n\nMarinari, E., Parisi, G., Ruelle, D., and Windey, P. Random Walk in a Random Environment and 1f Noise.\n\narXiv preprint arXiv:1508.04025, 2015.\n\nPhysical Review Letters, 50(1):1223\u20131225, 1983.\n\nMnih, V., Kavukcuoglu, K., Silver, D., et al. Human-level control through deep reinforcement learning. Nature,\n\n518(7540):529\u2013533, 2015.\n\n978-3-642-35288-1.\n\nMontavon, G., Orr, G., and M\u00fcller, K.-R. Neural Networks: Tricks of the Trade. 2 edition, 2012. ISBN\n\nRuder, S. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.\nSilver, D., Huang, A., Maddison, C. J., et al. Mastering the game of go with deep neural networks and tree\n\nsearch. Nature, 529(7587):484\u2013489, 2016.\n\nSimonyan, K. e. a. Very deep convolutional networks for large-scale image recognition. arXiv preprint\n\nSoudry, D., Hoffer, E., and Srebro, N. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints,\n\narXiv:1409.1556, 2014.\n\nOctober 2017.\n\nSoudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in multilayer neural networks.\n\narXiv preprint arXiv:1702.05777, 2017.\n\nSrivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to\nprevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15(1):1929\u20131958, 2014.\nSutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep\n\nlearning. In International conference on machine learning, pp. 1139\u20131147, 2013.\n\nSzegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for\ncomputer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\n2818\u20132826, 2016.\n\nWan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect.\n\nICML\u201913, pp. III\u20131058\u2013III\u20131066. JMLR.org, 2013.\n\nWu, Y., Schuster, M., Chen, Z., et al. Google\u2019s neural machine translation system: Bridging the gap between\n\nhuman and machine translation. CoRR, abs/1609.08144, 2016.\n\nYou, Y., Gitman, I., and Ginsburg, B. Scaling sgd batch size to 32k for imagenet training. arXiv preprint\n\nZagoruyko, K. Wide residual networks. In BMVC, 2016.\nZhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking\n\narXiv:1708.03888, 2017.\n\ngeneralization. In ICLR, 2017.\n\n10\n\n\fZhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In NIPS, pp. 685\u2013693,\n\n2015.\n\n11\n\n\f", "award": [], "sourceid": 1098, "authors": [{"given_name": "Elad", "family_name": "Hoffer", "institution": "Technion"}, {"given_name": "Itay", "family_name": "Hubara", "institution": "Technion"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}]}