{"title": "Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 10678, "page_last": 10688, "abstract": "Regularization is typically understood as improving generalization by altering the landscape of local extrema to which the model eventually converges. Deep neural networks (DNNs), however, challenge this view: We show that removing regularization after an initial transient period has little effect on generalization, even if the final loss landscape is the same as if there had been no regularization. In some cases, generalization even improves after interrupting regularization. Conversely, if regularization is applied only after the initial transient, it has no effect on the final solution, whose generalization gap is as bad as if regularization never happened. This suggests that what matters for training deep networks is not just whether or how, but when to regularize. The phenomena we observe are manifest in different datasets (CIFAR-10, CIFAR-100, SVHN, ImageNet), different architectures (ResNet-18, All-CNN), different regularization methods (weight decay, data augmentation, mixup), different learning rate schedules (exponential, piece-wise constant). They collectively suggest that there is a \"critical period'' for regularizing deep networks that is decisive of the final performance. More analysis should, therefore, focus on the transient rather than asymptotic behavior of learning.", "full_text": "Time Matters in Regularizing Deep Networks:\n\nWeight Decay and Data Augmentation Affect Early Learning\n\nDynamics, Matter Little Near Convergence\n\nAditya Golatkar, Alessandro Achille, Stefano Soatto\n\n{aditya29,achille,soatto}@cs.ucla.edu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nAbstract\n\nRegularization is typically understood as improving generalization by altering\nthe landscape of local extrema to which the model eventually converges. Deep\nneural networks (DNNs), however, challenge this view: We show that removing\nregularization after an initial transient period has little effect on generalization,\neven if the \ufb01nal loss landscape is the same as if there had been no regularization.\nIn some cases, generalization even improves after interrupting regularization. Con-\nversely, if regularization is applied only after the initial transient, it has no effect\non the \ufb01nal solution, whose generalization gap is as bad as if regularization never\nhappened. This suggests that what matters for training deep networks is not just\nwhether or how, but when to regularize. The phenomena we observe are manifest\nin different datasets (CIFAR-10, CIFAR-100, SVHN, ImageNet), different architec-\ntures (ResNet-18, All-CNN), different regularization methods (weight decay, data\naugmentation, mixup), different learning rate schedules (exponential, piece-wise\nconstant). They collectively suggest that there is a \u201ccritical period\u201d for regularizing\ndeep networks that is decisive of the \ufb01nal performance. More analysis should,\ntherefore, focus on the transient rather than asymptotic behavior of learning.\n\n1\n\nIntroduction\n\nThere is no shortage of literature on what regularizers to use when training deep neural networks and\nhow they affect the loss landscape but, to the best of our knowledge, no work has addressed when\nto apply regularization. We test the hypothesis that applying regularization at different epochs of\ntraining can yield different outcomes. Our curiosity stems from recent observations suggesting that\nthe early epochs of training are decisive of the outcome of learning with a deep neural network [1].\nWe \ufb01nd that regularization via weight decay or data augmentation has the same effect on generalization\nwhen applied only during the initial epochs of training. Conversely, if regularization is applied only in\nthe latter phase of convergence, it has little effect on the \ufb01nal solution, whose generalization is as bad\nas if regularization never happened. This suggests that, contrary to classical models, the mechanism\nby which regularization affects generalization in deep networks is not by changing the landscape of\ncritical points at convergence, but by in\ufb02uencing the early transient of learning. This is unlike convex\noptimization (linear regression, support vector machines) where the transient is irrelevant.\nIn short, what matters for training deep networks is not just whether or how, but when to regularize.\nIn particular, the effect of temporary regularization on the \ufb01nal performance is maximal during an\ninitial \u201ccritical period.\u201d This mimics other phenomena affecting the learning process which, albeit\ntemporary, can permanently affect the \ufb01nal outcome if applied at the right time, as observed in a\nvariety of learning systems, from arti\ufb01cial deep neural networks to biological ones. We use the\nmethodology of [1] to regress the most critical epochs for various architectures and datasets.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSpeci\ufb01cally, our \ufb01ndings are:\n\n(i) Applying weight decay or data augmentation beyond the initial transient of training does not\nimprove generalization (Figure 1, Left). The transient is decisive of asymptotic performance.\n(ii) Applying regularization only during the \ufb01nal phases of convergence does not improve, and in\nsome cases degrades generalization. Hence, regularization in deep networks does not work by\nre-shaping the loss function at convergence (Figure 1, Center).\n\n(iii) Applying regularization only during a short sliding window shows that its effect is most\npronounced during a critical period of few epochs (Figure 1, Right). Hence, the analysis of\nregularization in Deep Learning should focus on the transient, rather than asymptotics.\n\nThe explanation for these phenomena is not as simple as the solution being stuck in some local\nminimum: When turning regularization on or off after the critical period, the value of the weights\nchanges, so the solution moves in the loss landscape. However, test accuracy, hence generalization,\ndoes not change. Adding regularization after the critical period does change the loss function, and\nalso changes the \ufb01nal solution, but not for the better. Thus, the role of regularization is not to bias\nthe \ufb01nal solution towards critical points with better generalization. Instead, it is to bias the initial\ntransient towards regions of the loss landscape that contains multiple equivalent solutions with good\ngeneralization properties.\nIn the next section we place our observations in the context of prior related work, then introduce\nsome of the nomenclature and notation (Sect. 3) before describing our experiments in Sect. 4. We\ndiscuss the results in Sect. 5.\n\n2 Related Work\n\nThere is a considerable volume of work addressing regularization in deep networks, too vast to review\nhere. Most of the efforts are towards analyzing the geometry and topology of the loss landscape\nat convergence. Work relating the local curvature of the loss around the point of convergence\nto regularization (\u201c\ufb02at minima\u201d [14, 21, 8, 4]) has been especially in\ufb02uential [5, 24]. Other work\naddresses the topological characteristics of the point of convergence (minima vs. saddles [6]). [20, 16]\ndiscuss the effects of the learning rate and batch size on stochastic gradient descent (SGD) dynamics\nand generalization. At the other end of the spectrum, there is complementary work addressing\ninitialization of deep networks, [10, 13]. There is limited work addressing the timing of regularization,\nother than for the scheduling of learning rates [32, 25].\nChanging the regularizer during training is common practice in many \ufb01elds, and can be done in a\nvariety of ways, either pre-scheduled \u2013 as in homotopy continuation methods [28], or in a manner\nthat depends on the state of learning \u2013 as in adaptive regularization [18]. For example, in variational\nstereo-view reconstruction, regularization of the reconstruction loss is typically varied during the\noptimization, starting with high regularization and, ideally, ending with no regularization. This is\nquite unlike the case of Deep Learning: Stereo is ill-posed, as the object of inference (the disparity\n\ufb01eld) is in\ufb01nite-dimensional and not smooth due to occluding boundaries. So, ideally one would\nnot want to impose regularization, except for wading through the myriad of local mimima due to\nlocal self-similarity in images. Imposing regularization all along, however, causes over-smoothing,\nwhereas the ground-truth disparity \ufb01eld is typically discontinuous. So, regularization is introduced\ninitially and then removed to capture \ufb01ne details. In other words, the ideal loss is not regularized,\nand regularization is introduced arti\ufb01cially to improve transient performance. In the case of machine\nlearning, regularization is often interpreted as a prior on the solution. Thus, regularization is part of\nthe problem formulation, rather than the mechanics of its solution.\nAlso related to our work, there have been attempts to interpret the mechanisms of action of certain\nregularization methods, such as weight decay [38, 35, 26, 15, 23, 3], data augmentation [36], dropout\n[34]. It has been pointed out in [38] that the Gauss-Newton norm correlates with generalization, and\nwith the Fisher Information Matrix [9, 2], a measure of the \ufb02atness of the minimum, to conclude that\nthe Fisher Information at convergence correlates with generalization. However, there is no causal\nlink proven. In fact, we suggest this correlation may be an epi-phenomenon: Weight decay causes an\nincrease in Fisher information during the transient, which is responsible for generalization (Figure 5),\nwhereas the asymptotic value of the Fisher norm (i.e., sharpness of the minimum) is not causative. In\nparticular, we show that increasing Fisher Information can actually improve generalization.\n\n2\n\n\fFigure 1: Critical periods for regularization in DNNs : (Left) Final test accuracy as a function\nof the epoch in which the regularizer is removed during training. Applying regularization beyond the\ninitial transient of training (around 100 epochs) produces no appreciable increase in the test accuracy.\nIn some cases, early removal of regularization e.g., at epoch 75 for All-CNN, actually improves\ngeneralization. Despite the loss landscape at convergence being un-regularized, the network achieves\naccuracy comparable to a regularized one. (Center) Final test accuracy as a function of the onset\nof regularization. Applying regularization after the initial transient changes the convergence point\n(Fig. 2, B), but does not improve regularization. Thus, regularization does not in\ufb02uence generalization\nby re-shaping the loss landscape near the eventual solution. Instead, regularization biases the solution\ntowards regions with good generalization properties during the initial transient. Weight decay (blue)\nshows a more marked time dependency than data augmentation (orange). The dashed line (green)\nin (Left) and (Center) corresponds to the \ufb01nal accuracy when we regularize throughout the training.\n(Right) Sensitivity (change in the \ufb01nal accuracy relative to un-regularized training) as a function\nof the onset of a 50-epoch regularization window. Initial learning epochs are more sensitive to\nweight decay compared to the intermediate training epochs for data augmentation. The shape of\nthe sensitivity curve depends on the regularization scheme as well as the network architecture. For\nexperiments with weight decay (or data augmentation), we apply data augmentation (or weight decay)\nthroughout the training. Critical period for regularization occurs during the initial rapid decreasing\nphase of the training loss (red dotted line), which in this case is from epoch 0 to 75. The error bars\nindicate thrice the standard deviation across 5 independent trials.\n\n3 Preliminaries and notation\n\nGiven an observed input x (e.g., an image) and a random variable y we are trying to infer (e.g., a\ndiscrete label), we denote with pw(y|x) the output distribution of a deep network parameterized by\nweights w. For discrete y, we usually have pw(y|x) = softmax(fw(x)) for some parametric function\ni=1, the cross-entropy loss of the network pw(y|x) on the\nfw(x). Given a dataset D = {(xi, yi)}N\nNPN\ndataset D is de\ufb01ned as LD(w) := 1\ni=1 `(y, fw(xi)) = E(xi,yi)\u21e0D[ log pw(yi|xi)].\nWhen minimizing LD(w) with stochastic gradient descent (SGD), we update the weights w with an\nestimate of the gradient computed from a small number of samples (mini-batch). That is, wt+1 \nwt \u2318Ei2\u21e0t[r`(yi, fw(xi))] where \u21e0t \u2713{ 1, . . . , N} is a random subset of indices of size |\u21e0t| = B\n(mini-batch size). In our implementation, weight decay (WD) is equivalent to imposing a penalty to\nthe L2 norm of the weights, so that we minimize the regularized loss L = LD(w) + \nData augmentation (DA) expands the training set by choosing a set of random transformations\nof the data, x0 = g(x) (e.g., random translations, rotations, re\ufb02ections of the domain and af\ufb01ne\ntransformations of the range of the images), sampled from a known distribution Pg, to yield D0(g) =\n{(gj(xi), yi)}gj\u21e0Pg.\nIn our experiments, we choose g to be random cropping and horizontal \ufb02ipping (re\ufb02ections) of\nthe images; D are the CIFAR-10 and CIFAR-100 datasets [22], and the class of functions fw are\nResNet-18 [12] and All-CNN [33]. For all experiments, unless otherwise noted, we train with SGD\n\n2kwk2.\n\n3\n\n\fFigure 2: Intermediate application or removal of regularization affects the \ufb01nal solution: (A-\nC) L2 norm of the weights as a function of the training epoch (corresponding to Figure 1 (Top)). The\nweights of the network move after application or removal of regularization, which can be seen by\nthe change in their norm. Correlation between the norm of the weights and generalization properties\nis not as straightforward as lower norm implying better generalization. For instance, (C) applying\nweight decay only at the beginning (curve 0) reduces the norm only during the critical period, and\nyields higher norm asymptotically than, for example, curve 25. Yet it has better generalization. This\nsuggests that the having a lower norm mostly help only during the critical period. We plot the norm\nof the weights for 200 training epochs to con\ufb01rm that the weights stabilize and would not improve\nfurther with additional training. (D) PCA-projection of the training paths obtained removing weight\ndecay at different times (see Appendix A.1). Removing WD before the end of the critical period\n(curves 25, 50) makes the network converge to different regions of the parameter space. Removing\nWD after the critical period (curves 75 to 200) still sensibly changes the \ufb01nal point (in particular,\ncritical periods are not due the optimization being stuck in a local minimum), but all points lie in a\nsimilar area, supporting the Critical Period interpretation of [1]. (E) Same plots, but for DA, which\nunlike WD does not have a sharp critical period: all training paths converge to a similar area.\n\nwith momentum 0.9 and exponentially decaying learning rate with factor = 0.97 per epoch, starting\nfrom learning rate \u2318 = 0.1 (see also Appendix A).\n\n4 Experiments\n\nTo test the hypothesis that regularization can have different effects when applied at different epochs\nof training, we perform three kinds of experiments. In the \ufb01rst, we apply regularization up to a certain\npoint, and then switch off the regularizer. In the second, we initially forgo regularization, and switch\nit on only after a certain number of epochs. In the third, we apply regularization for a short window\nduring the training process. We describe these three experiments in order, before discussing the\neffect of batch normalization, and analyzing changes in the loss landscape during training using local\ncurvature (Fisher Information).\n\nRegularization interrupted. We train standard DNN architectures (ResNet-18/All-CNN on\nCIFAR-10) using weight decay (WD) during the \ufb01rst t0 epochs, then continue without WD. Similarly,\nwe augment the dataset (DA) up to t0 epochs, past which we revert to the original training set. We\ntrain both the architectures for 200 epochs. In all cases, the training loss converges to essentially zero\nfor all values of t0. We then examine the \ufb01nal test accuracy as a function of t0 (Figure 1, Left). We\nobserve that applying regularization beyond the initial transient (around 100 epochs) produces no\nmeasurable improvement in generalization (test accuracy). In Figure 3 (Left), we observe similar\nresults for a different data distribution (CIFAR-100). Surprisingly, limiting regularization to the initial\nlearning epochs yields \ufb01nal test accuracy that is as good as that achieved by regularizing to the end,\neven if the \ufb01nal loss landscapes, and hence the minima encountered at convergence, are different.\n\n4\n\nABCDE\fFigure 3: (Top) Critical periods for regularization are independent of the data distribution:\nWe repeat the same experiment as in Figure 1 on CIFAR-100. We observe that the results are\nconsistent with Figure 1. The dashed line (green) in (Left) and (Right) denotes the \ufb01nal accuracy\nwhen regularization is applied throughout the training. The dashed line on top corresponds to ResNet-\n18, while the one below it corresponds to All-CNN. (Center) In the middle row (Left and Center),\nwe show critical regularization periods for models trained on SVHN [30] and ImageNet [7]. Critical\nperiods for regularization also exists for regularization methods apart from weight decay and data\naugmentation, for example, Mixup [39] (Center Right). In fact, we observe that applying Mixup only\nduring the critical period (\ufb01rst 75-100 epochs) results in better generalization compared to applying\nit throughout the training. (Bottom) Critical regularization periods with a piecewise constant\nlearning rate schedule: We repeat experiment in Figure 1, but change the learning rate scheduling.\nNetworks trained with piecewise constant learning rate exhibit behavior that is qualitatively similar\nto the exponentially decaying learning rate. The same experiment with constant learning rate is\ninconclusive since the network does not converge (see Appendix, Figure 11).\n\nIt is tempting to ascribe the imperviousness to regularization in the latter epochs of training (Figure 1,\nLeft) to the optimization being stuck in a local minimum. After all, the decreased learning rate, or the\nshape of the loss around the minimum, could prevent the solution from moving. However, Figure 2\n(A, curves 75/100) shows that the norm of the weights changes signi\ufb01cantly after switching off the\nregularizer: the optimization is not stuck. The point of convergence does change, just not in a way\nthat improves test accuracy.\nThe fact that applying regularization only at the very beginning yields comparable results, suggests\nthat regularization matters not because it alters the shape of the loss function at convergence, reducing\nconvergence to spurious minimizers, but rather because it \u201cdirects\u201d the initial phase of training\ntowards regions with multiple extrema with similar generalization properties. Once the network\nenters such a region, removing regularization causes the solution to move to different extrema, with\nno appreciable change in test accuracy.\n\nRegularization delayed.\nIn this experiment, we switch on regularization starting at some epoch t0,\nand continue training to convergence. We train the DNNs for 200 epochs, except when regularization\nis applied late (from epoch 150/175), where we allow the training to continue for an additional\n50 epochs to ensure the network\u2019s convergence. Figure 1 (Center) displays the \ufb01nal accuracy as\na function of the onset t0, which shows that there is a \u201ccritical period\u201d to perform regularization\n(around epoch 50), beyond which adding a regularizer yields no bene\ufb01t.\n\n5\n\n\fFigure 4: Critical periods for regularization are independent of Batch-Normalization: We\nrepeat the same experiment as in Figure 1, but without Batch-Normalization. The results are\nlargely compatible with previous experiments, suggesting that the effects are not caused by the\ninteraction between batch normalization and regularization. (Left) Notice that, surprisingly, removal\nof weight decay right after the initial critical period actually improves generalization. (Center) Data\naugmentation in this setting shows a more marked dependency on timing. (Right) Unlike weight\ndecay which mainly affects initial epochs, data augmentation is critical for the intermediate epochs.\n\nAbsence of regularization can be thought of as a form of learning de\ufb01cit. The permanent effect\nof temporary de\ufb01cits during the early phases of learning has been documented across different\ntasks and systems, both biological and arti\ufb01cial [1]. Critical periods thus appear to be fundamental\nphenomena, not just quirks of biology or the choice of the dataset, architecture, learning rate, or other\nhyperparameters in deep networks.\nIn Figure 1 (Top Center), we see that delaying WD by 50 epochs causes a 40% increase in test error,\nfrom 5% regularizing all along, to 7% with onset t0 = 50 epochs. This is despite the two optimization\nproblems sharing the same loss landscape at convergence. This reinforces the intuition that WD does\nnot improve generalization by modifying the loss function, lest Figure 1 (Center) would show an\nincrease in test accuracy after the onset of regularization.\nHere, too, we see that the optimization is not stuck in a local minimum: Figure 2 (B) shows the\nweights changing even after late onset of regularization. Unlike the previous case, in the absence of\nregularization, the network enters prematurely into regions with multiple sub-optimal local extrema,\nseen in the \ufb02at part of the curve in Figure 1 (Center).\nNote that the magnitude of critical period effects depends on the kind of regularization. Figure 1\n(Center) shows that WD exhibits more signi\ufb01cant critical period behavior than DA. At convergence,\ndata augmentation is more effective than weight decay. In Figure 3 (Center), we observe critical\nperiods for DNNs trained on CIFAR-100, suggesting that they are independent of the data distribution.\n\nSliding Window Regularization.\nIn an effort to regress which phase of learning is most impacted\nby regularization, we compute the maximum sensitivity against a sliding window of 50 epochs\nduring which WD and DA are applied (Figure 1 Right). The early epochs are the most sensitive,\nand regularizing for a short 50 epochs yields generalization that is almost as if we had regularized\nall along. This captures the critical period for regularization. Note that the shape of the sensitivity\nto critical periods depends on the type of regularization: Data augmentation has essentially the\nsame effect throughout training, whereas weight decay impacts critically only the initial epochs.\nSimilar to the previous experiments, we train the networks for 200 epochs except, when the window\nonsets late (epoch 125/150/175), where we train for 50 additional epochs after the termination of the\nregularization window which ensures that the network converges.\n\nReshaping the loss landscape. L2 regularization is classically understood as trading classi\ufb01cation\nloss against the norm of the parameters (weights), which is a simple proxy for model complexity. The\neffects of such a tradeoff on generalization are established in classical models such as linear regression\nor support-vector machines. However, DNNs need not trade classi\ufb01cation accuracy for the L2 norm\nof the weights, as evident from the fact that the training error can always reach zero regardless of\nthe amount of L2 regularization. Current explanations [11] are based on asymptotic convergence\nproperties, that is, on the effect of regularization on the loss landscape and the minima to which the\noptimization converges. In fact, for learning algorithm that reduces to a convex problem, this is the\nonly possible effect. However, Figure 1 shows that for DNNs, the critical role of regularization is\nto change the dynamics of the initial transient, which biases the model towards regions with good\n\n6\n\n\fFigure 5: Fisher Information and generalization: (Left) Trace of the Fisher Information Matrix\n(FIM) as a function of the training epochs. Weight decay increases the peak of the FIM during the\ntransient, with negligible effect on the \ufb01nal value (see left plot when regularization is terminated\nbeyond 100 epochs). The FIM trace is proportional to the norm of the gradients of the cross-entropy\nloss. FIM trace plots for delayed application/sliding window can be found in the Appendix (Figure 7)\n(Center) & (Right): Peak vs. \ufb01nal Fisher Information correlate differently with test accuracy:\nEach point in the plot is a ResNet-18 trained on CIFAR-10 achieving 100% training accuracy.\nSurprisingly, the maximum value of the FIM trace correlates far better with generalization than its\n\ufb01nal value, which is instead related to the local curvature of the loss landscape (\u201c\ufb02at minima\u201d). The\nPearson correlation coef\ufb01cient for the peak FIM trace is 0.92 (p-value < 0.001) compared to 0.29\n(p-value > 0.05) for the \ufb01nal FIM trace.\n\ngeneralization. This can be seen in Figure 1 (Left), where despite halting regularization after 100\nepochs, thus letting the model converge in the un-regularized loss landscape, the network achieves\naround 5% test error. Also in Figure 1 (Top Center), despite applying regularization after 50 epochs,\nthus converging in the regularized loss landscape, the DNN generalizes poorly (around 7% error).\nThus, while there is reshaping of the loss landscape at convergence, this is not the mechanism by\nwhich deep networks achieve generalization. It is commonly believed that a smaller L2 norm of\nthe weights at convergence implies better generalization [37, 31]. Our experiments show no such\ncausation: Slight changes of the training algorithm can yield solutions with larger norm that generalize\nbetter (Figure 2, (C) & Figure 1, Top right: onset epoch 0 vs 25/50).\n\nEffect of Batch-Normalization. One would expect L2 regularization to be ineffective when used\nin conjunction with Batch-Normalization (BN) [19], since BN makes the network\u2019s output invariant\nto changes in the norm of its weights. However, it has been observed that, in practice, WD improves\ngeneralization even, or especially, when used with BN. Several authors [38, 15, 35] have observed\n2, where \u2318t is the learning rate at\nthat WD increases the effective learning rate \u2318ef f,t = \u2318t/kwtk2\nepoch t and kwtk2\n2 is the squared-norm of weights at epoch t, by decreasing the weight norm, which\nincreases the effective gradient noise, which promotes generalization [29, 20, 17]. However, in\nthe sliding window experiment for L2 regularization, we observe that networks with regularization\napplied around epoch 50, despite having smaller weight norm (Figure 2 (C), compare onset epoch 50\nto onset epoch 0) and thus a higher effective learning rate, generalize poorly (Figure 1 Top Right:\nonset epoch 50 has a mean test accuracy increase of 0.24% compared to 1.92% for onset epoch 0). We\ninterpret the latter (onset epoch 0) as having a higher effective learning rate during the critical period,\nwhile for the former (onset epoch 50) it was past its critical period. Thus, previous observations in the\nliterature should be considered with more nuance: we contend that an increased effective learning rate\ninduces generalization only insofar as it modi\ufb01es the dynamics during the critical period, reinforcing\nthe importance of studying when to regularize, in addition to how. In Figure 9 in the Appendix, we\nshow that the initial effective learning rate correlates better with generalization (Pearson coef\ufb01cient\n0.96, p-value < 0.001) than the \ufb01nal effective learning rate (Pearson coef\ufb01cient 0.85, p-value < 0.001).\nWe repeat the experiments in Figure 1 without Batch-Normalization (Figure 4). We observe a similar\nresult, suggesting that the positive effect of weight decay during the transient cannot be due solely to\nthe use of batch normalization and an increased effective learning rate.\n\nWeight decay, Fisher and \ufb02atness. Generalization for DNNs is often correlated with the \ufb02atness\nof the minima to which the network converges during training [14, 24, 21, 4], where solutions\ncorresponding to \ufb02atter minima seem to generalize better. In order to understand if the effect of\nregularization is to increase the \ufb02atness at convergence, we use the Fisher Information Matrix (FIM),\n\n7\n\n\fwhich is a semi-de\ufb01nite approximation of the Hessian of the loss function [27] and thus a measure of\nthe curvature of the loss landscape. We recall that the Fisher Information Matrix is de\ufb01ned as:\n\nF := Ex\u21e0D0(x)Ey\u21e0pw(y|x)[rw log pw(y|x)rw log pw(y|x)T ].\n\nIn Figure 5 (Left) we plot the trace of FIM against the \ufb01nal accuracy. Notice that, contrary to our\nexpectations, weight decay increases the FIM norm, and hence curvature of the convergence point,\nbut this still leads to better generalization. Moreover, the effect of weight decay on the curvature is\nmore marked during the transient (Figure 5). This suggests that the peak curvature reached during\nthe transient, rather than its \ufb01nal value, may correlate with the effectiveness of regularization. To test\nthis hypothesis, we consider the DNNs trained in Figure 1 (Top) and plot the relationship between\npeak/\ufb01nal FIM value and test accuracy in Figure 5 (Center, Right): Indeed, while the peak value\nof the FIM strongly correlates with the \ufb01nal test performance (Pearson coef\ufb01cient 0.92, p-value <\n0.001), the \ufb01nal value of the FIM norm does not (Pearson 0.29, p-value > 0.05). We report plots of\nthe Fisher Norm for delayed/sliding window application of WD in the Appendix (Figure 7).\nThe FIM was also used to study critical period for changes in the data distribution in [1], which\nhowever in their setting observe an anti-correlation between Fisher and generalization. Indeed, the\nrelationship between the \ufb02atness of the convergence point and generalization established in the\nliterature emerges as rather complex, and we may hypothesize a more complex bias-variance trade-off\nlike a connection between the two, where either too low or too high curvature can be detrimental.\n\nJacobian norm.\n[38] relates the effect of regularization to the norm of the Gauss-Newton matrix,\nw Jw], where Jw is the Jacobian of fw(x) w.r.t w, which in turn relates to norm of the\nG = E[J T\nnetworks input-output Jacobian. The Fisher Information Matrix is indeed related to the GN matrix\n(more precisely, it coincides with the generalized Gauss-Newton matrix, G = E[J T\nw HJw], where\nH is the Hessian of `(y, fw(x)) w.r.t. fw(x)). However, while the GN norm remains approximately\nconstant during training, we found the changes of the Fisher-Norm during training (and in particular\nits peak) to be informative of the critical period for regularization, allowing for a more detailed\nanalysis.\n\n5 Discussion and Conclusions\n\nWe have tested the hypothesis that there exists a \u201ccritical period\u201d for regularization in training deep\nneural networks. Unlike classical machine learning, where regularization trades off the training error\nin the loss being minimized, DNNs are not subject to this trade-off: One can train a model with\nsuf\ufb01cient capacity to zero training error regardless of the norm constraint imposed on the weights.\nYet, weight decay works, even in the case where it seems it should not, for instance when the network\nis invariant to the scale of the weights, e.g., in the presence of batch normalization. We believe\nthe reason is that regularization affects the early epochs of training by biasing the solution towards\nregions that have good generalization properties. Once there, there are many local extrema to which\nthe optimization can converge. Which to is unimportant: Turning the regularizer on or off changes\nthe loss function, and the optimizer moves accordingly, but test error is unaffected, at least for the\nvariety of architectures, training sets, and learning rates we tested.\nWe believe that there are universal phenomena at play, and what we observe is not the byproduct of\naccidental choices of training set, architecture, and hyperparameters: One can see the absence of\nregularization as a learning de\ufb01cit, and it has been known for decades that de\ufb01cits that interfere with\nthe early phases of learning, or critical periods, have irreversible effects, from humans to songbirds\nand, as recently shown by [1], deep neural networks. Critical periods depend on the type of de\ufb01cits,\nthe task, the species or architecture. We have shown results for two datasets, two architectures, two\nlearning rate schedules.\nWhile our exploration is by no means exhaustive, it supports the point that considerably more effort\nshould be devoted to the analysis of the transient dynamics of Deep Learning. To this date, most of\nthe theoretical work in Deep Learning focuses on the asymptotics and the properties of the minimum\nat convergence.\nOur hypothesis also stands when considering the interaction with other forms of generalized regular-\nization, such as batch normalization, and explains why weight decay still works, even though batch\n\n8\n\n\fnormalization makes the activations invariant to the norm of the weights, which challenges previous\nexplanation of the mechanisms of action of weight decay.\nWe note that there is no trade-off between regularization and loss in DNNs, and the effects of\nregularization cannot (solely) be to change the shape of the loss landscape (WD), or to change the\nvariety of gradient noise (DA) preventing the network from converging to some local minimizers, as\nwithout regularization in the end, everything works. The main effect of regularization ought to be on\nthe transient dynamics before convergence.\nAt present, there is no viable theory on transient regularization. The empirical results we present\nshould be a call to arms for theoreticians interested in understanding Deep Learning. A possible\ninterpretation advanced by [1] is to interpret critical periods as the (irreversible) crossing of narrow\nbottlenecks in the loss landscape. Increasing the noise \u2013 either by increasing the effective learning rate\n(WD) or by adding variety to the samples (DA) \u2013 may help the network cross the right bottlenecks\nwhile avoiding those leading to irreversibly sub-optimal solutions. If this is the case, can better\nregularizers be designed for this task?\n\nAcknowledgments\nWe would like to thank the anonymous reviewers for their feedback and suggestions. This work is\nsupported by ARO W911NF-15-1-0564 and ONR N00014-19-1-2066.\n\nReferences\n[1] Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep\n\nnetworks. In International Conference on Learning Representations, 2019.\n\n[2] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013\n\n276, 1998.\n\n[3] Siegfried Bos and E Chug. Using weight decay to optimize the generalization ability of\na perceptron. In Proceedings of International Conference on Neural Networks (ICNN\u201996),\nvolume 1, pages 241\u2013246. IEEE, 1996.\n\n[4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[6] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in neural information processing systems, pages 2933\u20132941,\n2014.\n\n[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In 2009 IEEE conference on computer vision and pattern\nrecognition, pages 248\u2013255. Ieee, 2009.\n\n[8] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize\nfor deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 1019\u20131028. JMLR. org, 2017.\n\n[9] Ronald Aylmer Fisher. Theory of statistical estimation. In Mathematical Proceedings of the\nCambridge Philosophical Society, volume 22, pages 700\u2013725. Cambridge University Press,\n1925.\n\n[10] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedfor-\nward neural networks. In Proceedings of the thirteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 249\u2013256, 2010.\n\n9\n\n\f[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. 2016.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[13] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-\nmemory tasks. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2034\u20132042, 2016.\n\n[14] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[15] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: ef\ufb01cient and accurate\nIn Advances in Neural Information Processing\n\nnormalization schemes in deep networks.\nSystems, pages 2160\u20132170, 2018.\n\n[16] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 1731\u20131741. Curran Associates, Inc., 2017.\n\n[17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In Advances in Neural Information\nProcessing Systems, pages 1731\u20131741, 2017.\n\n[18] Byung-Woo Hong, Ja-Keoung Koo, Martin Burger, and Stefano Soatto. Adaptive regularization\n\nof some inverse problems in image analysis. arXiv preprint arXiv:1705.03350, 2017.\n\n[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proceedings of the 32Nd International Conference\non International Conference on Machine Learning - Volume 37, ICML\u201915, pages 448\u2013456.\nJMLR.org, 2015.\n\n[20] Stanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\narXiv preprint\n\nBengio, and Amos Storkey. Three factors in\ufb02uencing minima in sgd.\narXiv:1711.04623, 2017.\n\n[21] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[23] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In\n\nAdvances in neural information processing systems, pages 950\u2013957, 1992.\n\n[24] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the\nloss landscape of neural nets. In Advances in Neural Information Processing Systems, pages\n6389\u20136399, 2018.\n\n[25] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv\n\npreprint arXiv:1608.03983, 2016.\n\n[26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International\n\nConference on Learning Representations, 2019.\n\n[27] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\n[28] Hossein Mobahi and John W Fisher III. A theoretical analysis of optimization by gaussian\n\ncontinuation. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n10\n\n\f[29] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach,\nand James Martens. Adding gradient noise improves learning for very deep networks. arXiv\npreprint arXiv:1511.06807, 2015.\n\n[30] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to\nspectrally-normalized margin bounds for neural networks. In International Conference on\nLearning Representations, 2018.\n\n[32] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter\n\nConference on Applications of Computer Vision (WACV), pages 464\u2013472. IEEE, 2017.\n\n[33] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving\n\nfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[34] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[35] Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint\n\narXiv:1706.05350, 2017.\n\n[36] Vladimir N Vapnik. The vicinal risk minimization principle and the svms. In The nature of\n\nstatistical learning theory, pages 267\u2013290. Springer, 2000.\n\n[37] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The\nmarginal value of adaptive gradient methods in machine learning. In Advances in Neural\nInformation Processing Systems, pages 4148\u20134158, 2017.\n\n[38] Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight\n\ndecay regularization. In International Conference on Learning Representations, 2019.\n\n[39] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond\n\nempirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5691, "authors": [{"given_name": "Aditya Sharad", "family_name": "Golatkar", "institution": "UCLA"}, {"given_name": "Alessandro", "family_name": "Achille", "institution": "AWS"}, {"given_name": "Stefano", "family_name": "Soatto", "institution": "UCLA"}]}