{"title": "Exploring Generalization in Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5947, "page_last": 5956, "abstract": "With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.", "full_text": "Exploring Generalization in Deep Learning\n\nBehnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro\n\nToyota Technological Institute at Chicago\n\n{bneyshabur, srinadh, mcallester, nati}@ttic.edu\n\nAbstract\n\nWith a goal of understanding what drives generalization in deep networks, we\nconsider several recently suggested explanations, including norm-based control,\nsharpness and robustness. We study how these measures can ensure generalization,\nhighlighting the importance of scale normalization, and making a connection\nbetween sharpness and PAC-Bayes theory. We then investigate how well the\nmeasures explain different observed phenomena.\n\n1\n\nIntroduction\n\nLearning with deep neural networks has enjoyed huge empirical success in recent years across a wide\nvariety of tasks. Despite being a complex, non-convex optimization problem, simple methods such as\nstochastic gradient descent (SGD) are able to recover good solutions that minimize the training error.\nMore surprisingly, the networks learned this way exhibit good generalization behavior, even when\nthe number of parameters is signi\ufb01cantly larger than the amount of training data [20, 30].\nIn such an over parametrized setting, the objective has multiple global minima, all minimize the\ntraining error, but many of them do not generalize well. Hence, just minimizing the training error is\nnot suf\ufb01cient for learning: picking the wrong global minima can lead to bad generalization behavior.\nIn such situations, generalization behavior depends implicitly on the algorithm used to minimize\nthe training error. Different algorithmic choices for optimization such as the initialization, update\nrules, learning rate, and stopping condition, will lead to different global minima with different\ngeneralization behavior [7, 12, 18]. For example, Neyshabur et al. [18] introduced Path-SGD, an\noptimization algorithm that is invariant to rescaling of weights, and showed better generalization\nbehavior over SGD for both feedforward and recurrent neural networks [18, 22]. Keskar et al. [12]\nnoticed that the solutions found by stochastic gradient descent with large batch sizes generalizes\nworse than the one found with smaller batch sizes, and Hardt et al. [10] discuss how stochastic\ngradient descent ensures uniform stability, thereby helping generalization for convex objectives.\nWhat is the bias introduced by these algorithmic choices for neural networks? What ensures general-\nization in neural networks? What is the relevant notion of complexity or capacity control?\nAs mentioned above, simply accounting for complexity in terms of the number of parameters, or any\nmeasure which is uniform across all functions representable by a given architecture, is not suf\ufb01cient\nto explain the generalization ability of neural networks trained in practice. For linear models, norms\nand margin-based measures, and not the number of parameters, are commonly used for capacity\ncontrol [5, 9, 25]. Also norms such as the trace norm and max norm are considered as sensible\ninductive biases in matrix factorization and are often more appropriate than parameter-counting\nmeasures such as the rank [27, 28]. In a similar spirit, Bartlett [3], Neyshabur et al. [20] and in\nparallel to this work, Bartlett et al. [2] suggested different norms of network parameters to measure\nthe capacity of neural networks. In a different line of work, Keskar et al. [12] suggested \u201csharpness\u201d\n(robustness of the training error to perturbations in the parameters) as a complexity measure for neural\nnetworks. Others, including Langford and Caruana [13] and more recently Dziugaite and Roy [8],\npropose a PAC-Bayes analysis.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWhat makes a complexity measure appropriate for explaining generalization in deep learning?\nFirst, an appropriate complexity measure must be suf\ufb01cient in ensuring generalization. Second,\nnetworks learned in practice should be of low complexity under this measure. This can happen if our\noptimization algorithms bias us toward lower complexity models under this measure and it is possible\nto capture real data using networks of low complexity. In particular, the complexity measure should\nhelp explain several recently observed empirical phenomena that are not explained by a uniform\nnotion of complexity:\n\n\u2022 It is possible to obtain zero training error on random labels using the same architecture\nfor which training with real labels leads to good generalization [30]. We would expect\nthe networks learned using real labels (and which generalizes well) to have much lower\ncomplexity, under the suggested measure, than those learned using random labels (and\nwhich obviously do not generalize well).\n\u2022 Increasing the number of hidden units, thereby increasing the number of parameters, can\nlead to a decrease in generalization error even when the training error does not decrease [20].\nWe would expect to see the complexity measure decrease as we increase the number of\nhidden units.\n\u2022 When training the same architecture, with the same training set, using two different op-\ntimization methods (or different algorithmic or parameter choices), one method results\nin better generalization even though both lead to zero training error [18, 12]. We would\nexpect to see a correlation between the complexity measure and generalization ability among\nzero-training error models.\n\nIn this paper we examine complexity measures that have recently been suggested, or could be\nconsidered, in explaining generalization in deep learning. We evaluate the measures based on their\nability to theoretically guarantee generalization, and their empirical ability to explain the above\nphenomena. Studying how each measure can guarantee generalization also let us better understand\nhow it should be computed and compared in order to explain the empirical phenomena.\nWe investigate complexity measures including norms, robustness and sharpness of the network.\nWe emphasize in our theoretical and empirical study the importance of relating the scale of the\nparameters and the scale of the output of the network, e.g. by relating norm and margin. In this light,\nwe discuss how sharpness by itself is not suf\ufb01cient for ensuring generalization, but can be combined,\nthrough PAC-Bayes analysis, with the norm of the weights to obtain an appropriate complexity\nmeasure. The role of sharpness in PAC-Bayesian analysis of neural networks was also recently\nnoted by Dziugaite and Roy [8], who used numerical techniques to numerically optimize the overall\nPAC-Bayes bound\u2014here we emphasize the distinct role of sharpness as a balance for norm.\n\nNotation\n\nLet fw(x) be the function computed by a d layer feed-forward network with parameters w and\nRecti\ufb01ed Linear Unit (ReLU) activations, fw(x) = Wd \u03c6(Wd\u22121 \u03c6(....\u03c6(W1x))) where \u03c6(z) =\nmax{0, z}. Let hi be the number of nodes in layer i, with h0 = n. Therefore, for any layer i, we\n(cid:96)(w, x). We also denote by L(w) the expected loss and by(cid:98)L(w) the empirical loss over the training\nhave Wi \u2208 Rhi\u00d7hi\u22121. Given any input x, the loss of the prediction by the function fw is then given by\nset. For any integer k, [k] denotes the set {1, 2,\u00b7\u00b7\u00b7 , k}. Finally, (cid:107).(cid:107)F , (cid:107).(cid:107)2, (cid:107).(cid:107)1, (cid:107).(cid:107)\u221e denote\nFrobenius norm, the spectral norm, element-wise (cid:96)1-norm and element-wise (cid:96)\u221e norm respectively.\n\n2 Generalization and Capacity Control in Deep Learning\n\nIn this section, we discuss complexity measures that have been suggested, or could be used for capacity\ncontrol in neural networks. We discuss advantages and weaknesses of each of these complexity\nmeasures and examine their abilities to explain the observed generalization phenomena in deep\nlearning.\nWe consider the statistical capacity of a model class in terms of the number of examples required to\nensure generalization, i.e. that the population (or test error) is close to the training error, even when\nminimizing the training error. This also roughly corresponds to the maximum number of examples\non which one can obtain small training error even with random labels.\n\n2\n\n\fGiven a model class H, such as all the functions representable by some feedforward or convolutional\nnetworks, one can consider the capacity of the entire class H\u2014this corresponds to learning with\na uniform \u201cprior\u201d or notion of complexity over all models in the class. Alternatively, we can also\nconsider some complexity measure, which we take as a mapping that assigns a non-negative number\nto every hypothesis in the class - M : {H, S} \u2192 R+, where S is the training set. It is then suf\ufb01cient\nto consider the capacity of the restricted class HM,\u03b1 = {h : h \u2208 H,M(h) \u2264 \u03b1} for a given \u03b1 \u2265 0.\nOne can then ensure generalization of a learned hypothesis h in terms of the capacity of HM,M(h).\nHaving a good hypothesis with low complexity, and being biased toward low complexity (in terms of\nM) can then be suf\ufb01cient for learning, even if the capacity of the entire H is high. And if we are\nindeed relying on M for ensuring generalization (and in particular, biasing toward models with lower\ncomplexity under M), we would expect a learned h with lower value of M(h) to generalize better.\nFor some of the measures discussed, we allow M to depend also on the training set. If this is done\ncarefully, we can still ensure generalization for the restricted class HM,\u03b1.\nWe will consider several possible complexity measures. For each candidate measure, we \ufb01rst investi-\ngate whether it is suf\ufb01cient for generalization, and analyze the capacity of HM,\u03b1. Understanding the\ncapacity corresponding to different complexity measures also allows us to relate between different\nmeasures and provides guidance as to what and how we should measure: From the above discussion,\nit is clear that any monotone transformation of a complexity measures leads to an equivalent notion\nof complexity. Furthermore, complexity is meaningful only in the context of a speci\ufb01c hypothesis\nclass H, e.g. speci\ufb01c architecture or network size. The capacity, as we consider it (in units of sample\ncomplexity), provides a yardstick by which to measure complexity (we should be clear though, that\nwe are vague regarding the scaling of the generalization error itself, and only consider the scaling\nin terms of complexity and model class, thus we obtain only a very crude yardstick suf\ufb01cient for\ninvestigating trends and relative phenomena, not a quantitative yardstick).\n\n2.1 Network Size\n\nFor any model, if its parameters have \ufb01nite precision, its capacity is linear in the total number of\nparameters. Even without making an assumption on the precision of parameters, the VC dimension\nof feedforward networks can be bounded in terms of the number of parameters dim(w)[1, 3, 6, 23].\nIn particular, Bartlett [4] and Harvey et al. [11], following Bartlett et al. [6], give the following tight\n(up to logarithmic factors) bound on the VC dimension and hence capacity of feedforward networks\nwith ReLU activations:\n\n(1)\nIn the over-parametrized settings, where the number of parameters is more than the number of\nsamples, complexity measures that depend on the total number of parameters are too weak and\ncannot explain the generalization behavior. Neural networks used in practice often have signi\ufb01cantly\nmore parameters than samples, and indeed can perfectly \ufb01t even random labels, obviously without\ngeneralizing [30]. Moreover, measuring complexity in terms of number of parameters cannot explain\nthe reduction in generalization error as the number of hidden units increase [20] (see also Figure 4).\n\nVC-dim = \u02dcO(d \u2217 dim(w))\n\n2.2 Norms and Margins\n\nCapacity of linear predictors can be controlled independent of the number of parameters, e.g. through\non the (cid:96)1 norm of the weights of hidden units in each layer, and is proportional to(cid:81)d\nregularization of its (cid:96)2 norm. Similar norm based complexity measures have also been established for\nfeedforward neural networks with ReLU activations. For example, capacity can be bounded based\n1,\u221e,\nwhere (cid:107)Wi(cid:107)1,\u221e is the maximum over hidden units in layer i of the (cid:96)1 norm of incoming weights to\nthe hidden unit [5]. More generally Neyshabur et al. [19] considered group norms (cid:96)p,q corresponding\n(cid:81)d\nto (cid:96)q norm over hidden units of (cid:96)p norm of incoming weights to the hidden unit. This includes\n(cid:96)2,2 which is equivalent to the Frobenius norm where the capacity of the network is proportional to\ni=1 (cid:107)Wi(cid:107)2\nof(cid:81)d\nF . They further motivated a complexity measure that is invariant to node-wise rescaling\nreparametrization 1, suggesting (cid:96)p path norms which is the minimum over all node-wise rescalings\ni=1 (cid:107)Wi(cid:107)p,\u221e and is equal to (cid:96)p norm of a vector with coordinates each of which is the product\n1Node-rescaling can be de\ufb01ned as a sequence of reparametrizations, each of which corresponds to multiplying\nincoming weights and dividing outgoing weights of a hidden unit by a positive scalar \u03b1. The resulting network\ncomputes the same function as the network before the reparametrization.\n\ni=1 (cid:107)Wi(cid:107)2\n\n3\n\n\fof weights along a path from an input node to an output node in the network. While preparing this\nmanuscript, we became aware of parallel work Bartlett et al. [2] that proves generalization bounds\n\nwith capacity is proportional to(cid:81)d\n\ni=1 (cid:107)Wi(cid:107)2\n\n2\n\n(cid:16)(cid:80)d\n\nj=1\n\n(cid:0)(cid:107)Wj(cid:107)1 /(cid:107)Wj(cid:107)2\n\n(cid:1)2/3(cid:17)3\n\n.\n\nCapacity control in terms of norm, when using a zero/one loss (i.e. counting errors) requires us in\naddition to account for scaling of the output of the neural networks, as the loss is insensitive to this\nscaling but the norm only makes sense in the context of such scaling. For example, dividing all the\nweights by the same number will scale down the output of the network but does not change the 0/1\nloss, and hence it is possible to get a network with arbitrary small norm and the same 0/1 loss. Using\na scale sensitive losses, such as the cross entropy loss, does address this issue (if the outputs are scaled\ndown toward zero, the loss becomes trivially bad), and one can obtain generalization guarantees in\nterms of norm and the cross entropy loss.\nHowever, we should be careful when comparing the norms of different models learned by minimizing\nthe cross entropy loss, in particular when the training error goes to zero. When the training error goes\nto zero, in order to push the cross entropy loss (or any other positive loss that diminish at in\ufb01nity)\nto zero, the outputs of the network must go to in\ufb01nity, and thus the norm of the weights (under any\nnorm) should also go to in\ufb01nity. This means that minimizing the cross entropy loss will drive the\nnorm toward in\ufb01nity. In practice, the search is terminated at some \ufb01nite time, resulting in large,\nbut \ufb01nite norm. But the value of this norm is mostly an indication of how far the optimization is\nallowed to progress\u2014using a stricter stopping criteria (or higher allowed number of iterations) would\nyield higher norm. In particular, comparing the norms of models found using different optimization\napproaches is meaningless, as they would all go toward in\ufb01nity.\nInstead, to meaningfully compare norms of the network, we should explicitly take into account the\nscaling of the outputs of the network. One way this can be done, when the training error is indeed\nzero, is to consider the \u201cmargin\u201d of the predictions in addition to the norms of the parameters. We\nrefer to the margin for a single data point x as the difference between the score of the correct label\nand the maximum score of other labels, i.e.\n\nfw(x)[ytrue] \u2212 max\ny(cid:54)=ytrue\n\nfw(x)[y]\n\n(2)\n\nIn order to measure scale over an entire training set, one simple approach is to consider the \u201chard\nmargin\u201d, which is the minimum margin among all training points. However, this de\ufb01nition is very\nsensitive to extreme points as well as to the size of the training set. We consider instead a more\nrobust notion that allows a small portion of data points to violate the margin. For a given training\nset and small value \u0001 > 0, we de\ufb01ne the margin \u03b3margin as the lowest value of \u03b3 such that (cid:100)\u0001m(cid:101) data\npoint have margin lower than \u03b3 where m is the size of the training set. We found empirically that the\nqualitative and relative nature of our empirical results is almost unaffected by reasonable choices of \u0001\n(e.g. between 0.001 and 0.1).\nThe measures we investigate in this work and their corresponding capacity bounds are as follows 2:\n\nF [19].\n\n(cid:81)d\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80)\ni=1 4(cid:107)Wi(cid:107)2\n(cid:12)(cid:12)(cid:12)(cid:12) d(cid:81)\nj\u2208(cid:81)d\n(cid:81)d\n(cid:80)\nj\u2208(cid:81)d\n(cid:81)d\ni=1 hi (cid:107)Wi(cid:107)2\n2.\n\nk=0[hk]\n\nk=0[hk]\n\n1\n\u03b32\n\ni=1\n\n1\n\u03b32\n\nmargin\n\n1\n\u03b32\n\nmargin\n\nmargin\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n[5, 19].\n\n2Wi[ji, ji\u22121]\n\ni=1 4hiW 2\n\ni [ji, ji\u22121].\n\n\u2022 (cid:96)2 norm with capacity proportional to\n\n1\n\u03b32\n\nmargin\n\n\u2022 (cid:96)1-path norm with capacity proportional to\n\n\u2022 (cid:96)2-path norm with capacity proportional to\n\u2022 spectral norm with capacity proportional to\n\nwhere(cid:81)d\n\nk=0[hk] is the Cartesian product over sets [hk]. The above bounds indicate that capacity can\nbe bounded in terms of either (cid:96)2-norm or (cid:96)1-path norm independent of number of parameters. The\n\n2We have dropped the term that only depends on the norm of the input. The bounds based on (cid:96)2-path norm\nand spectral norm can be derived directly from the those based on (cid:96)1-path norm and (cid:96)2 norm respectively.\nWithout further conditions on weights, exponential dependence on depth is tight but the 4d dependence might be\nloose [19]. As we discussed at the beginning of this subsection, in parallel work, Bartlett et al. [2] have improved\nthe spectral bound.\n\n4\n\n\f(cid:96)2 norm\n\n(cid:96)1-path norm\n\n(cid:96)2-path norm\n\nspectral norm\n\nFigure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10\ndataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal-\ning issues (see Section 2), where for each complexity measure, we drop the terms that only depend on\ndepth or number of hidden units; e.g.\ni [ji, ji\u22121].\nWe also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e.\nPrc5 {fw(xi)[yi] \u2212 maxy(cid:54)=yi fw(x)[y]|(xi, yi) \u2208 S}. In all experiments, the training error of the learned net-\nwork is zero. The plots indicate that these measures can explain the generalization as the complexity of model\nlearned with random labels is always higher than the one learned with true labels. Moreover, the gap between the\ncomplexity of models learned with true and random labels increases as we increase the size of the training set.\n\nfor (cid:96)2-path norm we plot \u03b3\n\nj\u2208(cid:81)d\n\n(cid:81)d\n\n(cid:80)\n\nk=0[hk]\n\ni=1 W 2\n\n\u22122\nmargin\n\n(cid:96)2-path norm dependence on the number of hidden units in each layer is unavoidable. However, it is\nnot clear if a bound that only depends on the product of spectral norms is possible.\nAs an initial empirical investigation of the appropriateness of the different complexity measures,\nwe compared the complexity (under each of the above measures) of models trained on true versus\nrandom labels. We would expect to see two phenomena: \ufb01rst, the complexity of models trained\non true labels should be substantially lower than those trained on random labels, corresponding to\ntheir better generalization ability. Second, when training on random labels, we expect capacity to\nincrease almost linearly with the number of training examples, since every extra example requires\nnew capacity in order to \ufb01t it\u2019s random label. However, when training on true labels we expect the\nmodel to capture the true functional dependence between input and output and thus \ufb01tting more\ntraining examples should only require small increases in the capacity of the network. The results are\nreported in Figure 1. We indeed observe a gap between the complexity of models learned on real and\nrandom labels for all four norms, with the difference in increase in capacity between true and random\nlabels being most pronounced for the (cid:96)2 norm and (cid:96)2-path norm.\n\nLipschitz Continuity and Robustness The measures/norms we discussed so far also control the\nLipschitz constant of the network with respect to its input. Is the capacity control achieved through the\nbound on the Lipschitz constant? Is bounding the Lipschitz constant alone enough for generalization?\nIn Appendix A, we show that the current bounds using Lipschitz have exponential dependence to the\ninput dimension and therefore the capacity bounds discussed above are not merely a consequence of\nbounding the Lipschitz constant.\nIn Section 3 we present further empirical investigations of the appropriateness of these complexity\nmeasures to explain other phenomena.\n\n2.3 Sharpness\n\nThe notion of sharpness as a generalization measure was recently suggested by Keskar et al. [12] and\ncorresponds to robustness to adversarial perturbations on the parameter space:\n\n\u03b6\u03b1(w) =\n\n|\u03bdi|\u2264\u03b1(|wi|+1)\n\nwhere the training error(cid:98)L(fw) is generally very small in the case of neural networks in practice, so\n\nwe can simply drop it from the denominator without a signi\ufb01cant change in the sharpness value.\nAs we will explain below, sharpness de\ufb01ned this way does not capture the generalization behavior.\nTo see this, we \ufb01rst examine whether sharpness can predict the generalization behavior for networks\ntrained on true vs random labels. In the left plot of Figure 2, we plot the sharpness for networks\ntrained on true vs random labels. While sharpness correctly predicts the generalization behavior for\n\n(3)\n\nmax|\u03bdi|\u2264\u03b1(|wi|+1)(cid:98)L(fw+\u03bd ) \u2212(cid:98)L(fw)\n\n1 +(cid:98)L(fw)\n\n(cid:39)\n\nmax\n\n(cid:98)L(fw+\u03bd ) \u2212(cid:98)L(fw),\n\n5\n\nsizeoftraningset10K20K30K40K50K102010251030truelabelsrandomlabelssizeoftraningset10K20K30K40K50K102510301035sizeoftraningset10K20K30K40K50K100102104sizeoftraningset10K20K30K40K50K10510101015\ftrue labels\n\nrandom labels\n\nFigure 2: Sharpness and PAC-Bayes measures on a VGG network trained on subsets of CIFAR10 dataset with\ntrue or random labels. In the left panel, we plot max sharpness, calculated as suggested by Keskar et al. [12]\nwhere the perturbation for parameter wi has magnitude 5.10\u22124(|wi| + 1). The middle and right plots show the\nrelationship between expected sharpness and KL divergence in PAC-Bayes bound for true and random labels\nrespectively. For PAC-Bayes plots, each point in the plot correspond to a choice of \u03b1 where the standard deviation\nof the perturbation for the parameter wi is \u03b1(10|wi| + 1). The corresponding KL to each \u03b1 is weighted (cid:96)2\nnorm where the weight for each parameter is the inverse of the standard deviation of the perturbation.\n\nbigger networks, for networks of smaller size, those trained on random labels have less sharpness\nthan the ones trained on true labels. Furthermore sharpness de\ufb01ned above depends on the scale of w\nand can be arti\ufb01cially increased or decreased by changing the scale of the parameters. Therefore,\nsharpness alone is not suf\ufb01cient to control the capacity of the network.\nInstead, we advocate viewing a related notion of expected sharpness in the context of the PAC-\nBayesian framework. Viewed this way, it becomes clear that sharpness controls only one of two\nrelevant terms, and must be balanced with some other measure such as norm. Together, sharpness and\nnorm do provide capacity control and can explain many of the observed phenomena. This connection\nbetween sharpness and the PAC-Bayes framework was also recently noted by Dziugaite and Roy [8].\nThe PAC-Bayesian framework [16, 17] provides guarantees on the expected error of a randomized\npredictor (hypothesis), drawn form a distribution denoted Q and sometimes referred to as a \u201cposterior\u201d\n(although it need not be the Bayesian posterior), that depends on the training data. Let fw be any\npredictor (not necessarily a neural network) learned from training data. We consider a distribution\nQ over predictors with weights of the form w + \u03bd, where w is a single predictor learned from the\ntraining set, and \u03bd is a random variable. Then, given a \u201cprior\u201d distribution P over the hypothesis that\nis independent of the training data, with probability at least 1 \u2212 \u03b4 over the draw of the training data,\nthe expected error of fw+\u03bd can be bounded as follows [15]:\n\n(cid:115)(cid:0)KL (w + \u03bd(cid:107)P ) + ln 2m\n\n(cid:1)\n\n\u03b4\n\n(4)\n\nE\u03bd [L(fw+\u03bd )] \u2264 E\u03bd [(cid:98)L(fw+\u03bd )] + 4\n\nSubstituting E\u03bd [(cid:98)L(fw+\u03bd )] with(cid:98)L(fw) +\n\n(cid:16)E\u03bd [(cid:98)L(fw+\u03bd )] \u2212(cid:98)L(fw)\n\nm\n\n(cid:17)\n\nwe can see that the PAC-Bayes\nbound depends on two quantities - i) the expected sharpness and ii) the Kullback Leibler (KL)\ndivergence to the \u201cprior\u201d P . The bound is valid for any distribution measure P , any perturbation\ndistribution \u03bd and any method of choosing w dependent on the training set. A simple way to\ninstantiate the bound is to set P to be a zero mean, \u03c32 variance Gaussian distribution. Choosing the\nperturbation \u03bd to also be a zero mean spherical Gaussian with variance \u03c32 in every direction, yields\nthe following guarantee (w.p. 1 \u2212 \u03b4 over the training set):\nE\u03bd\u223cN (0,\u03c3)n [L(fw+\u03bd )] \u2264 (cid:98)L(fw) + E\u03bd\u223cN (0,\u03c3)n [(cid:98)L(fw+\u03bd )] \u2212(cid:98)L(fw)\n(cid:125)\nexpression changes to(cid:80)\n\nAnother interesting approach is to set the variance of the perturbation to each parameter with respect\nto the magnitude of the parameter. For example if \u03c3i = \u03b1|wi| + \u03b2, then the KL term in the above\n. The above generalization guarantees give a clear way to think about\ncapacity control jointly in terms of both the expected sharpness and the norm, and as we discussed\nearlier indicates that sharpness by itself cannot control the capacity without considering the scaling.\nIn the above generalization bound, norms and sharpness interact in a direct way depending on \u03c3,\n\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116) 1\n\n(cid:18) (cid:107)w(cid:107)2\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n2\n2\u03c32\n\nexpected sharpness\n\n(cid:123)(cid:122)\n\n2m\n\u03b4\n\nw2\ni\n2\u03c32\ni\n\ni\n\n,\n\n(5)\n\n(cid:19)\n\n(cid:124)\n\n+4\n\nm\n\n+ ln\n\nKL\n\n6\n\nsizeoftraningset10K20K30K40K50Ksharpness0.40.60.811.2truelabelsrandomlabelsKL#1080123expectedsharpness00.050.10.150.20.250.3KL2K4K6Kexpectedsharpness00.040.080.125K10K30K50KKL#1080123expectedsharpness00.050.10.150.20.250.3KL2K4K6Kexpectedsharpness00.040.080.125K10K30K50K\fFigure 3: Experiments on global minima with poor generalization. For each experiment, a VGG network is\ntrained on union of a subset of CIFAR10 with size 10000 containing samples with true labels and another subset\nof CIFAR10 datasets with varying size containing random labels. The learned networks are all global minima for\nthe objective function on the subset with true labels. The left plot indicates the training and test errors based on\nthe size of the set with random labels. The plot in the middle shows change in different measures based on the\nsize of the set with random labels. The plot on the right indicates the relationship between expected sharpness\nand KL in PAC-bayes for each of the experiments. Measures are calculated as explained in Figures 1 and 2.\n\nas increasing the norm by decreasing \u03c3 causes decrease in sharpness and vice versa. It is therefore\nimportant to \ufb01nd the right balance between the norm and sharpness by choosing \u03c3 appropriately in\norder to get a reasonable bound on the capacity.\nIn our experiments we observe that looking at both these measures jointly indeed makes a better pre-\ndictor for the generalization error. As discussed earlier, Dziugaite and Roy [8] numerically optimize\nthe overall PAC-Bayes generalization bound over a family of multivariate Gaussian distributions\n(different choices of perturbations and priors). Since the precise way the sharpness and KL-divergence\nare combined is not tight, certainly not in (5), nor in the more re\ufb01ned bound used by Dziugaite and\nRoy [8], we prefer shying away from numerically optimizing the balance between sharpness and the\nKL-divergence. Instead, we propose using bi-criteria plots, where sharpness and KL-divergence are\nplotted against each other, as we vary the perturbation variance. For example, in the center and right\npanels of Figure 2 we show such plots for networks trained on true and random labels respectively.\nWe see that although sharpness by itself is not suf\ufb01cient for explaining generalization in this setting\n(as we saw in the left panel), the bi-criteria plots are signi\ufb01cantly lower for the true labels. Even more\nso, the change in the bi-criteria plot as we increase the number of samples is signi\ufb01cantly larger with\nrandom labels, correctly capturing the required increase in capacity. For example, to get a \ufb01xed value\nof expected sharpness such as \u0001 = 0.05, networks trained with random labels require higher norm\ncompared to those trained with true labels. This behavior is in agreement with our earlier discussion,\nthat sharpness is sensitive to scaling of the parameters and is not a capacity control measure as it can\nbe arti\ufb01cially changed by scaling the network. However, combined with the norm, sharpness does\nseem to provide a capacity measure.\n\n3 Empirical Investigation\n\nIn this section we investigate the ability of the discussed measures to explain the the generalization\nphenomenon discussed in the Introduction. We already saw in Figures 1 and 2 that these measures\ncapture the difference in generalization behavior of models trained on true or random labels, including\nthe increase in capacity as the sample size increases, and the difference in this increase between true\nand random labels.\n\nDifferent Global Minima Given different global minima of the training loss on the same training\nset and with the same model class, can these measures indicate which model is going to generalize\nbetter? In order to verify this property, we can calculate each measure on several different global\nminima and see if lower values of the measure imply lower generalization error. In order to \ufb01nd\ndifferent global minima for the training loss, we design an experiment where we force the optimization\nmethods to converge to different global minima with varying generalization abilities by forming a\nconfusion set that includes samples with random labels. The optimization is done on the loss that\nincludes examples from both the confusion set and the training set. Since deep learning models have\nvery high capacity, the optimization over the union of confusion set and training set generally leads\nto a point with zero error over both confusion and training sets which thus is a global minima for the\n\n7\n\n#randomlabels01K2K3K4K5Kerror00.10.20.30.4trainingtest#randomlabels01K2K3K4K5Kmeasure00.20.40.60.81`2normspectralnormpath-`1normpath-`2normsharpnessKL#107012345expectedsharpness00.050.10.150.20.250.3KL#107012345expectedsharpness00.050.10.150.20.250.301K2K3K4K5K\fFigure 4: The generalization of two layer perceptron trained on MNIST with varying number of hidden units.\nThe left plot indicates the training and test errors. The test error decreases as the size increases. The middle\nplot shows measures for each of the trained networks. The plot on the right indicates the relationship between\nsharpness and KL in PAC-Bayes for each experiment. Measures are calculated as explained in Figures 1 and 2.\n\ntraining set. We randomly select a subset of CIFAR10 dataset with 10000 data points as the training\nset and our goal is to \ufb01nd networks that have zero error on this set but different generalization abilities\non the test set. In order to do that, we train networks on the union of the training set with \ufb01xed size\n10000 and confusion sets with varying sizes that consists of CIFAR10 samples with random labels;\nand we evaluate the learned model on an independent test set. The trained network achieves zero\ntraining error but as shown in Figure 3, the test error of the model increases with increasing size of\nthe confusion set. The middle panel of this Figure suggests that the norm of the learned networks can\nindeed be predictive of their generalization behavior. However, we again observe that sharpness has\na poor behavior in these experiments. The right panel of this \ufb01gure also suggests that PAC-Bayes\nmeasure of joint sharpness and KL divergence, has better behavior - for a \ufb01xed expected sharpness,\nnetworks that have higher generalization error, have higher norms.\n\nIncreasing Network Size We also repeat the experiments conducted by Neyshabur et al. [20] where\na fully connected feedforward network is trained on MNIST dataset with varying number of hidden\nunits and we check the values of different complexity measures on each of the learned networks. The\nleft panel in Figure 4 shows the training and test error for this experiment. While 32 hidden units are\nenough to \ufb01t the training data, we observe that networks with more hidden units generalize better.\nSince the optimization is done without any explicit regularization, the only possible explanation for\nthis phenomenon is the implicit regularization by the optimization algorithm. Therefore, we expect a\nsensible complexity measure to decrease beyond 32 hidden units and behave similar to the test error.\nDifferent measures are reported for learned networks. The middle panel suggest that all margin/norm\nbased complexity measures decrease for larger networks up to 128 hidden units. For networks with\nmore hidden units, (cid:96)2 norm and (cid:96)1-path norm increase with the size of the network. The middle panel\nsuggest that (cid:96)2-path norm and spectral norm can provide some explanation for this phenomenon.\nHowever, as we discussed in Section 2, the actual complexity measure based on (cid:96)2-path norm and\nspectral norm also depends on the number of hidden units and taking this into account indicates that\nthese measures cannot explain this phenomenon. In Appendix A, we discuss another complexity\nmeasure that also depends the spectral norm through Lipschitz continuity or robustness argument.\nEven though this bound is very loose (exponential in input dimension), it is monotonic with respect\nto the spectral norm that is reported in the plots. The right panel shows that the joint PAC-Bayes\nmeasure decrease for larger networks up to size 128 but fails to explain this generalization behavior\nfor larger networks. This suggests that the measures looked so far are not suf\ufb01cient to explain all the\ngeneralization phenomenon observed in neural networks.\n\n4 Conclusion\n\nLearning with deep neural networks displays good generalization behavior in practice, a phenomenon\nthat remains largely unexplained. In this paper we discussed different candidate complexity measures\nthat might explain generalization in neural networks. We outline a concrete methodology for\ninvestigating such measures, and report on experiments studying how well the measures explain\ndifferent phenomena. While there is no clear choice yet, some combination of expected sharpness\nand norms do seem to capture much of the generalization behavior of neural networks. A major issue\nstill left unresolved is how the choice of optimization algorithm biases such complexity to be low,\nand what is the precise relationship between optimization and implicit regularization.\n\n8\n\n#hiddenunits8321285122K8Kerror00.020.040.060.08trainingtest321285122K8K00.20.40.60.81KL#106012expectedsharpness00.050.10.150.20.250.3KL#106012expectedsharpness00.050.10.150.20.250.3321285122048\fReferences\n[1] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. cambridge\n\nuniversity press, 2009.\n\n[2] P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. arXiv preprint arXiv:1706.08498, 2017.\n\n[3] P. L. Bartlett. The sample complexity of pattern classi\ufb01cation with neural networks: the size of\nthe weights is more important than the size of the network. IEEE transactions on Information\nTheory, 44(2):525\u2013536, 1998.\n\n[4] P. L. Bartlett. The impact of the nonlinearity on the VC-dimension of a deep network. Preprint,\n\n2017.\n\n[5] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[6] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear vc dimension bounds for piecewise\n\npolynomial networks. Neural computation, 10(8):2159\u20132173, 1998.\n\n[7] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-sgd: Biasing gradient descent\n\ninto wide valleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[8] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep\n(stochastic) neural networks with many more parameters than training data. arXiv preprint\narXiv:1703.11008, 2017.\n\n[9] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines.\n\nAdvances in computational mathematics, 13(1):1\u201350, 2000.\n\n[10] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient\n\ndescent. In ICML, 2016.\n\n[11] N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear\n\nneural networks. arXiv preprint arXiv:1703.02930, 2017.\n\n[12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch train-\ning for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,\n2016.\n\n[13] J. Langford and R. Caruana.\n\nIn Proceedings of the 14th\nInternational Conference on Neural Information Processing Systems: Natural and Synthetic,\npages 809\u2013816. MIT Press, 2001.\n\n(not) bounding the true error.\n\n[14] U. v. Luxburg and O. Bousquet. Distance-based classi\ufb01cation with lipschitz functions. Journal\n\nof Machine Learning Research, 5(Jun):669\u2013695, 2004.\n\n[15] D. McAllester. Simpli\ufb01ed pac-bayesian margin bounds. Lecture notes in computer science,\n\npages 203\u2013215, 2003.\n\n[16] D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual\n\nconference on Computational learning theory, pages 230\u2013234. ACM, 1998.\n\n[17] D. A. McAllester. PAC-Bayesian model averaging.\n\nIn Proceedings of the twelfth annual\n\nconference on Computational learning theory, pages 164\u2013170. ACM, 1999.\n\n[18] B. Neyshabur, R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in\ndeep neural networks. In Advanced in Neural Information Processsing Systems (NIPS), 2015.\n\n[19] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In\n\nProceeding of the 28th Conference on Learning Theory (COLT), 2015.\n\n[20] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role\nof implicit regularization in deep learning. Proceeding of the International Conference on\nLearning Representations workshop track, 2015.\n\n9\n\n\f[21] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Data-dependent path normalization\n\nin neural networks. In the International Conference on Learning Representations, 2016.\n\n[22] B. Neyshabur, Y. Wu, R. Salakhutdinov, and N. Srebro. Path-normalized optimization of\nrecurrent neural networks with relu activations. Advances in Neural Information Processing\nSystems, 2016.\n\n[23] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[25] A. J. Smola, B. Sch\u00f6lkopf, and K.-R. M\u00fcller. The connection between regularization operators\n\nand support vector kernels. Neural networks, 11(4):637\u2013649, 1998.\n\n[26] J. Sokolic, R. Giryes, G. Sapiro, and M. R. Rodrigues. Generalization error of invariant\n\nclassi\ufb01ers. arXiv preprint arXiv:1610.04574, 2016.\n\n[27] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In International Conference on\n\nComputational Learning Theory, pages 545\u2013560. Springer Berlin Heidelberg, 2005.\n\n[28] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances\n\nin neural information processing systems, pages 1329\u20131336, 2005.\n\n[29] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86(3):391\u2013423, 2012.\n\n[30] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In International Conference on Learning Representations, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3037, "authors": [{"given_name": "Behnam", "family_name": "Neyshabur", "institution": "Institute for Advanced Study"}, {"given_name": "Srinadh", "family_name": "Bhojanapalli", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "David", "family_name": "Mcallester", "institution": "Toyota Tech Institute Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}