{"title": "Asymmetric Valleys: Beyond Sharp and Flat Local Minima", "book": "Advances in Neural Information Processing Systems", "page_first": 2553, "page_last": 2564, "abstract": "Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper con\ufb01guration is able to \ufb01nd wide and \ufb02at local minima, which are correlated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being \ufb02at or sharp. Instead, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side \u2013 we formally de\ufb01ne such minima as asymmetric valleys. Under mild assumptions, we \ufb01rst prove that for asymmetric valleys, a solution biased towards the \ufb02at side generalizes better than the exact empirical minimizer. Then, we show that performing weight averaging along the SGD trajectory implicitly induces such biased solutions. This provides theoretical explanations for a series of intriguing phenomena observed in recent work [25, 5, 51]. Finally, extensive empirical experiments on both modern deep networks and simple 2 layer networks are conducted to validate our assumptions and analyze the intriguing properties of asymmetric valleys.", "full_text": "Asymmetric Valleys: Beyond Sharp and Flat Local\n\nMinima\n\nHaowei He1\u2217\n\nhhw19@mails.tsinghua.edu.cn\n\nGao Huang2,3\n\ngaohuang@tsinghua.edu.cn\n\nYang Yuan1\n\nyuanyang@tsinghua.edu.cn\n\n1Institute for Interdisciplinary Information Sciences, Tsinghua University\n\n2Department of Automation, Tsinghua University\n\n3Beijing National Research Center for Information Science and Technology (BNRist)\n\nAbstract\n\nDespite the non-convex nature of their loss functions, deep neural networks are\nknown to generalize well when optimized with stochastic gradient descent (SGD).\nRecent work conjectures that SGD with proper con\ufb01guration is able to \ufb01nd wide\nand \ufb02at local minima, which are correlated with good generalization performance.\nIn this paper, we observe that local minima of modern deep networks are more\nthan being \ufb02at or sharp. Instead, at a local minimum there exist many asymmetric\ndirections such that the loss increases abruptly along one side, and slowly along\nthe opposite side \u2013 we formally de\ufb01ne such minima as asymmetric valleys. Under\nmild assumptions, we \ufb01rst prove that for asymmetric valleys, a solution biased\ntowards the \ufb02at side generalizes better than the exact empirical minimizer. Then,\nwe show that performing weight averaging along the SGD trajectory implicitly\ninduces such biased solutions. This provides theoretical explanations for a series\nof intriguing phenomena observed in recent work [25, 5, 51]. Finally, extensive\nempirical experiments on both modern deep networks and simple 2 layer networks\nare conducted to validate our assumptions and analyze the intriguing properties of\nasymmetric valleys.\n\n1\n\nIntroduction\n\nThe loss landscape of neural networks has attracted great research interests in the deep learning\ncommunity [9, 10, 32, 12, 15, 43, 36]. A deeper understanding of the loss landscape is important for\ndesigning better optimization algorithms, and helps to answer the question of when and how a deep\nnetwork can achieve good generalization performance. One hypothesis that draws attention recently\nis that the local minima of neural networks can be characterized by their \ufb02atness, and it is conjectured\nthat sharp minima tend to generalize worse than the \ufb02at ones [32]. A plausible explanation is that\na \ufb02at minimizer of the training loss can achieve lower generalization error if the test loss is shifted\nfrom the training loss due to random perturbations. Figure 1(a) gives an illustration for this argument.\nAlthough being supported by plenty of empirical observations [32, 25, 34], the de\ufb01nition of \ufb02atness\nwas recently challenged in [11], which shows that one can construct arbitrarily sharp minima through\nweight re-parameterization without affecting the generalization performance. Moreover, recent\nevidences suggest that the minima of modern deep networks are connected with simple paths with\nlow generalization error [12, 13]. It is empirically found that the minima found by large batch training\n\n\u2217Code available at https://github.com/962086838/code-for-Asymmetric-Valley\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) An illustration of sharp, \ufb02at and asymmetric minima. If there exists a shift from empirical\nloss to population loss, \ufb02at minimum is more robust than sharp minimum. (b) For asymmetric valleys,\nif there exists a random shift, the solution \u02dcw biased towards the \ufb02at side is more robust than the\nminimizer \u02c6w\u2217. (c) SGD tends to stay longer on the \ufb02at side of asymmetric valleys, therefore SGD\naveraging automatically produces a bias towards the \ufb02at side.\n\nand small batch training are shown to be connected by a path without any \u201cbumps\u201d [43]. In other\nwords, a \u201csharp minimum\u201d and a \u201c\ufb02at minimum\u201d may in fact belong to a same minimum in high\ndimensional space. Therefore, the notion of \ufb02at and sharp minima seems to be an oversimpli\ufb01cation\nof the empirical loss surface.\nIn this paper, we expand the notion of \ufb02at and sharp minima by introducing the concept of asymmetric\nvalleys. We observe that the loss surfaces of many neural networks are locally asymmetric. In speci\ufb01c,\nthere exist many directions such that the loss increases abruptly along one side, and grows rather\nslowly along the opposite side (see Figure 1(b) as an illustration). We formally de\ufb01ne this kind of\nlocal minima as asymmetric valleys. As we will show in Section 6, asymmetric valleys generate\ninteresting illusions in high dimensional space. For example, located in the same valley shown in\nFigure 1(b), \u02dcw may appear to be a wider and \ufb02atter minimum than \u02c6w as the former is farther away\nfrom the sharp side.\nAsymmetric valleys also introduce novel insights to generalization. Folklore says when the exact\nminimizer is \ufb02at, it tends to generalize better as it is more stable with respect to loss surface\nperturbations [32]. Instead of following this argument, we show that in asymmetric valleys, the\nsolution biased towards the \ufb02at side of the valley generalizes better than the exact minimizer, under\nmild assumptions. This result has at least two interesting implications: (1) converging to which local\nminimum (if there are many) may not be critical for modern deep networks. However, it matters a\nlot where the solution locates; and (2) the solution with lowest a priori generalization error is not\nnecessarily the minimizer of the training loss.\nGiven that a biased solution is preferred for asymmetric valleys, an immediate question is how we can\n\ufb01nd such solutions in practice. It turns out that simply averaging the weights along the SGD trajectory,\nnaturally leads to the desired solutions. We give a theoretical analysis to support this argument, see\nFigure 1(c) for an illustration. Our result nicely complements a series of recent empirical observations,\nwhich demonstrated that averaged SGD has better performance over plain SGD, for various scenarios\nincluding supervised/unsupervised/low-precision training [25, 5, 51].\nIn addition, we provide empirical analysis to verify our theoretical results and support our claims.\nFor example, we show that asymmetric valleys are indeed prevalent in modern deep networks, and\nsolutions with lower generalization error has bias towards the \ufb02at side of the valley.\n\n2 Related Work\n\nNeural network landscape. Neural network landscape analysis is an active and exciting area\n[16, 34, 15, 40, 49, 10, 43]. For example, [12, 13] observed that essentially all local minima are\nconnected together with simple paths. In [22], cyclic learning rate was used to explore multiple local\noptima along the training trajectory for model ensembling. There are also appealing visualizations\nfor the neural network landscape [34].\nSharp and \ufb02at minima. The discussion of sharp and \ufb02at local minima dates back to [20], and\nrecently regains its popularity. For example, Keskar et al. [32] proposed that large batch SGD \ufb01nds\n\n2\n\nFlatminimumAsymmetricminimumSharpminimumEmpirical lossPopulation lossEmpirical minimizer wPopulation minimizer w*Empirical lossPopulation loss 1Population loss 2Empirical minimizer wBiased solution wSGD updateEmpirical lossSGD IterEmpirical minimizer wSGD average w\fsharp minima, which leads to poor generalization. In [8], an entropy regularized SGD was introduced\nto explicitly searching for \ufb02at minima. It was later pointed out that large batch SGD can yield\ncomparable performance when the learning rate or the number of training iterations are properly set\n[21, 17, 47, 35, 46, 26]. Moreover, [11] showed that from a given \ufb02at minimum, one could construct\nanother minimum with arbitrarily sharp directions but equally good performance. In this paper, we\nargue that the description of sharp or \ufb02at minima is an oversimpli\ufb01cation. There may simultaneously\nexist steep directions, \ufb02at directions, and asymmetric directions for the same minimum.\nSGD optimization and generalization. As the de facto optimization tool for deep networks, SGD\nand its variants are extensively studied in the literature. For example, it is shown that they could\nescape saddle points or sharp local minima under reasonable assumptions [14, 28\u201330, 50, 1\u20133, 33].\nFor convex functions [41] or strongly convex but non-smooth functions [42], SGD averaging is shown\nto give better convergence rate. In addition, it can also achieve higher generalization performance for\nLipschitz functions in theory [44, 7], or for deep networks in practice [22, 25, 5, 51]. Discussions on\nthe generalization bound of neural networks can be found in [6, 39, 37, 31, 38, 4, 52]. We show that\nSGD averaging has implicit bias on the \ufb02at sides of the minima. Previously, it was shown that SGD\nhas other kinds of implicit bias as well [48, 27, 18].\n\n3 Asymmetric Valleys\n\nIn this section, we give a formal de\ufb01nition of asymmetric valley, and empirically show that it is\nprevalent in the loss landscape of modern deep neural networks.\nIn supervised learning, we seek to optimize w\u2217 (cid:44) arg minw\u2208Rd L(w), where\nPreliminaries.\nL(w) (cid:44) Ex\u223cD[f (x; w)] \u2208 Rd \u2192 R is the population loss, x \u2208 Rm is the input sampled from\ndistribution D, w \u2208 Rd denotes the model parameter, and f \u2208 Rm \u00d7 Rd \u2192 R is the loss function.\nSince the data distribution D is usually unknown, instead of optimizing L directly, we often use SGD\nto \ufb01nd the empirical risk minimizer \u02c6w\u2217 for a set of random samples {xi}n\ni=1 from D (a.k.a. training\nset): \u02c6w\u2217 (cid:44) arg minw\u2208Rd \u02c6L(w), where \u02c6L(w) (cid:44) 1\nIn practice, it is numerically infeasible to \ufb01nd or test the exact local minimizer \u02c6w\u2217. Fortunately, our\ntheoretical results only depend on a good enough solution rather than an exact local minimum, as we\nwill formally de\ufb01ne in Section 4. For simplicity, we still refer to such solutions as \u201clocal minima\u201d,\nalthough our analysis generalizes to \u201csolutions found by SGD\u201d.\n\ni=1 f (xi; w).\n\n(cid:80)n\n\nn\n\n3.1 De\ufb01nition of asymmetric valley\n\n\u02c6L(w \u2212 lu) > cp for l \u2208 (r, r).\n\nBefore formally introducing asymmetric valleys, we \ufb01rst de\ufb01ne asymmetric directions.\nDe\ufb01nition 1 (Asymmetric direction). Given constants p > 0, r > r > 0, c > 1, a direction u is\n(r, r, p, c)-asymmetric with respect to point w \u2208 Rd and loss function \u02c6L, if \u2207l\n\u02c6L(w + lu) < p, and\n\u2207l\nIn the above de\ufb01nition, u \u2208 Rd is a unit vector\nrepresenting a direction such that the points on\nthis direction passing w \u2208 Rd can be written as\nw + lu for l \u2208 (\u2212\u221e,\u221e). Intuitively, the loss\nlandscape in the interval (\u2212r,\u2212r) is \u201csharp\u201d,\nwhile it is \"\ufb02at\" in the region (r, r). Note that\nwe purposely leave out the region (\u2212r, r) with-\nout making further assumptions on it to comply\nwith the fact that the second order derivatives of\nthe loss function is usually continuous. It is im-\npractical to assume the slope of the loss function\nchange abruptly at the point l = 0.\nAs a concrete example, Figure 2 shows an asym-\nmetric direction for a local minimum in ResNet-\n110 trained on the CIFAR-10 dataset. We ver-\ni\ufb01ed that it is a (2.0, 0.6, 0.03, 15)-asymmetric\n\nFigure 2: An asymmetric direction of a solution\non the loss landscape of ResNet-110 trained on\nCIFAR-10.\n\n3\n\n\fdirection, which means in the region (\u22122.0,\u22120.6) \u222a (0.6, 2.0) the gradients are asymmetric with a\nrelative ratio of c = 15.\nWith this De\ufb01nition 1, we now formally de\ufb01ne the asymmetric valley2.\n\u2217 of \u02c6L \u2208\nDe\ufb01nition 2 (Asymmetric valley). Given constants p, r > r > 0, c > 1, a solution \u02c6w\nRd \u2192 R is a (r, r, p, c)-asymmetric valley, if there exists at least one direction u such that u is\n(r, r, p, c)-asymmetric with respect to \u02c6w\u2217 and \u02c6L.\n\n3.2 Asymmetric valleys in neural networks\n\nEmpirically, by taking random directions with value (0, 1) in each dimension, we can \ufb01nd an\nasymmetric direction for a given solution w\u2217 with decent probability. We perform experiments\nwith widely used deep networks, i.e.,ResNet-56, ResNet-110, ResNet-164 [19], VGG-16 [45] and\nDenseNet-100 [23], on the CIFAR-10, CIFAR-100, SVHN and STL-10 image classi\ufb01cation datasets.\nFor each model on each dataset, we conduct 5 independent runs. The results show that we can\nalways \ufb01nd asymmetric directions with certain speci\ufb01cation (r, r, p, c) with c > 2, which means all\nthe solutions that SGD found are located in asymmetric valleys. Asymmetric valleys widely exist\nin both simple and complex models, see Appendix A, Appendix E and Appendix F. For example,\nin Appendix A we show that asymmetric valleys exist in a simple 2 layer network with only 2\nparameters.\n\n4 Bias and Generalization\n\nAs we show in the previous section, in the context of deep learning most local minima in practice are\nasymmetric, i.e., they might be sharp on one direction, but \ufb02at on the opposite direction. Therefore, it\nis interesting to investigate the generalization ability of a solution w in this scenario, which may lead\nto different results as those obtained under the common symmetric assumption. In this section, we\nprove that a biased solution on the \ufb02at side of an asymmetric valley yields lower generalization error\nthan the exact empirical minimizer \u02c6w\n\n\u2217 in that valley.\n\n4.1 Theoretical analysis\n\nBefore presenting our theorem, we \ufb01rst introduce two mild assumptions. We will show that they\nempirically hold on modern deep networks in Section 4.2.\nThe \ufb01rst assumption (Assumption 1) states that there exists a shift between the empirical loss and\ntrue population loss. This is a common assumption in the previous works, e.g., [32], but was usually\npresented in an informal way. Here we de\ufb01ne the \u201cshift\u201d in formally. Without loss of generality, we\nwill compare the empirical loss \u02c6L with L(cid:48) (cid:44) L\u2212minw L(w)+minw\n\u02c6L(w) to remove the \u201cvertical\ndifference\u201d between \u02c6L and L. Notice that minw L(w) and minw\n\u02c6L(w) are constants and do not affect\nour generalization guarantee.\nDe\ufb01nition 3 ((\u03b4, R)-shift gap). For \u03be \u2265 0, \u03b4 \u2208 Rd, and \ufb01xed functions L and \u02c6L, we de\ufb01ne the\n(\u03b4, R)-shift gap between L and \u02c6L with respect to a point w as\n\n\u03be\u03b4(w) = max\nv\u2208B(R)\n\n|L(cid:48)(w + v + \u03b4) \u2212 \u02c6L(w + v)|\n\nwhere L(cid:48)(w) (cid:44) L(w) \u2212 minw L(w) + minw\nR centered at 0.\n\n\u02c6L(w), and B(R) is the d-dimensional ball with radius\n\nFrom the above de\ufb01nition, we know that the two functions match well after the shift \u03b4 if \u03be\u03b4(w) is\nvery small. For example, \u03be\u03b4(w) = 0 means L is locally identical to \u02c6L after the shift \u03b4. Since \u02c6L is\ncomputed on a set of random samples from D, the actual shift \u03b4 between \u02c6L and L is a random variable,\nideally with zero expectation3.\n\n2Here we abuse the name \u201cvalley\u201d, since \u02c6w\u2217 is essentially a point at the center of a valley.\n3It may not be zero, as we are talking about the shift between two loss functions, rather than the difference\n\nbetween empirical/population loss values.\n\n4\n\n\fAssumption 1 (Random shift assumption). For a given population loss L and a random empirical\nloss \u02c6L, constants R > 0, r \u2265 r > 0, \u03be \u2265 0, a vector \u00af\u03b4 \u2208 Rd with r \u2265 \u00af\u03b4i \u2265 r for all i \u2208 [d], a\n\u2217, we assume that there exists a random variable \u03b4 \u2208 Rd correlated with \u02c6L such that\nminimizer \u02c6w\nPr(\u03b4i = \u00af\u03b4i) = Pr(\u03b4i = \u2212\u00af\u03b4i) = 1\n2 for all i \u2208 [d], and the (\u03b4, R)-shift gap between L and \u02c6L with\n\u2217 is bounded by \u03be.\nrespect to \u02c6w\nClearly, \u03b4 has 2d possible values for a given shift vector \u00af\u03b4, each with probability 2\u2212d. Notice that\nAssumption 1 does not say that the difference between L and \u02c6L can only be one of the 2d possible\n\u03b4. Instead, it says after applying the shift \u03b4, the two functions have bounded L\u221e distance, which\nis a much milder assumption. It is also worth noting that our De\ufb01nition 1 can mask out the central\ninterval (\u2212r, r) because we have r \u2265 \u00af\u03b4i \u2265 r in Assumption 1. Therefore, r cannot be arbitrarily\nlarge, otherwise Assumption 1 does not hold. Our second assumption stated below can be seen as an\nextension of De\ufb01nition 2.\n\u2217, there\nAssumption 2 (Locally asymmetric). For a given population loss \u02c6L, and a minimizer \u02c6w\nexist orthogonal directions u1,\u00b7\u00b7\u00b7 , uk \u2208 Rd s.t. ui is (r, r, pi, ci)-asymmetric with respect to\n\u2217\n\u02c6w\n\n+ v \u2212 (cid:104)v, ui(cid:105)ui for all v \u2208 B(R(cid:48)) and i \u2208 [k].\n\n\u2217, then the point \u02c6w\n\n\u2217\n\n+v\u2212(cid:104)v, ui(cid:105)ui\n\u2217 along the perpendicular direction of ui, is also asymmetric along the direction\n\nAssumption 2 states that if ui is an asymmetric direction at \u02c6w\nthat deviates from \u02c6w\nof ui. In other words, the neighborhood around \u02c6w\u2217 is an asymmetric valley.\nUnder the above assumptions, we are ready to state our theorem, which says the empirical minimizer\nis not necessarily the optimal solution, and a biased solution leads to better generalization. We defer\nthe proof to Appendix B.\nTheorem 1 (Bias leads to better generalization). For any l \u2208 Rk, if Assumption 1 holds for R = (cid:107)l(cid:107)2,\n< li \u2264 max{r \u2212 \u00af\u03b4i, \u00af\u03b4i \u2212 r}, then we have\nAssumption 2 holds for R(cid:48) = (cid:107)\u00af\u03b4(cid:107)2 + (cid:107)l(cid:107)2, and\n\n4\u03be\n\n(cid:32)\n\nk(cid:88)\n\n(cid:33)\n\n(ci\u22121)pi\n\n\u2265 k(cid:88)\n\nE\u03b4L( \u02c6w\n\n\u2217\n\n) \u2212 E\u03b4L\n\n\u2217\n\n\u02c6w\n\n+\n\nliui\n\n(ci \u2212 1)lipi/2 \u2212 2k\u03be > 0\n\ni=1\n\ni=1\n\nRemark on Theorem 1. It is widely known that the empirical minimizer is usually different from\nthe true optimum. However, in practice it is dif\ufb01cult to know how the training loss shifts from the\npopulation loss. Therefore, the best we could it to minimize the empirical loss function (with some\nregularizers). However, Theorem 1 states that in the asymmetric case, we should pick a biased\nsolution even if the shift is unknown. This insight can be distilled into practical algorithms to achieve\nbetter generalization, as we will discuss in Section 5.\n\n4.2 Validating assumptions\n\nWe conducted a series of experiments with modern deep networks to show that the two assumptions\nintroduced above are generally valid.\n\n(a) Sh\ufb01t between L(cid:48) and \u02c6L\n\n(b) \u03be\u03b4/\u03be0 vs different shift \u03b4\n\nFigure 3: Shift exists between empirical loss and population loss for ResNet-110 on CIFAR-10.\n\nVeri\ufb01cation of Assumption 1. We show that a shift between L and \u02c6L is quite common in practice,\nby taking a ResNet-110 trained on CIFAR-10 as an example. Notice that we use test loss to represent\n\n5\n\n42024An asymmetric direction u0.00.20.40.60.8LossShift directionTraining loss LTest loss L0Shifted test loss0.00.20.40.60.80.20.40.60.81.01.21.4/0\fL in practice. Since we could not visualize a shift in a high dimensional space, we randomly sample\n\u2217. The blue\nan asymmetric direction u (more results are shown Appendix C) at the SGD solution \u02c6w\n\u2217\n\u2217\n+ lu) and L(cid:48)( \u02c6w\nand red curves shown in Figure 3(a) are obtained by calculating \u02c6L( \u02c6w\n+ lu) for\nl \u2208 [\u22123, 3], which correspond to the training and test loss, respectively.\nWe then try different shift values of \u03b4 to \u201cmatch\u201d the two curves. As shown in Figure 3(a), after\napplying a horizontal shift \u03b4 = 0.4 to the test loss, the two curves overlap almost perfectly. Quantita-\ntively, we can use the shift gap de\ufb01ned in De\ufb01nition 3 to evaluate how well the two curves match\neach other after shifting. It turns out that \u03be\u03b4=0.4 = 0.03, which is much lower than \u03be\u03b4=0 = 0.22 before\nshifting (\u03b4 has only one dimension here). In Figure 3(b), we plot \u03be\u03b4/\u03be0 as a function of \u03b4. Clearly,\nthere exists a \u03b4 that minimizes this ratio, indicating a good match.\nWe conducted the same experiments for different directions, models and datasets, and similar\nobservations were made. Please refer to Appendix C for more results.\nVeri\ufb01cation of Assumption 2. This is a mild assumption that can be veri\ufb01ed empirically. For\n\u2217, and specify an asymmetric\nexample, we take a SGD solution of ResNet-110 on CIFAR-10 as \u02c6w\ndirection u for \u02c6w\u2217. We then randomly sample 100 different local adjustments for v \u2208 B(25). Based\non these adjustments, we present the mean loss curves and standard variance zone on the asymmetric\ndirection u for all the points \u02c6w\u2217 + v \u2212 (cid:104)v, u(cid:105)u in Figure 4. As we can see, the variance of these\ncurves are very small, which means all of them are similar to each other. Moreover, we veri\ufb01ed that\nu is (4, 2, 0.1, 5.22)-asymmetric with respect to all neighboring points.\n\n5 Averaging Generates Good Bias\n\nIn the previous section, we show that when the loss landscape of a local minimum is asymmetric,\na solution with bias towards the \ufb02at side of the valley has better generalization performance. One\nimmediate question is that how can we obtain such a solution via practical algorithms? Below we\nshow that it can be achieved by simply taking the average of SGD iterates during the course of\ntraining. We \ufb01rst analyze the one dimensional case in Section 5.1, and then extend the analysis to the\nhigh dimensional case in Section 5.2.\nNote that weight averaging is a classical algorithm in optimization [41], and recently regained its\npopularity in the context of deep learning [25, 5, 51]. Our following analysis can be viewed as a\ntheoretical justi\ufb01cation of recent algorithms that based on SGD iterates averaging.\n\n5.1 One dimensional case\n\nFor asymmetric functions, as long as the learning rate is not too small, SGD will oscillate between the\n\ufb02at side and the sharp side. Below we focus on one round of oscillation, and show that the average\nof the iterates in each round has a bias on the \ufb02at side. Consequently, by aggregating all rounds of\noscillation, averaging SGD iterates leads to a bias as well.\nFor each individual round i, we assume that it starts from the iteration when SGD goes from sharp\n0), and ends at the iteration exactly before the iteration that SGD goes\nside to \ufb02at side (denoted as wi\n). Here Ti denotes the number of iterations in the\nfrom sharp side to \ufb02at side again (denoted as wi\nTi\ni-th rounds. The average iterate in the i-th round can be written as \u00afw (cid:44) 1\nj. For notational\nsimplicity, we will omit the super script i on wi\nj.\nThe following theorem shows that the expectation of the average has bias on the \ufb02at side. To get a\nformal lower bound on \u00afw, we consider the asymmetric case where r = 0, and also assume lower\nbounds for the gradients on the function. We defer the proof to Appendix D.\nTheorem 2 (SGD averaging generates a bias). Assume that a local minimizer w\u2217 = 0 is a (r, 0, a+, c)-\nasymmetric valley, where b\u2212 \u2264 \u2207L(w) \u2264 a\u2212 < 0 for w < 0, and 0 < b+ \u2264 \u2207L(w) \u2264 a+ for\nw \u2265 0. Assume \u2212a\u2212 = ca+ for a large constant c, and \u2212(b\u2212\u2212\u03bd)\n6 . The SGD updating\nrule is wt+1 = wt \u2212 \u03b7(\u2207L(w) + \u03c9t) where \u03c9t is the noise and |\u03c9t| < \u03bd, and assume \u03bd \u2264 a+. Then\nwe have\n\n= c(cid:48) < ec/3\n\n(cid:80)Ti\n\nj=0 wi\n\nTi\n\nb+\n\nwhere c0 is a constant that only depends on \u03b7, a+, a\u2212, b+, b\u2212 and \u03bd.\n\nE[ \u00afw] > c0 > 0,\n\n6\n\n\fTheorem 2 can be intuitively explained by Figure 5. If we run SGD on this one dimensional function,\nit will stay at the \ufb02at side for more iterations as the magnitude of the gradient on this side is much\nsmaller. Therefore, the average of the locations is biased towards the \ufb02at side.\n\nFigure 4: Training loss mean and variance for\nthe neighborhood of \u02c6w\u2217 at the direction of u.\n\nFigure 5: SGD iterates and their average on\nan asymmetric function.\n\n5.2 High dimensional case\n\nFor high dimensional functions, the analysis on averaging SGD iterates would be more complicated\ncompared to that given in the previous subsection. However, if we only care about the bias on a\nspeci\ufb01c direction u, we could directly apply Theorem 2 with one additional assumption. Speci\ufb01cally,\nif the projections of the loss function onto u along the SGD trajectory satisfy the assumptions in\nTheorem 2, i.e., being asymmetric and the gradient on both sides have upper and lower bounds, then\nthe claim of Theorem 2 directly applies. This is because only the gradient along the direction u will\naffect the SGD trajectory projected onto u, and we could safely omit all other directions.\nWe \ufb01nd that this assumption holds empirically. For a given SGD solution, we \ufb01x a random asymmetric\ndirection u \u2208 Rd, and sample the loss surface on direction u that passes the t-th epoch of SGD\ntrajectory (denoted as wt), i.e., evaluate \u02c6L(wt + lu), for 0 \u2264 t \u2264 200 and l \u2208 [\u221215, 15]. As shown in\nthe Figure 6, after the \ufb01rst 40 epochs, the projected loss surfaces becomes relatively stable. Therefore,\nwe can directly apply Theorem 2 to the direction u.\nAs we will see in Section 6.1, compared with SGD solutions, SGD averaging indeed creates bias\nalong different asymmetric directions, as predicted by our theory.\n\n6 Experimental Observations\n\nIn this section, we empirically show that asym-\nmetric valleys create interesting illusions when\nvisualizing high dimensional loss landscape in\nlow dimensional space. In addition, as a re\ufb01ne-\nment of judging the generalization performance\nby the sharpness/\ufb02atness of a local minimum,\nwe show that where the solution locates at a local\nminimum basin is important. We also \ufb01nd that\nbatch normalization [24] seems to be a major\ncause for asymmetric valleys in deep networks,\nbut the results are deferred to Appendix H due\nto space limit.\n\n6.1 Experiments with weight averaging\n\nFigure 6: Projection of the training loss\nsurface onto an asymmetric direction u\n\nRecently, Izmailov et al. [25] proposed the stochastic weight averaging (SWA) algorithm, which\nexplicitly takes the average of SGD iterates to achieve better generalization.\nInspired by their\nobservation that \u201cSWA leads to solutions corresponding to wider optima than SGD\u201d, we provide a\nmore re\ufb01ned explanation in this subsection. That is, averaging weights leads to \u201cbiased\u201d solutions in\nan asymmetric valley, which correspond to better generalization.\nSpeci\ufb01cally, we run the SWA algorithm (with deceasing learning rate) with popular deep networks,\nincluding ResNet-56, ResNet-110, ResNet-164, VGG-16, and DenseNet-100, on various datasets\n\n7\n\n42024An asymmetric direction u1.61.82.02.2LossTraining loss5432101An asymmetric direction0.00.51.01.52.0LossTraining lossSGD updateSGD averageLossLossepoch0epoch10epoch20Lossepoch40epoch60epoch80Lossepoch100epoch150epoch200\fFigure 7: SWA solution and SGD solution\ninterpolation (ResNet-164 on CIFAR-100)\n\nFigure 8: The average of SGD has a bias on\n\ufb02at side (ResNet-110 on CIFAR-100)\n\nTable 1: Training and test accuracy on CIFAR-100.\n\ntrain\n\nCIFAR-100\ntest\n\nNetwork\nResNet-110-SWA 94.98% 78.94%\nResNet-110-SGD 97.52% 78.29%\nResNet-164-SWA 97.48% 80.69%\nResNet-164-SGD 99.12% 76.56%\n\nincluding CIFAR-10, CIFAR-100, SVHN and STL-10, following the con\ufb01gurations in [25]. Then we\nrun SGD with small learning rate from the SWA solutions to \ufb01nd a solution located in the same basin\n(denoted as SGD).\nIn Figure 7, We draw an interpolation between the solutions obtained by SWA and SGD4. One can\nobserve that there is no \u201cbump\u201d between these two solutions, meaning they are located in the same\nbasin. Clearly, the SWA solution is biased towards the \ufb02at side, which veri\ufb01es our theoretical analysis\nin Section 5. Further, we notice that although the biased SWA solution has higher training loss than\nthe solution found by SGD, it indeed yields lower test loss. This veri\ufb01es our analysis in Section 4.\nSimilar observations are made on other networks and other datasets, which we present in Appendix E.\nTo further support our claim, we list our result in Table 1, from which we can observe that SGD\nsolutions always have higher training accuracy, but worse test accuracy, compared to SWA solutions.\nThis supports our claim in Theorem 1, which states that a bias towards the \ufb02at sides of asymmetric\nvalleys could help improve generalization, although it yields higher training error.\nVerifying Theorem 2. We further verify that averaging SGD solutions creates a bias towards the\n\ufb02at side in expectation for many other asymmetric directions, not just for the speci\ufb01c direction we\ndiscussed above.\nWe take a ResNet-110 trained on CIFAR-100 as an example. Denote uinter as the unit vector pointing\nfrom the SGD solution to the SWA solution, urand as another unit random direction, and the direction\nuinter + urand is used to explore the asymmetric landscape.\nThe results are shown in Figure 8, from which we can observe that SWA has a bias on the \ufb02at side\ncompared with the SGD solution. We create 10 different random vectors for each network and each\ndataset, and similar observations can be made (see more examples in Appendix F).\nBatch size effect In addition to SWA algorithm, we also observe similar trend when training with\ndifferent batch sizes. The results are deferred to Appendix G.\n\n6.2\n\nIllusions created by asymmetric valleys\n\nWe further point out that visualizing the \u201cwidth\u201d of a given solution w in a low-dimensional space\nmay lead to illusive results. For example, one visualization technique used in [25] is to show how the\nloss changes along many random directions vi\u2019s drawn from the d-dimensional Gaussian distribution.\nWe take the large batch and small batch solutions from the previous subsection as an example.\nFigure 9 visualizes the \u201cwidth\u201d of the two solutions using the method described above. From the\n\n4Izmailov et al. [25] have done a similar experiment.\n\n8\n\n05101520253035A direction pointing from SWA solution to SGD solution01234LossTraining lossTest lossSWA solutionSGD solution15105051015Projection onto uinter+urand direction02468LossSGD training lossSWA training lossSGD solutionSWA solution\f\ufb01gure, one may draw the conclusion that small batch training leads to a wider minimum compared to\nlarge batch training. However, these two solutions are in fact from the same basin (see the discussion\nin Appendix G). In other words, the loss curvature near the two solutions looks different because they\nare located at different locations in a same asymmetric valley, instead of being located at different\nlocal minima. Similar observation holds for SWA and SGD solutions, see Figure 105 .\n\nFigure 9: Random ray of large batch and\nsmall batch solution.\n\nFigure 10: Random ray of SGD and SWA\nsolution\n\n7 Conclusion\n\nIn this paper, we introduced the notion of asymmetric valley to characterize the loss landscape of deep\nnetworks, expanding the current research that simply categorizes local minima by sharpness/\ufb02atness.\nThis notion allowed us to analyze and understand the geometry of loss landscape from a new\nperspective. For example, based on a formal de\ufb01nition of asymmetric valley, we showed that a biased\nsolution lying on the \ufb02at side of the valley generalizes better than the exact empirical minimizer.\nFurther, it is proved that by averaging the weights obtained along the SGD trajectory naturally leads\nto such biased solution. We also conducted extensive experiments with state-of-the-art deep models\nto analyze the properties of asymmetric valleys. It is showed that due to the existence of asymmetric\nvalleys, intriguing illustions can be created when visualizing high dimensional loss surface in the\n1D space. We hope this work will deepen our understanding on the loss landscape of deep neural\nnetworks, and inspire new theories and algorithms that further improve generalization.\n\nAcknowledgment\n\nThis work has been supported in part by the Zhongguancun Haihua Institute for Frontier Information\nTechnology. Gao Huang is supported in part by Beijing Academy of Arti\ufb01cial Intelligence (BAAI)\nunder grant BAAI2019QN0106.\n\nReferences\n[1] Allen-Zhu, Z. How to make the gradients small stochastically: Even faster convex and noncon-\n\nvex SGD. In NeurIPS, pp. 1165\u20131175, 2018.\n\n[2] Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than SGD. In NeurIPS, pp. 2680\u2013\n\n2691, 2018.\n\n[3] Allen-Zhu, Z. and Li, Y. NEON2: \ufb01nding local minima via \ufb01rst-order oracles. In NeurIPS, pp.\n\n3720\u20133730, 2018.\n\n[4] Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via\na compression approach. In ICML, volume 80 of Proceedings of Machine Learning Research,\npp. 254\u2013263. PMLR, 2018.\n\n[5] Athiwaratkun, B., Finzi, M., Izmailov, P., and Wilson, A. G. There are many consistent\nexplanations of unlabeled data: Why you should average. In International Conference on\nLearning Representations, 2019. URL https://openreview.net/forum?id=rkgKBhA5Y7.\n\n5Similar observations were made by Izmailov et al. [25] as well.\n\n9\n\n0.02.55.07.510.012.515.017.520.0024Training lossLarge batchSmall batch0.02.55.07.510.012.515.017.520.0Distance from the SGD solutions24Test lossLarge batchSmall batch0.02.55.07.510.012.515.017.520.002Training lossSGDSWA0.02.55.07.510.012.515.017.520.0Distance from the SGD and SWA solutions123Test lossSGDSWA\f[6] Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-normalized margin bounds for\n\nneural networks. In NIPS, pp. 6241\u20136250, 2017.\n\n[7] Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning\n\nalgorithms. In NIPS, pp. 359\u2013366. MIT Press, 2001.\n\n[8] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J. T.,\nSagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR.\nOpenReview.net, 2017.\n\n[9] Choromanska, A., LeCun, Y., and Arous, G. B. Open problem: The landscape of the loss\nsurfaces of multilayer networks. In COLT, volume 40 of JMLR Workshop and Conference\nProceedings, pp. 1756\u20131760. JMLR.org, 2015.\n\n[10] Cooper, Y. The loss landscape of overparameterized neural networks. CoRR, abs/1804.10200,\n\n2018.\n\n[11] Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. In\nICML, volume 70 of Proceedings of Machine Learning Research, pp. 1019\u20131028. PMLR, 2017.\n\n[12] Dr\u00e4xler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural\nnetwork energy landscape. In ICML, volume 80 of Proceedings of Machine Learning Research,\npp. 1308\u20131317. PMLR, 2018.\n\n[13] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. Loss surfaces, mode\n\nconnectivity, and fast ensembling of dnns. In NeurIPS, pp. 8803\u20138812, 2018.\n\n[14] Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points - online stochastic gradient\nfor tensor decomposition. In COLT, volume 40 of JMLR Workshop and Conference Proceedings,\npp. 797\u2013842. JMLR.org, 2015.\n\n[15] Ge, R., Lee, J. D., and Ma, T. Learning one-hidden-layer neural networks with landscape design.\n\nIn ICLR. OpenReview.net, 2018.\n\n[16] Goodfellow, I. J. and Vinyals, O. Qualitatively characterizing neural network optimization\n\nproblems. In ICLR, 2015.\n\n[17] Goyal, P., Doll\u00e1r, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch,\nA., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR,\nabs/1706.02677, 2017.\n\n[18] Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of\noptimization geometry. In ICML, volume 80 of Proceedings of Machine Learning Research, pp.\n1827\u20131836. PMLR, 2018.\n\n[19] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR,\n\npp. 770\u2013778. IEEE Computer Society, 2016.\n\n[20] Hochreiter, S. and Schmidhuber, J. Simplifying neural nets by discovering \ufb02at minima. In NIPS,\n\npp. 529\u2013536. MIT Press, 1994.\n\n[21] Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization\n\ngap in large batch training of neural networks. In NIPS, pp. 1729\u20131739, 2017.\n\n[22] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. Snapshot ensembles:\n\nTrain 1, get M for free. In ICLR. OpenReview.net, 2017.\n\n[23] Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional\n\nnetworks. In CVPR, pp. 2261\u20132269. IEEE Computer Society, 2017.\n\n[24] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In ICML, volume 37 of JMLR Workshop and Conference Proceedings,\npp. 448\u2013456. JMLR.org, 2015.\n\n10\n\n\f[25] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. Averaging weights\n\nleads to wider optima and better generalization. In UAI, pp. 876\u2013885. AUAI Press, 2018.\n\n[26] Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. J.\n\nThree factors in\ufb02uencing minima in SGD. CoRR, abs/1711.04623, 2017.\n\n[27] Ji, Z. and Telgarsky, M. Risk and parameter convergence of logistic regression. CoRR,\n\nabs/1803.07300, 2018.\n\n[28] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points\nef\ufb01ciently. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 1724\u20131732.\nPMLR, 2017.\n\n[29] Jin, C., Liu, L. T., Ge, R., and Jordan, M. I. On the local minima of the empirical risk. In\n\nNeurIPS, pp. 4901\u20134910, 2018.\n\n[30] Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated gradient descent escapes saddle points faster\nthan gradient descent. In COLT, volume 75 of Proceedings of Machine Learning Research, pp.\n1042\u20131085. PMLR, 2018.\n\n[31] Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. Generalization in deep learning. CoRR,\n\nabs/1710.05468, 2017.\n\n[32] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch\ntraining for deep learning: Generalization gap and sharp minima. In ICLR. OpenReview.net,\n2017.\n\n[33] Kleinberg, R., Li, Y., and Yuan, Y. An alternative view: When does SGD escape local minima?\nIn ICML, volume 80 of Proceedings of Machine Learning Research, pp. 2703\u20132712. PMLR,\n2018.\n\n[34] Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural\n\nnets. In NeurIPS, pp. 6391\u20136401, 2018.\n\n[35] Masters, D. and Luschi, C. Revisiting small batch training for deep neural networks. CoRR,\n\nabs/1804.07612, 2018.\n\n[36] Mehta, D., Chen, T., Tang, T., and Hauenstein, J. D. The loss surface of deep linear networks\n\nviewed through the algebraic geometry lens. CoRR, abs/1810.07716, 2018.\n\n[37] Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in\n\ndeep learning. In NIPS, pp. 5949\u20135958, 2017.\n\n[38] Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-\n\nnormalized margin bounds for neural networks. In ICLR. OpenReview.net, 2018.\n\n[39] Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the\nrole of over-parametrization in generalization of neural networks. CoRR, abs/1805.12076, 2018.\n\n[40] Pennington, J. and Bahri, Y. Geometry of neural network loss surfaces via random matrix theory.\nIn ICML, volume 70 of Proceedings of Machine Learning Research, pp. 2798\u20132806. PMLR,\n2017.\n\n[41] Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM\n\nJ. Control Optim., 30(4):838\u2013855, July 1992. ISSN 0363-0129.\n\n[42] Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML. icml.cc / Omnipress, 2012.\n\n[43] Sagun, L., Evci, U., G\u00fcney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the hessian\n\nof over-parametrized neural networks. In ICLR (Workshop). OpenReview.net, 2018.\n\n[44] Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization.\n\nIn COLT, 2009.\n\n11\n\n\f[45] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\n[46] Smith, S. L. and Le, Q. V. A bayesian perspective on generalization and stochastic gradient\n\ndescent. In ICLR. OpenReview.net, 2018.\n\n[47] Smith, S. L., Kindermans, P., Ying, C., and Le, Q. V. Don\u2019t decay the learning rate, increase the\n\nbatch size. In ICLR. OpenReview.net, 2018.\n\n[48] Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of\ngradient descent on separable data. Journal of Machine Learning Research, 19:70:1\u201370:57,\n2018.\n\n[49] Wu, L., Zhu, Z., and E, W. Towards understanding generalization of deep learning: Perspective\n\nof loss landscapes. CoRR, abs/1706.10239, 2017.\n\n[50] Xu, Y., Rong, J., and Yang, T. First-order stochastic algorithms for escaping from saddle points\n\nin almost linear time. In NeurIPS, pp. 5535\u20135545, 2018.\n\n[51] Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A. G., and Sa, C. D. Swalp: Stochastic\n\nweight averaging in low precision training. In ICML, 2019.\n\n[52] Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous generaliza-\ntion bounds at the imagenet scale: a PAC-bayesian compression approach. In International\nConference on Learning Representations, 2019.\n\n12\n\n\f", "award": [], "sourceid": 1457, "authors": [{"given_name": "Haowei", "family_name": "He", "institution": "Tsinghua University"}, {"given_name": "Gao", "family_name": "Huang", "institution": "Tsinghua"}, {"given_name": "Yang", "family_name": "Yuan", "institution": "Cornell University"}]}