{"title": "The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6406, "page_last": 6416, "abstract": "Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings. We also found that layer normalization cannot alleviate pathological sharpness either. Thus, we can conclude that batch normalization in the last layer significantly contributes to decreasing the sharpness induced by the FIM.", "full_text": "The Normalization Method for Alleviating\n\nPathological Sharpness in Wide Neural Networks\n\nRyo Karakida\n\nAIST\n\nTokyo, Japan\n\nkarakida.ryo@aist.go.jp\n\ns.akaho@aist.go.jp\n\nShotaro Akaho\n\nAIST\n\nIbaraki, Japan\n\nShun-ichi Amari\n\nRIKEN CBS\nSaitama, Japan\n\namari@brain.riken.jp\n\nAbstract\n\nNormalization methods play an important role in enhancing the performance\nof deep learning while their theoretical understandings have been limited. To\ntheoretically elucidate the effectiveness of normalization, we quantify the geometry\nof the parameter space determined by the Fisher information matrix (FIM), which\nalso corresponds to the local shape of the loss landscape under certain conditions.\nWe analyze deep neural networks with random initialization, which is known\nto suffer from a pathologically sharp shape of the landscape when the network\nbecomes suf\ufb01ciently wide. We reveal that batch normalization in the last layer\ncontributes to drastically decreasing such pathological sharpness if the width\nand sample number satisfy a speci\ufb01c condition. In contrast, it is hard for batch\nnormalization in the middle hidden layers to alleviate pathological sharpness in\nmany settings. We also found that layer normalization cannot alleviate pathological\nsharpness either. Thus, we can conclude that batch normalization in the last layer\nsigni\ufb01cantly contributes to decreasing the sharpness induced by the FIM.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have performed excellently in various practical applications [1], but\nthere are still many heuristics and an arbitrariness in their settings and learning algorithms. To proceed\nfurther, it would be bene\ufb01cial to give theoretical elucidation of how and under what conditions deep\nlearning works well in practice.\nNormalization methods are widely used to enhance the trainability and generalization ability of\nDNNs. In particular, batch normalization makes optimization faster with a large learning rate and\nachieves better generalization in experiments [2]. Recently, some studies have reported that batch\nnormalization changes the shape of the loss function, which leads to better performance [3, 4].\nBatch normalization alleviates a sharp change of the loss function and makes the loss landscape\nsmoother [3], and prevents an explosion of the loss function and its gradient [4]. The \ufb02atness of\nthe loss landscape and its geometric characterization have been explored in various topics such\nas improvement of generalization ability [5, 6], advantage of skip connections [7], and robustness\nagainst adversarial attacks [8]. Thus, it seems to be an important direction of research to investigate\nnormalization methods from the viewpoint of the geometric characterization. Nevertheless, its\ntheoretical elucidation has been limited to only linear networks [4] and simpli\ufb01ed models neglecting\nthe hierarchical structure of DNNs [3, 9].\nOne promising approach of analyzing normalization methods is to consider DNNs with random\nweights and suf\ufb01ciently wide hidden layers. While theoretical analysis of DNNs often becomes\nintractable because of hierarchical nonlinear transformations, wide DNNs with random weights\ncan overcome such dif\ufb01culties and are attracting much attention, especially within the last few\nyears; mean \ufb01eld theory of DNNs [10\u201314], random matrix theory [15] and kernel methods [16, 17].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThey have succeeded in predicting hyperparameters with which learning algorithms work well and\neven used as a kernel function for the Gaussian process. In addition, recent studies on the neural\ntangent kernel (NTK) have revealed that the Gaussian process with the NTK of random initialization\ndetermines even the performance of trained neural networks [18, 19]. Thus, the theory of wide DNNs\nis becoming a foundation for comprehensive understanding of DNNs. Regarding the geometric\ncharacterization, there have been studies on the Fisher information matrix (FIM) of wide DNNs\n[20, 21]. The FIM widely appears in the context of deep learning [6, 22] because it determines the\nRiemannian geometry of the parameter space and a local shape of the loss landscape around a certain\nglobal minimum. In particular, Karakida et al. [20] have reported that the eigenvalue spectrum of the\nFIM is strongly distorted in wide DNNs, that is, the largest eigenvalue takes a pathologically large\nvalue (Theorem 2.2). This causes pathological sharpness of the landscape and such sharpness seems\nto be harmful from the perspective of optimization [23] and generalization [5].\nIn this study, we focus on the FIM of DNNs and uncover how normalization methods affect it. First,\nto clarify a condition to alleviate the pathologically large eigenvalues, we identify the eigenspace\nof the largest eigenvalues (Theorem 3.1). Then, we reveal that batch normalization in the last layer\ndrastically decreases the size of the largest eigenvalues and successfully alleviates the pathological\nsharpness. This alleviation requires a certain condition on the width and sample size (Theorem\n3.3), which is determined by a convergence rate of order parameters. In contrast, we \ufb01nd that\nbatch normalization in the middle layers cannot alleviate pathological sharpness in many settings\n(Theorem 3.4) and layer normalization cannot either (Theorem 4.1). Thus, we can conclude that batch\nnormalization in the last layer has a vital role in decreasing pathological sharpness. Our experiments\nsuggest that such alleviation of the sharpness is helpful in making gradient descent converge even\nwith a larger learning rate. These results give novel quantitative insight into normalization methods,\nwide DNNs, and geometric characterization of DNNs and is expected to be helpful in developing a\nfurther theory of deep learning.\n\n2 Preliminaries\n\n2.1 Model architecture\n\nW l\n\nj=1\n\nul\ni =\n\nj + bl\n\ni, hl\n\nMl\u22121(cid:88)\n\nWe investigate a fully-connected feedforward neural network with random weights and bias parame-\nters. The network consists of one input layer with M0 units, L \u2212 1 hidden layers with Ml units per\nlayer (l = 1, 2, ..., L \u2212 1), and one output layer:\nijhl\u22121\n\n(1)\nj = xj are inputs. It includes a shallow network (L = 2) and deep ones (L \u2265 3). We set the\nwhere h0\nlast layer to have a linear readout, i.e., hL\ni . The dimensionality of each variable is given by\nW l \u2208 RMl\u00d7Ml\u22121 and hl, bl \u2208 RMl. Suppose that the activation function \u03c6(x) has a bounded weak\nderivative. A wide class of activation functions, including the sigmoid-like and (leaky-) recti\ufb01ed\nlinear unit (ReLU) functions, satisfy the condition. Different layers may have different activation\nfunctions. Regarding network width, we set Ml = \u03b1lM (l \u2264 L\u22121) and consider the limiting case of\nlarge M with constant coef\ufb01cients \u03b1l. The number of readout units is given by a constant ML = C,\nwhich is independent of M, as is usual in practice. Suppose that the parameter set \u03b8 = {W l\ni} is\nan ensemble generated by\n\ni = \u03c6(ul\n\ni = uL\n\nij, bl\n\ni),\n\n(2)\nthen \ufb01xed, where N (0, \u03c32) denotes a Gaussian distribution with zero mean and variance \u03c32. We\nassume that there are T input samples x(t) \u2208 RM0 (t = 1, ..., T ) generated from an input distribution\nindependently and that it is given by a standard normal distribution, i.e.,\n\nw/Ml\u22121), bl\ni\n\nW l\nij\n\ni.i.d.\u223c N (0, \u03c32\n\ni.i.d.\u223c N (0, \u03c32\nb )\n\ni.i.d.\u223c N (0, 1).\n\nxj(t)\n\n(3)\n\nThe FIM of a DNN is computed by the chain rule in a manner similar to that of the backpropagation\nalgorithm:\n\n\u2202fk\n\u2202W l\nij\n\n= \u03b4l\n\nk,ihl\u22121\n\nj\n\n, \u03b4l\n\nk,i = \u03c6(cid:48)(ul\ni)\n\n\u03b4l+1\nk,j W l+1\n\nji\n\n,\n\n(4)\n\n(cid:88)\n\nj\n\n2\n\n\fwhere we denote fk = uL\nomit index k of the output unit, i.e., \u03b4l\n\nk and \u03b4l\n\nk,i := \u2202fk/\u2202ul\nk,i.\n\ni = \u03b4l\n\ni for k = 1, ..., C. To avoid complicated notation, we\n\n2.2 Understanding DNNs through order parameters\n\nt := (cid:80)\n\nt, \u02dcql\n\nt, \u02c6ql\n\nst, \u02dcql\n\nWe use the following four types of order parameters, i.e., (\u02c6ql\nst), which have been commonly\nused in various studies of wide DNNs [10\u201313, 17\u201320, 24]. First, we use the following order parameters\ni(t)/Ml, where\nfor feedforward signal propagations: \u02c6ql\ni(t) are the outputs of the l-th layer when the input is xj(t) (t = 1, ..., T ). The variable \u02c6ql\nt is the\nhl\ntotal activity of the outputs in the l-th layer, and the variable \u02c6ql\nst is the overlap between the activations\nfor different input samples x(s) and x(t). These variables have been utilized to explain the depth to\nwhich signals can be suf\ufb01ciently propagated from the perspective of order-to-chaos phase transition\n[10]. In the large M limit, these variables are recursively computed by integration over Gaussian\ndistributions [10, 24]:\n\ni(t)2/Ml and \u02c6ql\n\ni(s)hl\n\ni hl\n\ni hl\n\nst := (cid:80)\n\n(cid:90)\n\n(cid:18)(cid:113)\n\n(cid:19)\n\n\u02c6ql+1\nt =\n\nDu\u03c62\n\nql+1\nt u\n\n, \u02c6ql+1\n\nst = I\u03c6[ql+1\n\nt\n\n, ql+1\n\nst\n\n],\n\n(5)\n\nw \u02c6ql\n\nw \u02c6ql\n\nql+1\nt\n\n:= \u03c32\n\n:= \u03c32\n\nt + \u03c32\n\nb , ql+1\nst\n\nst + \u03c32\nb ,\n\nst in each layer takes the same value for all s (cid:54)= t, and so does \u02c6ql\n\n(6)\nst = 0 for\nt for all t. The\n2\u03c0 means integration over the standard Gaussian density. We use\n1 \u2212 c2y))\n\nfor l = 0, ..., L \u2212 1. Because input samples generated by Eq. (3) yield \u02c6q0\n\u221a\nall s and t, \u02c6ql\nnotation Du = du exp(\u2212u2/2)/\n\na two-dimensional Gaussian integral given by I\u03c6[a, b] :=(cid:82) DyDx\u03c6(\n(cid:80)\n\nwith c = b/a.\nWe also use the corresponding variables for backward signals: \u02dcql\nt\n\ni \u03b4l\nt is the magnitude of the backward signals and \u02dcql\nst and \u02dcql\n\ni(t)2 and \u02dcql\n:=\nst\nst is their overlap.\nst in the large M limit are easily computed using the following\n\n:= (cid:80)\n\n\u221a\nax)\u03c6(\n\nt = 1 and \u02c6q0\n\ni(t). The variable \u02dcql\nPrevious studies found that \u02dcql\nrecurrence relations [11, 25],\n\ni(s)\u03b4l\n\na(cx +\n\ni \u03b4l\n\n\u221a\n\n\u221a\n\n(cid:90)\n\n(cid:113)\n\n(cid:20)\n\n\u03c6(cid:48)(\n\n(cid:21)2\n\nt\n\nDu\n\nst],\n\n, \u02dcql\n\nt, ql\n\nql\ntu)\n\nw \u02dcql+1\n\nw \u02dcql+1\n\nt = \u02dcqL\n\nst = \u03c32\n\nst I\u03c6(cid:48)[ql\n\nt = \u03c32\n\u02dcql\nfor l = 0, ..., L \u2212 1 with \u02dcqL\nst = 1. Previous studies con\ufb01rmed excellent agreements between\nthese backward order parameters and experimental results [11\u201313]. Although these studies required\nthe so-called gradient independence assumption to derive these recurrences (details are given in\nAssumption 3.2), Yang [25] has recently proved that such assumption is unnecessary when \u03c6(x) has\na polynomially bounded weak derivative.\nThe order parameters depend only on \u03c32\nb , the types of activation functions, and depth. The\nrecurrence relations require L iterations of one- and two-dimensional numerical integrals. They are\nanalytically tractable in certain activation functions including the ReLUs [20].\n\nw and \u03c32\n\n(7)\n\n2.3 Pathological sharpness of local landscapes\n\nThe FIM plays an essential role in the geometry of the parameter space and is a fundamental quantity\nin both statistics and machine learning. It de\ufb01nes a Riemannian metric of the parameter space, where\nthe in\ufb01nitesimal difference of statistical models is measured by Kullback-Leibler divergence, as in\ninformation geometry [26]. We analyze the eigenvalue statistics of the following FIM of DNNs\n[20, 21, 27, 28],\n\nE[\u2207\u03b8fk(t)\u2207\u03b8fk(t)(cid:62)],\n\nF =\n\n(8)\ni} and \u2207\u03b8 is the derivative with respect to\nwhere \u03b8 is a vector composed of all parameters {W l\nit. The average over an input distribution is denoted by E[\u00b7]. As usual, when T input samples x(t)\n(t = 1, ..., T ) are available for training, we replace the expectation E[\u00b7] with the FIM by the empirical\naverage over T samples, i.e., E[\u00b7] = 1\nt=1. This study investigates such an empirical FIM for\narbitrary T . It converges to the expected FIM as T \u2192 \u221e. This empirical FIM is widely used in\n\n(cid:80)T\n\nij, bl\n\nk=1\n\nT\n\nC(cid:88)\n\n3\n\n\fmachine learning and corresponds to the statistical model for the squared-error loss [21, 27, 28] (see\nKarakida et al. [20] for more details on this FIM). Recently, Kunstner et al. [29] emphasized that\nin the context of natural gradient algorithms, the FIM (8) leads to better optimization than an FIM\napproximated by using training labels.\n(cid:80)T\nThe FIM is known to determine not only the local distortion of the parameter space but also the\nloss landscape around a certain global minimum. Suppose the squared loss function E(\u03b8) =\nt=1(yk(t) \u2212 fk(t))2, where yk(t) represents a training label corresponding to the\ninput sample x(t). The FIM is related to the Hessian of the loss function, H := \u2207\u03b8\u2207\u03b8E(\u03b8), in the\nfollowing manner [20, 21]:\n\n(1/2T )(cid:80)C\n\nk=1\n\n(yk(t) \u2212 fk(t))\u2207\u03b8\u2207\u03b8fk(t).\n\n(9)\n\nT(cid:88)\n\nC(cid:88)\n\nt\n\nk\n\nH = F \u2212 1\nT\n\n(cid:18) T \u2212 1\nL(cid:88)\n\nT\n\nl=1\n\nThe Hessian coincides with the empirical FIM when the parameter converges to the global minimum\nwith zero training error. In that sense, the FIM determines the local shape of the loss landscape\naround the minimum. This FIM is also known as the Gauss-Newton approximation of the Hessian.\nKarakida et al. [20] elucidated hidden relations between the order parameters and basic statistics of\nthe FIM\u2019s eigenvalues. We investigate DNNs satisfying the following condition.\nDe\ufb01nition 2.1. Suppose a DNN with bias terms (\u03c3b (cid:54)= 0) or activation functions satisfying the\nnon-zero Gaussian mean. We refer to this as a non-centered network.\n\nvarious realistic settings because usual networks include bias terms, and widely used activation\nfunctions, such as the sigmoid function and (leaky-) ReLUs, have the non-zero Gaussian mean.\nDenote the FIM\u2019s eigenvalues as \u03bbi (i = 1, ..., P ) where P is the number of all parameters. The\ni \u03bbi/P and the maximum is\n\nThe de\ufb01nition of the non-zero Gaussian mean is(cid:82) Dz\u03c6(z) (cid:54)= 0. Non-centered networks include\neigenvalues are non-negative by de\ufb01nition. Their mean is m\u03bb :=(cid:80)P\n(cid:19)\n\n\u03bbmax := maxi \u03bbi. The following theorem holds:\nTheorem 2.2 ([20]). Suppose a non-centered network and i.i.d. input samples generated by Eq. (3).\nWhen M is suf\ufb01ciently large, the eigenvalue statistics of F are asymptotically evaluated as\n\nm\u03bb \u223c \u03ba1C/M, \u03bbmax \u223c \u03b1\n\n\u03ba2 +\n\nM,\n\n\u03ba1\nT\n\n(10)\n\nwhere \u03b1 :=(cid:80)L\u22121\n\nl=1 \u03b1l\u03b1l\u22121, and positive constants \u03ba1 and \u03ba2 are obtained using order parameters,\n\n\u03b1l\u22121\n\n\u03b1\n\nt \u02c6ql\u22121\n\u02dcql\n\nt\n\n, \u03ba2 :=\n\n\u03b1l\u22121\n\n\u03b1\n\nst \u02c6ql\u22121\n\u02dcql\n\nst\n\n.\n\n(11)\n\nL(cid:88)\n\nl=1\n\n\u03ba1 :=\n\nThe mean is asymptotically close to zero and it implies that most of the eigenvalues are very small.\nIn contrast, \u03bbmax becomes pathologically large in proportion to the width. We refer to this \u03bbmax as\npathological sharpness since FIM\u2019s eigenvalues determine the local shape of the parameter space\nand loss landscape. Empirical experiments reported that both of close-to-zero eigenvalues and\npathologically large ones appear in trained networks as well [23, 30].\nPathological sharpness universally appears in various DNNs. Technically speaking, if the network is\nnot non-centered (i.e., a network with no bias terms and zero-Gaussian mean; we call it a centered\nnetwork), \u03ba2 = 0 holds and lower order terms of the eigenvalue statistics become non-negligible [20],\nand the pathological sharpness may disappear. For instance, \u03bbmax is of O(1) when T is properly\nscaled with M in a centered shallow network [21]. Except for such special centered networks, we\ncannot avoid pathological sharpness. In practice, it would be better to alleviate the pathologically\nlarge \u03bbmax because it causes the sharp loss landscape. It requires very small learning rates (see\nSection 3.4) and will lead to worse generalization [4, 5]. In the following section, we reveal that a\nspeci\ufb01c normalization method plays an important role in alleviating pathological sharpness.\n\n3 Alleviation of pathological sharpness in batch normalization\n\n3.1 Eigenspace of largest eigenvalues\n\nBefore analyzing the effects of normalization methods on the FIM, it will be helpful to characterize\nthe cause of pathological sharpness. We \ufb01nd the following eigenspace of \u03bbmax\u2019s:\n\n4\n\n\fTheorem 3.1. Suppose a non-centered network and i.i.d. input samples generated by Eq. (3). When\nM is suf\ufb01ciently large, the eigenvectors corresponding to \u03bbmax\u2019s are asymptotically equivalent to\n(12)\n\nE[\u2207\u03b8fk] (k = 1, ..., C).\n\nThe derivation is shown in Supplementary Material A.2. This theorem gives us an idea of the\neffect of normalization on the FIM. Assume that we could shift the model output as \u00affk(t) =\nfk(t) \u2212 E[fk]. In this shifted model, the eigenvectors become E[\u2207\u03b8\n\u00affk] = 0 and vanish. Naively\nthinking, pathologically large eigenvalues may disappear under this shift of outputs. The following\nanalysis shows that this naive speculation is correct in a certain condition.\n\n3.2 Batch normalization in last layer\n\nIn this section, we analyze batch normalization in the last layer (L-th layer):\n\nfk(t) :=\n\nk (t) \u2212 \u00b5k(\u03b8)\nuL\n\n\u03c3k(\u03b8)\n\n\u03b3k + \u03b2k,\n\n(cid:113)\n\nE[uL\n\nk (t)2] \u2212 \u00b5k(\u03b8)2,\n\n\u00b5k(\u03b8) := E[uL\n\nk (t)], \u03c3k(\u03b8) :=\n\n(14)\nfor k = 1, ..., C. The average operator E[\u00b7] is taken over all input samples. In practical use of\nbatch normalization in stochastic gradient descent, the training samples are often divided into many\nsmall mini-batches, but we do not consider such division since our current interest is to evaluate\nthe FIM averaged over all samples. We set the hyperparameter \u03b3k = 1 for simplicity because \u03b3k\nonly changes the scale of the FIM up to a constant. The constant \u03b2k works as a new bias term in the\nnormalized network. We do not normalize middle layers (1 \u2264 l \u2264 L \u2212 1) to observe only the effect\nof normalization in the last layer.\nIn the following analysis, we use a widely used assumption for DNNs with random weights:\nAssumption 3.2 (the gradient independence assumption [11\u201314, 20]). When one evaluates backward\norder parameters, one can replace weight matrices W l+1 in the chain rule (4) with a fresh i.i.d. copy,\ni.e., \u02dcW l+1\n\ni.i.d.\u223c N (0, \u03c32\n\nw/Ml).\n\nij\n\nSupposing this assumption has been a central technique of the mean \ufb01eld theory of DNNs [11\u201314, 20]\nto make the derivation of backward order parameters relatively easy. These studies con\ufb01rmed that\nthis assumption leads to excellent agreements with experimental results. Moreover, recent studies\n[25, 31] have succeeded in theoretically justifying that various statistical quantities obtained under\nthis assumption coincide with exact solutions. Thus, Assumption 3.2 is considered to be effective as\nthe \ufb01rst step of the analysis.\nFirst, let us set \u03c3k(\u03b8) as a constant and only consider mean subtraction in the last layer:\n\nk (t) \u2212 \u00b5k(\u03b8))\u03b3k + \u03b2k.\n\n\u00affk(t) := (uL\n\n(15)\nSince the \u03c3k(\u03b8) controls the scale of the network output, one may suspect that the contribution of the\nmean subtraction would only be restrictive for alleviating sharpness. Contrary to this expectation, we\n\ufb01nd an interesting fact that the mean subtraction is essential to alleviate pathological sharpness:\nTheorem 3.3. Suppose a non-centered network with the mean subtraction in the last layer (Eq. (15))\nand i.i.d. input samples generated by Eq. (3). In the large M limit, the mean of the FIM\u2019s eigenvalues\nis asymptotically evaluated by\n\n(13)\n\n(16)\n\n(17)\n\nThe largest eigenvalue is asymptotically evaluated as follows: (i) when T \u2265 2 and T = O(1),\n\nm\u03bb \u223c (1 \u2212 1/T )(\u03ba1 \u2212 \u03ba2)C/M.\n\n\u03bbmax \u223c \u03b1\n\n\u03ba1 \u2212 \u03ba2\n\nT\n\nM,\n\nand (ii) when T = O(M ) with a constant \u03c1 := M/T , under the gradient independence assumption,\nwe have\n\n\u03c1\u03b1(\u03ba1 \u2212 \u03ba2) + c1 \u2264 \u03bbmax \u2264(cid:112)(C\u03b12\u03c1(\u03ba1 \u2212 \u03ba2)2 + c2)M ,\n\n(18)\n\nfor non-negative constants c1 and c2.\n\n5\n\n\fFigure 1: Effect of mean subtraction in last layer on \u03bbmax. Largest eigenvalues with T = M are\nshown. Black points show experimental values without mean subtraction, and red dashed lines show\ntheoretical values of \u03bbmax in Theorem 2.2. In contrast, blue points show experimental values with\nmean subtraction and red solid lines show theoretical values of lower bound in Theorem 3.3.\n\n\u221a\n\nw, \u03c32\n\nThe derivation is shown in Supplementary Material B.1. The mean subtraction does not change\nthe order of \u03bbmax when T = O(1). In contrast, it is interesting that it decreases the order when\nT = O(M ). The decrease in m\u03bb only appears in the coef\ufb01cient because \u03ba1 > \u03ba1 \u2212 \u03ba2 > 0 hold in\nthe non-centered networks. Thus, we can conclude that the mean subtraction in the last layer plays an\nessential role in decreasing \u03bbmax when T is appropriately scaled to M.\nAs shown in Fig.1, we empirically con\ufb01rmed that \u03bbmax became of O(1) in numerical experiments\nand pathological sharpness disappeared when T = M. Theorem 3.3 is consistent with the numerical\nexperimental results. We numerically computed \u03bbmax in DNNs with random Gaussian weights,\nbiases, and input samples generated by Eq. (3). We set \u03b1l = C = 1 and L = 3. Variances of\nparameters were given by (\u03c32\nb ) = (2, 0) in the ReLU case, and (3, 0.64) in the tanh case. Each\npoints and error bars show the experimental results over 100 different ensembles. We show the value\nof \u03c1\u03b1(\u03ba1 \u2212 \u03ba2) as the lower bound of \u03bbmax (red line). Although this lower bound and the theoretical\nupper bound of order\nM are relatively loose compared to the experimental results, recall that our\npurpose is not to obtain the tight bounds but to show the alleviation of \u03bbmax. The experimental\nresults with the mean subtraction were much lower than those without it as our theory predicts.\nFrom a theoretical perspective, one may be interested in how Assumption 3.2 works in the evaluation\nof \u03bbmax. As shown in Theorem B.1 of Supplementary Material B.1, we can evaluate \u03bbmax even\nwithout using this assumption. In general, the mean subtraction in the last layer makes \u03bbmax depend\ni(t) =\n) \u2264 \u03bbmax \u2264 O(M 1\u2212q\u2217\nst + O(1/M q) with the convergence rate q > 0, it leads to O(M 1\u22122q\u2217\n\u02dcql\n)\nwith q\u2217 = min{1/2, q}. This means that the alleviation appears for any q. In particular, Assumption\n3.2 yields q = q\u2217 = 1/2 and we obtain the lower bound of order 1. We have also con\ufb01rmed that\nbackward order parameters in numerical experiments on DNNs with random weights achieved the\nconvergence rate of q = q\u2217 = 1/2 (in Fig. S.1). Thus, Theorem 3.3 under this assumption becomes\nconsistent with the experimental results. The batch normalization essentially requires the evaluation\nof the convergence rate and this is an important difference from the previous study on DNNs without\nnormalization methods [20].\nWe can also add \u03c3L\n(Eq. (13)). When T = O(M ), the eigenvalue statistics slightly change to\n\non a convergence rate of backward order parameters. That is, when we have(cid:80)\n\nk (\u03b8) to Theorem 3.3 and obtain the eigenvalue statistics under the normalization\n\ni \u03b4l\n\ni(s)\u03b4l\n\nm\u03bb \u223c Q1(\u03ba1 \u2212 \u03ba2)/M, \u03c1\u03b1\n\n(\u03ba1 \u2212 \u03ba2) + c(cid:48)\n\n(Q2\u03b12\u03c1(\u03ba1 \u2212 \u03ba2)2 + c(cid:48)\n\n2)M (19)\n\nQ2\nQ1\n\nwhere Q1 := (cid:80)C\n\nk 1/\u03c3k(\u03b8)2, Q2 := (cid:80)C\n\n2 are non-negative constants. The\nderivation is shown in Supplementary Material B.2. This clari\ufb01es that the variance normalization\nworks only as a constant factor and the mean subtraction is essential to reduce pathological sharpness.\n\nk 1/\u03c3k(\u03b8)4, c(cid:48)\n\n1 and c(cid:48)\n\n1 \u2264 \u03bbmax \u2264(cid:113)\n\n6\n\nReLUTanhMM\u03bbmax\u03bbmax\f3.3 Batch normalization in middle layers\n\nTo distinguish the effectiveness of normalization in the last layer from those in other layers, we apply\nbatch normalization in all layers except for the last layer :\n\nul\ni(t) =\n\nW l\n\nijhl\u22121\n\nj\n\n(t) + bl\n\ni, \u00aful\n\ni(t) :=\n\ni(t) \u2212 \u00b5l\nul\n\ni\n\n\u03c3l\ni\n\n\u03b3l\ni + \u03b2l\n\ni, hl\n\ni(t) = \u03c6(\u00aful\n\ni(t)),\n\n(20)\n\nMl\u22121(cid:88)\n\nj=1\n\n(cid:113)\n\ni(t)2] \u2212 (\u00b5l\n\ni)2,\n\nE[ul\n\ni :=\n\ni(t)], \u03c3l\n\nkjhL\u22121\n\n\u00b5l\ni := E[ul\n\nfk(t) =(cid:80)\n\n(21)\nfor all middle layers (l = 1, ..., L \u2212 1) while the last layer is kept in an un-normalized manner, i.e.,\ni depend on weight and bias parameters. For\n\n(t) + bL\nj\nsimplicity, we set \u03b3l\ni = 1 and \u03b2l\nTheorem 3.4. Suppose non-negative activation functions and i.i.d. input samples generated by Eq.\n(3). The largest eigenvalue of the FIM under the normalization (Eq. (20)) is asymptotically lower\nbounded by\n\nk . The variables \u00b5l\ni = 0. We \ufb01nd a lower bound of \u03bbmax with order of M:\n\ni and \u03c3l\n\nj W L\n\n(cid:32)\n\n(cid:33)\n\n\u03bbmax \u2265 \u03b1L\u22121\n\nT \u2212 1\nT\n\n\u02c6qL\u22121\nst,BN +\n\n\u02c6qL\u22121\nt,BN\nT\n\nM,\n\n(22)\n\nwhere \u02c6qL\u22121\n\nt,BN and \u02c6qL\u22121\n\nst,BN are positive constants independent of M.\n\nBecause the last layer is unnormalized, we can construct a lower bound composed of the activations\nin the (L \u2212 1)-th layer. Note that the set of non-negative activation functions (i.e., \u03c6(x) \u2265 0) is a\nsubclass of the non-centered networks. It includes sigmoid and ReLU functions which are widely\nused. The bias term, i.e., \u03c32\nb , does not affect the theorem because they are canceled out in the mean\nsubtraction of each middle layer. After this batch normalization, \u03bbmax is still of O(M ) at lowest and\nthe pathological sharpness is unavoidable in that sense. Thus, one can conclude that the normalization\nin the middle layers cannot alleviate pathological sharpness in many settings.\nThe constants \u02c6qL\u22121\nst,BN correspond to feedforward order parameters in batch normalization.\nThe details are shown in Supplementary Material C.1. Although the purpose of our study was to\nevaluate the order of the eigenvalues, some approaches analytically compute the speci\ufb01c values of\nthe order parameters under certain conditions [14, 32] (see Supplementary Material C.2 for more\ndetails). In particular, they are analytically tractable in ReLU networks as follows; \u02c6qL\u22121\nt,BN = 1/2 and\n\u02c6qL\u22121\nst,BN = 1\n\n2 J(\u22121/(T \u2212 1)) where J(x) is the arccosine kernel [14].\n\nt,BN and \u02c6qL\u22121\n\n3.4 Effect on the gradient descent method\n\n\u03b7 < 2/\u03bbmax.\n\nIts update rule is given by \u03b8t+1 \u2190\nConsider the gradient descent method in a batch regime.\n\u03b8t \u2212 \u03b7\u2207\u03b8E(\u03b8t) where \u03b7 is a constant learning rate. Under some natural assumptions, there exists a\nnecessary condition of the learning rate for the gradient dynamics to converge to a global minimum\n[20, 23];\n\n(23)\nBecause our theory shows that batch normalization in the last layer decreased \u03bbmax, the appropriate\nlearning rate for convergence becomes larger. To con\ufb01rm this effect on the learning rate, we did\nexperiments on training with the gradient descent as shown in Fig. 2. we trained DNNs with various\nwidths by using various \ufb01xed learning rates, providing i.i.d. Gaussian input samples and labels\ngenerated by corresponding teacher networks. It was the same setting as the experiment shown in\n[20]. Fig. 2 (left) shows the color map of training losses without any normalization method and\nis just a reproduction of [20]. Losses exploded in the gray area (i.e., were larger than 103) and\nthe red line shows the theoretical value of 2/\u03bbmax, which was calculated with the FIM at random\ninitialization. Training above the red line exploded in suf\ufb01ciently widen DNNs, just as the necessary\ncondition (23) predicts. In contrast, Fig. 2 (right) shows the result of the batch normalization (mean\nsubtraction) in the last layer. We con\ufb01rmed that it allows larger learning rates for convergence and\nthey are independent of width. We calculated the theoretical line by using the lower bound of \u03bbmax,\ni.e., \u03b7 = 2/(\u03c1\u03b1(\u03ba1 \u2212 \u03ba2)). Note that Fig. 2 shows the results on the single trial of training with \ufb01xed\ninitialization. It caused the stripe pattern of color map depending on the random seed of each width,\nespecially in the case of normalized networks. As shown in Fig. S.2 of Supplementary Material\n\n7\n\n\fFigure 2: Exhaustively searched training losses depending on M (width) and \u03b7 (learning rate). The\ncolor bar shows the value of training loss after 1000 steps of training. We trained deep ReLU networks\nwith \u03b1l = C = 1, L = 3, T = 1000 and (\u03c32\n\nb ) = (4, 1).\n\nw, \u03c32\n\nD, accumulation of multiple trials achieves lower losses regardless of the width. Thus, the batch\nnormalization is helpful to set larger learning rates, which could be expected to speed-up the training\nof neural networks [2].\n\n4 Pathological sharpness in layer normalization\n\nIt is an interesting question to investigate the effect of other normalization methods on pathological\nsharpness. Let us consider layer normalization [9]:\n\nul\ni(t) =\n\nW l\n\nijhl\u22121\n\nj\n\n(t) + bl\n\ni, \u00aful\n\ni(t) \u2212 \u00b5l(t)\nul\n\n\u03c3l(t)\n\n\u03b3l\ni + \u03b2l\n\ni, hl\n\ni = \u03c6(\u00aful\n\ni(t)),\n\n(24)\n\n(25)\n\nMl\u22121(cid:88)\n(cid:88)\n\nj=1\n\ni(t) :=\n\n(cid:115)(cid:88)\n\n\u00b5l(t) :=\n\nul\ni(t)/Ml, \u03c3l(t) :=\n\ni\n\ni\n\ni(t)2/Ml \u2212 \u00b5l(t)2,\nul\n\nfor all layers (l = 1, ..., L). The network output is normalized as fk(t) = \u00afuL\nk (t). While batch nor-\nmalization (20) normalizes the pre-activation of each unit across batch samples, layer normalization\n(24) normalizes that of each sample across the units in the same layer. Although layer normalization\nis the method typically used in recurrent neural networks, we show its effectiveness in feedforward\nnetworks to contrast the effect of batch normalization on the FIM. For simplicity, we set \u03b3l\ni = 1 and\ni = 0. Then, we \ufb01nd\n\u03b2l\nTheorem 4.1. Suppose a non-centered network, i.i.d. input samples generated by Eq. (3), and the\ngradient independence assumption. When M is suf\ufb01ciently large and C > 2, the eigenvalue statistics\nof the FIM under the normalization (Eq. (24)) are asymptotically evaluated as\n\u221a\n\nM \u2264 \u03bbmax \u2264 \u03b1\n\nsM,\n\n(26)\n\nm\u03bb \u223c (C \u2212 2)\u03b71\u03ba(cid:48)\n\n1/M,\n\nwith\n\ns =\n\n3 \u2212 \u03b72\n(\u03b72\n\n1)T + (C \u2212 2)(\u03b72\n\nT\n\n\u03ba(cid:48)\n\n2\n\n2 + (C \u2212 2)\u03b72\n\n2\n\n,\n\n(27)\n\n\u03ba(cid:48)\n1\nT\n\n1, \u03ba(cid:48)\n\n2 and \u03b7i (i = 1, 2, 3) are constants independent of M.\n\nM. Intuitively, this is because E[fk] is not equal to(cid:80)\n\nwhere \u03ba(cid:48)\nLayer normalization does not alleviate the pathological sharpness in the sense that \u03bbmax is of order\nk fk/C and the mean subtraction in the last\nlayer does not cancel out the eigenvectors in Theorem 3.1. We can compute \u03ba(cid:48)\n1 by using order\nparameters and \u03b7i (i = 1, 2, 3) by the variance \u03c3L(t)2. The de\ufb01nition of each variable and the proof\nof the theorem are shown in Supplementary Material E. The independence assumption is used in\nderivation of backward order parameters as usual [11, 14]. When C = 2, the FIM becomes a zero\nmatrix because of a special symmetry in the last layer. Therefore, the non-trivial case is C > 2.\n\n1 and \u03ba(cid:48)\n\n\u03b1s\n\n(C \u2212 2)\u03b71\u03ba(cid:48)\n1T \u2212 \u03b72)\n\n1\n\n5 Related work\nNormalization and geometric characterization. Batch normalization is believed to perform well\nbecause it suppresses the internal covariate shift [2]. Recent extensive studies, however, have reported\n\n8\n\nTraining lossM\u03b7w/ mean subtractionw/o normalizationM\u03b7\falternative explanations on how batch normalization works [3, 4]. Santurkar et al. [3] empirically\nfound that batch normalization decreases a sharp change of the loss function and makes the loss\nlandscape smoother. Bjorck et al. [4] reported that batch normalization works to prevent an explosion\nof the loss and gradients. While some theoretical studies analyzed FIMs in un-normalized DNNs\n[6, 20, 21], analysis in normalized DNNs has been limited. Santurkar et al. [3] analyzed gradients\nand Hessian under batch normalization in a single layer and theoretically evaluated their worst\ncase bounds, but its inequality was too general to quantify the decrease of sharpness. In particular,\nit misses the special effect of the last layer, as we found in this study. The original paper [9] of\nlayer normalization analyzed the FIM in generalized linear models (GLMs) and argued that the\nnormalization could decrease curvature of the parameter space. While a GLM corresponds to the\nsingle layer model, shallow and deep networks have hidden layers. As the hidden layers become\nwide, pathological sharpness appears and layer normalization suffers from it.\nGradient descent method. There are other related works in addition to those mentioned in Section\n3.4. Bjorck et al. [4] speculated that larger learning rates realized by batch normalization may\nhelp stochastic gradient descent avoid sharp minima and it leads to better generalization. Wei\net al. [32] estimated \u03bbmax and \u03b7 under a special type of batch-wise normalization. Because their\nnormalization method approximates a chain rule of backpropagation by neglecting the contribution\nof mean subtraction, it suffers from pathological sharpness and requires smaller learning rates.\nNeural tangent kernel. The FIM and NTK satisfy a kind of duality, and share the same non-zero\neigenvalues. Our proofs on the eigenvalue statistics use NTK with standard parameterization, i.e.,\nF \u2217 in Supplementary Material A.1. The NTK at random initialization is known to determine the\ngradient dynamics of a suf\ufb01ciently wide DNN in function space. The suf\ufb01ciently wide network\ncan achieve a zero training error and it means that there is always a global minimum suf\ufb01ciently\nclose to random initialization. In the parameter space, Lee et al. [19] proved that NTK dynamics is\nsuf\ufb01ciently approximated by the gradient descent of a linearized model expanded around random\ninitialization \u03b80: f (x; \u03b8t) = f (x; \u03b80) + \u2207\u03b8f (x; \u03b80)(cid:62)\u03c9t, where \u03c9t := \u03b8t \u2212 \u03b80 and t means the step\nof the gradient descent. Naively speaking, this suggests that the optimization of the wide DNN\napproximately becomes convex and the loss landscape is dominated by a quadratic form with the\nFIM, i.e., \u03c9(cid:62)\n\nt F \u03c9t.\n\n6 Discussion\n\nThere remain a number of directions for extending our theoretical framework. Recent studies on\nwide DNNs have revealed that the NTK of random initialization dominates the training dynamics\nand even the performance of trained networks [18, 19]. Since the NTK is de\ufb01ned as a right-to-left\nreversed Gram matrix of the FIM under a special parameterization, the convergence speed of the\ntraining dynamics is essentially governed by the eigenvalues of the FIM at the random initialization.\nAnalyzing these dynamics under normalization remains to be uncovered. For further analysis, random\nmatrix theory will also be helpful in obtaining the whole eigenvalue spectrum or deriving tighter\nbounds of the largest eigenvalues. Although random matrix theory has been limited to a single layer\nor shallow networks [21], it will be an important direction to extend it to deeper and normalized\nnetworks.\nThere may be potential properties of normalization methods that are not detected in our framework.\nKohler et al. [33] analyzed the decoupling of the weight vector to its direction and length as in batch\nnormalization and weight normalization. They revealed that such decoupling could contribute to\naccelerating the optimization. Bjorck et al. [4] discussed that deep linear networks without bias\nterms suffer from the explosion of the feature vectors and speculated that batch normalization is\nhelpful in reducing this explosion. This implies that batch normalization may be helpful to improve\noptimization performance even in a centered network. Yang et al. [14] developed an excellent mean-\n\ufb01eld framework for batch normalization through all layers and found that the gradient explosion\nis induced by batch normalization in networks with extreme depth. Even if batch normalization\nalleviates pathological sharpness regarding the width, the coef\ufb01cients of order evaluation can become\nvery large when the network is extremely deep. It may cause another type of sharpness. It is also\ninteresting to explore SGD training under normalization and quantify how the alleviation of sharpness\naffects appropriate sizes of learning rate and mini-batch, which have been mainly investigated in\nSGD training without normalization [34]. Further studies on such phenomena in wide DNNs would\nbe helpful for further understanding and development of normalization methods.\n\n9\n\n\fAcknowledgments\n\nThis work was partially supported by a Grant-in-Aid for Young Scientists (19K20366) from the Japan\nSociety for the Promotion of Science (JSPS).\n\nReferences\n[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proceedings of International Conference on Machine\nLearning (ICML), pages 448\u2013456, 2015.\n\n[3] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch\nnormalization help optimization? Proceedings of Advances in Neural Information Processing\nSystems (NeurIPS), 2018.\n\n[4] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch\nnormalization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS),\npages 7694\u20137705, 2018.\n\n[5] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nICLR\u20192017 arXiv:1609.04836, 2017.\n\n[6] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric,\ngeometry, and complexity of neural networks. In Proceedings of International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), pages 888\u2013896, 2019.\n\n[7] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss\nlandscape of neural nets. In Proceedings of Advances in Neural Information Processing Systems\n(NeurIPS), pages 6389\u20136399, 2018.\n\n[8] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based\nanalysis of large batch training and robustness to adversaries. In Proceedings of Advances in\nNeural Information Processing Systems (NeurIPS), pages 4949\u20134959, 2018.\n\n[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.\n\narXiv:1607.06450, 2016.\n\nLayer normalization.\n\n[10] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Proceedings of\nAdvances In Neural Information Processing Systems (NIPS), pages 3360\u20133368, 2016.\n\n[11] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-\n\ntion propagation. ICLR\u20192017 arXiv:1611.01232, 2017.\n\n[12] Greg Yang and Samuel S Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In\nProceedings of Advances in Neural Information Processing Systems (NIPS), pages 2865\u20132873.\n2017.\n\n[13] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen-\nnington. Dynamical isometry and a mean \ufb01eld theory of CNNs: How to train 10,000-layer\nvanilla convolutional neural networks. In Proceedings of International Conference on Machine\nLearning (ICML), pages 5393\u20135402, 2018.\n\n[14] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz.\n\nA mean \ufb01eld theory of batch normalization. ICLR\u20192019 arXiv:1902.08129, 2019.\n\n[15] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral uni-\nversality in deep networks. In Proceedings of International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), pages 1924\u20131932, 2018.\n\n10\n\n\f[16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Proceedings of Advances In\nNeural Information Processing Systems (NIPS), pages 2253\u20132261, 2016.\n\n[17] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,\nICLR\u20192018\n\nand Jascha Sohl-Dickstein. Deep neural networks as gaussian processes.\narXiv:1711.00165, 2018.\n\n[18] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence\nand generalization in neural networks. In Proceedings of Advances in Neural Information\nProcessing Systems (NeurIPS), pages 8580\u20138589, 2018.\n\n[19] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv:1902.06720, 2019.\n\n[20] Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Universal statistics of Fisher information\nin deep neural networks: Mean \ufb01eld approach. In Proceedings of International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), pages 1032\u20131041, 2019.\n\n[21] Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of a single-\nhidden-layer neural network. In Proceedings of Advances in Neural Information Processing\nSystems (NeurIPS), pages 5410\u20135419, 2018.\n\n[22] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx-\nimate curvature. In Proceedings of International Conference on Machine Learning (ICML),\npages 2408\u20132417, 2015.\n\n[23] Yann LeCun, L\u00e9on Bottou, Genevieve B Orr, and Klaus-Robert M\u00fcller. Ef\ufb01cient backprop. In\n\nNeural networks: Tricks of the trade, pages 9\u201350. Springer, 1998.\n\n[24] Shun-ichi Amari. A method of statistical neurodynamics. Kybernetik, 14(4):201\u2013215, 1974.\n\n[25] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process\nbehavior, gradient independence, and neural tangent kernel derivation. arXiv:1902.04760, 2019.\n\n[26] Shun-ichi Amari. Information geometry and its applications. Springer, 2016.\n\n[27] Hyeyoung Park, Shun-ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning\n\nalgorithms for various stochastic models. Neural Networks, 13(7):755\u2013764, 2000.\n\n[28] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. ICLR\u20192014\n\narXiv:1301.3584, 2014.\n\n[29] Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical Fisher\n\napproximation. arXiv preprint arXiv:1905.12558, 2019.\n\n[30] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis\n\nof the Hessian of over-parametrized neural networks. arXiv:1706.04454, 2017.\n\n[31] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\n\nOn exact computation with an in\ufb01nitely wide neural net. arXiv:1904.11955, 2019.\n\n[32] Mingwei Wei, James Stokes, and David J Schwab. Mean-\ufb01eld analysis of batch normalization.\n\narXiv:1903.02606, 2019.\n\n[33] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann, Ming Zhou, and Klaus\nNeymeyr. Exponential convergence rates for batch normalization: The power of length-direction\ndecoupling in non-convex optimization. In Proceedings of International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), pages 806\u2013815, 2019.\n\n[34] Daniel Park, Jascha Sohl-Dickstein, Quoc Le, and Samuel Smith. The effect of network width on\nstochastic gradient descent and generalization: an empirical study. In International Conference\non Machine Learning, pages 5042\u20135051, 2019.\n\n11\n\n\f", "award": [], "sourceid": 3462, "authors": [{"given_name": "Ryo", "family_name": "Karakida", "institution": "National Institute of Advanced Industrial Science and Technology"}, {"given_name": "Shotaro", "family_name": "Akaho", "institution": "AIST"}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": "RIKEN"}]}