{"title": "Data-Dependence of Plateau Phenomenon in Learning with Neural Network --- Statistical Mechanical Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1722, "page_last": 1730, "abstract": "The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, has been reported by various researchers. The phenomenon is actively inspected in the 1990s and found to be due to the fundamental hierarchical structure of neural network models. Then the phenomenon has been thought as inevitable. However, the phenomenon seldom occurs in the context of recent deep learning. There is a gap between theory and reality. In this paper, using statistical mechanical formulation, we clarified the relationship between the plateau phenomenon and the statistical property of the data learned. It is shown that the data whose covariance has small and dispersed eigenvalues tend to make the plateau phenomenon inconspicuous.", "full_text": "Data-Dependence of Plateau Phenomenon in\nLearning with Neural Network \u2014 Statistical\n\nMechanical Analysis\n\nDepartment of Complexity Science and Engineering, Graduate School of Frontier Sciences,\n\nYuki Yoshida\n\nMasato Okada\n\nThe University of Tokyo\n\n5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan\n{yoshida@mns, okada@edu}.k.u-tokyo.ac.jp\n\nAbstract\n\nThe plateau phenomenon, wherein the loss value stops decreasing during the\nprocess of learning, has been reported by various researchers. The phenomenon is\nactively inspected in the 1990s and found to be due to the fundamental hierarchical\nstructure of neural network models. Then the phenomenon has been thought as\ninevitable. However, the phenomenon seldom occurs in the context of recent\ndeep learning. There is a gap between theory and reality. In this paper, using\nstatistical mechanical formulation, we clari\ufb01ed the relationship between the plateau\nphenomenon and the statistical property of the data learned. It is shown that the\ndata whose covariance has small and dispersed eigenvalues tend to make the plateau\nphenomenon inconspicuous.\n\n1\n\nIntroduction\n\n1.1 Plateau Phenomenon\n\nDeep learning, and neural network as its essential component, has come to be applied to various\n\ufb01elds. However, these still remain unclear in various points theoretically. The plateau phenomenon\nis one of them. In the learning process of neural networks, their weight parameters are updated\niteratively so that the loss decreases. However, in some settings the loss does not decrease simply, but\nits decreasing speed slows down signi\ufb01cantly partway through learning, and then it speeds up again\nafter a long period of time. This is called as \u201cplateau phenomenon\u201d. Since 1990s, this phenomena\nhave been reported to occur in various practical learning situations (see Figure 1 (a) and Park et al.\n[2000], Fukumizu and Amari [2000]) . As a fundamental cause of this phenomenon, it has been\npointed out by a number of researchers that the intrinsic symmetry of neural network models brings\nsingularity to the metric in the parameter space which then gives rise to special attractors whose\nregions of attraction have nonzero measure, called as Milnor attractor (de\ufb01ned by Milnor [1985]; see\nalso Figure 5 in Fukumizu and Amari [2000] for a schematic diagram of the attractor).\n\n1.2 Who moved the plateau phenomenon?\n\nHowever, the plateau phenomenon seldom occurs in recent practical use of neural networks (see\nFigure 1 (b) for example).\nIn this research, we rethink the plateau phenomenon, and discuss which situations are likely to cause\nthe phenomenon. First we introduce the student-teacher model of two-layered networks as an ideal\nsystem. Next, we reduce the learning dynamics of the student-teacher model to a small-dimensional\norder parameter system by using statistical mechanical formulation, under the assumption that the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finput dimension is suf\ufb01ciently large. Through analyzing the order parameter system, we can discuss\nhow the macroscopic learning dynamics depends on the statistics of input data. Our main contribution\nis the following:\n\n\u2022 Under the statistical mechanical formulation of learning in the two-layered perceptron, we\nshowed that macroscopic equations can be derived even when the statistical properties of\nthe input are generalized. In other words, we extended the result of Saad and Solla [1995]\nand Riegler and Biehl [1995].\n\n\u2022 By analyzing the macroscopic system we derived, we showed that the dynamics of learning\n\ndepends only on the eigenvalue distribution of the covariance matrix of the input data.\n\n\u2022 We clari\ufb01ed the relationship between the input data statistics and plateau phenomenon.\nIn particular, it is shown that the data whose covariance matrix has small and disparsed\neigenvalues tend to make the phenomenon inconspicuous, by numerically analyzing the\nmacroscopic system.\n\n1.3 Related works\n\nThe statistical mechanical approach used in this research is \ufb01rstly developed by Saad and Solla\n[1995]. The method reduces high-dimensional learning dynamics of nonlinear neural networks to\nlow-dimensional system of order parameters. They derived the macroscopic behavior of learning\ndynamics in two-layered soft-committee machine and by analyzing it they point out the existence\nof plateau phenomenon. Nowadays the statistical mechanical method is applied to analyze recent\ntechniques (Hara et al. [2016], Yoshida et al. [2017], Takagi et al. [2019] and Straat and Biehl [2019]),\nand generalization performance in over-parameterized setting (Goldt et al. [2019]) and environment\nwith conceptual drift (Straat et al. [2018]). However, it is unknown that how the property of input\ndataset itself can affect the learning dynamics, including plateaus.\nPlateau phenomenon and singularity in loss landscape as its main cause have been studied by\nFukumizu and Amari [2000], Wei et al. [2008], Cousseau et al. [2008] and Guo et al. [2018]. On the\nother hand, recent several works suggest that plateau and singularity can be mitigated in some settings.\nOrhan and Pitkow [2017] shows that skip connections eliminate the singularity. Another work by\nYoshida et al. [2019] points out that output dimensionality affects the plateau phenomenon, in that\nmultiple output units alleviate the plateau phenomenon. However, the number of output elements\ndoes not fully determine the presence or absence of plateaus, nor does the use of skip connections.\nThe statistical property of data just can affect the learning dynamics dramatically; for example, see\nFigure 2 for learning curves with using different datasets and same network architecture. We focus\non what kind of statistical property of the data brings plateau phenomenon.\n\nFigure 1:\n(a) Training loss curves when two-layer perceptron with 4-4-3 units and ReLU activation\nlearns IRIS dataset. (b) Training loss curve when two-layer perceptron with 784-20-10 units and\nReLU activation learns MNIST dataset. For both (a) and (b), results of 100 trials with random\ninitialization are overlaid. Minibatch size of 10 and vanilla SGD (learning rate: 0.01) are used.\n\n2\n\n02505007501000Epoch0.000.250.500.751.001.251.50Training Loss(a)02505007501000Epoch0.000.050.100.150.20Training Loss(b)\fFigure 2: Loss curves yielded by student-teacher learning with two-layer perceptron which has 2\nhidden units, 1 output unit and sigmoid activation, and with (a) IRIS dataset, (b) MNIST dataset, (c)\na dataset in R60000\u00d7784 drawn from standard normal distribution, as input distribution p(\u03be). In every\nsub\ufb01gure, results for 20 trials with random initialization are overlaid. Vanilla SGD (learning rate:\n(a)(b) 0.005, (c) 0.001) and minibatch size of 1 are used for all three settings.\n\n2 Formulation\n\n2.1 Student-Teacher Model\n\nWe consider a two-layer perceptron which has N input units, K hidden units and 1 output unit. We\ni=1 wig(J i\u00b7\u03be) \u2208\n\ndenote the input to the network by \u03be \u2208 RN . Then the output can be written as s =(cid:80)K\nt =(cid:80)M\n\nR, where g is an activation function.\nWe consider the situation that the network learns data generated by another network, called \u201cteacher\nnetwork\u201d, which has \ufb01xed weights. Speci\ufb01cally, we consider two-layer perceptron that outputs\nn=1 vng(Bn \u00b7 \u03be) \u2208 R for input \u03be as the teacher network. The generated data (\u03be, t) is then fed\nto the student network stated above and learned by it in the on-line manner (see Figure 3). We assume\nthat the input \u03be is drawn from some distribution p(\u03be) every time independently. We adopt vanilla\nstochastic gradient descent (SGD) algorithm for learning. We assume the squared loss function\n\u03b5 = 1\n\n2 (s \u2212 t)2, which is most commonly used for regression.\n\n2.2 Statistical Mechanical Formulation\n\nIn order to capture the learning dynamics of nonlinear neural networks described in the previous\nsubsection macroscopically, we introduce the statistical mechanical formulation in this subsection.\nLet xi := J i \u00b7 \u03be (1 \u2264 i \u2264 K) and yn := Bn \u00b7 \u03be (1 \u2264 n \u2264 M). Then\n\n(x1, . . . , xK, y1, . . . , yM ) \u223c N(cid:0)0, [J 1, . . . , J K, B1, . . . , BM ]T \u03a3[J 1, . . . , J K, B1, . . . , BM ](cid:1)\n\nholds with N \u2192 \u221e by generalized central limit theorem, provided that the input distribution p(\u03be)\nhas zero mean and \ufb01nite covariance matrix \u03a3.\nNext, let us introduce order parameters as following: Qij := J T\n(cid:104)xiyn(cid:105), Tnm := BT\n\nn \u03a3Bm = (cid:104)ynym(cid:105) and Dij := wiwj, Ein := wivn, Fnm := vnvm. Then\n\ni \u03a3J j = (cid:104)xixj(cid:105), Rin := J T\n\ni \u03a3Bn =\n\n(x1, . . . , xK, y1, . . . , yM ) \u223c N (0,\n\n(cid:18) Q R\n\nRT\n\nT\n\n(cid:19)\n\n).\n\nThe parameters Qij, Rin, Tnm, Dij, Ein, and Fnm introduced above capture the state of the system\nmacroscopically; therefore they are called as \u201corder parameters.\u201d The \ufb01rst three represent the state\nof the \ufb01rst layers of the two networks (student and teacher), and the latter three represent their\nsecond layers\u2019 state. Q describes the statistics of the student\u2019s \ufb01rst layer and T represents that of the\nteacher\u2019s \ufb01rst layer. R is related to similarity between the student and teacher\u2019s \ufb01rst layer. D, E, F\nis the second layers\u2019 counterpart of Q, R, T . The values of Qij, Rin, Dij, and Ein change during\nlearning; their dynamics are what to be determined, from the dynamics of microscopic variables, i.e.\nconnection weights. In contrast, Tnm and Fnm are constant during learning.\n\n3\n\n102103104105Epoch109107105103101Training Loss(a)100101102103Epoch109107105103101Training Loss(b)100101102103Epoch109107105103101Training Loss(c)\fFigure 3: Overview of student-teacher model formulation.\n\n2.2.1 Higher-order order parameters\n\nThe important difference between our situation and that of Saad and Solla [1995] is the covariance\nmatrix \u03a3 of the input \u03be is not necessarily equal to identity. This makes the matter complicated, since\nhigher-order terms \u03a3e (e = 1, 2, . . .) appear inevitably in the learning dynamics of order parameters.\nIn order to deal with these, here we de\ufb01ne some higher-order version of order parameters.\nin and T (e)\nLet us de\ufb01ne higher-order order parameters Q(e)\n:=\ni \u03a3eJ j, R(e)\nJ T\nn \u03a3eBm. Note that they are identical to Qij,\nRin and Tnm in the case of e = 1. Also we de\ufb01ne higher-order version of xi and yn, namely x(e)\nand y(e)\n\nnm for e = 0, 1, 2, . . ., as Q(e)\nij\n\nn := \u03beT \u03a3eBn. Note that x(0)\n\n:= \u03beT \u03a3eJ i, y(e)\n\nin := J T\n\nnm := BT\n\nn , as x(e)\n\nand T (e)\n\nij , R(e)\n\ni = xi and y(0)\n\nn = yn.\n\ni \u03a3eBn,\n\ni\n\ni\n\n3 Derivation of dynamics of order parameters\n\nAt each iteration of on-line learning, weights of the student network J i and wi are updated with\n\u2206J i = \u2212 \u03b7\nN\n\n[(t \u2212 s) \u00b7 wi]g(cid:48)(xi)\u03be =\n\nwjg(xj)\n\n\u03b7\nN\n\n\u03b7\nN\n\n=\n\nd\u03b5\ndJ i\n\n\uf8f6\uf8f8 \u00b7 wi\n\n\uf8f9\uf8fb g(cid:48)(xi)\u03be,\n\n\uf8ee\uf8f0\uf8eb\uf8ed M(cid:88)\nvng(yn) \u2212 K(cid:88)\n\uf8eb\uf8ed M(cid:88)\nvng(yn) \u2212 K(cid:88)\n\nn=1\n\nj=1\n\nn=1\n\nj=1\n\n\uf8f6\uf8f8 ,\n\nwjg(xj)\n\n\u2206wi = \u2212 \u03b7\nN\n\nd\u03b5\ndwi\n\n=\n\n\u03b7\nN\n\ng(xi)(t \u2212 s) =\n\n\u03b7\nN\n\ng(xi)\n\nin which we set the learning rate as \u03b7/N, so that our macroscopic system is N-independent.\nThen, the order parameters Q(e)\n\nin (e = 0, 1, 2, . . .) are updated with\n\nij and R(e)\n\n(1)\n\n\u2206Q(e)\n\nij = (J i + \u2206J i)T \u03a3e(J j + \u2206J j) \u2212 J T\n\ni \u03a3e\u2206J j + J T\n\nj \u03a3e\u2206J i + \u2206J T\n\ni \u03a3e\u2206J j\n\nEipEjqg(cid:48)(xi)g(cid:48)(xj)g(yp)g(yq)\n\nEipDjqg(cid:48)(xi)g(cid:48)(xj)g(yp)g(xq)\n\n(cid:35)\n\n,\n\n(2)\n\n=\n\n\u03b7\nN\n\nEipg(cid:48)(xi)x(e)\n\nj g(yp) \u2212 K(cid:88)\n\ni \u03a3eJ j = J T\nDipg(cid:48)(xi)x(e)\n\nj g(xp)\n\np=1\n\np=1\n\nEjpg(cid:48)(xj)x(e)\n\ni g(yp) \u2212 K(cid:88)\n\nM(cid:88)\n(cid:34)K,K(cid:88)\nDipEjqg(cid:48)(xi)g(cid:48)(xj)g(xp)g(yq) \u2212 M,K(cid:88)\n\nDipDjqg(cid:48)(xi)g(cid:48)(xj)g(xp)g(xq) +\n\np=1\n\np,q\n\nDjpg(cid:48)(xj)x(e)\n\ni g(xp)\n\n(cid:35)\n\n(cid:34) M(cid:88)\n\np=1\n\n+\n\n\u2212 K,M(cid:88)\n(cid:34) M(cid:88)\n\np,q\n\n+\n\n\u03b72\nN 2 \u03beT \u03a3e\u03be\n\nM,M(cid:88)\n\np,q\n\n(cid:35)\n\n\u2206R(e)\n\nin = (J i + \u2206J i)T \u03a3eBn \u2212 J T\n\ni \u03a3eBn = \u2206J T\n\nn g(yp) \u2212 K(cid:88)\n\np,q\ni \u03a3eBn\nDipg(cid:48)(xi)y(e)\n\nn g(xp)\n\n.\n\n=\n\n\u03b7\nN\n\nEipg(cid:48)(xi)y(e)\n\np=1\n\np=1\n\n4\n\n......Student Network\u03be1\u03be2\u03beNJ1JKw1wKs......Teacher Network\u03be1\u03be2\u03beNB1BMv1vMt\u03beSamplingInput\u2028Distribution\u03b5=12(s\u2212t)2\fSince\n\n\u03beT \u03a3e\u03be \u2248 N \u00b5e+1\n\nwhere \u00b5d :=\n\n1\nN\n\nN(cid:88)\n\n\u03bbd\ni ,\n\n\u03bb1, . . . , \u03bbN : eigenvalues of \u03a3\n\nand the right hand sides of the difference equations are O(N\u22121), we can replace these difference\nequations with differential ones with N \u2192 \u221e, by taking the expectation over all input vectors \u03be:\n\ni=1\n\n, yp) \u2212 K(cid:88)\n\nDipI3(xi, x(e)\nj\n\n, xp)\n\n(cid:34) M(cid:88)\n\ndQ(e)\nij\nd\u02dc\u03b1\n\n= \u03b7\n\n+ \u03b72\u00b5e+1\n\nEipEjqI4(xi, xj, yp, yq)\n\n(3)\n\nEipI3(xi, x(e)\nj\n\ni\n\n+\n\np=1\n\np=1\n\np=1\n\np=1\n\nEjpI3(xj, x(e)\n\n, yp) \u2212 K(cid:88)\n\nM(cid:88)\n(cid:34)K,K(cid:88)\nDipEjqI4(xi, xj, xp, yq) \u2212 M,K(cid:88)\n\u2212 K,M(cid:88)\n(cid:34) M(cid:88)\n\nn , yp) \u2212 K(cid:88)\n\nDipDjqI4(xi, xj, xp, xq) +\n\np,q\n\np,q\n\np,q\n\n(cid:35)\n\n, xp)\n\nDjpI3(xj, x(e)\n\ni\n\nM,M(cid:88)\n\n(cid:35)\n\n,\n\np,q\n\nEipDjqI4(xi, xj, yp, xq)\n\n(cid:35)\n\nDipI3(xi, y(e)\n\nn , xp)\n\np=1\n\n= \u03b7\n\nEipI3(xi, y(e)\n\ndR(e)\nin\nd\u02dc\u03b1\nI3(z1, z2, z3) := (cid:104)g(cid:48)(z1)z2g(z3)(cid:105)\n\np=1\n\nand I4(z1, z2, z3, z4) := (cid:104)g(cid:48)(z1)g(cid:48)(z2)g(z3)g(z4)(cid:105).\nwhere\n(4)\nIn these equations, \u02dc\u03b1 := \u03b1/N represents time (normalized number of steps), and the brackets (cid:104)\u00b7(cid:105)\nrepresent the expectation when the input \u03be follows the input distribution p(\u03be).\nThe differential equations for D and E are obtained in a similar way:\n\n(cid:34) M(cid:88)\n(cid:34) M(cid:88)\n\np=1\n\nEipI2(xj, yp) \u2212 K(cid:88)\nFpnI2(xi, yp) \u2212 K(cid:88)\n\np=1\n\np=1\n\np=1\n\ndDij\nd\u02dc\u03b1\n\ndEin\nd\u02dc\u03b1\n\n= \u03b7\n\n= \u03b7\n\nDipI2(xj, xp) +\n\n(cid:35)\n\nEpnI2(xi, xp)\n\nM(cid:88)\n\nEjpI2(xi, yp) \u2212 K(cid:88)\n\np=1\n\np=1\n\n(cid:35)\n\nDjpI2(xi, xp)\n\n,\n\n(5)\n\nI2(z1, z2) := (cid:104)g(z1)g(z2)(cid:105).\n\nwhere\n\n(6)\nThese differential equations (3) and (5) govern the macroscopic dynamics of learning. In addition,\n2(cid:107)s \u2212 t(cid:107)2 over all input vectors \u03be, is\nthe generalization loss \u03b5g, the expectation of loss value \u03b5(\u03be) = 1\nrepresented as\n\u03b5g = (cid:104) 1\n2\n\nDpqI2(xp, xq) \u22122\n\n(cid:34)M,M(cid:88)\n\n(cid:107)s \u2212 t(cid:107)2(cid:105) =\n\nFpqI2(yp, yq) +\n\nK,M(cid:88)\n\nEpqI2(xp, yq)\n\nK,K(cid:88)\n\n(cid:35)\n\n1\n2\n\n.\n\np,q\n\np,q\n\np,q\n\n(7)\n\n3.1 Expectation terms\n\nAbove we have determined the dynamics of order parameters as (3), (5) and (7). However they have\nexpectation terms I2(z1, z2), I3(z1, z(e)\n2 , z3) and I4(z1, z2, z3, z4), where zs are either xi or yn. By\nstudying what distribution z follows, we can show that these expectation terms are dependent only\non 1-st and (e + 1)-th order parameters, namely, Q(1), R(1), T (1) and Q(e+1), R(e+1), T (e+1); for\nexample,\n\n(cid:90)\n\nI3(xi, x(e)\nj\n\n, yp) =\n\ndz1dz2dz3 g(cid:48)(z1)z2g(z3) N (z|0,\n\n5\n\n\uf8eb\uf8ec\uf8ed Q(1)\n\nii\n\nQ(e+1)\nij\nR(1)\nip\n\n\uf8f6\uf8f7\uf8f8)\n\nQ(e+1)\n\nij\n\n\u2217\n\nR(e+1)\n\njp\n\nR(1)\nip\nR(e+1)\njp\nT (1)\npp\n\n\fholds, where \u2217 does not in\ufb02uence the value of this expression (see Supplementary Material A.1 for\nmore detailed discussion). Thus, we see the \u2018speed\u2019 of e-th order parameters (i.e. (3) and (5)) only\ndepends on 1-st and (e + 1)-th order parameters, and the generalization error \u03b5g (equation (7)) only\ndepends on 1-st order parameters. Therefore, with denoting (Q(e), R(e), T (e)) by \u2126(e) and (D, E, F )\nby \u03c7, we can write\n\nd\nd\u02dc\u03b1\n\n\u2126(e) = f (e)(\u2126(1), \u2126(e+1), \u03c7),\n\nd\nd\u02dc\u03b1\n\n\u03c7 = g(\u2126(1), \u03c7),\n\nand\n\n\u03b5g = h(\u2126(1), \u03c7)\n\nwith appropriate functions f (e), g and h. Additionally, a polynomial of \u03a3\n\nd(cid:89)\n\nd(cid:88)\n\nP (\u03a3) :=\n\n(\u03a3 \u2212 \u03bb(cid:48)\n\niIN ) =\n\nce\u03a3e\n\nwhere \u03bb(cid:48)\n\n1, . . . , \u03bb(cid:48)\n\nd\n\ni=1\n\ne=0\n\nequals to 0, thus we get\n\n\u2126(d) = \u2212 d\u22121(cid:88)\n\nce\u2126(e).\n\nare distinct eigenvalues of \u03a3\n\n(8)\n\nUsing this relation, we can reduce \u2126(d) to expressions which contain only \u2126(0), . . . , \u2126(d\u22121), therefore\nwe can get a closed differential equation system with \u2126(0), . . . , \u2126(d\u22121) and \u03c7.\nIn summary, our macroscopic system is closed with the following order parameters:\n\ne=0\n\nij , Q(1)\nOrder variables : Q(0)\nOrder constants :\n(d: number of distinct eigenvalues of \u03a3)\nT (0)\nnm, T (1)\nThe order variables are governed by (3) and (5). For the lengthy full expressions of our macroscopic\nsystem for speci\ufb01c cases, see Supplementary Material A.2.\n\nij\nnm , Fnm.\n\n, Dij, Ein\n\nij , . . . , Q(d\u22121)\nnm, . . . , T (d\u22121)\n\n, R(0)\n\nin , R(1)\n\nin , . . . , R(d\u22121)\n\nin\n\n3.2 Dependency on input data covariance \u03a3\n\nThe differential equation system we derived depends on \u03a3, through two ways; the coef\ufb01cient \u00b5e+1\nof O(\u03b72)-term, and how (d)-th order parameters are expanded with lower order parameters (as (8)).\nSpeci\ufb01cally, the system only depends on the eigenvalue distribution of \u03a3.\n\n3.3 Evaluation of expectation terms for speci\ufb01c activation functions\n\nExpectation terms I2, I3 and I4 can be analytically determined for some activation functions g,\nincluding sigmoid-like g(x) = erf(x/\n2) (see Saad and Solla [1995]) and g(x) = ReLU(x) (see\nYoshida et al. [2017]).\n\n\u221a\n\n4 Analysis of numerical solutions of macroscopic differential equations\n\nIn this section, we analyze numerically the order parameter system, derived in the previous section1.\nWe assume that the second layers\u2019 weights of the student and the teacher, namely wi and vn, are \ufb01xed\nto 1 (i.e. we consider the learning of soft-committee machine), and that K and M are equal to 2, for\nsimplicity. Here we think of sigmoid-like activation g(x) = erf(x/\n\n\u221a\n\n2).\n\n4.1 Consistency between macroscopic system and microscopic system\n\nFirst of all, we con\ufb01rmed the consistency between the macroscopic system we derived and the original\nmicroscopic system. That is, we computed the dynamics of the generalization loss \u03b5g in two ways:\n(i) by updating weights of the network with SGD (1) iteratively, and (ii) by solving numerically the\ndifferential equations (5) which govern the order parameters, and we con\ufb01rmed that they accord with\neach other very well (Figure 4). Note that we set the initial values of order parameters in (ii) as values\ncorresponding to initial weights used in (i). For dependence of the learning trajectory on the initial\ncondition, see Supplementary Material A.3.\n\n1 We executed all computations on a standard PC.\n\n6\n\n\fFigure 4: Example dynamics of generalization error \u03b5g computed with (a) microscopic and (b)\nmacroscopic system. Network size: N-2-1. Learning rate: \u03b7 = 0.1. Eigenvalues of \u03a3: \u03bb1 = 0.4 with\nmultiplicity 0.5N, \u03bb2 = 1.2 with multiplicity 0.3N, and \u03bb3 = 1.6 with multiplicity 0.2N. Black\nlines: dynamics of \u03b5g. Blue lines: Q11, Q12, Q22. Green lines: R11, R12, R21, R22.\n\n4.2 Case of scalar input covariance \u03a3 = \u03c3IN\n\nAs the simplest case, here we consider the case that the convariance matrix \u03a3 is proportional to\nunit matrix. In this case, \u03a3 has only one eigenvalue \u03bb = \u00b51 of multiplicity N, then our order\nparameter system contains only parameters whose order is 0 (e = 0). For various values of \u00b51, we\nsolved numerically the differential equations of order parameters (5) and plotted the time evolution\nof generalization loss \u03b5g (Figure 5(a)). From these plots, we quanti\ufb01ed the lengths and heights of\nthe plateaus as following: we regarded the system is plateauing if the decreasing speed of log-loss is\nsmaller than half of its terminal converging speed, and we de\ufb01ned the height of the plateau as the\nmedian of loss values during plateauing. Quanti\ufb01ed lengths and heights are plotted in Figure 5(b)(c).\nIt indicates that the plateau length and height heavily depend on \u00b51, the input scale. Speci\ufb01cally, as\n\u00b51 decreases, the plateau rapidly becomes longer and lower. Though smaller input data lead to longer\nplateaus, it also becomes lower and then inconspicuous. This tendency is consistent with Figure\n2(a)(b), since IRIS dataset has large \u00b51 (\u2248 15.9) and MNIST has small \u00b51 (\u2248 0.112). Considering\nthis, the claim that the plateau phenomenon does not occur in learning of MNIST is controversy; this\nsuggests the possibility that we are observing quite long and low plateaus.\nNote that Figure 5(b) shows that the speed of growing of plateau length is larger than O(1/\u00b51). This\nis contrast to the case of linear networks which have no activation; in that case, as \u00b51 decreases the\nspeed of learning gets exactly 1/\u00b51-times larger. In other words, this phenomenon is peculiar to\nnonlinear networks.\n\nFigure 5: (a) Dynamics of generalization error \u03b5g when input variance \u03a3 has only one eigenvalue\n\u03bb = \u00b51 of multiplicity N. Plots with various values of \u00b51 are shown. (b) Plateau length and (b)\nplateau height, quanti\ufb01ed from (a).\n\n7\n\n050000100000Iterations 104103102101100Generalization loss gMicroscopic (N=100)(a)05001000Normalized Steps =/N104103102101100MacroscopicQijRing(b)(a)(b)(c)\f4.3 Case of different input covariance \u03a3 with \ufb01xed \u00b51\n\nIn the previous subsection we inspected the dependence of the learning dynamics on the \ufb01rst moment\n\u00b51 of the eigenvalues of the covariance matrix \u03a3. In this subsection, we explored the dependence of\nthe dynamics on the higher moments of eigenvalues, under \ufb01xed \ufb01rst moment \u00b51.\nIn this subsection, we consider the case in which the input covariance matrix \u03a3 has two distinct\nnonzero eigenvalues, \u03bb1 = \u00b51 \u2212 \u2206\u03bb/2 and \u03bb2 = \u00b51 + \u2206\u03bb/2, of the same multiplicity N/2 (Figure\n6). With changing the control parameter \u2206\u03bb, we can get eigenvalue distributions with various values\nof second moment \u00b52 = (cid:104)\u03bb2\ni(cid:105).\n\nFigure 6: Eigenvalue distribution with \ufb01xed \u00b51 parameterized by \u2206\u03bb, which yields various \u00b52.\n\nFigure 7(a) shows learning curves with various \u00b52 while \ufb01xing \u00b51 to 1. From these curves, we\nquanti\ufb01ed the lengths and heights of the plateaus, and plotted them in Figure 7(b)(c). These indicate\nthat the length of the plateau shortens as \u00b52 becomes large. That is, the more the distribution of\nnonzero eigenvalues gets broaden, the more the plateau gets alleviated.\n\n(a) Dynamics of generalization error \u03b5g when input variance \u03a3 has two eigenvalues\nFigure 7:\n\u03bb1,2 = \u00b51 \u00b1 \u2206\u03bb/2 of multiplicity N/2. Plots with various values of \u00b52 are shown. (b) Plateau length\nand (c) plateau height, quanti\ufb01ed from (a).\n\n5 Conclusion\n\nUnder the statistical mechanical formulation of learning in the two-layered perceptron, we showed\nthat macroscopic equations can be derived even when the statistical properties of the input are\ngeneralized. We showed that the dynamics of learning depends only on the eigenvalue distribution\nof the covariance matrix of the input data. By numerically analyzing the macroscopic system, it is\nshown that the statistics of input data dramatically affect the plateau phenomenon.\nThrough this work, we explored the gap between theory and reality; though the plateau phenomenon\nis theoretically predicted to occur by the general symmetrical structure of neural networks, it is\nseldom observed in practice. However, more extensive researches are needed to fully understand the\ntheory underlying the plateau phenomenon in practical cases.\n\n8\n\n\u0394\u03bb\u03bc1+\u0394\u03bb2\u03bc1\u2212\u0394\u03bb2\u03bb(a)(b)(c)\fAcknowledgement\n\nThis work was supported by JSPS KAKENHI Grant-in-Aid for Scienti\ufb01c Research(A) (No.\n18H04106).\n\nReferences\nFlorent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learning in multilayer percep-\n\ntrons near singularities. IEEE Transactions on Neural Networks, 19(8):1313\u20131328, 2008.\n\nKenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of\n\nmultilayer perceptrons. Neural Networks, 13(3):317\u2013327, 2000.\n\nSebastian Goldt, Madhu S Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborov\u00e1.\nDynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.\narXiv preprint arXiv:1906.08632, 2019.\n\nWeili Guo, Yuan Yang, Yingjiang Zhou, Yushun Tan, Haikun Wei, Aiguo Song, and Guochen Pang.\nIn\ufb02uence area of overlap singularity in multilayer perceptrons. IEEE Access, 6:60214\u201360223,\n2018.\n\nKazuyuki Hara, Daisuke Saitoh, and Hayaru Shouno. Analysis of dropout learning regarded as\nIn International Conference on Arti\ufb01cial Neural Networks, pages 72\u201379.\n\nensemble learning.\nSpringer, 2016.\n\nJohn Milnor. On the concept of attractor. In The Theory of Chaotic Attractors, pages 243\u2013264.\n\nSpringer, 1985.\n\nA Emin Orhan and Xaq Pitkow. Skip connections eliminate singularities.\n\narXiv:1701.09175, 2017.\n\narXiv preprint\n\nHyeyoung Park, Shun-ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning algorithms\n\nfor various stochastic models. Neural Networks, 13(7):755\u2013764, 2000.\n\nPeter Riegler and Michael Biehl. On-line backpropagation in two-layered neural networks. Journal\n\nof Physics A: Mathematical and General, 28(20):L507, 1995.\n\nDavid Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E, 52\n\n(4):4225, 1995.\n\nMichiel Straat and Michael Biehl. On-line learning dynamics of relu neural networks using statistical\n\nphysics techniques. arXiv preprint arXiv:1903.07378, 2019.\n\nMichiel Straat, Fthi Abadi, Christina G\u00f6pfert, Barbara Hammer, and Michael Biehl. Statistical\n\nmechanics of on-line learning under concept drift. Entropy, 20(10):775, 2018.\n\nShiro Takagi, Yuki Yoshida, and Masato Okada. Impact of layer normalization on single-layer\nperceptron\u2014statistical mechanical analysis. Journal of the Physical Society of Japan, 88(7):\n074003, 2019.\n\nHaikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of\n\nlearning near singularities in layered networks. Neural computation, 20(3):813\u2013843, 2008.\n\nYuki Yoshida, Ryo Karakida, Masato Okada, and Shun-ichi Amari. Statistical mechanical analysis\nof online learning with weight normalization in single layer perceptron. Journal of the Physical\nSociety of Japan, 86(4):044002, 2017.\n\nYuki Yoshida, Ryo Karakida, Masato Okada, and Shun-ichi Amari. Statistical mechanical analysis\nof learning dynamics of two-layer perceptron with multiple output units. Journal of Physics A:\nMathematical and Theoretical, 2019.\n\n9\n\n\f", "award": [], "sourceid": 981, "authors": [{"given_name": "Yuki", "family_name": "Yoshida", "institution": "The University of Tokyo"}, {"given_name": "Masato", "family_name": "Okada", "institution": "The University of Tokyo"}]}