{"title": "Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice", "book": "Advances in Neural Information Processing Systems", "page_first": 4785, "page_last": 4795, "abstract": "It is well known that weight initialization in deep networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. Moreover, in deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. However, it is unclear how to achieve dynamical isometry in nonlinear deep networks. We address this question by employing powerful tools from free probability theory to analytically compute the {\\it entire} singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.", "full_text": "Resurrecting the sigmoid in deep learning through\n\ndynamical isometry: theory and practice\n\nJeffrey Pennington\n\nGoogle Brain\n\nSamuel S. Schoenholz\n\nGoogle Brain\n\nApplied Physics, Stanford University and Google Brain\n\nSurya Ganguli\n\nAbstract\n\nIt is well known that weight initialization in deep networks can have a dramatic\nimpact on learning speed. For example, ensuring the mean squared singular value\nof a network\u2019s input-output Jacobian is O(1) is essential for avoiding exponentially\nvanishing or exploding gradients. Moreover, in deep linear networks, ensuring that\nall singular values of the Jacobian are concentrated near 1 can yield a dramatic\nadditional speed-up in learning; this is a property known as dynamical isometry.\nHowever, it is unclear how to achieve dynamical isometry in nonlinear deep net-\nworks. We address this question by employing powerful tools from free probability\ntheory to analytically compute the entire singular value distribution of a deep\nnetwork\u2019s input-output Jacobian. We explore the dependence of the singular value\ndistribution on the depth of the network, the weight initialization, and the choice of\nnonlinearity. Intriguingly, we \ufb01nd that ReLU networks are incapable of dynamical\nisometry. On the other hand, sigmoidal networks can achieve isometry, but only\nwith orthogonal weight initialization. Moreover, we demonstrate empirically that\ndeep nonlinear networks achieving dynamical isometry learn orders of magnitude\nfaster than networks that do not. Indeed, we show that properly-initialized deep\nsigmoidal networks consistently outperform deep ReLU networks. Overall, our\nanalysis reveals that controlling the entire distribution of Jacobian singular values\nis an important design consideration in deep learning.\n\n1\n\nIntroduction\n\nDeep learning has achieved state-of-the-art performance in many domains, including computer\nvision [1], machine translation [2], human games [3], education [4], and neurobiological modelling [5,\n6]. A major determinant of success in training deep networks lies in appropriately choosing the\ninitial weights. Indeed the very genesis of deep learning rested upon the initial observation that\nunsupervised pre-training provides a good set of initial weights for subsequent \ufb01ne-tuning through\nbackpropagation [7]. Moreover, seminal work in deep learning suggested that appropriately-scaled\nGaussian weights can prevent gradients from exploding or vanishing exponentially [8], a condition\nthat has been found to be necessary to achieve reasonable learning speeds [9].\nThese random weight initializations were primarily driven by the principle that the mean squared\nsingular value of a deep network\u2019s Jacobian from input to output should remain close to 1. This\ncondition implies that on average, a randomly chosen error vector will preserve its norm under\nbackpropagation; however, it provides no guarantees on the worst case growth or shrinkage of an error\nvector. A stronger requirement one might demand is that every Jacobian singular value remain close\nto 1. Under this stronger requirement, every single error vector will approximately preserve its norm,\nand moreover all angles between different error vectors will be preserved. Since error information\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbackpropagates faithfully and isometrically through the network, this stronger requirement is called\ndynamical isometry [10].\nA theoretical analysis of exact solutions to the nonlinear dynamics of learning in deep linear networks\n[10] revealed that weight initializations satisfying dynamical isometry yield a dramatic increase in\nlearning speed compared to initializations that do not. For such linear networks, orthogonal weight\ninitializations achieve dynamical isometry, and, remarkably, their learning time, measured in number\nof learning epochs, becomes independent of depth. In contrast, random Gaussian initializations do\nnot achieve dynamical isometry, nor do they achieve depth-independent training times.\nIt remains unclear, however, how these results carry over to deep nonlinear networks. Indeed,\nempirically, a simple change from Gaussian to orthogonal initializations in nonlinear networks has\nyielded mixed results [11], raising important theoretical and practical questions. First, how does the\nentire distribution of singular values of a deep network\u2019s input-output Jacobian depend upon the depth,\nthe statistics of random initial weights, and the shape of the nonlinearity? Second, what combinations\nof these ingredients can achieve dynamical isometry? And third, among the nonlinear networks\nthat have neither vanishing nor exploding gradients, do those that in addition achieve dynamical\nisometry also achieve much faster learning compared to those that do not? Here we answer these\nthree questions, and we provide a detailed summary of our results in the discussion.\n\n2 Theoretical Results\n\nIn this section we derive expressions for the entire singular value density of the input-output Jacobian\nfor a variety of nonlinear networks in the large-width limit. We compute the mean squared singular\nvalue of J (or, equivalently, the mean eiganvalue of JJT ), and deduce a rescaling that sets it equal\nto 1. We then examine two metrics that help quantify the conditioning of the Jacobian: smax, the\nmaximum singular value of J (or, equivalently, max, the maximum eigenvalue of JJT ); and 2\nJJ T ,\nJJ T 1 then the Jacobian is\nthe variance of the eigenvalue distribution of JJT . If max 1 and 2\nill-conditioned and we expect the learning dynamics to be slow.\n\n2.1 Problem setup\nConsider an L-layer feed-forward neural network of width N with synaptic weight matrices Wl 2\nRN\u21e5N, bias vectors bl, pre-activations hl and post-activations xl, with l = 1, . . . , L. The feed-\nforward dynamics of the network are governed by,\n\n(1)\nwhere : R ! R is a pointwise nonlinearity and the input is h0 2 RN. Now consider the\ninput-output Jacobian J 2 RN\u21e5N given by\n\nxl = (hl) , xl = Wlhl1 + bl ,\n\nJ =\n\n@xL\n@h0 =\n\nLYl=1\n\nDlWl.\n\n(2)\n\nij = 0(hl\n\nHere Dl is a diagonal matrix with entries Dl\ni) ij. The input-output Jacobian J is closely\nrelated to the backpropagation operator mapping output errors to weight matrices at a given layer; if\nthe former is well conditioned, then the latter tends to be well-conditioned for all weight layers. We\ntherefore wish to understand the entire singular value spectrum of J for deep networks with randomly\ninitialized weights and biases.\nIn particular, we will take the biases bl\ni to be drawn i.i.d. from a zero mean Gaussian with standard\ndeviation b. For the weights, we will consider two random matrix ensembles: (1) random Gaussian\nweights in which each W l\nw/N, and (2) random\northogonal weights, drawn from a uniform distribution over scaled orthogonal matrices obeying\n(Wl)T Wl = 2\n\nij is drawn i.i.d from a Gaussian with variance 2\n\nw I.\n\n2.2 Review of signal propagation\n\nThe random matrices Dl in eqn. (2) depend on the empirical distribution of pre-activations hl entering\nthe nonlinearity in eqn. (1). The propagation of this empirical distribution through different layers l\n\n2\n\n\fwas studied in [12]. There, it was shown that in the large-N limit this empirical distribution converges\nto a Gaussian with zero mean and variance ql, where ql obeys a recursion relation induced by the\ndynamics in eqn. (1),\n\nwith initial condition q0 = 1\nGaussian measure. This recursion has a \ufb01xed point obeying,\n\ni=1(h0\n\n2 ) denotes the standard\n\nql = 2\n\nNPN\n\nq\u21e4 = 2\n\n+ 2\nb ,\nexp ( h2\n\ni )2, and where Dh = dhp2\u21e1\n\nwZ Dh\u21e3pql1h\u23182\nwZ Dhpq\u21e4h2 + 2\n\nb .\n\nIf the input h0 is chosen so that q0 = q\u21e4, then we start at the \ufb01xed point, and the distribution of Dl\nbecomes independent of l. Also, if we do not start at the \ufb01xed point, in many scenarios we rapidly\napproach it in a few layers (see [12]), so for large L, assuming ql = q\u21e4 at all depths l is a good\napproximation in computing the spectrum of J.\nAnother important quantity governing signal propagation through deep networks [12, 13] is\n\n =\n\n1\n\nN \u2326Tr (DW)T DW\u21b5 = 2\n\nwZ Dh\u21e50pq\u21e4h\u21e42 ,\n\nwhere 0 is the derivative of . Here is the mean of the distribution of squared singular values of\nthe matrix DW, when the pre-activations are at their \ufb01xed point distribution with variance q\u21e4. As\nshown in [12, 13] and Fig. 1, (w, b) separates the (w, b) plane into two phases, chaotic and\nordered, in which gradients exponentially explode or vanish respectively. Indeed, the mean squared\nsingular value of J was shown simply to be L in [12, 13], so = 1 is a critical line of initializations\nwith neither vanishing nor exploding gradients.\n\nOrdered\n(w, b) < 1\n\nVanishing Gradients\n\nChaotic\n(w, b) > 1\n\nExploding Gradients\n\nq\u21e4 = 1.5\n1.5\n\n1.0\n\n0.5\n\n0.0\n\nFigure 1: Order-chaos transition when (h) = tanh(h). The\ncritical line (w, b) = 1 determines the boundary between\ntwo phases [12, 13]: (a) a chaotic phase when > 1, where\nforward signal propagation expands and folds space in a\nchaotic manner and back-propagated gradients exponentially\nexplode, and (b) an ordered phase when < 1, where for-\nward signal propagation contracts space in an ordered manner\nand back-propagated gradients exponentially vanish. The\nvalue of q\u21e4 along the critical line separating the two phases\nis shown as a heatmap.\n\n2.3 Free probability, random matrix theory and deep networks.\nWhile the previous section revealed that the mean squared singular value of J is L, we would like to\nobtain more detailed information about the entire singular value distribution of J, especially when\n = 1. Since eqn. (2) consists of a product of random matrices, free probability [14, 15, 16] becomes\nrelevant to deep learning as a powerful tool to compute the spectrum of J, as we now review.\nIn general, given a random matrix X, its limiting spectral density is de\ufb01ned as\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nwhere h\u00b7iX denotes the mean with respect to the distribution of the random matrix X. Also,\n\nis the de\ufb01nition of the Stieltjes transform of \u21e2X, which can be inverted using,\n\nNXi=1\n\nN\n\n\u21e2X() \u2318* 1\nGX(z) \u2318ZR\n\n\u21e2X(t)\nz t\n\n( i)+X\n\n,\n\ndt ,\n\nz 2 C \\ R ,\n\n\u21e2X() = \n\n1\n\u21e1\n\nlim\n\u270f!0+\n\nIm GX( + i\u270f) .\n\n3\n\n\fL = 2\n\nL = 8\n\nL = 32\n\n(a)\n\n(b)\n\n(c)\n\nL = 128\n\nLinear Gaussian\nReLU Orthogonal\nHTanh Orthogonal\n\n(d)\n\nFigure 2: Examples of deep spectra at criticality for different nonlinearities at different depths.\nExcellent agreement is observed between empirical simulations of networks of width 1000 (dashed\nlines) and theoretical predictions (solid lines). ReLU and hard tanh are with orthogonal weights,\nand linear is with Gaussian weights. Gaussian linear and orthogonal ReLU have similarly-shaped\ndistributions, especially for large depths, where poor conditioning and many large singular values are\nobserved. On the other hand, orthogonal hard tanh is much better conditioned.\n\nThe Stieltjes transform GX is related to the moment generating function MX,\n\nMX(z) \u2318 zGX(z) 1 =\n\nmk\nzk ,\n\n(9)\n\n1Xk=1\n\nwhere the mk is the kth moment of the distribution \u21e2X, mk =R d \u21e2X()k = 1\n\nwe denote the functional inverse of MX by M1\nM1\n\nX (MX(z)) = z. Finally, the S-transform [14, 15] is de\ufb01ned as,\n\nX , which by de\ufb01nition satis\ufb01es MX(M1\n\nN htrXkiX . In turn,\nX (z)) =\n\nSX(z) =\n\n1 + z\nzM1\nX (z)\n\n.\n\n(10)\n\nThe utility of the S-transform arises from its behavior under multiplication. Speci\ufb01cally, if A and\nB are two freely-independent random matrices, then the S-transform of the product random matrix\nensemble AB is simply the product of their S-transforms,\n\nSAB(z) = SA(z)SB(z) .\n\n(11)\n\nOur \ufb01rst main result will be to use eqn. (11) to write down an implicit de\ufb01nition of the spectral density\nof JJT . To do this we \ufb01rst note that (see Result 1 of the supplementary material),\n\nSJJ T =\n\nSWlW T\n\nl\n\nSD2\n\nl\n\n= SL\n\nW W T SL\n\nD2 ,\n\n(12)\n\nLYl=1\n\nl\n\n= SD2 for all l.\n\nwhere we have used the identical distribution of the weights to de\ufb01ne SW W T = SWlW T\nfor all l, and\nwe have also used the fact the pre-activations are distributed independently of depth as hl \u21e0N (0, q\u21e4),\nwhich implies that SD2\nEqn. (12) provides a method to compute the spectrum \u21e2JJ T (). Starting from \u21e2W T W () and \u21e2D2(),\nwe compute their respective S-transforms through the sequence of equations eqns. (7), (9), and (10),\ntake the product in eqn. (12), and then reverse the sequence of steps to go from SJJ T to \u21e2JJ T ()\nthrough the inverses of eqns. (10), (9), and (8). Thus we must calculate the S-transforms of WWT\nand D2, which we attack next for speci\ufb01c nonlinearities and weight ensembles in the following\nsections. In principle, this procedure can be carried out numerically for an arbitrary choice of\nnonlinearity, but we postpone this investigation to future work.\n\nl\n\n2.4 Linear networks\n\nAs a warm-up, we \ufb01rst consider a linear network in which J =QL\n\nl=1 Wl. Since criticality ( = 1\nin eqn. (5)) implies 2\nb , the only critical point is\n(w, b) = (1, 0). The case of orthogonal weights is simple: J is also orthogonal, and all its singular\nvalues are 1, thereby achieving perfect dynamic isometry. Gaussian weights behave very differently.\n\nw = 1 and eqn. (4) reduces to q\u21e4 = 2\n\nwq\u21e4 + 2\n\n4\n\n\fThe squared singular values s2\ni of J equal the eigenvalues i of JJT , which is a product Wishart\nmatrix, whose spectral density was recently computed in [17]. The resulting singular value density of\nJ is given by,\n\n\u21e2(s()) =\n\n2\n\n\u21e1s sin3() sinL2(L)\n\nsinL1((L + 1))\n\n,\n\ns() =s sinL+1((L + 1))\n\nsin sinL(L)\n\n.\n\n(13)\n\nFig. 2(a) demonstrates a match between this theoretical density and the empirical density obtained\nfrom numerical simulations of random linear networks. As the depth increases, this density becomes\nhighly anisotropic, both concentrating about zero and developing an extended tail.\nNote that = \u21e1/(L + 1) corresponds to the minimum singular value smin = 0, while = 0\ncorresponds to the maximum eigenvalue, max = s2\nmax = LL(L + 1)L+1, which, for large L scales\nas max \u21e0 eL. Both eqn. (13) and the methods of Section 2.5 yield the variance of the eigenvalue\ndistribution of JJT to be 2\nJJ T = L. Thus for linear Gaussian networks, both smax and 2\nJJ T grow\nlinearly with depth, signalling poor conditioning and the breakdown of dynamical isometry.\n\n1 dh eh2/2q\u21e4\n\np(q\u21e4) =R 1\n\nw = 2, in which case eqn. (4) reduces to q\u21e4 = 1\n\n2.5 ReLU and hard-tanh networks\nWe \ufb01rst discuss the criticality conditions (\ufb01nite q\u21e4 in eqn. (4) and = 1 in eqn. (5)) in these\ntwo nonlinear networks. For both networks, since the slope of the nonlinearity 0(h) only takes\nthe values 0 and 1, in eqn. (5) reduces to = 2\nwp(q\u21e4) where p(q\u21e4) is the probability that\na given neuron is in the linear regime with 0(h) = 1. As discussed above, we take the large-\nwidth limit in which the distribution of the pre-activations h is a zero mean Gaussian with variance\nq\u21e4. We therefore \ufb01nd that for ReLU, p(q\u21e4) = 1\n2 is independent of q\u21e4, whereas for hard-tanh,\np2\u21e1q\u21e4 = erf(1/p2q\u21e4) depends on q\u21e4. In particular, it approaches 1 as q\u21e4 ! 0.\nThus for ReLU, = 1 if and only if 2\nb ,\nwq\u21e4 + 2\nimplying that the only critical point is (w, b) = (2, 0). For hard-tanh, in contrast, = 2\nwp(q\u21e4),\nwhere p(q\u21e4) itself depends on w and b through eqn. (4), and so the criticality condition = 1\nyields a curve in the (w, b) plane similar to that shown for the tanh network in Fig. 1. As one moves\nalong this curve in the direction of decreasing w, the curve approaches the point (w, b) = (1, 0)\nwith q\u21e4 monotonically decreasing towards 0, i.e. q\u21e4 ! 0 as w ! 1.\nThe critical ReLU network and the one parameter family of critical hard-tanh networks have neither\nvanishing nor exploding gradients, due to = 1. Nevertheless, the entire singular value spectrum\nof J of these networks can behave very differently. From eqn. (12), this spectrum depends on\nthe non-linearity (h) through SD2 in eqn. (10), which in turn only depends on the distribution\nof eigenvalues of D2, or equivalently, the distribution of squared derivatives 0(h)2. As we have\nseen, this distribution is a Bernoulli distribution with parameter p(q\u21e4): \u21e2D2(z) = (1 p(q\u21e4)) (z) +\np(q\u21e4) (z 1). Inserting this distribution into the sequence eqn. (7), eqn. (9), eqn. (10) then yields\n(14)\n\np(q\u21e4)\nz 1\nTo complete the calculation of SJJ T in eqn. (12), we must also compute SW W T . We do this for\nGaussian and orthogonal weights in the next two subsections.\n\n1 p(q\u21e4)\n\nMD2(z) =\n\np(q\u21e4)\nz 1\n\n,\n\nz + p(q\u21e4)\n\nGD2(z) =\n\n,\n\nSD2(z) =\n\n2 2\n\nz + 1\n\n.\n\n+\n\nz\n\n2.5.1 Gaussian weights\nWe re-derive the well-known expression for the S-transform of products of random Gaussian matrices\nw (1 + z)1,\nwith variance 2\nw in Example 3 of the supplementary material. The result is SW W T = 2\nwhich, when combined with eqn. (14) for SD2, eqn. (12) for SJJ T , and eqn. (10) for M1\nX (z), yields\n\nSJJ T (z) = 2L\n\nw (z + p(q\u21e4))L,\n\nM1\n\nJJ T (z) =\n\nUsing eqn. (15) and eqn. (9), we can de\ufb01ne a polynomial that the Stieltjes transform G satis\ufb01es,\n\n(16)\nThe correct root of this equation is the one for which G \u21e0 1/z as z ! 1 [16]. From eqn. (8), the\nspectral density is obtained from the imaginary part of G( + i\u270f) as \u270f ! 0+.\n\nw G(Gz + p(q\u21e4) 1)L (Gz 1) = 0 .\n2L\n\n5\n\nz + 1\n\nz\n\nz + p(q\u21e4)L2L\n\nw .\n\n(15)\n\n\f(a)\n\nq\u21e4 = 64\n\n(b)\n\n(c)\n\nL = 1024\n\n(d)\n\nq\u21e4 = 1/64\n\nL = 1\n\nFigure 3: The max singular value smax of J versus L and q\u21e4 for Gaussian (a,c) and orthogonal (b,d)\nweights, with ReLU (dashed) and hard-tanh (solid) networks. For Gaussian weights and for both\nReLU and hard-tanh, smax grows with L for all q\u21e4 (see a,c) as predicted in eqn. (17) . In contrast, for\northogonal hard-tanh, but not orthogonal ReLU, at small enough q\u21e4, smax can remain O(1) even at\nlarge L (see b,d) as predicted in eqn. (22). In essence, at \ufb01xed small q\u21e4, if p(q\u21e4) is the large fraction\nof neurons in the linear regime, smax only grows with L after L > p/(1 p) (see d). As q\u21e4 ! 0,\np(q\u21e4) ! 1 and the hard-tanh networks look linear. Thus the lowest curve in (a) corresponds to\nthe prediction of linear Gaussian networks in eqn. (13), while the lowest curve in (b) is simply 1,\ncorresponding to linear orthogonal networks.\n\nThe positions of the spectral edges, namely locations of the minimum and maximum eigenvalues\nof JJT , can be deduced from the values of z for which the imaginary part of the root of eqn. (16)\nvanishes, i.e. when the discriminant of the polynomial in eqn. (16) vanishes. After a detailed but\nunenlightening calculation, we \ufb01nd, for large L,\n\nmax = s2\n\nmax =2\n\nwp(q\u21e4)L\u2713 e\n\np(q\u21e4)\n\nL + O(1)\u25c6 .\n\n(17)\n\nwp(q\u21e4), we \ufb01nd exponential growth in max if > 1 and exponential decay if\n\nRecalling that = 2\n< 1. Moreover, even at criticality when = 1, max still grows linearly with depth.\nNext, we obtain the variance 2\nmoments m1 and m2. We employ the Lagrange inversion theorem [18],\n\nJJ T of the eigenvalue density of JJT by computing its \ufb01rst two\n\nMJJ T (z) =\n\nm1\nz\n\n+\n\nm2\nz2 + \u00b7\u00b7\u00b7 ,\n\nM1\n\nJJ T (z) =\n\nm1\nz\n\n+\n\nm2\nm1\n\n+ \u00b7\u00b7\u00b7 ,\n\n(18)\n\nm1 = (2\n\nwp(q\u21e4))L ,\n\nwhich relates the expansions of the moment generating function MJJ T (z) and its functional inverse\nM1\nJJ T (z) into eqn. (15), expanding the right hand side,\nand equating the coef\ufb01cients of z, we \ufb01nd,\n\nJJ T (z). Substituting this expansion for M1\n\nwp(q\u21e4) = 1, the variance 2\n\n(19)\nBoth moments generically either exponentially grow or vanish. However even at criticality, when\np(q\u21e4) still exhibits linear growth with depth.\n = 2\nNote that p(q\u21e4) is the fraction of neurons operating in the linear regime, which is always less than 1.\nThus for both ReLU and hard-tanh networks, no choice of Gaussian initialization can ever prevent\nthis linear growth, both in 2\nJJ T and max, implying that even critical Gaussian initializations will\nalways lead to a failure of dynamical isometry at large depth for these networks.\n\nwp(q\u21e4))2LL + p(q\u21e4)/p(q\u21e4) .\n\nJJ T = m2 m2\n\nm2 = (2\n\n1 = L\n\n2.5.2 Orthogonal weights\nFor orthogonal W, we have WWT = I, and the S-transform is SI = 1 (see Example 2 of\nthe supplementary material). After scaling by w, we have SW W T = S2\nw .\nw SI = 2\nCombining this with eqn. (14) and eqn. (12) yields SJJ T (z) and, through eqn. (10), yields M1\nJJ T :\n\nwI = 2\n\nSJJ T (z) = 2L\n\nw \u2713 z + 1\n\nz + p(q\u21e4)\u25c6L\n\n, M1\n\nJJ T =\n\nz + 1\n\nz \u2713 z + 1\n\nz + p(q\u21e4)\u25c6L\n\n2L\nw .\n\n(20)\n\nNow, combining eqn. (20) and eqn. (9), we obtain a polynomial that the Stieltjes transform G satis\ufb01es:\n\ng2LG(Gz + p(g) 1)L (zG)L(Gz 1) = 0 .\n\n(21)\n\n6\n\n\fSGD\n\nMomentum\n\nADAM\n\nRMSProp\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nw = 2, and black is ReLU with 2\n\nFigure 4: Learning dynamics, measured by generalization performance on a test set, for networks of\ndepth 200 and width 400 trained on CIFAR10 with different optimizers. Blue is tanh with 2\nw = 1.05,\nred is tanh with 2\nw = 2. Solid lines are orthogonal and dashed\nlines are Gaussian initialization. The relative ordering of curves robustly persists across optimizers,\nand is strongly correlated with the degree to which dynamical isometry is present at initialization, as\nmeasured by smax in Fig. 3. Networks with smax closer to 1 learn faster, even though all networks are\ninitialized critically with = 1. The most isometric orthogonal tanh with small 2\nw trains several\norders of magnitude faster than the least isometric ReLU network.\n\nFrom this we can extract the eigenvalue and singular value density of JJT and J, respectively, through\neqn. (8). Figs. 2(b) and 2(c) demonstrate an excellent match between our theoretical predictions and\nnumerical simulations of random networks. We \ufb01nd that at modest depths, the singular values are\npeaked near max, but at larger depths, the distribution both accumulates mass at 0 and spreads out,\ndeveloping a growing tail. Thus at \ufb01xed critical values of w and b, both deep ReLU and hard-tanh\nnetworks have ill-conditioned Jacobians, even with orthogonal weight matrices.\nAs above, we can obtain the maximum eigenvalue of JJT by determining the values of z for which\nthe discriminant of the polynomial in eqn. (21) vanishes. This calculation yields,\n\nmax = s2\n\nwp(q\u21e4)L 1 p(q\u21e4)\n\np(q\u21e4)\n\nLL\n\n(L 1)L1 .\n\n(22)\n\nmax =2\np(q\u21e4) eL e\n\nwp(q\u21e4) =\n\nJJ T in eqn. (20) and applying eqn. (18). At criticality, we \ufb01nd\nJJ T depends\n\n2 + O(L1). Also, as above, we can compute the\n\nJJ T by expanding M1\np(q\u21e4) L for large L. Now the large L asymptotic behavior of both max and 2\n\nFor large L, max either exponentially explodes or decays, except at criticality when = 2\n1, where it behaves as max = 1p(q\u21e4)\nvariance 2\nJJ T = 1p(q\u21e4)\n2\ncrucially on p(q\u21e4), the fraction of neurons in the linear regime.\nFor ReLU networks, p(q\u21e4) = 1/2, and we see that max and 2\nJJ T grow linearly with depth and\ndynamical isometry is unachievable in ReLU networks, even for critical orthogonal weights. In\ncontrast, for hard tanh networks, p(q\u21e4) = erf(1/p2q\u21e4). Therefore, one can always move along the\ncritical line in the (w, b) plane towards the point (1, 0), thereby reducing q\u21e4, increasing p(q\u21e4), and\ndecreasing, to an arbitrarily small value, the prefactor 1p(q\u21e4)\np(q\u21e4) controlling the linear growth of both\nJJ T . So unlike either ReLU networks, or Gaussian networks, one can achieve dynamical\nmax and 2\nisometry up to depth L by choosing q\u21e4 small enough so that p(q\u21e4) \u21e1 1 1\nL. In essence, this strategy\nincreases the fraction of neurons operating in the linear regime, enabling orthogonal hard-tanh nets to\nmimic the successful dynamical isometry achieved by orthogonal linear nets. However, this strategy\nis unavailable for orthogonal ReLU networks. A demonstration of these results is shown in Fig. 3.\n\n3 Experiments\n\nHaving established a theory of the entire singular value distribution of J, and in particular of when\ndynamical isometry is present or not, we now provide empirical evidence that the presence or absence\nof this isometry can have a large impact on training speed. In our \ufb01rst experiment, summarized in\nFig. 4, we compare three different classes of critical neural networks: (1) tanh with small 2\nw = 1.05\nand 2\nb = 0.104; and (3) ReLU with 2\nw = 2 and\nb = 2.01 \u21e5 105. In each case b is chosen appropriately to achieve critical initial conditions at the\n2\n\nb = 2.01\u21e5 105; (2) tanh with large 2\n\nw = 2 and 2\n\n7\n\n\fL = 10\n\n(a)\n\nq\u21e4 = 64\n\n(b)\n\n(c)\n\n(d)\n\nL = 300\n\nq\u21e4 = 1/64\n\nFigure 5: Empirical measurements of SGD training time \u2327, de\ufb01ned as number of steps to reach\np \u21e1 0.25 accuracy, for orthogonal tanh networks. In (a), curves re\ufb02ect different depths L at \ufb01xed\nsmall q\u21e4 = 0.025. Intriguingly, they all collapse onto a single universal curve when the learning\nrate \u2318 is rescaled by L and \u2327 is rescaled by 1/pL. This implies the optimal learning rate is O(1/L),\nand remarkably, the optimal learning time \u2327 grows only as O(pL). (b) Now different curves re\ufb02ect\ndifferent q\u21e4 at \ufb01xed L = 200, revealing that smaller q\u21e4, associated with increased dynamical isometry\nin J, enables faster training times by allowing a larger optimal learning rate \u2318. (c) \u2327 as a function of\nL for a few values of q\u21e4. (d) \u2327 as a function of q\u21e4 for a few values of L. We see qualitative agreement\nof (c,d) with Fig. 3(b,d), suggesting a strong connection between \u2327 and smax.\n\nboundary between order and chaos [12, 13], with = 1. All three of these networks have a mean\nsquared singular value of 1 with neither vanishing nor exploding gradients in the in\ufb01nite width limit.\nThese experiments therefore probe the speci\ufb01c effect of dynamical isometry, or the entire shape of\nthe spectrum of J, on learning. We also explore the degree to which more sophisticated optimizers\ncan overcome poor initializations. We compare SGD, Momentum, RMSProp [19], and ADAM [20].\nWe train networks of depth L = 200 and width N = 400 for 105 steps with a batch size of 103. We\nadditionally average our results over 30 different instantiations of the network to reduce noise. For\neach nonlinearity, initialization, and optimizer, we obtain the optimal learning rate through grid search.\nFor SGD and SGD+Momentum we consider logarithmically spaced rates between [104, 101] in\nsteps 100.1; for ADAM and RMSProp we explore the range [107, 104] at the same step size. To\nchoose the optimal learning rate we select a threshold accuracy p and measure the \ufb01rst step when\nperformance exceeds p. Our qualitative conclusions are fairly independent of p. Here we report\nresults on a version of CIFAR101.\nBased on our theory, we expect the performance advantage of orthogonal over Gaussian initializations\nto be signi\ufb01cant in case (1) and somewhat negligible in cases (2) and (3). This prediction is veri\ufb01ed\nin Fig. 4 (blue solid and dashed learning curves are well-separated, compared to red and black cases).\nFurthermore, the extent of dynamical isometry at initialization strongly predicts the speed of learning.\nThe effect is large, with the most isometric case (orthogonal tanh with small 2\nw) learning faster\nthan the least isometric case (ReLU networks) by several orders of magnitude. Moreover, these\nconclusions robustly persist across all optimizers. Intriguingly, in the case where dynamical isometry\nhelps the most (tanh with small 2\nw), the effect of initialization (orthogonal versus Gaussian) has a\nmuch larger impact on learning speed than the choice of optimizer.\nThese insights suggest a more quantitative analysis of the relation between dynamical isometry and\nlearning speed for orthogonal tanh networks, summarized in Fig. 5. We focus on SGD, given the lack\nof a strong dependence on optimizer. Intriguingly, Fig. 5(a) demonstrates the optimal training time\nis O(pL) and so grows sublinearly with depth L. Also Fig. 5(b) reveals that increased dynamical\nisometry enables faster training by making available larger (i.e. faster) learning rates. Finally,\nFig. 5(c,d) and their similarity to Fig. 3(b,d) suggest a strong positive correlation between training\ntime and max singular value of J. Overall, these results suggest that dynamical isometry is correlated\nwith learning speed, and controlling the entire distribution of Jacobian singular values may be an\nimportant design consideration in deep learning.\nIn Fig. 6, we explore the relationship between dynamical isometry and performance going beyond\ninitialization by studying the evolution of singular values throughout training. We \ufb01nd that if\ndynamical isometry is present at initialization, it persists for some time into training. Intriguingly,\n\n1We use the standard CIFAR10 dataset augmented with random \ufb02ips and crops, and random saturation,\n\nbrightness, and contrast perturbations\n\n8\n\n\f103\n\n(a)\n\n(b)\n\n(c)\nq\u21e4 = 1/64\n\n(d)\n\n102\n\n101\n\nt = 0\n\nq\u21e4 = 32\n\nFigure 6: Singular value evolution of J for orthogonal tanh networks during SGD training. (a) The\naverage distribution, over 30 networks with q\u21e4 = 1/64, at different SGD steps. (b) A measure of\neigenvalue ill-conditioning of JJT (hi2/h2i \uf8ff 1 with equality if and only if \u21e2() = ( 0))\nover number of SGD steps for different initial q\u21e4. Interestingly, the optimal q\u21e4 that best maintains\ndynamical isometry in later stages of training is not simply the smallest q\u21e4. (c) Test accuracy as a\nfunction of SGD step for those q\u21e4 considered in (b). (d) Generalization accuracy as a function of\ninitial q\u21e4. Together (b,c,d) reveal that the optimal nonzero q\u21e4, that best maintains dynamical isometry\ninto training, also yields the fastest learning and best generalization accuracy.\n\nperfect dynamical isometry at initialization (q\u21e4 = 0) is not the best choice for preserving isometry\nthroughout training; instead, some small but nonzero value of q\u21e4 appears optimal. Moreover, both\nlearning speed and generalization accuracy peak at this nonzero value. These results bolster the\nrelationship between dynamical isometry and performance beyond simply the initialization.\n\n4 Discussion\nIn summary, we have employed free probability theory to analytically compute the entire distribution\nof Jacobian singular values as a function of depth, random initialization, and nonlinearity shape.\nThis analytic computation yielded several insights into which combinations of these ingredients\nenable nonlinear deep networks to achieve dynamical isometry. In particular, deep linear Gaussian\nnetworks cannot; the maximum Jacobian singular value grows linearly with depth even if the second\nmoment remains 1. The same is true for both orthogonal and Gaussian ReLU networks. Thus\nthe ReLU nonlinearity destroys the dynamical isometry of orthogonal linear networks. In contrast,\northogonal, but not Gaussian, sigmoidal networks can achieve dynamical isometry; as the depth\nincreases, the max singular value can remain O(1) in the former case but grows linearly in the latter.\nThus orthogonal sigmoidal networks rescue the failure of dynamical isometry in ReLU networks.\nCorrespondingly, we demonstrate, on CIFAR-10, that orthogonal sigmoidal networks can learn orders\nof magnitude faster than ReLU networks. This performance advantage is robust to the choice of a\nvariety of optimizers, including SGD, momentum, RMSProp and ADAM. Orthogonal sigmoidal\nnetworks moreover have sublinear learning times with depth. While not as fast as orthogonal\nlinear networks, which have depth independent training times [10], orthogonal sigmoidal networks\nhave training times growing as the square root of depth. Finally, dynamical isometry, if present at\ninitialization, persists for a large amount of time during training. Moreover, isometric initializations\nwith longer persistence times yield both faster learning and better generalization.\nOverall, these results yield the insight that the shape of the entire distribution of a deep network\u2019s\nJacobian singular values can have a dramatic effect on learning speed; only controlling the second\nmoment, to avoid exponentially vanishing and exploding gradients, can leave signi\ufb01cant performance\nadvantages on the table. Moreover, by pursuing the design principle of tightly concentrating the entire\ndistribution around 1, we reveal that very deep feedfoward networks, with sigmoidal nonlinearities,\ncan actually outperform ReLU networks, the most popular type of nonlinear deep network used today.\nIn future work, it would be interesting to extend our methods to other types of networks, including for\nexample skip connections, or convolutional architectures. More generally, the performance advantage\nin learning that accompanies dynamical isometry suggests it may be interesting to explicitly optimize\nthis property in reinforcement learning based searches over architectures [21].\n\nAcknowledgments\nS.G. thanks the Simons, McKnight, James S. McDonnell, and Burroughs Wellcome Foundations and\nthe Of\ufb01ce of Naval Research for support.\n\n9\n\n\fReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing\nLiu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George\nKurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,\nGreg Corrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. CoRR, abs/1609.08144, 2016.\n\n[3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks\nand tree search. Nature, 529(7587):484\u2013489, 01 2016.\n\n[4] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and\nJascha Sohl-Dickstein. Deep knowledge tracing. In Advances in Neural Information Processing Systems,\npages 505\u2013513, 2015.\n\n[5] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo.\nPerformance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings\nof the National Academy of Sciences, 111(23):8619\u20138624, 2014.\n\n[6] Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen Baccus. Deep learning\nmodels of the retinal response to natural scenes. In Advances in Neural Information Processing Systems,\npages 1369\u20131377, 2016.\n\n[7] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nscience, 313(5786):504\u2013507, 2006.\n\n[8] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\n\nnetworks.\nStatistics, volume 9, pages 249\u2013256, 2010.\n\n[9] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\n\nnetworks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[10] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. ICLR 2014, 2013.\n\n[11] Dmytro Mishkin and Jiri Matas. All you need is a good init. CoRR, abs/1511.06422, 2015.\n\n[12] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural\n\nnetworks through transient chaos. Neural Information Processing Systems, 2016.\n\n[13] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep Information Propagation. International\n\nConference on Learning Representations (ICLR), 2017.\n\n[14] Roland Speicher. Multiplicative functions on the lattice of non-crossing partitions and free convolution.\n\nMathematische Annalen, 298(1):611\u2013628, 1994.\n\n[15] Dan V Voiculescu, Ken J Dykema, and Alexandru Nica. Free random variables. Number 1. American\n\nMathematical Soc., 1992.\n\n[16] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Society Providence,\n\nRI, 2012.\n\n[17] Thorsten Neuschel. Plancherel\u2013rotach formulae for average characteristic polynomials of products of\nginibre random matrices and the fuss\u2013catalan distribution. Random Matrices: Theory and Applications,\n3(01):1450003, 2014.\n\n[18] Joseph Louis Lagrange. Nouvelle m\u00e9thode pour r\u00e9soudre les probl\u00e8mes ind\u00e9termin\u00e9s en nombres entiers.\n\nChez Haude et Spener, Libraires de la Cour & de l\u2019Acad\u00e9mie royale, 1770.\n\n[19] Geoffrey Hinton, NiRsh Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a\n\noverview of mini\u2013batch gradient descent.\n\n10\n\n\f[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR,\n\nabs/1611.01578, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2497, "authors": [{"given_name": "Jeffrey", "family_name": "Pennington", "institution": "Google Brain"}, {"given_name": "Samuel", "family_name": "Schoenholz", "institution": "Google Brain"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}]}