{"title": "Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?", "book": "Advances in Neural Information Processing Systems", "page_first": 582, "page_last": 591, "abstract": "We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.", "full_text": "Which Neural Net Architectures Give Rise to\n\nExploding and Vanishing Gradients?\n\nBoris Hanin\n\nDepartment of Mathematics\n\nTexas A& M University\nCollege Station, TX, USA\nbhanin@math.tamu.edu\n\nAbstract\n\nWe give a rigorous analysis of the statistical behavior of gradients in a randomly\ninitialized fully connected network N with ReLU activations. Our results show\nthat the empirical variance of the squares of the entries in the input-output Jacobian\nof N is exponential in a simple architecture-dependent constant , given by the\nsum of the reciprocals of the hidden layer widths. When is large, the gradients\ncomputed by N at initialization vary wildly. Our approach complements the mean\n\ufb01eld theory analysis of random networks. From this point of view, we rigorously\ncompute \ufb01nite width corrections to the statistics of gradients at the edge of chaos.\n\n1\n\nIntroduction\n\nA fundamental obstacle in training deep neural nets using gradient based optimization is the exploding\nand vanishing gradient problem (EVGP), which has attracted much attention (e.g. [BSF94, HBF+01,\nMM15, XXP17, PSG17, PSG18]) after \ufb01rst being studied by Hochreiter [Hoc91]. The EVGP occurs\nwhen the derivative of the loss in the SGD update\n\nW\n\n \n\nW \n\n@L\n@W\n\n,\n\n(1)\n\nis very large for some trainable parameters W and very small for others:\n\n\n\n@L\n\n@W \u21e1 0 or 1.\n\nThis makes the increment in (1) either too small to be meaningful or too large to be precise. In practice,\na number of ways of overcoming the EVGP have been proposed (see e.g. [Sch]). Let us mention\nthree general approaches: (i) using architectures such as LSTMs [HS97], highway networks [SGS15],\nor ResNets [HZRS16] that are designed speci\ufb01cally to control gradients; (ii) precisely initializing\nweights (e.g. i.i.d. with properly chosen variances [MM15, HZRS15] or using orthogonal weight\nmatrices [ASB16, HSL16]); (iii) choosing non-linearities that that tend to compute numerically stable\ngradients or activations at initialization [KUMH17].\nA number of articles (e.g. [PLR+16, RPK+17, PSG17, PSG18]) use mean \ufb01eld theory to show\nthat even vanilla fully connected architectures can avoid the EVGP in the limit of in\ufb01nitely wide\nhidden layers. In this article, we continue this line of investigation. We focus speci\ufb01cally on fully\nconnected ReLU nets, and give a rigorous answer to the question of which combinations of depths d\nand hidden layer widths nj give ReLU nets that suffer from the EVGP at initialization. In particular,\nwe avoid approach (iii) to the EVGP by setting once and for all the activations in N to be ReLU\nand that we study approach (ii) in the limited sense that we consider only initializations in which\nweights and biases are independent (and properly scaled as in De\ufb01nition 1) but do not investigate\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fother initialization strategies. Instead, we focus on rigorously understanding the effects of \ufb01nite depth\nand width on gradients in randomly initialized networks. The main contributions of this work are:\n\n1. We derive new exact formulas for the joint even moments of the entries of the\ninput-output Jacobian in a fully connected ReLU net with random weights and biases.\nThese formulas hold at \ufb01nite depth and width (see Theorem 3).\n\n2. We prove that the empirical variance of gradients in a fully connected ReLU net is\nexponential in the sum of the reciprocals of the hidden layer widths. This suggests that\nwhen this sum of reciprocals is too large, early training dynamics are very slow and it may\ntake many epochs to achieve better-than-chance performance (see Figure 1).\n\n3. We prove that, so long as weights and biases are initialized independently with the correct\nvariance scaling (see De\ufb01nition 1), whether the EVGP occurs (in the precise sense\nexplained in \u00a73) in fully connected ReLU nets is a function only of the architecture\nand not the distributions from which the weights and biases are drawn.\n\nFigure 1: Comparison of early training dynamics on vectorized MNIST for fully connected ReLU\nnets with various architectures. Plot shows the mean number of epochs (over 100 independent training\nruns) that a given architecture takes to reach 20% accuracy as a function of the sum of reciprocals of\nhidden layer widths. (Figure reprinted with permission from [HR18] with caption modi\ufb01ed).\n\n1.1 Practical Implications\n\nThe second of the listed contributions has several concrete consequences for architecture selection and\nfor understanding initial training dynamics in ReLU nets. Speci\ufb01cally, our main results, Theorems\n1-3, prove that the EVGP will occur in a ReLU net N (in either the annealed or the quenched sense\ndescribed in \u00a73) if and only if a single scalar parameter, the sum\n\n =\n\n1\nnj\n\nd1Xj=1\n\nof reciprocals of the hidden layer widths of N , is large. Here nj denotes the width of the jth hidden\nlayer, and we prove in Theorem 1 that the variance of entries in the input-output Jacobian of N\n\n2\n\n\f1\n\n0@ 1\n\nd 1\n\nd1Xj=1\n\n1\n\nnj1A\n\n\uf8ff\n\n1\nd 1\n\nd1Xj=1\n\n1/2\n\nnj \uf8ff 0@ 1\n\nd 1\n\nd1Xj=1\n\nn2\n\nj1A\n\n,\n\n(2)\n\nis exponential in . Implications for architecture selection then follow from special cases of the\npower-mean inequality:\n\nin which equality is achieved if and only if nj are all equal. We interpret the leftmost inequality\n\nas follows. Fix d and a total budgetPj nj of hidden layer neurons. Theorems 1 and 2 say that to\navoid the EVGP in both the quenched and annealed senses, one should minimize and hence make\nthe leftmost expression in (2) as large as possible. This occurs precisely when nj are all equal. Fix\ninstead d and a budget of trainable parameters,Pj nj(nj1 + 1), which is close toPj n2\nj if the nj\u2019s\ndon\u2019t \ufb02uctuate too much. Again using (2), we \ufb01nd that from the point of view of avoiding the EVGP,\nit is advantageous to take the nj\u2019s to be equal.\nIn short, our theoretical results (Theorems 1 and 2) show that if is large then, at initialization, N\nwill compute gradients that \ufb02uctuate wildly, intuitively leading to slow initial training dynamics. This\nheuristic is corroborated by an experiment from [HR18] about the start of training on MNIST for\nfully connected neural nets with varying depths and hidden layer widths (the parameter appeared in\n[HR18] in a different context). Figure 1 shows that is a good summary statistic for predicting how\nquickly deep networks will start to train.\nWe conclude the introduction by mentioning what we see as the principal weaknesses of the present\nwork. First, our analysis holds only for ReLU activations and assumes that all non-zero weights are\nindependent and zero centered. Therefore, our conclusions do not directly carry over to convolutional,\nresidual, and recurrent networks. Second, our results yield information about the \ufb02uctuations of\nthe entries Zp,q of the input-output Jacobian JN at any \ufb01xed input to N . It would be interesting to\nhave information about the joint distribution of the Zp,q\u2019s with inputs ranging over an entire dataset.\nThird, our techniques do not directly extend to initializations such as orthogonal weight matrices. We\nhope to address these issues in the future and, speci\ufb01cally, believe that the qualitative results of this\narticle will generalize to convolutional networks in which the number of channels grows with the\nlayer number.\n\n2 Relation to Prior Work\n\nTo provide some context for our results, we contrast both our approach and contributions with\nthe recent work [PSG17, PSG18]. These articles consider two senses in which a fully connected\nneural net N with random weights and biases can avoid the EVGP. The \ufb01rst is that the average\nsingular value of the input-output Jacobian JN remains approximately 1, while the second, termed\ndynamical isometry, requires that all the singular values of JN are approximately 1. The authors\nof [PSG17, PSG18] study the full distribution of the singular values of the Jacobian JN \ufb01rst in the\nin\ufb01nite width limit n ! 1 and then in the in\ufb01nite depth limit d ! 1.\nLet us emphasize two particularly attractive features of [PSG17, PSG18]. First, neither the initializa-\ntion nor the non-linearity in the neural nets N is assumed to be \ufb01xed, allowing the authors to consider\nsolutions of types (ii) and (iii) above to the EVGP. The techniques used in these articles are also\nrather general, and point to the emergence of universality classes for singular values of the Jacobian\nof deep neural nets at initialization. Second, the results in these articles access the full distribution of\nsingular values for the Jacobian JN , providing signi\ufb01cantly more re\ufb01ned information than simply\ncontrolling the mean singular value.\nThe neural nets considered in [PSG17, PSG18] are essentially assumed to be in\ufb01nitely wide, however.\nThis raises the question of whether there is any \ufb01nite width at which the behavior of a randomly\ninitialized network will resemble the in\ufb01nite width regime, and moreover, if such a width exists, how\nwide is wide enough? In this work we give rigorous answers to such questions by quantifying \ufb01nite\nwidth effects, leaving aside questions about both different choices of non-linearity and about good\ninitializations that go beyond independent weights.\nInstead of taking the singular value de\ufb01nition of the EVGP as in [PSG17, PSG18], we propose\ntwo non-spectral formulations of the EVGP, which we term annealed and quenched. Their precise\nde\ufb01nitions are given in \u00a73.2 and \u00a73.3, and we provide in \u00a73.1 a discussion of the relation between the\ndifferent senses in which the EVGP can occur.\n\n3\n\n\fTheorem 1 below implies, in the in\ufb01nite width limit, that all ReLU nets avoid the EVGP in both the\nquenched and annealed sense. Hence, our de\ufb01nition of the EVGP (see \u00a73.2 and \u00a73.3) is weaker than\nthe dynamical isometry condition from [PSG17, PSG18]. But, as explained in \u00a73.1, it is stronger the\ncondition that the average singular value equal 1. Both the quenched and annealed versions of the\nEVGP concern the \ufb02uctuations of the partial derivatives\n\nZp,q :=\n\n@ (fN )q\n@ Act(0)\np\n\n(3)\n\nof the qth component of the function fN computed by N with respect to the pth component of its\ninput (Act(0) is an input vector - see (10)). The stronger, quenched version of the EVGP concerns\nthe empirical variance of the squares of all the different Zp,q :\n\ndVar\u21e5Z2\u21e4 :=\n\n1\nM\n\nMXm=1\n\npm,qm 1\n\nM\n\nZ4\n\nMXm=1\n\npm,qm!2\n\nZ2\n\n,\n\nM = n0nd.\n\n(4)\n\nHere, n0 is the input dimension to N , nd is the output dimension, and the index m runs over all n0nd\npossible input-output neuron pairs (pm, qm). Intuitively, since we will show in Theorem 1 that\n\nE\u21e5Z2\n\np,q\u21e4 = \u21e5(1),\n\nindependently of the depth, having a large mean fordVar\u21e5Z2\u21e4 means that for a typical realization of\n\nthe weights and biases in N , the derivatives of fN with respect to different trainable parameters will\nvary over several orders of magnitude, leading to inef\ufb01cient SGD updates (1) for any \ufb01xed learning\nrate (see \u00a73.1 - \u00a73.3).\nTo avoid the EVGP (in the annealed or quenched senses described below) in deep feed-forward\nnetworks with ReLU activations, our results advise letting the widths of hidden layers grow as a\nfunction of the depth. In fact, as the width of a given hidden layer tends to in\ufb01nity, the input to the\nnext hidden layer can viewed as a Gaussian process and can be understood using mean \ufb01eld theory\n(in which case one \ufb01rst considers the in\ufb01nite width limit and only then the in\ufb01nite depth limit). This\npoint of view was taken in several interesting papers (e.g. [PLR+16, RPK+17, PSG17, PSG18] and\nreferences therein), which analyze the dynamics of signal propagation through such deep nets. In their\nnotation, the fan-in normalization (condition (ii) in De\ufb01nition 1) guarantees that we\u2019ve initialized\nour neural nets at the edge of chaos (see e.g. around (7) in [PLR+16] and (5) in [RPK+17]). Indeed,\nwriting \u00b5(j) for the weight distribution at layer j and using our normalization Var[\u00b5(j)] = 2/nj1,\nthe order parameter 1 from [PLR+16, RPK+17] becomes\n\n1 = nj1 \u00b7 Var[\u00b5(j)]ZR\n\nez2/20(pq\u21e4z)2 dz\np2\u21e1\n\n= 1,\n\nsince = ReLU, making 0(z) the indicator function 1[0,1)(z) and the value of 0(pq\u21e4z) indepen-\ndent of the asymptotic length q\u21e4 for activations. The condition 1 = 1 de\ufb01nes the edge of chaos\nregime. This gives a heuristic explanation for why the nets considered in the present article cannot\nhave just one of vanishing and exploding gradients. It also allows us to interpret our results as a\nrigorous computation for ReLU nets of the 1/nj corrections at the edge of chaos.\nIn addition to the mean \ufb01eld theory papers, we mention the article [SPSD17]. It does not deal\ndirectly with gradients, but it does treat the \ufb01nite width corrections to the statistical distribution of\npre-activations in a feed-forward network with Gaussian initialized weights and biases. A nice aspect\nof this work is that the results give the joint distribution not only over all the neurons but also over\nany number of inputs to the network. In a similar vein, we bring to the reader\u2019s attention [BFL+17],\nwhich gives interesting heuristic computations about the structure of correlations between gradients\ncorresponding to different inputs in both fully connected and residual ReLU nets.\n\n3 De\ufb01ning the EVGP for Feed-Forward Networks\n\nWe now explain in exactly what sense we study the EVGP and contrast our de\ufb01nition, which depends\non the behavior of the entries of the input-output Jacobian JN , with the more usual de\ufb01nition, which\ndepends on the behavior of its singular values (see \u00a73.1). To do this, consider a feed-forward fully\n\n4\n\n\fconnected depth d network N with hidden layer widths n0, . . . , nd, and \ufb01x an input Act(0) 2 Rn0.\nWe denote by Act(j) the corresponding vector of activations at layer j (see (10)). The exploding and\nvanishing gradient problem can be roughly stated as follows:\n\n !\n\nZp,q has large \ufb02uctuations,\n\nExploding/Vanishing Gradients\n\n(5)\nwhere Zp,q the entries of the Jacobian JN (see (3)). A common way to formalize this statement is\nto interpret \u201cZp,q has large \ufb02uctuations\u201d to mean that the Jacobian JN of the function computed by\nN has both very large and very small singular values [BSF94, HBF+01, PSG17]. We give in \u00a73.1 a\nbrief account of the reasoning behind this formulation of the EVGP and explain why is also natural\nto de\ufb01ne the EVGP via the moments of Zp,q. Then, in \u00a73.2 and \u00a73.3, we de\ufb01ne two precise senses,\nwhich we call annealed and quenched, in which that EVGP can occur, phrased directly in terms of\nthe joint moments of Zp,q.\n\n\u21b5\n\n0(Act(j)\n\n ),\n\n@L/@W (j)\n\n3.1 Spectral vs. Entrywise De\ufb01nitions of the EVGP\nLet us recall the rationale behind using the spectral theory of JN to de\ufb01ne the EVGP. The gradient in\n(1) of the loss with respect to, say, a weight W (j)\n\u21b5, connecting neuron \u21b5 in layer j 1 to neuron in\nlayer j is\n(6)\n ) is the derivative of the non-linearity, the derivative of the loss L with respect to the\n\n\u21b5, = hrAct(d)L, JN ,(j ! d)i Act(j1)\nq , q = 1, . . . , nd\u2318 ,\n\nand we\u2019ve denoted the th row in the layer j to output Jacobian JN (j ! d) by\n , q = 1, . . . , nd\u2318 .\n\nrAct(d)L =\u21e3@L@ Act(d)\nJN ,(j ! d) =\u21e3@ Act(d)\nq @ Act(j)\n\nwhere 0(Act(j)\noutput Act(d) of N is\n\nSince JN (j ! d) is the product of d j layer-to-layer Jacobians, its inner product with rAct(d)L is\nusually the term considered responsible for the EVGP. The worst case distortion it can achieve on the\nvector rAct(d)L is captured precisely by its condition number, the ratio of its largest and smallest\nsingular values.\nHowever, unlike the case of recurrent networks in which JN (j ! d) is (d j)fold product of a\n\ufb01xed matrix, when the hidden layer widths grow with the depth d, the dimensions of the layer j to\nlayer j0 Jacobians JN (j ! j0) are not \ufb01xed and it is not clear to what extent the vector rAct(d)L\nwill actually be stretched or compressed by the worst case bounds coming from estimates on the\ncondition number of JN (j ! d).\nMoreover, on a practical level, the EVGP is about the numerical stability of the increments of the\nSGD updates (1) over all weights (and biases) in the network, which is directly captured by the joint\ndistribution of the random variables\n\n{|@L/@W (j)\n\n\u21b5,|2, j = 1, . . . , nd, \u21b5 = 1, . . . , nj1, = 1, . . . , nj}.\n\nDue to the relation (6), two terms in\ufb02uence the moments of |@L/@W (j)\n\u21b5,|2: one coming from the\nactivations at layer j 1 and the other from the entries of JN (j ! d). We focus in this article on\nthe second term and hence interpret the \ufb02uctuations of the entries of JN (j ! d) as a measure of the\nEVGP.\nTo conclude, we recall a simple relationship between the moments of the entries of the input-output\nJacobian JN and the distribution of its singular values, which can be used to directly compare spectral\nand entrywise de\ufb01nitions of the EVGP. Suppose for instance one is interested in the average singular\nvalue of JN (as in [PSG17, PSG18]). The sum of the singular values of JN is given by\n\ntr(J T\n\nN JN ) =\n\nn0Xj=1\u2326J T\n\nN JN uj, uj\u21b5 =\n\nn0Xj=1\n\nkJN ujk2 ,\n\nwhere {uj} is any orthonormal basis. Hence, the average singular value can be obtained directly from\nthe joint even moments of the entries of JN . Both the quenched and annealed EVGP (see (7),(9))\n\n5\n\n\fentail that the average singular value for JN equals 1, and we prove in Theorem 1 (speci\ufb01cally (11))\nthat even at \ufb01nite depth and width the average singular value for JN equals 1 for all the random\nReLU nets we consider!\nOne can push this line of reasoning further. Namely, the singular values of any matrix M are\ndetermined by the Stieltjes transform of the empirical distribution M of the eigenvalues of M T M :\n\nSM (z) =ZR\n\ndM (x)\nz x\n\n,\n\nz 2 C\\R.\n\nWriting (z x)1 as a power series in z shows that SJN is determined by traces of powers of J T\nand hence by the joint even moments of the entries of JN . We hope to estimate SJN\nfuture work.\n\nN JN\n(z) directly in\n\n3.2 Annealed Exploding and Vanishing Gradients\nFix a sequence of positive integers n0, n1, . . . . For each d 1 write Nd for the depth d ReLU net\nwith hidden layer widths n0, . . . , nd and random weights and biases (see De\ufb01nition 1 below). As in\n(3), write Zp,q(d) for the partial derivative of the qth component of the output of Nd with respect to\npth component of its input. We say that the family of architectures given by {n0, n1, . . .} avoids the\nexploding and vanishing gradient problem in the annealed sense if for each \ufb01xed input to Nd and\nevery p, q we have\n\nE\u21e5Z2\n\np,q(d)\u21e4 = 1, Var[Z2\n\np,q(d)] = \u21e5(1),\n\nsup\nd1\n\nE\u21e5Z2K\n\np,q (d)\u21e4 < 1, 8K 3.\n\n(7)\n\nHere the expectation is over the weights and biases in Nd. Architectures that avoid the EVGP in the\nannealed sense are ones where the typical magnitude of the partial derivatives Zp,q(d) have bounded\n(both above and below) \ufb02uctuations around a constant mean value. This allows for a reliable a priori\nselection of the learning rate from (1) even for deep architectures. Our main result about the\nannealed EVGP is Theorem 1: a family of neural net architectures avoids the EVGP in the annealed\nsense if and only if\n\n1\nnj\n\n< 1.\n\n(8)\n\n1Xj=1\n\nWe prove in Theorem 1 that E\u21e5Z2K\n\np,q (d)\u21e4 is exponential inPj\uf8ffd 1/nj for every K.\n\n3.3 Quenched Exploding and Vanishing Gradients\nThere is an important objection to de\ufb01ning the EVGP as in the previous section. Namely, if a neural\nnet N suffers from the annealed EVGP, then it is impossible to choose an appropriate a priori learning\nrate that works for a typical initialization. However, it may still be that for a typical realization\nof the weights and biases there is some choice of (depending on the particular initialization), that\nworks well for all (or most) trainable parameters in N . To study whether this is the case, we must\nconsider the variation of the Zp,q\u2019s across different p, q in a \ufb01xed realization of weights and biases.\nThis is the essence of the quenched EVGP.\nTo formulate the precise de\ufb01nition, we again \ufb01x a sequence of positive integers n0, n1, . . . and write\nNd for a depth d ReLU net with hidden layer widths n0, . . . , nd. We write as in (4)\n\ndVar\u21e5Z(d)2\u21e4 :=\n\n1\nM\n\nMXm=1\n\nZpm,qm(d)4 1\n\nM\n\nZpm,qm(d)2!2\n\nMXm=1\n\n,\n\nM = n0nd\n\nfor the empirical variance of the squares all the entries Zp,q(d) of the input-output Jacobian of Nd.\nWe will say that the family of architectures given by {n0, n1, . . .} avoids the exploding and vanishing\ngradient problem in the quenched sense if\n\n(9)\nJust as in the annealed case (7), the expectation E [\u00b7] is with respect to the weights and biases\nof N . In words, a neural net architecture suffers from the EVGP in the quenched sense if for a\n\nEhdVar[Z(d)2]i = \u21e5(1).\n\nE\u21e5Zp,q(d)2\u21e4 = 1\n\nand\n\n6\n\n\fpm,qm} is large.\n\ntypical realization of the weights and biases the empirical variance of the squared partial derivatives\n{Z2\nOur main result about the quenched sense of the EVGP is Theorem 2. It turns out, at least for the\nReLU nets we study, that a family of neural net architectures avoids the quenched EVGP if and only\nif it also avoids the annealed exploding and vanishing gradient problem (i.e. if (8) holds).\n\n4 Acknowledgements\n\nI thank Leonid Hanin for a number of useful conversations and for his comments on an early draft. I\nam also grateful to Jeffrey Pennington for pointing out an important typo in the proof of Theorem 3\nand to David Rolnick for several helpful conversations and, speci\ufb01cally, for pointing out the relevance\nof the power-mean inequality for understanding . Finally, I would like to thank several anonymous\nreferees for their help in improving the exposition. One referee in particular raised concerns about the\nannealed and quenched de\ufb01nitions of the EVGP. Addressing these concerns resulted in the discussion\nin \u00a73.1.\n\n5 Notation and Main Results\n\n5.1 De\ufb01nition of Random Networks\nTo formally state our results, we \ufb01rst give the precise de\ufb01nition of the random networks we study.\nFor every d 1 and each n = (ni)d\n\n+ , write\n\nThe function fN computed by N 2 N(n, d) is determined by a collection of weights and biases\n\ni=0 2 Zd+1\nN(n, d) =nfully connected feed-forward nets with ReLU activations,\ndepth d, and whose jth hidden layer has width nj o .\n{w(j)\n\n1 \uf8ff \u21b5 \uf8ff nj, 1 \uf8ff \uf8ff nj+1, j = 0, . . . , d 1}.\n\n\u21b5,, b(j)\n ,\nSpeci\ufb01cally, given an input\n\nAct(0) =\u21e3Act(0)\ni \u2318n0\ni=1 2 Rn0\n\nto N , we de\ufb01ne for every j = 1, . . . , d\nAct(j1)\n\nact(j)\n\n +\n\n\u21b5\n\nAct(j)\n\nw(j)\n\u21b5,,\n\n = b(j)\n\n1 \uf8ff \uf8ff nj.\n\n = (act(j)\n ),\n\nnj1X\u21b5=1\nThe vectors act(j), Act(j) therefore represent the vectors of inputs and outputs of the neurons in the\njth layer of N . The function computed by N takes the form\nfN\u21e3Act(0)\u2318 = fN\u21e3Act(0), w(j)\n\n \u2318 = Act(d) .\nA random network is obtained by randomizing weights and biases.\nDe\ufb01nition 1 (Random Nets). Fix d 1, n = (n0, . . . , nd) 2 Zd+1\nity measures \u00b5 =\u00b5(1), . . . , \u00b5(d) and \u232b =\u232b(1), . . . , \u232b(d) on R such that\n\n+ , and two collections of probabil-\n\n\u21b5,, b(j)\n\n(10)\n\n(i) \u00b5(j), \u232b(j) are symmetric around 0 for every 1 \uf8ff j \uf8ff d.\n(ii) the variance of \u00b5(j) is 2/(nj1).\n(iii) \u232b(j) has no atoms.\n\nA random net N 2 N\u00b5,\u232b (n, d) is obtained by requiring that the weights and biases for neurons at\nlayer j are drawn independently from \u00b5(j), \u232b(j) :\nw(j)\n\u21b5, \u21e0 \u00b5(j), b(j)\n\nRemark 1. Condition (iii) is used when we apply Lemma 1 in the proof of Theorem 3. It can be\n\n \u21e0 \u232b(j)\nj=1 nj\u2318 . Since this yields slightly messier but not\n\nremoved under the restriction that d \u2327 exp\u21e3Pd\n\nmeaningfully different results, we do not pursue this point.\n\ni.i.d.\n\n7\n\n\f5.2 Results\nOur main theoretical results are Theorems 1 and 3. They concern the statistics of the slopes of the\nfunctions computed by a random neural net in the sense of De\ufb01nition 1. To state them compactly, we\nde\ufb01ne for any probability measure \u00b5 on R\n\ne\u00b52K := RR x2Kd\u00b5\nRR x2d\u00b5K ,\nand, given a collection of probability measures {\u00b5(j)}d\ne\u00b52K,max := max\n\n2K.\n\n1\uf8ffj\uf8ffde\u00b5(j)\n\nK 0,\n\nj=1 on R, set for any K 1\n\nWe also continue to write Zp,q for the entries of the input-output Jacobian of a neural net (see (3)).\nTheorem 1. Fix d 1 and a multi-index n = (n0, . . . , nd) 2 Zd+1\n+ . Let N 2 N\u00b5,\u232b (n, d) be a\nrandom network as in De\ufb01nition 1. For any \ufb01xed input to N , we have\n\nd1Xj=1\n\n1\n\nnj1A .\n\n(11)\n\n(12)\n\n(13)\n\n.\n\n1\n\n1\nn0\n\n2\nn2\n0\n\np,q\u21e4 =\n\nexp0@ 1\n\nE\u21e5Z2\nIn contrast, the fourth moment of Zp,q(x) is exponential inPj\nnj1A \uf8ff E\u21e5Z4\np,q\u21e4 \uf8ff\n6e\u00b54,max\nj=1{nj}, then\nexp0@CK,\u00b5\nCK,\u00b5\nnK\n0\n\nE\u21e5Z2K\np,q\u21e4 \uf8ff\n\nd1Xj=1\n\nn2\n0\n\n2\n\n1\nnj\n\n:\n\nexp0@6e\u00b54,max\nnj1A .\nd1Xj=1\n\n1\n\nMoreover, there exists a constant CK,\u00b5 > 0 depending only on K and the \ufb01rst 2K moments of the\nmeasures {\u00b5(j)}d\n\nj=1 such that if K < mind1\n\nthe right hand side of (12) is not optimal and can be reduced by a more careful analysis along the\nsame lines as the proof of Theorem 1 given below. We do not pursue this here, however, since we are\n\nRemark 2. In (11), (12), and (13), the bias distributions \u232b(j) play no role. However, in the derivation\nof these relations, we use in Lemma 1 that \u232b(j) has no atoms (see Remark 1). Also, the condition\nK < mind\nj=1{nj1} can be relaxed by allowing K to violate this inequality a \ufb01xed \ufb01nite number `\nof times. This causes the constant CK,\u00b5 to depend on ` as well.\nWe prove Theorem 1 in Appendix B. The constant factor multiplyingPj 1/nj in the exponent on\nprimarily interested in \ufb01xing K and understanding the dependence of E\u21e5Zp,q(x)2K\u21e4 on the widths\ntechniques will give analogous estimates for any mixed even moments E\u21e5Z2K1\npm,q\u21e4 when K\nis set toPm Km (see Remark 3). In particular, we can estimate the mean of the empirical variance\nof gradients.\nTheorem 2. Fix n0, . . . , nd 2 Z+, and let N be a random fully connected depth d ReLU net with\nhidden layer widths n0, . . . , nd and random weights and biases as in De\ufb01nition 1. Write M = n0nd\nand writedVar\u21e5Z2\u21e4 for the empirical variance of the squares {Z2\npm,qm} of all M input-output neuron\n\nnj and the depth d. Although we\u2019ve stated Theorem 1 only for the even moments of Zp,q, the same\n\npairs as in (4). We have\n\np1,q \u00b7\u00b7\u00b7 Z2Km\n\nEhdVar[Z2]i \uf8ff \u27131 \n\nand\n\nEhdVar[Z2]i \n\n1\nn2\n\n0\u27131 \n\n1\n\nM\u25c6\u27131 \u2318 +\n\n1\n\n1\n\nn2\n0\n\nM\u25c6 6e\u00b54,max\nn1\u21e3e\u00b5(1)\n\nexp0@6e\u00b54,max\n4 1\u2318 e 1\n\nnj1A\nd1Xj=1\nn1\u25c6 exp0@ 1\nd1Xj=1\n\n4\u2318\n\n2\n\n(14)\n\n(15)\n\n1\n\nnj1A\n\n8\n\n\fwhere\n\n\u2318 :=\n\n#{m1, m2 | m1 6= m2, qm1 = qm2}\n\n=\n\nM (M 1)\n\nn0 1\nn0nd 1\n\n.\n\nHence, the family Nd of ReLU nets avoids the exploding and vanishing gradient problem in the\nquenched sense if and only if\n\n1\nnj\n\n< 1.\n\n1Xj=1\n\nWe prove Theorem 2 in Appendix C. The results in Theorems 1 and 2 are based on exact expressions,\n\ngiven in Theorem 3, for the even moments E\u21e5Zp,q(x)2K\u21e4 in terms only of the moments of the\nweight distributions \u00b5(j). To give the formal statement, we introduce the following notation. For any\nn = (ni)d\ni=0 and any 1 \uf8ff p \uf8ff n0, 1 \uf8ff q \uf8ff nd, we say that a path from the pth input neuron to the\nqth output neuron in N 2 N (n, d) is a sequence\n\n{(j)}d\n\nj=0,\n\n1 \uf8ff (j) \uf8ff nj,\n\n(0) = p,\n\n(d) = q,\n\nso that (j) represents a neuron in the jth layer of N . Similarly, given any collection of K 1 paths\n = (k)K\nk=1 that connect (possibly different) neurons in the input of N with neurons in its output\nand any 1 \uf8ff j \uf8ff d, denote by\n\n(j) = [2\n\n{(j)}\n\nthe neurons in the jth layer of N that belong to at least one element of . Finally, for every\n\u21b5 2 (j 1) and 2 (j), denote by\n\n|\u21b5,(j)| = #{ 2 | (j 1) = \u21b5, (j) = }\n\nwhere the sum is over ordered tuples = (1, . . . , 2K) of paths in N from p to q and\n\nthe number of paths in that pass through neuron \u21b5 at layer j 1 and through neuron at layer j.\nTheorem 3. Fix d 1 and n = (n0, . . . , nd) 2 Zd+1\n+ . Let N 2 N\u00b5,\u232b (n, d) be a random network\nas in De\ufb01nition 1. For every K 1 and all 1 \uf8ff p \uf8ff n0, 1 \uf8ff q \uf8ff nd, we have\ndYj=1\np,q\u21e4 =X\nE\u21e5Z2K\n2\u25c6|(j)| Y\u21b52(j1)\nCj() =\u2713 1\nwhere for every r 0, the quantity \u00b5(j)\nRemark 3. The expression (16) can be generalized to case of mixed even moments. Namely, given\nm 1 and for each 1 \uf8ff m \uf8ff M integers Km 0 and 1 \uf8ff pm \uf8ff n0 1 \uf8ff qm \uf8ff nd, we have\n\nr denotes the rth moment of the measure \u00b5(j).\n\n\u00b5(j)\n|\u21b5, (j)|\n\n,\n\nCj(),\n\n2(j)\n\n(16)\n\nE\" MYm=1\n\nZpm,qm(x)2Km# =X\n\ndYj=1\n\nCj(),\n\n(17)\n\nwhere now the sum is over collections = (1, . . . , 2K) of 2K = Pm 2Km paths in N with\n\nexactly 2Km paths from pm to qm. The proof is identical up to the addition of several well-placed\nsubscripts.\n\nSee Appendix A for the proof of Theorem 3.\n\nReferences\n[ASB16] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural\nnetworks. In International Conference on Machine Learning, pages 1120\u20131128, 2016.\n\n9\n\n\f[BFL+17] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian\nMcWilliams. The shattered gradients problem: If resnets are the answer, then what is\nthe question? arXiv preprint arXiv:1702.08591, 2017.\nYoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies\nwith gradient descent is dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166,\n1994.\n\n[BSF94]\n\n[CHM+15] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann\nLeCun. The loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics,\npages 192\u2013204, 2015.\n\n[HBF+01] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, J\u00fcrgen Schmidhuber, et al. Gradient\n\n[Hoc91]\n\n[HR18]\n\n[HS97]\n\n\ufb02ow in recurrent nets: the dif\ufb01culty of learning long-term dependencies, 2001.\nSepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Tech-\nnische Universit\u00e4t M\u00fcnchen, 91, 1991.\nBoris Hanin and David Rolnick. How to start training: The effect of initialization and\narchitecture. In Advances in Neural Information Processing Systems 32, 2018.\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computa-\ntion, 9(8):1735\u20131780, 1997.\n\n[HSL16] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and\nlong-memory tasks. In Proceedings of The 33rd International Conference on Machine\nLearning, volume 48, pages 2034\u20132042, 2016.\n\n[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the\nIEEE international conference on computer vision, pages 1026\u20131034, 2015.\n\n[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning\nfor image recognition. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 770\u2013778, 2016.\n\n[KUMH17] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-\nnormalizing neural networks. In Advances in Neural Information Processing Systems,\npages 972\u2013981, 2017.\nDmytro Mishkin and Jiri Matas. All you need is a good init.\narXiv:1511.06422, 2015.\n\narXiv preprint\n\n[MM15]\n\n[PSG17]\n\n[PLR+16] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Advances\nin neural information processing systems, pages 3360\u20133368, 2016.\nJeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid\nin deep learning through dynamical isometry: theory and practice. In Advances in\nneural information processing systems, pages 4788\u20134798, 2017.\nJeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of\nspectral universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.\n\n[PSG18]\n\n[Sch]\n\n[SGS15]\n\n[RPK+17] Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein.\nOn the expressive power of deep neural networks. In Proceedings of the 34th Interna-\ntional Conference on Machine Learning, ICML 2017, pages 2847\u20132854, 2017.\nSepp hochreiter\u2019s fundamental deep learning problem (1991). http://people.idsia.\nch/~juergen/fundamentaldeeplearningproblem.html. Accessed: 2017-12-26.\nRupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks.\n2015.\nSamuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. A correspon-\ndence between random neural networks and statistical \ufb01eld theory. arXiv preprint\narXiv:1710.06570, 2017.\nDi Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Explor-\ning better solution for training extremely deep convolutional neural networks with\northonormality and modulation. arXiv preprint arXiv:1703.01827, 2017.\n\n[SPSD17]\n\n[XXP17]\n\n10\n\n\f", "award": [], "sourceid": 346, "authors": [{"given_name": "Boris", "family_name": "Hanin", "institution": "Texas A&M"}]}