{"title": "Exponential expressivity in deep neural networks through transient chaos", "book": "Advances in Neural Information Processing Systems", "page_first": 3360, "page_last": 3368, "abstract": "We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in deep neural networks with random weights. Our results reveal a phase transition in the expressivity of random deep networks, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth, but not with width. We prove that this generic class of random functions cannot be efficiently computed by any shallow network, going beyond prior work that restricts their analysis to single functions. Moreover, we formally quantify and demonstrate the long conjectured idea that deep networks can disentangle exponentially curved manifolds in input space into flat manifolds in hidden space. Our theoretical framework for analyzing the expressive power of deep networks is broadly applicable and provides a basis for quantifying previously abstract notions about the geometry of deep functions.", "full_text": "Exponential expressivity in deep neural networks\n\nthrough transient chaos\n\nBen Poole1, Subhaneil Lahiri1, Maithra Raghu2, Jascha Sohl-Dickstein2, Surya Ganguli1\n\n{benpoole,sulahiri,sganguli}@stanford.edu, {maithra,jaschasd}@google.com\n\n1Stanford University, 2Google Brain\n\nAbstract\n\nWe combine Riemannian geometry with the mean \ufb01eld theory of high dimensional\nchaos to study the nature of signal propagation in generic, deep neural networks\nwith random weights. Our results reveal an order-to-chaos expressivity phase\ntransition, with networks in the chaotic phase computing nonlinear functions whose\nglobal curvature grows exponentially with depth but not width. We prove this\ngeneric class of deep random functions cannot be ef\ufb01ciently computed by any shal-\nlow network, going beyond prior work restricted to the analysis of single functions.\nMoreover, we formalize and quantitatively demonstrate the long conjectured idea\nthat deep networks can disentangle highly curved manifolds in input space into \ufb02at\nmanifolds in hidden space. Our theoretical analysis of the expressive power of deep\nnetworks broadly applies to arbitrary nonlinearities, and provides a quantitative\nunderpinning for previously abstract notions about the geometry of deep functions.\n\n1\n\nIntroduction\n\nDeep feedforward neural networks have achieved remarkable performance across many domains\n[1\u20136]. A key factor thought to underlie their success is their high expressivity. This informal notion\nhas manifested itself primarily in two forms of intuition. The \ufb01rst is that deep networks can compactly\nexpress highly complex functions over input space in a way that shallow networks with one hidden\nlayer and the same number of neurons cannot. The second piece of intuition, which has captured\nthe imagination of machine learning [7] and neuroscience [8] alike, is that deep neural networks can\ndisentangle highly curved manifolds in input space into \ufb02attened manifolds in hidden space. These\nintuitions, while attractive, have been dif\ufb01cult to formalize mathematically and thus test rigorously.\nFor the \ufb01rst intuition, seminal works have exhibited examples of particular functions that can be\ncomputed with a polynomial number of neurons (in the input dimension) in a deep network but\nrequire an exponential number of neurons in a shallow network [9\u201313]. This raises a central open\nquestion: are such functions merely rare curiosities, or is any function computed by a generic deep\nnetwork not ef\ufb01ciently computable by a shallow network? The theoretical techniques employed in\nprior work both limited the applicability of theory to speci\ufb01c nonlinearities and dictated the particular\nmeasure of deep functional complexity involved. For example, [9] focused on ReLU nonlinearities\nand number of linear regions as a complexity measure, while [10] focused on sum-product networks\nand the number of monomials as complexity measure, and [14] focused on Pfaf\ufb01an nonlinearities and\ntopological measures of complexity, like the sum of Betti numbers of a decision boundary (however,\nsee [15] for an interesting analysis of a general class of compositional functions). The limits of\nprior theoretical techniques raise another central question: is there a unifying theoretical framework\nfor deep neural expressivity that is simultaneously applicable to arbitrary nonlinearities, generic\nnetworks, and a natural, general measure of functional complexity?\n\nCode to reproduce all results available at: https://github.com/ganguli-lab/deepchaos\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fHere we attack both central problems of deep neural expressivity by combining Riemannian geometry\n[16] and dynamical mean \ufb01eld theory [17]. This novel combination of tools enables us to show that\nfor very broad classes of nonlinearities, even random deep neural networks can construct hidden\ninternal representations whose global extrinsic curvature grows exponentially with depth but not width.\nOur geometric framework enables us to quantitatively de\ufb01ne a notion of disentangling and verify\nthis notion in deep random networks. Furthermore, our methods yield insights into the emergent,\ndeterministic nature of signal propagation through large random feedforward networks, revealing the\nexistence of an order to chaos transition as a function of the statistics of weights and biases. We \ufb01nd\nthat the transient, \ufb01nite depth evolution in the chaotic regime underlies the origins of exponential\nexpressivity in deep random networks. In a companion paper [18], we study several related measures\nof expressivity in deep random neural networks with piecewise linear activations.\n\n2 A mean \ufb01eld theory of deep nonlinear signal propagation\n\nConsider a deep feedforward network with D layers of weights W1, . . . , WD and D + 1 layers of\nneural activity vectors x0, . . . , xD, with Nl neurons in each layer l, so that xl \u2208 RNl and Wl is an\nNl \u00d7 Nl\u22121 weight matrix. The feedforward dynamics elicited by an input x0 is given by\n\nhl = Wl xl\u22121 + bl\n\nxl = \u03c6(hl)\n\nfor l = 1, . . . , D,\n\nij are drawn i.i.d. from a zero mean Gaussian with variance \u03c32\n\n(1)\nwhere bl is a vector of biases, hl is the pattern of inputs to neurons at layer l, and \u03c6 is a single\nneuron scalar nonlinearity that acts component-wise to transform inputs hl to activities xl. We\nwish to understand the nature of typical functions computable by such networks, as a consequence\nof their depth. We therefore study ensembles of random networks in which each of the synaptic\nweights Wl\nw/Nl\u22121, while the biases\nb . This weight scaling ensures that the\nare drawn i.i.d. from a zero mean Gaussian with variance \u03c32\ninput contribution to each individual neuron at layer l from activities in layer l \u2212 1 remains O(1),\nindependent of the layer width Nl\u22121. This ensemble constitutes a maximum entropy distribution over\ndeep neural networks, subject to constraints on the means and variances of weights and biases. This\nensemble induces no further structure in the resulting set of deep functions, so its analysis provides\nan opportunity to understand the speci\ufb01c contribution of depth alone to the nature of typical functions\ncomputed by deep networks.\nIn the limit of large layer widths, Nl (cid:29) 1, certain aspects of signal propagation through deep\nrandom neural networks take on an essentially deterministic character. This emergent determinism\nin large random neural networks enables us to understand how the Riemannian geometry of simple\nmanifolds in the input layer x0 is typically modi\ufb01ed as the manifold propagates into the deep layers.\nFor example, consider the simplest case of a single input vector x0. As it propagates through the\nnetwork, its length in downstream layers will change. We track this changing length by computing\nthe normalized squared length of the input vector at each layer:\n\nql =\n\n1\nNl\n\n(hl\n\ni)2.\n\n(2)\n\nNl(cid:88)\n\ni=1\n\n(cid:16)(cid:112)\n\n(cid:90)\n\n(cid:17)2\n\ni =(cid:80)\n\nThis length is the second moment of the empirical distribution of inputs hl\ni across all Nl neurons\nin layer l. For large Nl, this empirical distribution converges to a zero mean Gaussian since each\nij\u03c6(hl\u22121\ni is a weighted sum of a large number of uncorrelated random variables\nhl\n- i.e. the weights Wl\ni, which are independent of the activity in previous layers. By\npropagating this Gaussian distribution across one layer, we obtain an iterative map for ql in (2):\n\n) + bl\nij and biases bl\n\nj Wl\n\nj\n\nql = V(ql\u22121 | \u03c3w, \u03c3b) \u2261 \u03c32\n\nw\n\nDz \u03c6\n\nql\u22121z\n\n+ \u03c32\nb ,\n\nfor\n\nl = 2, . . . , D,\n\n(3)\n\n2\u03c0\n\n2 is the standard Gaussian measure, and the initial condition is q1 = \u03c32\n\nwhere Dz = dz\u221a\ne\u2212 z2\nb ,\nwq0 + \u03c32\nx0 \u00b7 x0 is the length in the initial activity layer. See Supplementary Material (SM)\nwhere q0 = 1\nN0\nfor a derivation of (3). Intuitively, the integral over z in (3) replaces an average over the empirical\ndistribution of hl\nThe function V in (3) is an iterative variance, or length, map that predicts how the length of an input in\n(2) changes as it propagates through the network. This length map is plotted in Fig. 1A for the special\n\ni across neurons i in layer l at large layer width Nl.\n\n2\n\n\fFigure 1: Dynamics of the squared length ql for a sigmoidal network (\u03c6(h) = tanh(h)) with 1000\nhidden units. (A) The iterative length map in (3) for 3 different \u03c3w at \u03c3b = 0.3. Theoretical\npredictions (solid lines) match well with individual network simulations (dots). Stars re\ufb02ect \ufb01xed\npoints q\u2217 of the map. (B) The iterative dynamics of the length map yields rapid convergence of ql\nto its \ufb01xed point q\u2217 , independent of initial condition (lines=theory; dots=simulation). (C) q\u2217 as a\nfunction of \u03c3w and \u03c3b. (D) Number of iterations required to achieve \u2264 1% fractional deviation off\nthe \ufb01xed point. The (\u03c3b, \u03c3w) pairs in (A,B) are marked with color matched circles in (C,D).\n\ncase of a sigmoidal nonlinearity, \u03c6(h) = tanh(h). For monotonic nonlinearities, this length map is\na monotonically increasing, concave function whose intersections with the unity line determine its\n\ufb01xed points q\u2217(\u03c3w, \u03c3b). For \u03c3b = 0 and \u03c3w < 1, the only intersection is at q\u2217 = 0. In this bias-free,\nsmall weight regime, the network shrinks all inputs to the origin. For \u03c3w > 1 and \u03c3b = 0, the q\u2217 = 0\n\ufb01xed point becomes unstable and the length map acquires a second nonzero \ufb01xed point, which is\nstable. In this bias-free, large weight regime, the network expands small inputs and contracts large\ninputs. Also, for any nonzero bias \u03c3b, the length map has a single stable non-zero \ufb01xed point. In such\na regime, even with small weights, the injected biases at each layer prevent signals from decaying to\n0. The dynamics of the length map leads to rapid convergence of length to its \ufb01xed point with depth\n(Fig. 1B,D), often within only 4 layers. The \ufb01xed points q\u2217(\u03c3w, \u03c3b) are shown in Fig. 1C.\n\n3 Transient chaos in deep networks\n\nNow consider the layer-wise propagation of two inputs x0,1 and x0,2. The geometry of these two\ninputs as they propagate through the network is captured by the 2 by 2 matrix of inner products:\n\nNl(cid:88)\n\ni=1\n\nql\nab =\n\n1\nNl\n\nhl\ni(x0,a) hl\n\ni(x0,b)\n\na, b \u2208 {1, 2}.\n\n(4)\n\n(cid:113)\n12 , ql\u22121\n11 , ql\u22121\nql\u22121\n11 z1,\n\n(cid:113)\n22 | \u03c3w, \u03c3b) \u2261 \u03c32\n\n(cid:20)\n\n(cid:113)\n\n(cid:21)\n\nThe dynamics of the two diagonal terms are each theoretically predicted by the length map in (3). We\nderive (see SM) a correlation map C that predicts the layer-wise dynamics of ql\n12:\nDz1 Dz2 \u03c6 (u1) \u03c6 (u2) + \u03c32\nb ,\n\n12 = C(cl\u22121\nql\n\n(5)\n\nw\n\n(cid:90)\n\nql\u22121\n\n22\n\ncl\u22121\n12 z1 +\n\n1 \u2212 (cl\u22121\n\n,\n\n12 )2z2\n\n11ql\n\nu2 =\n\n12(ql\n\nu1 =\n22)\u22121/2 is the correlation coef\ufb01cient. Here z1 and z2 are independent standard\nwhere cl\n12 = ql\nGaussian variables, while u1 and u2 are correlated Gaussian variables with covariance matrix\n(cid:104)uaub(cid:105) = ql\u22121\nab . Together, (3) and (5) constitute a theoretical prediction for the typical evolution of\nthe geometry of 2 points in (4) in a \ufb01xed large network.\nAnalysis of these equations reveals an interesting order to chaos transition in the \u03c3w and \u03c3b plane. In\nparticular, what happens to two nearby points as they propagate through the layers? Their relation to\n12 between the two points, which approaches\neach other can be tracked by the correlation coef\ufb01cient cl\na \ufb01xed point c\u2217(\u03c3w, \u03c3b) at large depth. Since the length of each point rapidly converges to q\u2217(\u03c3w, \u03c3b),\n22 = q\u2217(\u03c3w, \u03c3b) in (5) and\nas shown in Fig. 1BD, we can compute c\u2217 by simply setting ql\ndividing by q\u2217 to obtain an iterative correlation coef\ufb01cient map, or C-map, for cl\n12:\n\n11 = ql\n\ncl\n12 =\n\n1\n\nq\u2217C(cl\u22121\n\n12 , q\u2217, q\u2217 | \u03c3w, \u03c3b).\n\n(6)\n\n3\n\n\fThis C-map is shown in Fig. 2A. It always has a \ufb01xed point at c\u2217 = 1 as can be checked by direct\ncalculation. However, the stability of this \ufb01xed point depends on the slope of the map at 1, which is\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)c=1\n\n(cid:90)\n\nDz(cid:2)\u03c6(cid:48)(cid:0)\u221a\n\nq\u2217z(cid:1)(cid:3)2\n\n\u03c71 \u2261 \u2202cl\n\u2202cl\u22121\n\n12\n\n12\n\n= \u03c32\nw\n\n.\n\n(7)\n\nSee SM for a derivation of (7). If the slope \u03c71 is less than 1, then the C-map is above the unity line,\nthe \ufb01xed point at 1 under the C-map in (6) is stable, and nearby points become more similar over time.\n\n12, in a sigmoidal network with \u03c6(h) = tanh(h). (A) The\nFigure 2: Dynamics of correlations, cl\nC-map in (6) for the same \u03c3w and \u03c3b = 0.3 as in Fig. 1A. (B) The C-map dynamics, derived from\nboth theory, through (6) (solid lines) and numerical simulations of (1) with Nl = 1000 (dots) (C)\nFixed points c\u2217 of the C-map. (D) The slope of the C-map at 1, \u03c71, partitions the space (black dotted\nline at \u03c71 = 1) into chaotic (\u03c71 > 1, c\u2217 < 1) and ordered (\u03c71 < 1, c\u2217 = 1) regions.\n\nij = Wl\n\nj\n\ni\n\n2/||u||2\n\nij\u03c6(cid:48)(hl\u22121\n\n) at a point hl\u22121\n\nConversely, if \u03c71 > 1 then this \ufb01xed point is unstable, and nearby points separate as they propagate\nthrough the layers. Thus we can intuitively understand \u03c71 as a multiplicative stretch factor. This\nintuition can be made precise by considering the Jacobian Jl\nj with\nlength q\u2217. Jl is a linear approximation of the network map from layer l\u2212 1 to l in the vicinity of hl\u22121.\nTherefore a small random perturbation hl\u22121 + u will map to hl + Ju. The growth of the perturbation,\n||Ju||2\n2 becomes \u03c71(q\u2217) after averaging over the random perturbation u, weight matrix Wl,\nand Gaussian distribution of hl\u22121\nacross i. Thus \u03c71 directly re\ufb02ects the typical multiplicative growth\nor shrinkage of a random perturbation across one layer.\nThe dynamics of the iterative C-map and its agreement with network simulations is shown in Fig.\n2B. The correlation dynamics are much slower than the length dynamics because the C-map is closer\nto the unity line (Fig. 2A) than the length map (Fig. 1A). Thus correlations typically take about 20\nlayers to approach the \ufb01xed point, while lengths need only 4. The \ufb01xed point c\u2217 and slope \u03c71 of\nthe C-map are shown in Fig. 2CD. For any \ufb01xed, \ufb01nite \u03c3b, as \u03c3w increases three qualitative regions\noccur. For small \u03c3w, c\u2217 = 1 is the only \ufb01xed point, and it is stable because \u03c71 < 1. In this strong\nbias regime, any two input points converge to each other as they propagate through the network. As\n\u03c3w increases, \u03c71 increases and crosses 1, destabilizing the c\u2217 = 1 \ufb01xed point. In this intermediate\nregime, a new stable \ufb01xed point c\u2217 appears, which decreases as \u03c3w increases. Here an equal footing\ncompetition between weights and nonlinearities (which de-correlate inputs) and the biases (which\ncorrelate them), leads to a \ufb01nite c\u2217. At larger \u03c3w, the strong weights overwhelm the biases and\nmaximally de-correlate inputs to make them orthogonal, leading to a stable \ufb01xed point at c\u2217 = 0.\nThus the equation \u03c71(\u03c3w, \u03c3b) = 1 yields a phase transition boundary in the (\u03c3w, \u03c3b) plane, separating\nit into a chaotic (or ordered) phase, in which nearby points separate (or converge). In dynamical\nsystems theory, the logarithm of \u03c71 is related to the well known Lyapunov exponent which is positive\n(or negative) for chaotic (or ordered) dynamics. However, in a feedforward network, the dynamics is\ntruncated at a \ufb01nite depth D, and hence the dynamics are a form of transient chaos.\n\n4 The propagation of manifold geometry through deep networks\n\nNow consider a 1 dimensional manifold x0(\u03b8) in input space, where \u03b8 is an intrinsic scalar coordinate\non the manifold. This manifold propagates to a new manifold hl(\u03b8) = hl(x0(\u03b8)) in the vector\nspace of inputs to layer l. The typical geometry of the manifold in the l\u2019th layer is summarized\nby ql(\u03b81, \u03b82), which for any \u03b81 and \u03b82 is de\ufb01ned by (4) with the choice x0,a = x0(\u03b81) and x0,b =\n\n4\n\n\fx0(\u03b82). The theory for the propagation of pairs of points applies to all pairs of points on the\nmanifold, so intuitively, we expect that in the chaotic phase of a sigmoidal network, the manifold\nshould in some sense de-correlate, and become more complex, while in the ordered phase the\nmanifold should contract around a central point. This theoretical prediction of equations (3) and\n(5) is quantitatively con\ufb01rmed in simulations in Fig. 3, when the input is a simple manifold, the\ncircle, h1(\u03b8) =\ndimensional subspace of RN1 in which the circle lives. The scaling is chosen so that each neuron has\ninput activity O(1). Also, for simplicity, we choose the \ufb01xed point radius q = q\u2217 in Fig. 3.\n\nN1q(cid:2)u0 cos(\u03b8) + u1 sin(\u03b8)(cid:3), where u0 and u1 form an orthonormal basis for a 2\n\n\u221a\n\nFigure 3: Propagating a circle through three random sigmoidal networks with varying \u03c3w and \ufb01xed\n\u03c3b = 0.3. (A) Projection of hidden inputs of simulated networks at layer 5 and 10 onto their \ufb01rst\nthree principal components. Insets show the fraction of variance explained by the \ufb01rst 5 singular\nvalues. For large weights (bottom), the distribution of singular values gets \ufb02atter and the projected\ncurve is more tangled. (B) The autocorrelation, cl\nas a function of layer for simulated networks. (C) The theoretical predictions from (6) (solid lines)\ncompared to the average (dots) and standard deviation across \u03b8 (shaded) in a simulated network.\n\n12(\u2206\u03b8) =(cid:82) d\u03b8 ql(\u03b8, \u03b8 + \u2206\u03b8)/q\u2217, of hidden inputs\n\nTo quantitatively understand the layer-wise growth of complexity of this manifold, it is useful to turn\nto concepts in Riemannian geometry [16]. First, at each point \u03b8, the manifold h(\u03b8) (we temporarily\nsuppress the layer index l) has a tangent, or velocity vector v(\u03b8) = \u2202\u03b8h(\u03b8). Intuitively, curvature\nis related to how quickly this tangent vector rotates in the ambient space RN as one moves along\nthe manifold, or in essence the acceleration vector a(\u03b8) = \u2202\u03b8v(\u03b8). Now at each point \u03b8, when both\nare nonzero, v(\u03b8) and a(\u03b8) span a 2 dimensional subspace of RN . Within this subspace, there is a\nunique circle of radius R(\u03b8) that has the same position, velocity and acceleration vector as the curve\nh(\u03b8) at \u03b8. This circle is known as the osculating circle (Fig. 4A), and the extrinsic curvature \u03ba(\u03b8) of\nthe curve is de\ufb01ned as \u03ba(\u03b8) = 1/R(\u03b8). Thus, intuitively, small radii of curvature R(\u03b8) imply high\nextrinsic curvature \u03ba(\u03b8). The extrinsic curvature of a curve depends only on its image in RN and\nis invariant with respect to the particular parameterization \u03b8 \u2192 h(\u03b8). For any parameterization, an\nunder a unit speed parameterization of the curve, so that v(\u03b8) \u00b7 v(\u03b8) = 1, we have v(\u03b8) \u00b7 a(\u03b8) = 0,\nand \u03ba(\u03b8) is simply the norm of the acceleration vector.\nAnother measure of the curve\u2019s complexity is the length LE of its image in the ambient Euclidean\nspace. The Euclidean metric in RN induces a metric gE(\u03b8) = v(\u03b8) \u00b7 v(\u03b8) on the curve, so that the\n\nexplicit expression for \u03ba(\u03b8) is given by \u03ba(\u03b8) = (v\u00b7 v)\u22123/2(cid:112)(v \u00b7 v)(a \u00b7 a) \u2212 (v \u00b7 a)2 [16]. Note that\ndistance dLE moved in RN as one moves from \u03b8 to \u03b8 + d\u03b8 on the curve is dLE =(cid:112)gE(\u03b8)d\u03b8. The\ntotal curve length is LE =(cid:82)(cid:112)gE(\u03b8)d\u03b8. However, even straight line segments can have a large\n\nEuclidean length. Another interesting measure of length that takes into account curvature, is the\nlength of the image of the curve under the Gauss map. For a K dimensional manifold M embedded in\n\n5\n\n\fFigure 4: Propagation of extrinsic curvature and length in a network with 1000 hidden units. (A)\nAn osculating circle.\n(B) A curve with unit tangent vectors at 4 points in ambient space, and\nthe image of these points under the Gauss map. (C-E) Propagation of curvature metrics based\non both theory derived from iterative maps in (3), (6) and (8) (solid lines) and simulations using\n(1) (dots). (F) Schematic of the normal vector, tangent plane, and principal curvatures for a 2D\nmanifold embedded in R3. (G) average principal curvatures for the largest and smallest 4 principal\ncurvatures (\u03ba\u00b11, . . . , \u03ba\u00b14) across locations \u03b8 within one network. The principal curvatures all grow\nexponentially as we backpropagate to the input layer. Panels F,G are discussed in Sec. 5.\nRN , the Gauss map (Fig. 4B) maps a point \u03b8 \u2208 M to its K dimensional tangent plane T\u03b8M \u2208 GK,N ,\nwhere GK,N is the Grassmannian manifold of all K dimensional subspaces in RN . In the special case\nof K = 1, GK,N is the sphere SN\u22121 with antipodal points identi\ufb01ed, since a 1-dimensional subspace\ncan be identi\ufb01ed with a unit vector, modulo sign. The Gauss map takes a point \u03b8 on the curve and\nSN\u22121 induces a Gauss metric on the curve, given by gG(\u03b8) = (\u2202\u03b8 \u02c6v(\u03b8)) \u00b7 (\u2202\u03b8 \u02c6v(\u03b8)), which measures\nhow quickly the unit tangent vector \u02c6v(\u03b8) changes as \u03b8 changes. Thus the distance dLG moved in\n\nmaps it to the unit velocity vector \u02c6v(\u03b8) = v(\u03b8)/(cid:112)v(\u03b8) \u00b7 v(\u03b8). In particular, the natural metric on\nthe Grassmannian GK,N as one moves from \u03b8 to \u03b8 + d\u03b8 on the curve is dLG =(cid:112)gG(\u03b8)d\u03b8, and the\nlength of the curve under the Gauss map is LG =(cid:82)(cid:112)gG(\u03b8)d\u03b8. Furthermore, the Gauss metric is\n\n\u221a\n\n\u221a\n\n\u221a\nN q, \u03ba(\u03b8) = 1/\n\nrelated to the extrinsic curvature and the Euclidean metric via the relation gG(\u03b8) = \u03ba(\u03b8)2gE(\u03b8) [16].\nTo illustrate these concepts, it is useful to compute all of them for the circle h1(\u03b8) de\ufb01ned above:\ngE(\u03b8) = N q, LE = 2\u03c0\nN q, gG(\u03b8) = 1, and LG = 2\u03c0. As expected, \u03ba(\u03b8) is\nthe inverse of the radius of curvature, which is\nN q. Now consider how these quantities change\nif the circle is scaled up so that h(\u03b8) \u2192 \u03c7h(\u03b8). The length LE and radius scale up by \u03c7, but the\ncurvature \u03ba scales down as \u03c7\u22121, and so LG does not change. Thus linear expansion increases length\nand decreases curvature, thereby maintaining constant Grassmannian length LG.\nWe now show that nonlinear propagation of this same circle through a deep network can behave very\ndifferently from linear expansion: in the chaotic regime, length can increase without any decrease\nin extrinsic curvature! To remove the scaling with N in the above quantities, we will work with the\nLE. Thus, 1/(\u00af\u03ba)2 can be thought\nrenormalized quantities \u00af\u03ba =\nof as a radius of curvature squared per neuron of the osculating circle, while ( \u00afLE)2 is the squared\nEuclidean length of the curve per neuron. For the circle, these quantities are q and 2\u03c0q respectively.\nFor simplicity, in the inputs to the \ufb01rst layer of neurons, we begin with a circle h1(\u03b8) with squared\nradius per neuron q1 = q\u2217, so this radius is already at the \ufb01xed point of the length map in (3). In the\nSM, we derive an iterative formula for the extrinsic curvature and Euclidean metric of this manifold\nas it propagates through the layers of a deep network:\n\nN gE, and \u00afLE = 1\u221a\n\nN \u03ba, \u00afgE = 1\n\n\u221a\n\nN\n\n\u00afgE,l = \u03c71 \u00afgE,l\u22121\n\n(\u00af\u03bal)2 = 3\n\n\u03c72\n\u03c72\n1\n\n(\u00af\u03bal\u22121)2,\n\n1\n\u03c71\n\n\u00afgE,1 = q\u2217,\n\n(\u00af\u03ba1)2 = 1/q\u2217.\n\n(8)\n\n+\n\n(cid:90)\n\nDz(cid:2)\u03c6(cid:48)(cid:48)(cid:0)\u221a\n\nq\u2217z(cid:1)(cid:3)2\n\n6\n\nwhere \u03c71 is the stretch factor de\ufb01ned in (7) and \u03c72 is de\ufb01ned analogously as\n\n\u03c72 = \u03c32\nw\n\n.\n\n(9)\n\n\f12 = 1; this second derivative is\n\n\u03c72 is closely related to the second derivative of the C-map in (6) at cl\u22121\n\u03c72q\u2217. See SM for a derivation of the evolution equations for extrinsic geometry in (8).\nIntriguingly for a sigmoidal neural network, these evolution equations behave very differently in\nthe chaotic (\u03c71 > 1) versus ordered (\u03c71 < 1) phase. In the chaotic phase, the Euclidean metric \u00afgE\ngrows exponentially with depth due to multiplicative stretching through \u03c71. This stretching does\nmultiplicatively attenuate any curvature in layer l \u2212 1 by a factor 1/\u03c71 (see the update equation for\n\u00af\u03bal in (8)), but new curvature is added in due to a nonzero \u03c72, which originates from the curvature of\nthe single neuron nonlinearity in (9). Thus, unlike in linear expansion, extrinsic curvature is not lost,\nbut maintained, and ultimately approaches a \ufb01xed point \u00af\u03ba\u2217. This implies that the global curvature\nmeasure \u00afLG grows exponentially with depth. These highly nontrivial predictions of the metric and\ncurvature evolution equations in (8) are quantitatively con\ufb01rmed in simulations in Figure 4C-E.\nIntuitively, this exponential growth of global curvature \u00afLG in the chaotic phase implies that the curve\nexplores many different tangent directions in hidden representation space. This further implies that\nthe coordinate functions of the embedding hl\ni(\u03b8) become highly complex curved basis functions\non the input manifold coordinate \u03b8, allowing a deep network to compute exponentially complex\nfunctions over simple low dimensional manifolds (Figure 5A-C, details in SM).\n\nFigure 5: Deep networks in the chaotic regime are more expressive than shallow networks. (A)\nActivity of four different neurons in the output layer as a function of the input, \u03b8 for three networks\nof different depth (width Nl = 1, 000). (B) Linear regression of the output activity onto a random\nfunction (black) shows closer predictions (blue) with deeper networks (bottom) than shallow networks\n(top). (C) Decomposing the prediction error by frequency shows shallow networks cannot capture high\nfrequency content in random functions but deep networks can (yellow=high error). (D) Increasing\nthe width of a one hidden layer network up to 10, 000 does not decrease error at high frequencies.\n5 Shallow networks cannot achieve exponential expressivity\n\nConsider a shallow network with 1 hidden layer x1, one input layer x0, with x1 = \u03c6(W1x0) + b1,\nand a linear readout layer. How complex can the hidden representation be as a function of its width\nN1, relative to the results above for depth? We prove a general upper bound on LE (see SM):\nTheorem 1. Suppose \u03c6(h) is monotonically non-decreasing with bounded dynamic range R, i.e.\nmaxh \u03c6(h) \u2212 minh \u03c6(h) = R. Further suppose that x0(\u03b8) is a curve in input space such that no 1D\nprojection of \u2202\u03b8x(\u03b8) changes sign more than s times over the range of \u03b8. Then for any choice of W1\nand b1 the Euclidean length of x1(\u03b8), satis\ufb01es LE \u2264 N1(1 + s)R.\n\u221a\nFor the circle input, s = 1 and for the tanh nonlinearity, R = 2, so in this special case, the normalized\nlength \u00afLE \u2264 2\nN1. In contrast, for deep networks in the chaotic regime \u00afLE grows exponentially\nwith depth in h space, and so consequently also in x space. Therefore the length of curves typically\nexpand exponentially in depth even for random deep networks, but can only expand as the square\nroot of width no matter what shallow network is chosen. Moreover, as we have seen above, it is the\nexponential growth of \u00afLE that fundamentally drives the exponential growth of \u00afLG with depth. Indeed\nshallow random networks exhibit minimal growth in expressivity even at large widths (Figure 5D).\n\n6 Classi\ufb01cation boundaries acquire exponential local curvature with depth\nWe have focused so far on how simple manifolds in input space can acquire both exponential\nEuclidean and Grassmannian length with depth, thereby exponentially de-correlating and \ufb01lling up\n\n7\n\n\f2 P \u22022G\n\n\u2202x\u2202xT P, where P = I \u2212(cid:98)\u2207G(cid:98)\u2207GT is the projection operator\nonto Tx\u2217M and (cid:98)\u2207G is the unit normal vector [16]. Intuitively, near x\u2217, the decision boundary M\n\nhidden representation space. Another natural question is how the complexity of a decision boundary\ngrows as it is backpropagated to the input layer. Consider a linear classi\ufb01er y = sgn(\u03b2 \u00b7 xD \u2212 \u03b20)\nacting on the \ufb01nal layer. In this layer, the N \u2212 1 dimensional decision boundary is the hyperplane\n\u03b2\u00b7xD\u2212\u03b20 = 0. However, in the input layer x0, the decision boundary is a curved N \u22121 dimensional\nmanifold M that arises as the solution set of the nonlinear equation G(x0) \u2261 \u03b2 \u00b7 xD(x0) \u2212 \u03b20 = 0,\nwhere xD(x0) is the nonlinear feedforward map from input to output.\nAt any point x\u2217 on the decision boundary in layer l, the gradient (cid:126)\u2207G is perpendicular to the N \u2212 1\ndimensional tangent plane Tx\u2217M (see Fig. 4F). The normal vector (cid:126)\u2207G, along with any unit tangent\nvector \u02c6v \u2208 Tx\u2217M, spans a 2 dimensional subspace whose intersection with M yields a geodesic\ncurve in M passing through x\u2217 with velocity vector \u02c6v. This geodesic will have extrinsic curvature\n\u03ba(x\u2217, \u02c6v). Maximizing this curvature over \u02c6v yields the \ufb01rst principal curvature \u03ba1(x\u2217). A sequence\nof successive maximizations of \u03ba(x\u2217, \u02c6v), while constraining \u02c6v to be perpendicular to all previous\nsolutions, yields the sequence of principal curvatures \u03ba1(x\u2217) \u2265 \u03ba2(x\u2217) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03baN\u22121(x\u2217). These\nprincipal curvatures arise as the eigenvalues of a normalized Hessian operator projected onto the\ntangent plane Tx\u2217M: H = ||(cid:126)\u2207G||\u22121\ncan be approximated as a paraboloid with a quadratic form H whose N \u2212 1 eigenvalues are the\nprincipal curvatures \u03ba1, . . . , \u03baN\u22121 (Fig. 4F).\nWe compute these curvatures numerically as a function of depth in Fig. 4G (see SM for details).\nWe \ufb01nd, remarkably, that a subset of principal curvatures grow exponentially with depth. Here\nthe principal curvatures are signed, with positive (negative) curvature indicating that the associated\ngeodesic curves towards (away from) the normal vector (cid:126)\u2207G. Thus the decision boundary can\nbecome exponentially curved with depth, enabling highly complex classi\ufb01cations. Moreover, this\nexponentially curved boundary is disentangled and mapped to a \ufb02at boundary in the output layer.\n7 Discussion\nFundamentally, neural networks compute nonlinear maps between high dimensional spaces, for\nexample from RN1 \u2192 RND, and it is unclear what the most appropriate mathematics is for under-\nstanding such daunting spaces of maps. Previous works have attacked this problem by restricting\nthe nature of the nonlinearity involved (e.g. piecewise linear, sum-product, or Pfaf\ufb01an) and thereby\nrestricting the space of maps to those amenable to special theoretical analysis methods (combinatorics,\npolynomial relations, or topological invariants). We have begun a preliminary exploration of the\nexpressivity of such deep functions based on Riemannian geometry and dynamical mean \ufb01eld theory.\nWe demonstrate that networks in a chaotic phase compactly exhibit functions that exponentially grow\nthe global curvature of simple one dimensional manifolds from input to output and the local curvature\nof simple co-dimension one manifolds from output to input. The former captures the notion that deep\nneural networks can ef\ufb01ciently compute highly expressive functions in ways that shallow networks\ncannot, while the latter quanti\ufb01es and demonstrates the power of deep neural networks to disentangle\ncurved input manifolds, an attractive idea that has eluded formal quanti\ufb01cation.\nMoreover, our analysis of a maximum entropy distribution over deep networks constitutes an im-\nportant null model of deep signal propagation that can be used to assess and understand different\nbehavior in trained networks. For example, the metrics we have adapted from Riemannian geometry,\ncombined with an understanding of their behavior in random networks, may provide a basis for\nunderstanding what is special about trained networks. Furthermore, while we have focused on the\nnotion of input-output chaos, the duality between inputs and synaptic weights imply a form of weight\nchaos, in which deep neural networks rapidly traverse function space as weights change (see SM).\nIndeed, just as autocorrelation lengths between outputs as a function of inputs shrink exponentially\nwith depth, so too will autocorrelations between outputs as a function of weights. Finally, while our\nlength and correlation maps can be applied directly to piecewise linear nonlinearities (e.g. ReLUs),\ndeep piecewise linear functions have 0 local curvature. To characterize how such functions twist\nacross input space, our methods can compute tangent vector auto-correlations instead of curvature.\nBut more generally, to understand functions, we often look to their graphs. The graph of a map from\nRN1 \u2192 RND is an RN1 dimensional submanifold of RN1+ND, and therefore has both high dimension\nand co-dimension. We speculate that many of the secrets of deep learning may be uncovered by\nstudying the geometry of this graph as a Riemannian manifold, and understanding how it changes\nwith both depth and learning.\n\n8\n\n\fReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[3] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan\nPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up\nend-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.\n\n[4] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[5] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J\nGuibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In Advances in Neural Information\nProcessing Systems, pages 505\u2013513, 2015.\n\n[6] Lane T. McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen A.\nBaccus. Deep learning models of the retinal response to natural scenes. In Advances in Neural\nInformation Processing Systems, 2016.\n\n[7] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\nnew perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):\n1798\u20131828, 2013.\n\n[8] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive\n\nsciences, 11(8):333\u2013341, 2007.\n\n[9] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[10] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[11] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv\n\npreprint arXiv:1512.03965, 2015.\n\n[12] Matus Telgarsky. Representation bene\ufb01ts of deep feedforward networks. arXiv preprint\n\narXiv:1509.08101, 2015.\n\n[13] James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel. On the representational\nef\ufb01ciency of restricted boltzmann machines. In Advances in Neural Information Processing\nSystems, pages 2877\u20132885, 2013.\n\n[14] Monica Bianchini and Franco Scarselli. On the complexity of neural network classi\ufb01ers: A\ncomparison between shallow and deep architectures. Neural Networks and Learning Systems,\nIEEE Transactions on, 25(8):1553\u20131565, 2014.\n\n[15] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions:\n\nWhen is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.\n\n[16] John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer\n\nScience & Business Media, 2006.\n\n[17] Haim Sompolinsky, A Crisanti, and HJ Sommers. Chaos in random neural networks. Physical\n\nReview Letters, 61(3):259, 1988.\n\n[18] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\n\nexpressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1670, "authors": [{"given_name": "Ben", "family_name": "Poole", "institution": "Stanford University"}, {"given_name": "Subhaneil", "family_name": "Lahiri", "institution": "Stanford University"}, {"given_name": "Maithra", "family_name": "Raghu", "institution": "Cornell University"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}]}