{"title": "On the Convergence Rate of Training Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6676, "page_last": 6688, "abstract": "How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper?\n\nIn this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.\nThey are harder to analyze than feedforward neural networks, because the \\emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data.\n\nMore importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.", "full_text": "On the Convergence Rate of\n\nTraining Recurrent Neural Networks\u2217\n\nZeyuan Allen-Zhu\nMicrosoft Research AI\n\nzeyuan@csail.mit.edu\n\nYuanzhi Li\n\nCarnegie Mellon University\nyuanzhil@andrew.cmu.edu\n\nZhao Song\nUT-Austin\n\nzhaos@utexas.edu\n\nAbstract\n\nHow can local-search methods such as stochastic gradient descent (SGD) avoid\nbad local minima in training multi-layer neural networks? Why can they \ufb01t ran-\ndom labels even given non-convex and non-smooth architectures? Most existing\ntheory only covers networks with one hidden layer, so can we go deeper?\nIn this paper, we focus on recurrent neural networks (RNNs) which are multi-layer\nnetworks widely used in natural language processing. They are harder to analyze\nthan feedforward neural networks, because the same recurrent unit is repeatedly\napplied across the entire time horizon of length L, which is analogous to feedfor-\nward networks of depth L. We show when the number of neurons is suf\ufb01ciently\nlarge, meaning polynomial in the training data size and in L, then SGD is capa-\nble of minimizing the regression loss in the linear convergence rate. This gives\ntheoretical evidence of how RNNs can memorize data.\nMore importantly, in this paper we build general toolkits to analyze multi-layer\nnetworks with ReLU activations. For instance, we prove why ReLU activations\ncan prevent exponential gradient explosion or vanishing, and build a perturbation\ntheory to analyze \ufb01rst-order approximation of multi-layer networks.\n\n1\n\nIntroduction\n\nNeural networks have been one of the most powerful tools in machine learning over the past a few\ndecades. The multi-layer structure of neural network gives it supreme power in expressibility and\nlearning performance. However, it raises complexity concerns: the training objective is generally\nnon-convex and non-smooth. In practice, local-search algorithms such as stochastic gradient descent\n(SGD) are capable of \ufb01nding global optima, at least on the training data [19, 59]. How SGD avoids\nlocal minima for such objectives remains an open theoretical question since Goodfellow et al. [19].\nIn recent years, there have been a number of theoretical results aiming at a better understanding of\nthis phenomenon. Many of them focus on two-layer (thus one-hidden-layer) neural networks and\nassume that the inputs are random Gaussian or suf\ufb01ciently close to Gaussian [7, 15, 18, 32, 37, 50,\n55, 60, 61]. Some study deep neural networks but assuming the activation function is linear [5, 6, 22].\nSome study the convex task of training essentially only the last layer of the network [13].\nMore recently, Safran and Shamir [43] provided evidence that, even when inputs are standard Gaus-\nsians, two-layer neural networks can indeed have spurious local minima, and suggested that over-\nparameterization (i.e., increasing the number of neurons) may be the key in avoiding spurious local\nminima. Li and Liang [31] showed that, for two-layer networks with the cross-entropy loss, in\nthe over-parametrization regime, gradient descent (GD) is capable of \ufb01nding nearly-global optimal\nsolutions on the training data. This result was later extended to the (cid:96)2 loss by Du et al. [16].\nIn this paper, we show GD and SGD are capable of training multi-layer neural networks (with ReLU\nactivation) to global minima on any non-degenerate training data set. Furthermore, the running time\n\n\u2217Full version and future updates can be found on https://arxiv.org/abs/1810.12065.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis polynomial in the number of layers and the number of data points. Since there are many different\ntypes of multi-layer networks (convolutional, feedforward, recurrent, etc.), in this present paper, we\nfocus on recurrent neural networks (RNN) as our choice of multi-layer networks, and feedforward\nnetworks are only its \u201cspecial case\u201d (see our follow-up work [3]).\nRecurrent Neural Networks. Among different architectures of neural networks, one of the least\ntheoretically-understood structure is the recurrent one [17]. A recurrent neural network recurrently\napplies the same network unit to a sequence of input tokens, such as a sequence of words in a\nlanguage sentence. RNN is particularly useful when there are long-term, non-linear interactions\nbetween input tokens in the same sequence. These networks are widely used in practice for natural\nlanguage processing, language generation, machine translation, speech recognition, video and music\nprocessing, and many other tasks [11, 12, 27, 35, 36, 44, 52, 54]. On the theory side, while there\nare some attempts to show that an RNN is more expressive than a feedforward neural network [28],\nwhen and how an RNN can be ef\ufb01ciently learned has nearly-zero theoretical explanation.\nIn practice, RNN is usually trained by simple local-search algorithms such as SGD. However, unlike\nshallow networks, the training process of RNN often runs into the trouble of vanishing or exploding\ngradient [53]. That is, the value of the gradient becomes exponentially small or large in the time\nhorizon, even when the training objective is still constant. In practice, one of the popular ways to\nresolve this is by the long short term memory (LSTM) structure [24]. However, one can also use\nrecti\ufb01ed linear units (ReLUs) as activation functions to avoid vanishing or exploding gradient [45].\nIn fact, one of the earliest adoptions of ReLUs was on applications of RNNs for this purpose twenty\nyears ago [21, 46]. For a detailed survey on RNN, we refer the readers to Salehinejad et al. [45].\n\n1.1 Our Question\n\nIn this paper, we study the following general question\n\n\u2022 Can ReLU provably stabilize the training process and avoid vanishing/exploding gradient?\n\u2022 Can RNN be trained close to zero training error ef\ufb01ciently under mild assumptions?\n\n(When there is no activation function, RNN is known as linear dynamical system and Hardt et al.\n[23] proved the convergence for such linear dynamical systems.)\nMotivations. One may also want to study whether RNN can be trained close to zero test error.\nHowever, unlike feedforward networks, the training error, or the ability to memorize examples, may\nactually be desirable for RNN. After all, many tasks involving RNN are related to memories, and\ncertain RNN units are even referred to memory cells. Since RNN applies the same network unit to\nall input tokens in a sequence, the following question can possibly of its own interest:\n\n\u2022 How does RNN learn mappings (say from token 3 to token 7) without destroying others?\n\nAnother motivation is the following. An RNN can be viewed as a space constraint, differentiable\nTuring machine, except that the input is only allowed to be read in a \ufb01xed order. It was shown\nin Siegelmann and Sontag [49] that all Turing machines can be simulated by recurrent networks\nbuilt of neurons with non-linear activations. In practice, RNN is also used as a tool to build neural\nTuring machines [20], equipped with a grand goal of automatically learning an algorithm based on\nthe observation of the inputs and outputs. To this extent, we believe the task of understanding the\ntrainability as a \ufb01rst step towards understanding RNN can be meaningful on its own.\nOur Result. To present the simplest result, we focus on the classical Elman network with ReLU\nactivation:\nwhere W \u2208 Rm\u00d7m, A \u2208 Rm\u00d7dx\nwhere B \u2208 Rd\u00d7m\nWe denote by \u03c6 the ReLU activation function: \u03c6(x) = max(x, 0). We note that (fully-connected)\nfeedforward networks are only \u201cspecial cases\u201d to this by replacing W with W(cid:96) for each layer.2\nWe consider a regression task where each sequence of inputs consists of vectors x1, . . . , xL \u2208 Rdx\nL \u2208 Rd. We assume there are n\nand we perform least-square regression with respect to y\u2217\ntraining sequences, each of length L. We assume the training sequences are \u03b4-separable (say vectors\n\nh(cid:96) = \u03c6(W \u00b7 h(cid:96)\u22121 + Ax(cid:96)) \u2208 Rm\ny(cid:96) = B \u00b7 h(cid:96) \u2208 Rd\n\n1, . . . , y\u2217\n\n2Most of the technical lemmas of this paper remain to hold (and become much simpler) once W is replaced\n\nwith W(cid:96). This is carefully treated by [3].\n\n2\n\n\fx1 are different by relative distance \u03b4 > 0 for every pairs of training sequences). Our main theorem\ncan be stated as follows\nTheorem. If the number of neurons m \u2265 poly(n, d, L, \u03b4\u22121, log \u03b5\u22121) is polynomially large, we can\n\ufb01nd weight matrices W, A, B where the RNN gives \u03b5 training error\n\n\u2022 if gradient descent (GD) is applied for T = \u2126(cid:0) poly(n,d,L)\n\u2126(cid:0) poly(n,d,L)\n\n(cid:1) iterations, starting from random Gaussian initializations.3\n\n(cid:1) iterations, starting from ran-\n\n(mini-batch or regular) stochastic gradient descent\n\n(SGD)\n\nis applied for T =\n\ndom Gaussian initializations; or\n\n\u03b42\n\nlog 1\n\u03b5\n\n\u2022 if\n\n\u03b42\n\nlog 1\n\u03b5\n\nOur Contribution. We summarize our contributions as follows.\n\u2022 We believe this is the \ufb01rst proof of convergence of GD/SGD for training the hidden layers of\nrecurrent neural networks (or even for any multi-layer networks of more than two layers) when\nactivation functions are present.4\n\n\u2022 Our results provide arguably the \ufb01rst theoretical evidence towards the empirical \ufb01nding of Good-\nfellow et al. [19] on multi-layer networks, regarding the ability of SGD to avoid (spurious) local\nminima. Our theorem does not exclude the existence of bad local minima\n\n\u2022 We build new technical toolkits to analyze multi-layer networks with ReLU activation, which\nhave now found many applications [1\u20133, 9]. For instance, combining this paper with new tech-\nniques, one can derive guarantees on testing error for RNN in the PAC-learning language [1].\n\nExtension: DNN. A feedforward neural network of depth L is similar to Elman RNN with the\nmain difference being that the weights across layers are separately trained. As one shall see, this\nonly makes our proofs simpler because we have more independence in randomness. Our theorems\nalso apply to feedforward neural networks, and we have written a separate follow-up paper [3] to\naddress feedforward (fully-connected, residual, and convolutional) neural networks.\nEXTENSION: DEEP RNN. Elman RNN is also referred to as three-layer RNN, and one may also\nstudy the convergence of RNNs with more hidden layers. This is referred to as deep RNN [45]. Our\ntheorem also applies to deep RNNs (by combining this paper together with [3]).\nEXTENSION: LOSS FUNCTIONS. For simplicity, in this paper we have adopted the (cid:96)2 regression\nloss. Our results generalize to other Lipschitz smooth (but possibly nonconvex) loss functions, by\ncombining with the techniques of [3].\n\n1.2 Other Related Works\n\nAnother relevant work is Brutzkus et al. [8] where the authors studied over-paramterization in the\ncase of two-layer neural network under a linear-separable assumption.\nInstead of using randomly initialized weights like this paper, there is a line of work proposing algo-\nrithms using weights generated from some \u201ctensor initialization\u201d process [4, 26, 48, 55, 61].\nThere is huge literature on using the mean-\ufb01eld theory to study neural networks [10, 14, 25, 30, 34,\n38\u201340, 42, 47, 56\u201358]. At a high level, they study the network dynamics at random initialization\nwhen the number of hidden neurons grow to in\ufb01nity, and use such initialization theory to predict per-\nformance after training. However, they do not provide theoretical convergence rate for the training\nprocess (at least when the number of neurons is \ufb01nite).\n\n3At a \ufb01rst glance, one may question how it is possible for SGD to enjoy a logarithmic time dependency in\n\u03b5\u22121; after all, even when minimizing strongly-convex and Lipschitz-smooth functions, the typical convergence\nrate of SGD is T \u221d 1/\u03b5 as opposed to T \u221d log(1/\u03b5). We quickly point out there is no contradiction here if\nthe stochastic pieces of the objective enjoy a common global minimizer.\n\n4Our theorem holds even when A, B are at random initialization and only the hidden weight matrix W\nis trained. This is much more dif\ufb01cult to analyze than the convex task of training only the last layer B [13].\nTraining only the last layer can signi\ufb01cantly reduce the learning power of (recurrent or not) neural networks in\npractice.\n\n3\n\n\f2\n\ni\u2208[m] (cid:107)wi(cid:107)p\n\n2 Notations and Preliminaries\nWe denote by (cid:107)\u00b7(cid:107)2 (or sometimes (cid:107)\u00b7(cid:107)) the Euclidean norm of vectors, and by (cid:107)\u00b7(cid:107)2 the spectral norm\nof matrices. We denote by (cid:107)\u00b7(cid:107)\u221e the in\ufb01nite norm of vectors, (cid:107)\u00b7(cid:107)0 the sparsity of vectors or diagonal\nmatrices, and (cid:107)\u00b7(cid:107)F the Frobenius norm of matrices. Given matrix W , we denote by Wk or wk the k-\n\nth row vector of W . We denote the row (cid:96)p norm for W \u2208 Rm\u00d7d as (cid:107)W(cid:107)2,p :=(cid:0)(cid:80)\n\n(cid:1)1/p.\n\n(cid:81)i\u22121\nj=1(I\u2212(cid:98)vj(cid:98)v(cid:62)\n(cid:107)(cid:81)i\u22121\nj=1(I\u2212(cid:98)vj(cid:98)v(cid:62)\n\nBy de\ufb01nition, (cid:107)W(cid:107)2,2 = (cid:107)W(cid:107)F .\nWe use N (\u00b5, \u03c3) to denote Gaussian distribution with mean \u00b5 and variance \u03c3; or N (\u00b5, \u03a3) to denote\nGaussian vector with mean \u00b5 and covariance \u03a3. We use 1event to denote the indicator function of\nwhether event is true. We denote by ek the k-th standard basis vector. We use \u03c6(\u00b7) to denote the\nReLU function, namely \u03c6(x) = max{x, 0} = 1x\u22650 \u00b7 x. Given univariate function f : R \u2192 R, we\nalso use f to denote the same function over vectors: f (x) = (f (x1), . . . , f (xm)) if x \u2208 Rm.\nGiven vectors v1, . . . , vn \u2208 Rm, we de\ufb01ne U = GS(v1, . . . , vn) as their Gram-Schmidt orthonor-\n\nmalization. Namely, U = [(cid:98)v1, . . . ,(cid:98)vn] \u2208 Rm\u00d7n where\nNote that in the occasion that(cid:81)i\u22121\nv1(cid:107)v1(cid:107)\nj )vi is the zero vector, we let(cid:98)vi be an arbitrary unit\nvector that is orthogonal to(cid:98)v1, . . . ,(cid:98)vi\u22121.\nWe assume n training inputs are given: (xi,1, xi,2, . . . , xi,L) \u2208(cid:0)Rdx(cid:1)L for each input i \u2208 [n]. We\ni,L) \u2208(cid:0)Rd(cid:1)L for each input i \u2208 [n]. Without\n\nassume n training labels are given: (y\u2217\nloss of generality, we assume (cid:107)xi,(cid:96)(cid:107) \u2264 1 for every i \u2208 [n] and (cid:96) \u2208 [L]. Also without loss of\nfor every i \u2208 [n].5\ngenerality, we assume (cid:107)xi,1(cid:107) = 1 and its last coordinate [xi,1]dx = 1\u221a\nWe make the following assumption on the input data (see Footnote 9 for how to relax it):\nAssumption 2.1. (cid:107)xi,1 \u2212 xj,1(cid:107) \u2265 \u03b4 for some parameter \u03b4 \u2208 (0, 1] and every pair of i (cid:54)= j \u2208 [n].\nGiven weight matrices W \u2208 Rm\u00d7m, A \u2208 Rm\u00d7dx, B \u2208 Rd\u00d7m, we introduce the following notations\nto describe the evaluation of RNN on the input sequences. For each i \u2208 [n] and j \u2208 [L]:\n\n2.1 Elman Recurrent Neural Network\n\nj=1(I \u2212(cid:98)vj(cid:98)v(cid:62)\n\ni,2, . . . , y\u2217\n\n(cid:98)v1 =\n\n(cid:98)vi =\n\nj )vi\n\nj )vi(cid:107).\n\nand\n\nfor i \u2265 2:\n\ni,1, y\u2217\n\n2\n\nhi,0 = 0 \u2208 Rm\nyi,(cid:96) = B \u00b7 hi,(cid:96) \u2208 Rd\n\ngi,(cid:96) = W \u00b7 hi,(cid:96)\u22121 + Axi,(cid:96) \u2208 Rm\nhi,(cid:96) = \u03c6(W \u00b7 hi,(cid:96)\u22121 + Axi,(cid:96)) \u2208 Rm\n\nA very important notion that this entire paper relies on is the following:\nDe\ufb01nition 2.2. For each i \u2208 [n] and (cid:96) \u2208 [L], let Di,(cid:96) \u2208 Rm\u00d7m be the diagonal matrix where\n\nAs a result, we can write hi,(cid:96) = Di,(cid:96)W hi,(cid:96)\u22121.\n\n(Di,(cid:96))k,k = 1(W\u00b7hi,(cid:96)\u22121+Axi,(cid:96))k\u22650 = 1(gi,(cid:96))k\u22650 .\n\nm ), and the entries of Bi,j are i.i.d. generated from N (0, 1\nd ).\n\nWe consider the following random initialization distributions for W , A and B.\nDe\ufb01nition 2.3. We say that W, A, B are at random initialization, if the entries of W and A are i.i.d.\ngenerated from N (0, 2\nThroughout this paper, for notational simplicity, we refer to index (cid:96) as the (cid:96)-th layer of RNN, and\nhi,(cid:96), xi,(cid:96), yi,(cid:96) respectively as the hidden neurons, input, output on the (cid:96)-th layer. We acknowledge\nthat in certain literatures, one may regard Elman network as a three-layer RNN.\nAssumption 2.4. We assume m \u2265 poly(n, d, L, 1\nWithout loss of generality, we assume \u03b4 \u2264\n\nis not satis\ufb01ed one can decrease \u03b4). Throughout the paper except the detailed appendix, we use (cid:101)O,\n(cid:101)\u2126 and(cid:101)\u0398 notions to hide polylogarithmic dependency in m. To simplify notations, we denote by\n\nCL2 log3 m for some suf\ufb01ciently large constant C (if this\n\n\u03b5 ) for some suf\ufb01ciently large polynomial.\n\n\u03b4 , log 1\n\n1\n\n\u03c1 := nLd log m and \u0001 := nLd\u03b4\u22121 log(m/\u03b5) .\n\n5If it only satis\ufb01es (cid:107)xi,1(cid:107) \u2264 1 one can pad it with an additional coordinate to make (cid:107)xi,1(cid:107) = 1 hold. As\n\n, this is equivalent to adding a bias term N (0, 1\n\nm ) for the \ufb01rst layer.\n\nfor the assumption [xi,1]dx = 1\u221a\n\n2\n\n4\n\n\ff (W ) :=(cid:80)n\n\u2207kf (W ) =(cid:80)n\n\u2207f (W ) =(cid:80)n\n\n(cid:80)L\n(cid:80)L\n\ni=1\n\na=2\n\ni=1\n\na=2\n\n(cid:80)a\u22121\n(cid:80)a\u22121\n\n2.2 Objective and Gradient\nFor simplicity, we only optimize over the weight matrix W \u2208 Rm\u00d7m and let A and B be at random\ninitialization. As a result, our (cid:96)2-regression objective is a function over W :6\n\ni=1 fi(W )\n\nand fi(W ) := 1\n2\n\n2 where\n\nlossi,(cid:96) := Bhi,(cid:96) \u2212 y\u2217\n\ni,(cid:96) .\n\n(cid:80)L\n(cid:96)=2 (cid:107) lossi,(cid:96) (cid:107)2\n\nUsing chain rule, one can write down a closed form of the (sub-)gradient:\nFact 2.5. For k \u2208 [m], the gradient with respect to Wk (denoted by \u2207k) and the full gradient are\n\ni,(cid:96)+1\u2192a \u00b7 lossi,a)k \u00b7 hi,(cid:96) \u00b7 1(cid:104)Wk,hi,(cid:96)(cid:105)+(cid:104)Ak,xi,(cid:96)+1(cid:105)\u22650\n(cid:62)\n(cid:96)=1 (Back\n(cid:96)=1 Di,(cid:96)+1\n\ni,(cid:96)+1\u2192a \u00b7 lossi,a\n(cid:62)\n\ni,(cid:96)\n\n(cid:0) Back\n\n(cid:1) \u00b7 h(cid:62)\n\nwhere for every i \u2208 [n], (cid:96) \u2208 [L], and a = (cid:96) + 1, (cid:96) + 2, . . . , L:\n\nBacki,(cid:96)\u2192(cid:96) := B \u2208 Rd\u00d7m and Backi,(cid:96)\u2192a := BDi,aW \u00b7\u00b7\u00b7 Di,(cid:96)+1W \u2208 Rd\u00d7m .\n\n3 Our Results\n\nOur main results can be formally stated as follows.\n\nTheorem 1 (GD). Suppose \u03b7 = (cid:101)\u0398(cid:0) \u03b4\n\nm poly(n, d, L)(cid:1) and m \u2265 poly(n, d, L, \u03b4\u22121, log \u03b5\u22121). Let\n\nW (0), A, B be at random initialization. With high probability over the randomness of W (0), A, B,\nif we apply gradient descent for T steps W (t+1) = W (t) \u2212 \u03b7\u2207f (W (t)), then it satis\ufb01es\n\nTheorem 2 (SGD). Suppose \u03b7 = (cid:101)\u0398(cid:0) \u03b4\n\nf (W (T )) \u2264 \u03b5\n\nfor\n\n\u03b42\n\n1\n\u03b5\n\nlog\n\n(cid:1).\n\nT =(cid:101)\u2126(cid:0) poly(n, d, L)\nm poly(n, d, L)(cid:1) and m \u2265 poly(n, d, L, \u03b4\u22121, log \u03b5\u22121).\nT =(cid:101)\u2126(cid:0) poly(n, d, L)\n\n(cid:1).\n\nlog\n\n\u03b42\n\n1\n\u03b5\n\nLet W (0), A, B be at random initialization.\nIf we apply stochastic gradient descent for T steps\nW (t+1) = W (t)\u2212 \u03b7\u2207fi(W (t)) for a random index i \u2208 [n] per step, then with high probability (over\nW (0), A, B and the randomness of SGD), it satis\ufb01es\n\nf (W (T )) \u2264 \u03b5\n\nfor\n\nIn both cases, we essentially have linear convergence rates. Notably, our results show that the\ndependency of the number of layers L, is polynomial. Thus, even when RNN is applied to sequences\nof long input data, it does not suffer from exponential gradient explosion or vanishing (e.g., 2\u2126(L)\nor 2\u2212\u2126(L)) through the entire training process.\nMain Technical Theorems. Our main Theorem 1 and Theorem 2 are in fact natural consequences\nof the following two technical theorems. They both talk about the \ufb01rst-order behavior of RNNs\nwhen the weight matrix W is suf\ufb01ciently close to some random initialization.\nThe \ufb01rst theorem is similar to the classical Polyak-\u0141ojasiewicz condition [33, 41], and says that\n(cid:107)\u2207f (W )(cid:107)2\n\nTheorem 3. With high probability over random initialization(cid:102)W , A, B, it satis\ufb01es\n\u2200W \u2208 Rm\u00d7m with (cid:107)W \u2212(cid:102)W(cid:107)2 \u2264 poly(\u0001)\u221a\n\nF is at least as large as the objective value.\n\n(cid:107)\u2207f (W )(cid:107)2\nF ,(cid:107)\u2207fi(W )(cid:107)2\n\nF \u2265\n\u00d7 m \u00d7 f (W ) ,\nF \u2264 poly(\u03c1) \u00d7 m \u00d7 f (W ) .\n(Only the \ufb01rst statement is the Polyak-\u0141ojasiewicz condition; the second is a simple-to-proof gradi-\nTheorem 4. With high probability over random initialization(cid:102)W , A, B, it satis\ufb01es for every \u02d8W \u2208\nent upper bound.) The second theorem shows a special \u201csemi-smoothness\u201d property of the objective.\nRm\u00d7m with (cid:107) \u02d8W \u2212(cid:102)W(cid:107) \u2264 poly(\u0001)\u221a\nf ( \u02d8W + W (cid:48)) \u2264 f ( \u02d8W ) + (cid:104)\u2207f ( \u02d8W ), W (cid:48)(cid:105) + poly(\u0001)m1/3 \u00b7(cid:112)f (W ) \u00b7 (cid:107)W (cid:48)(cid:107)2 + poly(\u03c1)m(cid:107)W (cid:48)(cid:107)2\n\nm , and for every W (cid:48) \u2208 Rm\u00d7m with (cid:107)W (cid:48)(cid:107) \u2264 \u03c40\u221a\nm ,\n\n2 .\n6The index (cid:96) starts from 2, because Bhi,1 = B\u03c6(Axi,1) remains constant if we are not optimizing over A\n\n(cid:107)\u2207f (W )(cid:107)2\n\npoly(\u03c1)\n\nm\n\n\u03b4\n\n:\n\nand B.\n\n5\n\n\fAt a high level, the convergence of GD and SGD are careful applications of the two technical the-\norems above: indeed, Theorem 3 shows that as long as the objective value is high, the gradient is\nlarge; and Theorem 4 shows that if one moves in the (negative) gradient direction, then the objective\nvalue can be suf\ufb01ciently decreased. These two technical theorems together ensure that GD/SGD\ndoes not hit any saddle point or (bad) local minima along its training trajectory. This was practically\nobserved by Goodfellow et al. [19] and a theoretical justi\ufb01cation was open since then.\nAn Open Question. We did not try to tighten the polynomial dependencies of (n, d, L) in the\nproofs. When m is suf\ufb01ciently large, we make use of the randomness at initialization to argue that,\nfor all the points within a certain radius from initialization, for instance Theorem 3 holds. In practice,\nhowever, the SGD can create additional randomness as time goes; also, in practice, it suf\ufb01ces for\nthose points on the SGD trajectory to satisfy Theorem 3. Unfortunately, such randomness can \u2014 in\nprinciple \u2014 be correlated with the SGD trajectory, so we do not know how to use that in the proofs.\nAnalyzing such correlated randomness is certainly beyond the scope of this paper, but can possibly\nexplain why in practice, the size of m needed is not that large.\n\n3.1 Conclusion\n\nOverall, we provide the \ufb01rst proof of convergence of GD/SGD for non-linear neural networks that\nhave more two layers. We show with overparameterization GD/SGD can avoid hitting any (bad)\nlocal minima along its training trajectory. This was practically observed by Goodfellow et al. [19]\nand a theoretical justi\ufb01cation was open since then. We present our result using recurrent neural\nnetworks (as opposed to the simpler feedforward networks [3]) in this very \ufb01rst paper, because\nmemorization in RNN could be of independent interest. Also, our result proves that RNN can\nlearn mappings from different input tokens to different output tokens simultaneously using the same\nrecurrent unit.\nLast but not least, we build new tools to analyze multi-layer networks with ReLU activations that\ncould facilitate many new research on deep learning. For instance, our techniques in Section 4\nprovide a general theory for why ReLU activations avoid exponential exploding (see e.g.\n(4.1),\n(4.4)) or exponential vanishing (see e.g. (4.1), (4.3)); and our techniques in Section 5 give a general\ntheory for the stability of multi-layer networks against adversarial weight perturbations, which is at\nthe heart of showing the semi-smoothness Theorem 4, and used by all the follow-up works [1\u20133, 9].\n\nPROOF SKETCH\n\nThe main dif\ufb01culty of this paper is to prove Theorem 3 and 4, and we shall sketch the proof ideas in\nSection 4 through 8. In this main body, we only include Section 4 and 5 because they already given\nsome insights into how the proof proceeds. We shall put our emphasize on\n\u2022 how to avoid exponential blow up in L, and\n\u2022 how to deal with the issue of randomness dependence across layers.\nWe genuinely hope that this high-level sketch can (1) give readers a clear overview of the proof\nwithout the necessity of going to the appendix, and (2) appreciate our proof and understand why it\nis necessarily long.7\n\n4 Basic Properties at Random Initialization\nIn this section we derive basic properties of the RNN when the weight matrices W, A, B are all at\nrandom initialization. The corresponding precise statements and proofs are in Appendix B.\nThe \ufb01rst one says that the forward propagation neither explodes or vanishes, that is,\n\n\u2264 (cid:107)hi,(cid:96)(cid:107)2,(cid:107)gi,(cid:96)(cid:107)2 \u2264 O(L) .\n\n(4.1)\n7For instance, proving gradient norm lower bound in Theorem 3 for a single neuron k \u2208 [m] is easy, but\nhow to apply concentration across neurons? Crucially, due to the recurrent structure these quantities are never\nindependent, so we have to build necessary probabilistic tools to tackle this. If one is willing to ignore such\nsubtleties, then our sketched proof is suf\ufb01ciently short and gives a good overview.\n\n1\n2\n\n6\n\n\f\u221a\n\ni.i.d. from N(cid:0)0, 2\n\n(cid:1), the norm (cid:107)W z(cid:107)2 is around\n\nm\n\nIntuitively, (4.1) very reasonable. Since the weight matrix W is randomly initialized with entries\n2 for any \ufb01xed vector z. Equipped with ReLU\nactivation, it \u201cshuts down\u201d roughly half of the coordinates of W z and reduces the norm (cid:107)\u03c6(W z)(cid:107) to\none. Since in each layer (cid:96), there is an additional unit-norm signal xi,(cid:96) coming in, we should expect\nthe \ufb01nal norm of hidden neurons to be at most O(L).\nUnfortunately, the above argument cannot be directly applied since the weight matrix W is reused\nfor L times so there is no fresh new randomness across layers. Let us explain how we deal with this\nissue carefully, because it is at the heart of all of our proofs in this paper. Recall, each time W is\napplied to some vector hi,(cid:96), it only uses \u201cone column of randomness\u201d of W . Mathematically, letting\nU(cid:96) \u2208 Rm\u00d7n(cid:96) denote the column orthonormal matrix using Gram-Schmidt\n\nwe have W hi,(cid:96) = W U(cid:96)\u22121U(cid:62)\n\u2022 The term W (I \u2212 U(cid:96)\u22121U(cid:62)\n\u2022 The term W U(cid:96)\u22121U(cid:62)\n\nU(cid:96) := GS (h1,1, . . . , hn,1, h1,2, . . . , hn,2, . . . , h1,(cid:96), . . . , hn,(cid:96)) ,\n\n(cid:96)\u22121hi,(cid:96) + W (I \u2212 U(cid:96)\u22121U(cid:62)\n(cid:96)\u22121)hi,(cid:96) has new randomness independent of the previous layers.8\n\n(cid:96)\u22121)hi,(cid:96).\n\n(cid:96)\u22121hi,(cid:96) relies on the randomness of W in the directions of hi,a for a < (cid:96)\nof the previous layers. We cannot rely on the randomness of this term, because when applying\ninductive argument till layer (cid:96), the randomness of W U(cid:96)\u22121 is already used.\nFortunately, W U(cid:96)\u22121 \u2208 Rm\u00d7n((cid:96)\u22121) is a rectangular matrix with m (cid:29) n((cid:96) \u2212 1) (thanks to\n\u221a\noverparameterization!) so one can bound its spectral norm by roughly\n2. This ensures that\nno matter how hi,(cid:96) behaves (even arbitrarily correlated with W U(cid:96)\u22121), the norm of the \ufb01rst term\ncannot be too large. It is crucial here that W U(cid:96)\u22121 is a rectangular matrix, because for a square\nrandom matrix such as W , its spectral norm is 2 and using that, the forward propagation bound\nwill exponentially blow up.\n\n(cid:96)\u22121)hi,(cid:96)(cid:107)2 \u2265(cid:101)\u2126(\n\n1\nL2 ) .\n\nThis summarizes the main idea for proving (cid:107)hi,(cid:96)(cid:107) \u2264 O(L) in (4.1); the lower bound 1\nOur next property says in each layer, the amount of \u201cfresh new randomness\u201d is non-negligible:\n\n2 is similar.\n\n(cid:107)(I \u2212 U(cid:96)\u22121U(cid:62)\n\n(I \u2212 U(cid:96)U(cid:62)\n\n(4.2)\nThis relies on a more involved inductive argument than (4.1). At high level, one needs to show that\nin each layer, the amount of \u201cfresh new randomness\u201d reduces only by a factor at most 1 \u2212 1\n10L.\nUsing (4.1) and (4.2), we obtain the following property about the data separability:\n\nHere, we say two vectors x and y are \u03b4-separable if (cid:13)(cid:13)(I \u2212 yy(cid:62)/(cid:107)y(cid:107)2\n\n(cid:96) )hi,(cid:96)+1 and (I \u2212 U(cid:96)U(cid:62)\n\n(cid:96) )hj,(cid:96)+1 are (\u03b4/2)-separable, \u2200i, j \u2208 [n] with i (cid:54)= j\n\n2)x(cid:13)(cid:13) \u2265 \u03b4 and vice versa.\n\n(4.3)\n\nProperty (4.3) shows that the separability information (say on input token 1) does not diminish by\nmore than a polynomial factor even if the information is propagated for L layers.\nWe prove (4.3) by induction. In the \ufb01rst layer (cid:96) = 1 we have hi,1 and hj,1 are \u03b4-separable which is a\nconsequence of Assumption 2.1. If having fresh new randomness, given two \u03b4 separable vectors x, y,\none can show that \u03c6(W x) and \u03c6(W y) are also \u03b4(1 \u2212 o( 1\nL ))-separable. Again, in RNN, we do not\nhave fresh new randomness, so we rely on (4.2) to give us reasonably large fresh new randomness.\nApplying a careful induction helps us to derive that (4.3) holds for all layers.9\nIntermediate Layers and Backward Propagation. Training neural network is not only about\nforward propagation. We also have to bound intermediate layers and backward propagation.\nThe \ufb01rst two results we derive are the following. For every (cid:96)1 \u2265 (cid:96)2 and diagonal matrices D(cid:48) of\nsparsity s \u2208 [\u03c12, m0.49]:\n\n(cid:107)D(cid:48)W Di,(cid:96)1 \u00b7\u00b7\u00b7 Di,(cid:96)2W D(cid:48)(cid:107)2 \u2264 (cid:101)O(\n(cid:107)W Di,(cid:96)1 \u00b7\u00b7\u00b7 Di,(cid:96)2W(cid:107)2 \u2264 O(L3)\n\u221a\n\n\u221a\n\nm)\n(cid:96)\u22121)hi,(cid:96), we have W (I \u2212 U(cid:96)\u22121U(cid:62)\nm I) and is independent of all {hi,a | i \u2208 [n], a < (cid:96)}.\n\n8More precisely, letting v = (I \u2212 U(cid:96)\u22121U(cid:62)\nW v(cid:107)v(cid:107) is a random Gaussian vector in N (0, 2\n9This is the only place that we rely on Assumption 2.1. This assumption is somewhat necessary in the\nfollowing sense. If xi,(cid:96) = xj,(cid:96) for some pair i (cid:54)= j for all the \ufb01rst ten layers (cid:96) = 1, 2, . . . , 10, and if y\u2217\ni,(cid:96) (cid:54)= y\u2217\ni,(cid:96)\nfor even just one of these layers, then there is no hope in having the training objective decrease to zero. Of\ncourse, one can make more relaxed assumption on the input data, involving both xi,(cid:96) and y\u2217\ni,(cid:96). While this is\npossible, it complicates the statements so we do not present such results in this paper.\n\ns/\n\n(cid:96)\u22121)hi,(cid:96) =(cid:0)W v(cid:107)v(cid:107)\n\n(4.4)\n(4.5)\n\n(cid:1)(cid:107)v(cid:107). Here,\n\n7\n\n\fIntuitively, one cannot use spectral bound argument to derive (4.4) or (4.5): the spectral norm of W\nis 2, and even if ReLU activations cancel half of its mass, the spectral norm (cid:107)DW(cid:107)2 remains to be\n\u221a\n2. When stacked together, this grows exponential in L.\n\nInstead, we use an analogous argument to (4.1) to show that, for each \ufb01xed vector z, the norm of\n(cid:107)W Di,(cid:96)1 \u00b7\u00b7\u00b7 Di,(cid:96)2W z(cid:107)2 is at most O(1) with extremely high probability 1 \u2212 e\u2212\u2126(m/L2). By stan-\ndard \u03b5-net argument, (cid:107)W Di,(cid:96)1 \u00b7\u00b7\u00b7 Di,(cid:96)2W z(cid:107)2 is at most O(1) for all m\nL3 -sparse vectors z. Finally,\nL3 . Finally, we apply\nfor a possible dense vector z, we can divide it into L3 chunks each of sparsity m\nthe upper bound for L3 times. This proves (4.4). One can use similar argument to prove (4.5).\nRemark 4.1. We did not try to tighten the polynomial factor here in L. We conjecture that proving\nan O(1) bound may be possible, but that question itself may be a suf\ufb01ciently interesting random\nmatrix theory problem on its own.\nThe next result is for back propagation. For every (cid:96)1 \u2265 (cid:96)2 and diagonal matrices D(cid:48) of sparsity\ns \u2208 [\u03c12, m0.49]:\n\n(4.6)\nIts proof is in the same spirit as (4.5), with the only difference being the spectral norm of B is around\n\ns)\n\n(cid:112)m/d as opposed to O(1).\n\n(cid:107)BDi,(cid:96)1 \u00b7\u00b7\u00b7 Di,(cid:96)2W D(cid:48)(cid:107)2 \u2264 (cid:101)O(\n\n\u221a\n\n5 Stability After Adversarial Perturbation\n\nIn this section we study the behavior of RNN after adversarial perturbation. The corresponding\nprecise statements and proofs are in Appendix C.\n\nLetting(cid:102)W , A, B be at random initialization, we consider some matrix W =(cid:102)W + W (cid:48) for (cid:107)W (cid:48)(cid:107)2 \u2264\nm . Here, W (cid:48) may depend on the randomness of(cid:102)W , A and B, so we say it can be adversarially\n\npoly(\u0001)\u221a\nchosen. The results of this section will later be applied essentially twice:\n\u2022 Once for those updates generated by GD or SGD, where W (cid:48) is how much the algorithm has\n\u2022 The other time (see Section 7.3) for a technique that we call \u201crandomness decomposition\u201d where\n\nwe decompose the true random initialization W into W =(cid:102)W +W (cid:48), where(cid:102)W is a \u201cfake\u201d random\n\nmoved away from the random initialization.\n\ninitialization but identically distributed as W . Such technique comes from smooth analysis [51].\n\nTo illustrate our high-level idea, from this section on (so in Section 5, 7 and 8)\n\nwe ignore the polynomial dependency in \u0001 and hide it in the big-O notion.\n\nWe denote by (cid:101)Di,(cid:96),(cid:101)gi,(cid:96),(cid:101)hi,(cid:96) respectively the values of Di,(cid:96), gi,(cid:96) and hi,(cid:96) determined by(cid:102)W and A at\nrandom initialization; and by Di,(cid:96) = (cid:101)Di,(cid:96) +D(cid:48)\nthose determined by W =(cid:102)W + W (cid:48) after the adversarial perturbation.\ni,(cid:96)(cid:107)2,(cid:107)h(cid:48)\n\nForward Stability. Our \ufb01rst, and most technical result is the following:\n(cid:107)g(cid:48)\nIntuitively, one may hope to prove (5.1) by induction, because we have (ignoring subscripts in i)\n\ni,(cid:96) and hi,(cid:96) =(cid:101)hi,(cid:96) +h(cid:48)\n\ni,(cid:96), gi,(cid:96) =(cid:101)gi,(cid:96) +g(cid:48)\n\ni,(cid:96)(cid:107)2 \u2264 O(m\u22121/2) ,\n\ni,(cid:96) respectively\n\n(cid:107)D(cid:48)\n\n(5.1)\n\nand\n\n(cid:107)D(cid:48)\n\ni,(cid:96)(cid:107)0 \u2264 O(m2/3)\n+(cid:102)W D(cid:48)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n(cid:96)(cid:48) = W (cid:48)D(cid:96)(cid:48)\u22121g(cid:96)(cid:48)\u22121\ng(cid:48)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:96)(cid:48)\u22121g(cid:96)(cid:48)\u22121\n\n(cid:125)\n\ni,(cid:96)gi,(cid:96)(cid:107)2 \u2264 O(m\u22121/2) .\n+(cid:102)W(cid:101)D(cid:96)(cid:48)\u22121g(cid:48)\n(cid:124)\n(cid:123)(cid:122)\n\n(cid:96)(cid:48)\u22121\n\n(cid:125)\n\n.\n\nx\n\nThe main issue here is that, the spectral norm of(cid:102)W(cid:101)D(cid:96)(cid:48)\u22121 in z is greater than 1, so we cannot apply\n\nnaive induction due to exponential blow up in L. Neither can we apply techniques from Section 4,\nbecause the changes such as g(cid:96)(cid:48)\u22121 can be adversarial.\nIn our actual proof of (5.1), instead of applying induction on z, we recursively expand z by the\nabove formula. This results in a total of L terms of x type and L terms of y type. The main\ndif\ufb01culty is to bound a term of y type, that is:\n\ny\n\nz\n\n(cid:13)(cid:13)(cid:102)W(cid:101)D(cid:96)1 \u00b7\u00b7\u00b7(cid:101)D(cid:96)2+1(cid:102)W D(cid:48)\n\ng(cid:96)2\n\n(cid:96)2\n\n(cid:13)(cid:13)2\n\nOur argument consists of two conceptual steps.\n\n8\n\n\f(1) Suppose g(cid:96)2 =(cid:101)g(cid:96)2 + g(cid:48)\n(cid:96)2,2(cid:107)\u221e \u2264 m\u22121,\n(cid:102)W(cid:101)D(cid:96)1 \u00b7\u00b7\u00b7(cid:101)D(cid:96)2+1(cid:102)W x can be written as y = y1 + y2 with (cid:107)y1(cid:107)2 \u2264 O(m\u22122/3) and (cid:107)y2(cid:107)\u221e \u2264\n\n=(cid:101)g(cid:96)2 + g(cid:48)\ng(cid:96)2(cid:107)2 \u2264 O(m\u22121/2) and (cid:107)D(cid:48)\n\n(2) Suppose x \u2208 Rm with (cid:107)x(cid:107)2 \u2264 m\u22121/2 and (cid:107)x(cid:107)0 \u2264 m2/3,\n\n(cid:96)2,1(cid:107)2 \u2264 m\u22121/2 and (cid:107)g(cid:48)\ng(cid:96)2(cid:107)0 \u2264 O(m2/3).\n\n(cid:96)2,1 + g(cid:48)\n\n(cid:96)2,2 where (cid:107)g(cid:48)\n\nthen we show that y =\n\nthen we argue that (cid:107)D(cid:48)\n\n(cid:96)2\n\n(cid:96)2\n\n(cid:96)2\n\nO(m\u22121).\n\nThe two steps above enable us to perform induction without exponential blow up. Indeed, they\ntogether enable us to go through the following logic chain:\n\n(cid:107) \u00b7 (cid:107)2 \u2264 m\u22121/2 and (cid:107) \u00b7 (cid:107)\u221e \u2264 m\u22121\n=\u21d2\n(cid:107) \u00b7 (cid:107)2 \u2264 m\u22122/3 and (cid:107) \u00b7 (cid:107)\u221e \u2264 m\u22121 \u21d0=\n\n(1)\n\n(2)\n\n\uf8fc\uf8fd\uf8fe(cid:107) \u00b7 (cid:107)2 \u2264 m\u22121/2 and (cid:107) \u00b7 (cid:107)0 \u2264 m2/3\n\nSince there is a gap between m\u22121/2 and m\u22122/3, we can make sure that all blow-up factors are\nabsorbed into this gap, using the property that m is polynomially large. This enables us to perform\ninduction to prove (5.1) without exponential blow-up.\nIntermediate Layers and Backward Stability. Using (5.1), and especially using the sparsity\n(cid:107)D(cid:48)(cid:107)0 \u2264 m2/3 from (5.1), one can apply the results in Section 4 to derive the following stability\nbounds for intermediate layers and backward propagation:\n\n(cid:13)(cid:13)Di,(cid:96)1W \u00b7\u00b7\u00b7 Di,(cid:96)2W \u2212 (cid:101)Di,(cid:96)1(cid:102)W \u00b7\u00b7\u00b7(cid:101)Di,(cid:96)2(cid:102)W(cid:13)(cid:13)2 \u2264 O(L7)\n(cid:13)(cid:13)BDi,(cid:96)1 W \u00b7\u00b7\u00b7 Di,(cid:96)2 W \u2212 B(cid:101)Di,(cid:96)1(cid:102)W \u00b7\u00b7\u00b7(cid:101)Di,(cid:96)2(cid:102)W(cid:13)(cid:13)2 \u2264 O(cid:0)m1/3(cid:1) .\ni,(cid:96))k| \u2264 O(cid:0)m\u22122/3(cid:1)\n(cid:13)(cid:13)2 \u2264 O(cid:0)m\u22121/6(cid:1)\n\n|(((cid:102)W + W (cid:48))h(cid:48)\n(cid:13)(cid:13)BDi,(cid:96)1W \u00b7\u00b7\u00b7 Di,(cid:96)2 W ek \u2212 B(cid:101)Di,(cid:96)1(cid:102)W \u00b7\u00b7\u00b7(cid:101)Di,(cid:96)2(cid:102)W ek\n\nSpecial Rank-1 Perturbation. For technical reasons, we also need two bounds in the special case\nof W (cid:48) = yz(cid:62) for some unit vector z and sparse y with (cid:107)y(cid:107)0 \u2264 poly(\u0001). We prove that, for this type\nof rank-one adversarial perturbation, it satis\ufb01es for every k \u2208 [m]:\n\n(5.2)\n(5.3)\n\n(5.4)\n(5.5)\n\n6 Conclusion and What\u2019s After Page 8\n\nWe conclude the paper here because Section 4 and 5 have already given some insights into how\nthe proof proceeds and how to avoid exponential blow up in L.\nIn the supplementary material,\nwithin another 3 pages we also sketch the proof ideas for Theorem 3 and 4 (see Section 7 and 8).\nWe genuinely hope that this high-level sketch can (1) give readers a clear overview of the proof\nwithout the necessity of going to the appendix, and (2) appreciate our proof and understand why it\nis necessarily long.10 /\nOverall, we provide the \ufb01rst proof of convergence of GD/SGD for non-linear neural networks that\nhave more two layers. We show with overparameterization GD/SGD can avoid hitting any (bad)\nlocal minima along its training trajectory. This was practically observed by Goodfellow et al. [19]\nand a theoretical justi\ufb01cation was open since then. We present our result using recurrent neural\nnetworks (as opposed to the simpler feedforward networks [3]) in this very \ufb01rst paper, because\nmemorization in RNN could be of its own independent interest. Also, our result proves that RNN\ncan indeed learn mappings from different input tokens to different input tokens simultaneously.\nLast but not least, we build new tools to analyze multi-layer networks with ReLU activations that\ncould facilitate many new research on deep learning. For instance, our techniques in Section 4\nprovide a general theory for why ReLU activations avoid exponential exploding (see e.g.\n(4.1),\n(4.4)) or exponential vanishing (see e.g. (4.1), (4.3)); and our techniques in Section 5 give a general\ntheory for the stability of multi-layer networks against adversarial weight perturbations, which is at\nthe heart of showing the semi-smoothness Theorem 4, and used by all the follow-up works [1\u20133, 9].\n\n10For instance, proving gradient norm lower bound in Theorem 3 for a single neuron k \u2208 [m] is easy, but\nhow to apply concentration across neurons? Crucially, due to the recurrent structure these quantities are never\nindependent, so we have to build necessary probabilistic tools to tackle this. If one is willing to ignore such\nsubtleties, then our sketched proof is suf\ufb01ciently short and gives a good overview.\n\n9\n\n\fReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD Learn Recurrent Neural Networks with Provable\nGeneralization? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1902.\n01028.\n\n[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overpa-\nrameterized Neural Networks, Going Beyond Two Layers. In NeurIPS, 2019. Full version\navailable at http://arxiv.org/abs/1811.04918.\n\n[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\nover-parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/\n1811.03962.\n\n[4] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning\nsome deep representations. In International Conference on Machine Learning (ICML), pages\n584\u2013592, 2014.\n\n[5] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradi-\n\nent descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.\n\n[6] Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization ef\ufb01-\nciently learns positive de\ufb01nite linear transformations. In International Conference on Machine\nLearning (ICML), pages 520\u2013529, 2018.\n\n[7] Alon Brutzkus and Amir Globerson.\n\nvnet with gaussian inputs.\nhttp://arxiv.org/abs/1702.07966, 2017.\n\nGlobally optimal gradient descent for a con-\nIn International Conference on Machine Learning (ICML).\n\n[8] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over-\nparameterized networks that provably generalize on linearly separable data. In ICLR, 2018.\nURL https://arxiv.org/abs/1710.10174.\n\n[9] Yuan Cao and Quanquan Gu. A Generalization Theory of Gradient Descent for Learning\n\nOver-parameterized Deep ReLU Networks. arXiv preprint arXiv:1902.01384, 2019.\n\n[10] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical isometry and a\nmean \ufb01eld theory of RNNs: Gating enables signal propagation in recurrent neural networks.\narXiv:1806.05394, 2018. URL http://arxiv.org/abs/1806.05394.\n\n[11] Kyunghyun Cho, Bart Van Merri\u00a8enboer, Dzmitry Bahdanau, and Yoshua Bengio. On\nthe properties of neural machine translation: Encoder-decoder approaches. arXiv preprint\narXiv:1409.1259, 2014.\n\n[12] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio.\n\nevaluation of gated recurrent neural networks on sequence modeling.\narXiv:1412.3555, 2014.\n\nEmpirical\narXiv preprint\n\n[13] Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), pages 2422\u20132430, 2017.\n\n[14] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural net-\nworks: The power of initialization and a dual view on expressivity. In Advances in Neural\nInformation Processing Systems (NeurIPS), pages 2253\u20132261, 2016.\n\n[15] Simon S. Du, Jason D. Lee, Yuandong Tian, Barnab\u00b4as P\u00b4oczos, and Aarti Singh. Gradient\nIn ICML,\n\ndescent learns one-hidden-layer CNN: don\u2019t be afraid of spurious local minima.\n2018.\n\n[16] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient Descent Provably Opti-\n\nmizes Over-parameterized Neural Networks. ArXiv e-prints, 2018.\n\n[17] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n\n10\n\n\f[18] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. In ICLR, 2017. URL http://arxiv.org/abs/1711.00501.\n\n[19] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural\n\nnetwork optimization problems. In ICLR, 2015.\n\n[20] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,\n\n2014. URL http://arxiv.org/abs/1410.5401.\n\n[21] Richard LT Hahnloser. On the piecewise analysis of networks of linear threshold neurons.\n\nNeural Networks, 11(4):691\u2013697, 1998.\n\n[22] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In ICLR, 2017. URL http:\n\n//arxiv.org/abs/1611.04231.\n\n[23] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical\n\nsystems. Journal of Machine Learning Research (JMLR), 19(29):1\u201344, 2018.\n\n[24] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9\n\n(8):1735\u20131780, 1997.\n\n[25] Arthur Jacot, Franck Gabriel, and Cl\u00b4ement Hongler. Neural tangent kernel: Convergence\nand generalization in neural networks. In Advances in neural information processing systems,\npages 8571\u20138580, 2018.\n\n[26] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-\nconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint\narXiv:1506.08473, 2015.\n\n[27] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings\nof the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700\u2013\n1709, 2013.\n\n[28] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent\nIn International Conference on Learning Representations, 2018. URL\n\nneural networks.\nhttps://openreview.net/forum?id=S1WRibb0Z.\n\n[29] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[30] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and\nJascha Sohl-Dickstein. Deep neural networks as Gaussian processes. arXiv:1711.00165, 2017.\nURL http://arxiv.org/abs/1711.00165.\n\n[31] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems\n(NeurIPS), 2018.\n\n[32] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with\nIn Advances in Neural Information Processing Systems (NeurIPS).\n\nReLU activation.\nhttp://arxiv.org/abs/1705.09886, 2017.\n\n[33] S Lojasiewicz. A topological property of real analytic subsets. Coll. du CNRS, Les \u00b4equations\n\naux d\u00b4eriv\u00b4ees partielles, 117:87\u201389, 1963.\n\n[34] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape\nof two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):\nE7665\u2013E7671, 2018.\n\n[35] Tom\u00b4a\u02c7s Mikolov, Martin Kara\ufb01\u00b4at, Luk\u00b4a\u02c7s Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recur-\nrent neural network based language model. In Eleventh Annual Conference of the International\nSpeech Communication Association, 2010.\n\n11\n\n\f[36] Tom\u00b4a\u02c7s Mikolov, Stefan Kombrink, Luk\u00b4a\u02c7s Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Ex-\ntensions of recurrent neural network language model. In Acoustics, Speech and Signal Pro-\ncessing (ICASSP), 2011 IEEE International Conference on, pages 5528\u20135531. IEEE, 2011.\n\n[37] Rina Panigrahy, Ali Rahimi, Sushant Sachdeva, and Qiuyi Zhang. Convergence results for\n\nneural networks via electrodynamics. In ITCS, 2018.\n\n[38] Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via ran-\ndom matrix theory. In International Conference on Machine Learning (ICML), International\nConvention Centre, Sydney, Australia, 2017.\n\n[39] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), 2017.\n\n[40] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in\ndeep learning through dynamical isometry: theory and practice. arXiv:1711.04735, November\n2017. URL http://arxiv.org/abs/1711.04735.\n\n[41] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[42] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nIn Advances In\n\nExponential expressivity in deep neural networks through transient chaos.\nNeural Information Processing Systems, pages 3360\u20133368, 2016. 00047.\n\n[43] Itay Safran and Ohad Shamir.\n\nReLU neural networks.\nhttp://arxiv.org/abs/1712.08968, 2018.\n\nSpurious local minima are common in two-layer\nIn International Conference on Machine Learning (ICML).\n\n[44] Has\u00b8im Sak, Andrew Senior, and Franc\u00b8oise Beaufays. Long short-term memory recurrent neural\nnetwork architectures for large scale acoustic modeling. In Fifteenth annual conference of the\ninternational speech communication association, 2014.\n\n[45] Hojjat Salehinejad, Julianne Baarbe, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh\nValaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.\n\n[46] Emilio Salinas and Laurence F. Abbott. A model of multiplicative neural responses in parietal\n\ncortex. Proceedings of the National Academy of Sciences, 93(21):11956\u201311961, 1996.\n\n[47] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep infor-\nmation propagation. In ICLR, 2017. URL https://openreview.net/pdf?id=H1W1UN9gg.\n\n[48] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with\n\nsparse connectivity. In ICLR. arXiv preprint arXiv:1412.2693, 2015.\n\n[49] Hava T Siegelmann and Eduardo D Sontag. Turing computability with neural nets. Applied\n\nMathematics Letters, 4(6):77\u201380, 1991.\n\n[50] Mahdi Soltanolkotabi. Learning ReLUs via gradient descent. CoRR, abs/1705.04591, 2017.\n\nURL http://arxiv.org/abs/1705.04591.\n\n[51] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex\n\nalgorithm usually takes polynomial time. Journal of the ACM, 51(3):385\u2013463, 2004.\n\n[52] Martin Sundermeyer, Ralf Schl\u00a8uter, and Hermann Ney. Lstm neural networks for language\nmodeling. In Thirteenth annual conference of the international speech communication associ-\nation, 2012.\n\n[53] Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating text with recurrent neural\nnetworks. In International Conference on Machine Learning (ICML), pages 1017\u20131024, 2011.\n\n[54] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\nnetworks. In Advances in neural information processing systems (NeurIPS), pages 3104\u20133112,\n2014.\n\n12\n\n\f[55] Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network\nand its applications in convergence and critical point analysis. In International Conference on\nMachine Learning (ICML). http://arxiv.org/abs/1703.00560, 2017.\n\n[56] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pen-\nnington. Dynamical isometry and a mean \ufb01eld theory of CNNs: How to train 10,000-layer\nIn International Conference on Machine Learning\nvanilla convolutional neural networks.\n(ICML), 2018.\n\n[57] Greg Yang and Sam S. Schoenholz. Deep mean \ufb01eld theory: Layerwise variance and width\nvariation as methods to control gradient explosion. ICLR open review, 2018. URL https:\n//openreview.net/forum?id=rJGY8GbR-.\n\n[58] Greg Yang and Samuel Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), pages 7103\u20137114, 2017.\n\n[59] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\ning deep learning requires rethinking generalization. In International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[60] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural\n\nnetworks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.\n\n[61] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guaran-\ntees for one-hidden-layer neural networks. In International Conference on Machine Learning\n(ICML). arXiv preprint arXiv:1706.03175, 2017.\n\n13\n\n\f", "award": [], "sourceid": 3626, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Zhao", "family_name": "Song", "institution": "University of Washington"}]}