{"title": "Can SGD Learn Recurrent Neural Networks with Provable Generalization?", "book": "Advances in Neural Information Processing Systems", "page_first": 10331, "page_last": 10341, "abstract": "Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other?\nExisting generalization bounds for RNN scale exponentially with the input length, significantly limiting their practical implications.\n\nIn this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class \\emph{efficiently}, meaning that both time and sample complexity scale \\emph{polynomially} in the input length (or almost polynomially, depending on the concept).\nThis concept class at least includes functions where each output token is generated from inputs of earlier tokens using a smooth two-layer neural network.", "full_text": "Can SGD Learn Recurrent Neural Networks\n\nwith Provable Generalization?\u2217\n\nZeyuan Allen-Zhu\nMicrosoft Research AI\n\nzeyuan@csail.mit.edu\n\nYuanzhi Li\n\nCarnegie Mellon University\nyuanzhil@andrew.cmu.edu\n\nAbstract\n\nRecurrent Neural Networks (RNNs) are among the most popular models in se-\nquential data analysis. Yet, in the foundational PAC learning language, what con-\ncept class can it learn? Moreover, how can the same recurrent unit simultaneously\nlearn functions from different input tokens to different output tokens, without af-\nfecting each other? Existing generalization bounds for RNN scale exponentially\nwith the input length, signi\ufb01cantly limiting their practical implications.\nIn this paper, we show using the vanilla stochastic gradient descent (SGD), RNN\ncan actually learn some notable concept class ef\ufb01ciently, meaning that both time\nand sample complexity scale polynomially in the input length (or almost polyno-\nmially, depending on the concept). This concept class at least includes functions\nwhere each output token is generated from inputs of earlier tokens using a smooth\ntwo-layer neural network.\n\nIntroduction\n\n1\nRecurrent neural networks (RNNs) is one of the most popular models in sequential data analy-\nsis [25]. When processing an input sequence, RNNs repeatedly and sequentially apply the same\noperation to each input token. The recurrent structure of RNNs allows it to capture the dependen-\ncies among different tokens inside each sequence, which is empirically shown to be effective in\nmany applications such as natural language processing [28], speech recognition [12] and so on.\nThe recurrent structure in RNNs shows great power in practice, however, it also imposes great\nchallenge in theory. Until now, RNNs remains to be one of the least theoretical understood models\nin deep learning. Many fundamental open questions are still largely unsolved in RNNs, including\n\n1. (Optimization). When can RNNs be trained ef\ufb01ciently?\n2. (Generalization). When do the results learned by RNNs generalize to test data?\n\nQuestion 1 is technically challenging due to the notorious question of vanishing/exploding gradients,\nand the non-convexity of the training objective induced by non-linear activation functions.\nQuestion 2 requires even deeper understanding of RNNs. For example, in natural language process-\ning, \u201cJuventus beats Bacerlona\u201d and \u201cBacerlona beats Juventus\u201d have completely different mean-\nings. How can the same operation in RNN encode a different rule for \u201cJuventus\u201d at token 1 vs.\n\u201cJuventus\u201d at token 3, instead of merely memorizing each training example?\nThere have been some recent progress towards obtaining more principled understandings of these\nquestions. On the optimization side, Hardt, Ma, and Recht [13] show that over-parameterization\ncan help in the training process of a linear dynamic system, which is a special case of RNNs with\nlinear activation functions. Allen-Zhu, Li, and Song [2] show that over-parameterization also helps\nin training RNNs with ReLU activations. This latter result gives no generalization guarantee.\n\n\u2217Full version and future updates can be found on https://arxiv.org/abs/1902.01028.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOn the generalization side, our understanding to RNN is even more limited. The VC-dimension\nbounds [10] and [17] polynomially depend on the size of the network, and either only apply to\nlinear (or threshold) networks or to networks with one dimension input. However, a bound scaling\nwith the total number of parameters usually cannot be applied to modern neural networks, which are\nheavily over-parameterized. Others [9, 31] (or the earlier work [14]) establish sample complexity\nbounds that exponentially grow in the input length. In particular, they depend on the operator norm\nof the recurrent unit, that we denote by \u03b2. If \u03b2 > 1, their bounds scale exponentially with input\nlength. Since most applications do not regularize \u03b2 and allow \u03b2 > 1 for a richer expressibility,2\ntheir bounds are still insuf\ufb01cient.\nIndeed, bridging the gap between optimization (question 1) and generalization (question 2) can be\nquite challenging in neural networks. The case of RNN is particularly so due to the (potentially)\nexponential blowup in input length.\n\n\u2022 Generalization (cid:57) Optimization. One could imagine adding a strong regularizer to ensure\n\u03b2 \u2264 1 for generalization purpose; however, it is unclear how an optimization algorithm such\nas stochastic gradient descent (SGD) \ufb01nds a network that both minimizes training loss and\nmaintains \u03b2 \u2264 1. One could also use a very small network so the number of parameters is\nlimited; however, it is not clear how SGD \ufb01nds a small network with small training loss.\n\u2022 Optimization (cid:57) Generalization. One could try to train RNNs without any regularization;\nhowever, it is then quite possible that the number of parameters need to be large and \u03b2 > 1\nafter the training. This is so both in practice (since \u201cmemory implies larger spectral radius\u201d\n[24]) and in theory [2]. All known generalization bounds fail to apply in this regime.\n\nIn this paper, we give arguably the \ufb01rst theoretical analysis of RNNs that captures optimization\nand generalization simultaneously. Given any set of input sequences, as long as the outputs are\n(approximately) realizable by some smooth function in a certain concept class, then after training\na vanilla RNN with ReLU activations, SGD provably \ufb01nds a solution that has both small training\nand generalization error. Our result allows \u03b2 to be larger than 1 by a constant, but is still ef\ufb01cient:\nmeaning that the iteration complexity of the SGD, the sample complexity, and the time complexity\nscale only polynomially (or almost polynomially) with the length of the input.\n\n2 Notations\nWe denote by (cid:107)\u00b7(cid:107)2 (or sometimes (cid:107)\u00b7(cid:107)) the Euclidean norm of vectors, and by (cid:107)\u00b7(cid:107)2 the spectral norm\nof matrices. We denote by (cid:107)\u00b7(cid:107)\u221e the in\ufb01nite norm of vectors, (cid:107)\u00b7(cid:107)0 the sparsity of vectors or diagonal\nmatrices, and (cid:107)\u00b7(cid:107)F the Frobenius norm of matrices. Given matrix W , we denote by Wk or wk the k-\n\nth row vector of W . We denote the row (cid:96)p norm for W \u2208 Rm\u00d7d as (cid:107)W(cid:107)2,p :=(cid:0)(cid:80)\n\n(cid:1)1/p.\n\ni\u2208[m] (cid:107)wi(cid:107)p\n\n2\n\nj )vi\n\n(cid:81)i\u22121\nj=1(I\u2212(cid:98)vj(cid:98)v(cid:62)\n(cid:107)(cid:81)i\u22121\nj=1(I\u2212(cid:98)vj(cid:98)v(cid:62)\n\nBy de\ufb01nition, (cid:107)W(cid:107)2,2 = (cid:107)W(cid:107)F . We use N (\u00b5, \u03c3) to denote Gaussian distribution with mean \u00b5 and\nvariance \u03c3; or N (\u00b5, \u03a3) to denote Gaussian vector with mean \u00b5 and covariance \u03a3. We use x = y\u00b1 z\nto denote that x \u2208 [y \u2212 z, y + z]. We use 1event to denote the indicator function of whether event\nis true. We denote by ek the k-th standard basis vector. We use \u03c3(\u00b7) to denote the ReLU function\n\u03c3(x) = max{x, 0} = 1x\u22650 \u00b7 x. Given univariate function f : R \u2192 R, we also use f to denote the\nsame function over vectors: f (x) = (f (x1), . . . , f (xm)) if x \u2208 Rm.\nGiven vectors v1, . . . , vn \u2208 Rm, we de\ufb01ne U = GS(v1, . . . , vn) as their Gram-Schmidt or-\n\nthonormalization. Namely, U = [(cid:98)v1, . . . ,(cid:98)vn] \u2208 Rm\u00d7n where(cid:98)v1 = v1(cid:107)v1(cid:107) and for every i \u2265 2,\nj )vi(cid:107). Note that in the occasion that(cid:81)i\u22121\n(cid:98)vi =\n(cid:98)vi be an arbitrary unit vector that is orthogonal to(cid:98)v1, . . . ,(cid:98)vi\u22121.\norder smooth function \u03c6 : R \u2192 R. Suppose \u03c6(z) = (cid:80)\u221e\n\nWe say a function f : Rd \u2192 R is L-Lipscthiz continuous if |f (x) \u2212 f (y)| \u2264 L(cid:107)x \u2212 y(cid:107)2; and say it\nis is L-smooth if its gradient is L-Lipscthiz continuous, that is (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)2 \u2264 L(cid:107)x \u2212 y(cid:107)2.\nFunction complexity. The following notions from [1] measure the complexity of any in\ufb01nite-\ni=0 cizi is its Taylor expansion. Given\n2For instance, if W \u2208 Rm\u00d7m is the recurrent weight matrix, and is followed with an ReLU activation\nm ), the combined operator \u03c3(W x) : Rm \u2192 Rm has operator\n\u221a\nm ), then \u03b2 becomes 1 but gradients will vanish\n2\n\n\u03c3. Under standard random initialization N (0, 2\nnorm\nexponentially fast in L.\n\n2 with high probability. If instead one uses N (0,\n\nj=1(I \u2212(cid:98)vj(cid:98)v(cid:62)\n\nj )vi is the zero vector, we let\n\n\u221a\n\n2\n\n\fnon-negative R,\n\n(cid:16)\nC\u03b5(\u03c6, R) :=(cid:80)\u221e\nCs(\u03c6, R) := C\u2217(cid:80)\u221e\n\u221a\nlog(1/\u03b5)\ni=0(i + 1)1.75Ri|ci|\n\n(C\u2217R)i +(cid:0)\u221a\n\ni=0\n\ni\n\nC\u2217R(cid:1)i(cid:17)|ci|\n\nwhere C\u2217 is a suf\ufb01ciently large constant (e.g., 104). It holds Cs(\u03c6, R) \u2264 C\u03b5(\u03c6, R) \u2264 Cs(\u03c6, O(R))\u00d7\npoly(1/\u03b5), and for sin z, ez or low degree polynomials, they only differ by o(1/\u03b5). [1]\nExample 2.1. If \u03c6(z) = zd for constant d then Cs(\u03c6, R) \u2264 O(Rd), C\u03b5(\u03c6, R) \u2264 O(Rdpolylog( 1\n\u03b5 )).\nFor functions such as \u03c6(z) = ez \u2212 1, sin z, sigmoid(z) or tanh(z), it suf\ufb01ces to consider \u03b5-\napproximations of them so we can truncate their Taylor expansions to degree O(log(1/\u03b5)). This\ngives Cs(\u03c6, R), C\u03b5(\u03c6, R) \u2264 (1/\u03b5)O(log R).\n\n3 Problem Formulation\nThe data are generated from an unknown distribution D over (x(cid:63), y(cid:63)) \u2208 (Rdx )(L\u22122) \u00d7 Y (L\u22122).\n(cid:96)(cid:107) = 1 and [x(cid:63)\n2 without\nEach input sequence x(cid:63) consists of x(cid:63)\nL \u2208 Y. The training dataset Z =\nloss of generality.3 Each label sequence y(cid:63) consists of y(cid:63)\n{((x(cid:63))(i), (y(cid:63))(i))}i\u2208[N ] is given as N i.i.d. samples from D. When (x(cid:63), y(cid:63)) is generated from D,\nwe call x(cid:63) the true input sequence and y(cid:63) the true label.\nDe\ufb01nition 3.1. Without loss of generality (see Remark 3.4), for each true input x(cid:63), we transform it\ninto an actual input sequence x1, x2, . . . , xL \u2208 Rdx+1 as follows.\n\nL\u22121 \u2208 Rdx with (cid:107)x(cid:63)\n3, . . . , y(cid:63)\n\n(cid:96) ]dx = 1\n\n2, . . . , x(cid:63)\n\nx1 = (0dx , 1)\n\nand x(cid:96) = (\u03b5xx(cid:63)\n\n(cid:96) , 0) for (cid:96) = 2, 3, . . . , L \u2212 1\n\nand xL = (0dx , \u03b5x)\n\nwhere \u03b5x \u2208 (0, 1) is a parameter to be chosen later. We then feed this actual sequence x into RNN.\nDe\ufb01nition 3.2. We say the sequence x1, . . . , xL \u2208 Rdx+1 is normalized if\n\n(cid:107)x1(cid:107) = 1 and\n\n(cid:107)x(cid:96)(cid:107) = \u03b5x\n\nfor all (cid:96) = 2, 3, . . . , L.\n\n3.1 Our Learner Network: Elman RNN\nTo present the simplest result, we focus on the classical Elman RNN with ReLU activation. Let\nW \u2208 Rm\u00d7m, A \u2208 Rm\u00d7(dx+1), and B \u2208 Rd\u00d7m be the weight matrices.\nDe\ufb01nition 3.3. Our Elman RNN can be described as follows. On input x1, . . . , xL \u2208 Rdx+1,\n\nh0 = 0 \u2208 Rm\ny(cid:96) = B \u00b7 h(cid:96) \u2208 Rd\n\ng(cid:96) = W \u00b7 h(cid:96)\u22121 + Ax(cid:96) \u2208 Rm\nh(cid:96) = \u03c3(W \u00b7 h(cid:96)\u22121 + Ax(cid:96)) \u2208 Rm\n\nm ), and the entries of B are i.i.d. generated from N (0, 1\nd ).\n\nWe say that W, A, B are at random initialization, if the entries of W and A are i.i.d. generated from\nN (0, 2\nFor simplicity, in this paper we only update W and let A and B be at their random initialization.\nThus, we write F(cid:96)(x(cid:63); W ) = y(cid:96) = Bh(cid:96) as the output of the (cid:96)-th layer.\nL \u2208 Y using some loss function\nOur goal is to use y3, . . . , yL \u2208 Rd to \ufb01t the true label y(cid:63)\nG : Rd \u00d7 Y \u2192 R. In this paper we assume, for every y(cid:63) \u2208 Y, G(0d, y(cid:63)) \u2208 [\u22121, 1] is bounded,\nand G(\u00b7, y(cid:63)) is convex and 1-Lipschitz continuous in its \ufb01rst variable. This includes for instance the\ncross-entropy loss and (cid:96)2-regression loss (for y(cid:63) being bounded).4\nRemark 3.4. Since we only update W , the label sequence y(cid:63)\nL is off from the input sequence\nL\u22121 by one. The last xL can be made zero, but we keep it normalized for notational\n2, . . . , x(cid:63)\nx(cid:63)\nsimplicity. The \ufb01rst x1 gives a random seed fed into the RNN (one can equivalently put it into h0).\nWe have scaled down the input signals by \u03b5x, which can be equivalently thought as scaling down A.\n\n3, . . . , y(cid:63)\n\n3, . . . , y(cid:63)\n\n(cid:96)(cid:107)2 \u2264 1 by padding(cid:112)1 \u2212 (cid:107)x(cid:63)\n\n3This is without loss of generality, since 1\n\ncan always be ensured from (cid:107)x(cid:63)\nassumption to simplify our notations: for instance, (x(cid:63)\nconcept class without bias.\n\n(cid:96)(cid:107)2 = 1\n(cid:96)(cid:107)2\n2 to the second-last coordinate. We make this\n2 allows us to focus only on networks in the\n(cid:96) )dx = 1\n4We use [\u22121, 1] and 1-Lipschitzness for notation simplicity. In generally, our \ufb01nal time and sample com-\n\n2 can always be padded to the last coordinate, and (cid:107)x(cid:63)\n\nplexity bounds only scale polynomially with the boundedness and Lipschitzness parameters.\n\n3\n\n\f3.2 Concept Class\nLet {\u03a6i\u2192j,r,s : R \u2192 R}i,j\u2208[L],r\u2208[p],s\u2208[d] be in\ufb01nite-order differentiable functions, and {w\u2217\ni\u2192j,r,s \u2208\nRdx}i,j\u2208[L],r\u2208[p],s\u2208[d] be unit vectors. Then, for every j = 3, 4, . . . , L, we consider target functions\nj : Rdx \u2192 Rd where F \u2217\nF \u2217\n\n(cid:1) can be written as\n\nj,1, . . . , F \u2217\n\nj,d\n\nj =(cid:0)F \u2217\nj,s(x(cid:63)) :=(cid:80)j\u22121\n\n(cid:80)\n\n(3.1)\n\nr\u2208[p] \u03a6i\u2192j,r,s((cid:104)w\u2217\nFor proof simplicity, we assume \u03a6i\u2192j,r,s(0) = 0. We also use\n\nF \u2217\n\ni=2\n\ni\u2192j,r,s, x(cid:63)\n\ni (cid:105)) \u2208 R .\n\nC\u03b5(\u03a6, R) = max\ni,j,r,s\n\n{C\u03b5(\u03a6i\u2192j,r,s, R)}\n\nand Cs(\u03a6, R) = max\ni,j,r,s\n\n{Cs(\u03a6i\u2192j,r,s, R)}\n\nto denote the complexity of F \u2217.\nAgnostic PAC-learning language. Our concept class consists of all functions F \u2217 in the form of\n(3.1) with complexity bounded by threshold C and parameter p bounded by threshold p0. Let OPT\nbe the population risk achieved by the best target function in this concept class. Then, our goal is to\nlearn this concept class with population risk OPT + \u03b5 using sample and time complexity polynomial\nin C, p0 and 1/\u03b5. In the remainder of this paper, to simplify notations, we do not explicitly de\ufb01ne\nthis concept class parameterized by C and p. Instead, we equivalently state our theorem with respect\nto any (unknown) target function F \u2217 with speci\ufb01c parameters C and p.\nExample 3.5. Our concept class is general enough and contains functions where the output at each\ntoken is generated from inputs of previous tokens using any two-layer neural network. Indeed, one\ncan verify that our general form (3.1) includes functions of the following:\n\nj (x(cid:63)) =(cid:80)j\u22121\nconstant complexity). The target function can be(cid:80)\nn = m, then(cid:80)\n\ni \u03c6(xi) = 0, otherwise it is non-zero.\n\ni=2 A\u2217\n\nF \u2217\n\nExample 3.6. Counting is an example task that falls into our concept class. Speci\ufb01cally, one can\nde\ufb01ne \u03c6 such that \u03c6(a) = 1 and \u03c6(b) = \u22121 (this can be achieved by a quadratic function with\ni \u03c6(xi). If the sequence is x = anbm such that\n\nj\u2212i\u03c6(W \u2217\n\nj\u2212ix(cid:63)\n\ni ) .\n\n4 Our Result: RNN Provably Learns the Concept Class\nSuppose the distribution D is generated by some (unknown) target function F \u2217 of the form (3.1) in\nthe concept class with population risk OPT, namely,\n\nand suppose we are given training dataset Z consisting of N i.i.d. samples from D. We consider the\nfollowing stochastic training objective\n\nE(x(cid:63),y(cid:63))\u223cD\n\n(cid:1)(cid:105) \u2264 OPT ,\n(cid:104)(cid:80)L\nj=3 G(cid:0)F \u2217\nObj(W (cid:48)) := E(x(cid:63),y(cid:63))\u223cZ(cid:2)Obj(x(cid:63), y(cid:63); W (cid:48))(cid:3)\nj=3 G(cid:0)\u03bbFj(x(cid:63); W + W (cid:48)), y(cid:63)\nwhere Obj(x(cid:63), y(cid:63); W (cid:48)) :=(cid:80)L\n\nj (x(cid:63)), y(cid:63)\nj\n\nj\n\n(cid:1)\n\nAbove, W \u2208 Rm\u00d7m is random initialization, W (cid:48) \u2208 Rm\u00d7m is the additional shift, and \u03bb \u2208 (0, 1) is\na constant scaling factor on the network output.5 We consider the vanilla stochastic gradient descent\n(SGD) algorithm with step size \u03b7. In each iteration t = 1, 2, . . . , T , it updates\n\nWt = Wt\u22121 \u2212 \u03b7\u2207W (cid:48)Obj(x(cid:63), y(cid:63); Wt\u22121)\n\nfor a random sample (x(cid:63), y(cid:63)) from the training set Z.\n\nTheorem 1. For every 0 < \u03b5 < (cid:101)O(cid:0)\nand \u03bb = (cid:101)\u0398(cid:0) \u03b5\nN = |Z| \u2265 poly(C, \u03b5\u22121, log m), then SGD with \u03b7 = (cid:101)\u0398(cid:0)\n(cid:16) p2C 2poly(L, d)\nT = (cid:101)\u0398\n\n(cid:1), de\ufb01ne complexity C = C\u03b5(\u03a6,\n(cid:1), if the number of neurons m \u2265 poly(C, \u03b5\u22121) and the number of samples is\n(cid:1) and\n(cid:17)\n\npoly(L,d)\u00b7p\u00b7Cs(\u03a6,O(\n\n\u03b5L2d2m\n\nL)\n\nL2d\n\nL))\n\n\u221a\n\n1\n\n1\n\n(SGD)\n\u221a\n\n\u03b52\n\n5Equivalently, one can scale matrix B by factor \u03bb. For notational simplicity, we split the matrix into W +W (cid:48)\nbut this does not change the algorithm since gradient with respect to W + W (cid:48) is the same with respect to W (cid:48).\n\n4\n\n\fsatis\ufb01es that, with probability at least 1 \u2212 e\u2212\u2126(\u03c12) over the random initialization\n\n(cid:104) 1\n\nT\u22121(cid:88)\n\nT\n\nt=0\n\nE\nsgd\n\nE\n\n(x(cid:63),y(cid:63))\u223cD\n\n(cid:2)Obj(x(cid:63), y(cid:63); W + Wt)(cid:3)(cid:105) \u2264 OPT + \u03b5 .\n\nAbove, Esgd takes expectation with respect to the randomness of SGD. Since SGD takes only one\nexample per iteration, the sample complexity N is also bounded by T .\n\n4.1 Our Contribution, Interpretation, and Discussion\nSample complexity. Our sample complexity only scales with log(m), making the result applicable\nto over-parameterized RNNs that have m (cid:29) N. Following Example 2.1, if \u03c6(z) is constant degree\npolynomial we have C = poly(L, log \u03b5\u22121) so Theorem 1 says that RNN learns such concept class\n\nwith size m =\n\npoly(L, d, p)\n\npoly(\u03b5)\n\nand sample complexity min{N, T} =\n\np2poly(L, d, log m)\n\n\u03b52\n\nIf \u03c6(z) is a function with good Taylor truncation, such as ez \u2212 1, sin z, sigmoid(z) or tanh(z), then\nC = LO(log(1/\u03b5)) is almost polynomial.\nNon-linear measurements. Our result shows that vanilla RNNs can ef\ufb01ciently learn a weighted\naverage of non-linear measurements of the input. As we argued in Example 3.5, this at least includes\nfunctions where the output at each token is generated from inputs of previous tokens using any two-\nlayer neural networks. Average of non-linear measurements can be quite powerful, achieving the\nstate-of-the-art performance in some sequential applications such as sentence embedding [4] and\nmany others [23], and acts as the base of attention mechanism in RNNs [5].\nAdapt to tokens.\nIn the target function, \u03a6i\u2192j,r,s can be different at each token, meaning that they\ncan adapt to the position of the input tokens. We emphasize that the positions of the tokens (namely,\nthe values i, j) are not directly fed into the network, rather it is discovered through sequentially\nreading the input. As one can see from our proofs, the ability of adapting to the tokens comes from\nthe inhomogeneity in hidden layers h(cid:96): even when x(cid:96) = x(cid:96)(cid:48) for different tokens (cid:96)(cid:48) (cid:54)= (cid:96), there is still\nbig difference between h(cid:96) and h(cid:96)(cid:48). Albeit the same operator is applied to x(cid:96) and x(cid:96)(cid:48), RNNs can still\nuse this crucial inhomogeneity to learn different functions at different tokens.\nIn our result, the function \u03a6i\u2192j,r,s only adapts with the positions of the input tokens, but in many\napplications, we would like the function to adapt with the values of the past tokens x(cid:63)\ni\u22121 as\nwell. We believe a study on other models (such as LSTM [15]) can potentially settle these questions.\nLong term memory.\nIt is commonly believed that vanilla RNNs cannot capture long term depen-\ndencies in the input. This does not contradict our result. Our complexity parameter C\u03b5(\u03a6,\nL) of\nthe learning process in Theorem 1 indeed suffers from L, the length of the input sequence. This is\ndue to the fact that vanilla RNN, the hidden neurons h(cid:96) will incorporate more and more noise as the\ntime horizon (cid:96) increases, making the new signal Ax(cid:96) less and less signi\ufb01cant.\nComparison to feed-forward networks. Recently there are many interesting results on analyzing\nthe learning process of feed-forward neural networks [7, 8, 11, 16, 18\u201320, 26, 27, 29, 30, 32]. Most\nof them either assume that the input is structured (e.g. Gaussian or separable) or only consider linear\nnetworks. Allen-Zhu, Li, and Liang [1] show a result in the same \ufb02avor as this paper but for two and\nthree-layer feedforward networks. Since RNNs apply the same unit repeatedly to each input token\nin a sequence, our analysis is signi\ufb01cantly different from [1] and creates lots of dif\ufb01culties in the\nanalysis.6\n\n1, . . . , x(cid:63)\n\n\u221a\n\n6More speci\ufb01cally, Allen-Zhu, Li, and Liang [1] study two (or three) layer feedforward networks, which\nuse one (or two) hidden weight matrix to learn one target function. Here in RNN, there is only one weight\nmatrix shared across the entire time horizon to learn L target functions at different input positions. In other\nwords, using \u201cone weight\u201d to learn \u201cone target function\u201d is known from prior work, but using \u201cone weight\u201d\nto ef\ufb01ciently learn \u201cL different target functions\u201d is substantially more dif\ufb01cult, especially when the position\ninformation is not given as the input to the network. For example, our theorem implies that an RNN can\ndistinguish the sequences \u201cAAAB\u201d from \u201cAABA\u201d, since the order of A and B are different. This requires the\nRNN to keep track, using one weight matrix, of the position information of the symbols in the sequence.\n\n5\n\n\f4.2 Conclusion\nWe show RNN can actually learn some notable concept class ef\ufb01ciently, using simple SGD method\nwith sample complexity polynomial or almost-polynomial in input length. This concept class at\nleast includes functions where each output token is generated from inputs of earlier tokens using a\nsmooth neural network. To the best of our knowledge, this is the \ufb01rst proof that some non-trivial\nconcept class is ef\ufb01ciently learnable by RNN. Our sample complexity is almost independent of m,\nmaking the result applicable to over-parameterized settings. On a separate note, our proof explains\nwhy the same recurrent unit is capable of learning various functions from different input tokens to\ndifferent output tokens.\nSection 6 through 9 continue to give sketch proofs. Our \ufb01nal proofs reply on many other technical\nproperties of RNN that may be of independent interests: such as properties of RNN at random\ninitialization (which we include in Section B and C), and properties of RNN stability (which we\ninclude in Section D, E, F). Some of these properties are simple modi\ufb01cations from prior work, but\nsome are completely new and require new proof techniques (namely, Section C, D and E).\n\nPROOF SKETCH\n\nOur proof of Theorem 1 divides into four conceptual steps.\n\n1. We obtain \ufb01rst-order approximation of how much the outputs of the RNN change if we move\nfrom W to W +W (cid:48). This change (up to small error) is a linear function in W (cid:48). (See Section 6).\n(This step can be derived from prior work [2] without much dif\ufb01culty.)\n(cid:62) \u2208 Rm\u00d7m so that this \u201clinear function\u201d, when\n\n2. We construct some (unknown) matrix W\n\n(cid:62), approximately gives the target F \u2217 in the concept class (see Section 5).\n\nevaluated on W\n\n3. We argue that the SGD method moves in a direction nearly as good as W\n\n(This step is the most interesting part of this paper.)\n(cid:62) and thus ef\ufb01ciently\n\ndecreases the training objective (see Section 7).\n\n(This is a routine analysis of SGD in the non-convex setting given Steps 1&2.)\n4. We use the \ufb01rst-order linear approximation to derive a Rademacher complexity bound that\ndoes not grow exponentially in L (see Section 8). By feeding the output of SGD into this\nRademacher complexity, we \ufb01nish the proof of Theorem 1 (see Section 9).\n\n(This is a one-paged proof given the Steps 1&2&3.)\nAlthough our proofs are technical, to help the readers, we write 7 pages of sketch proofs for Steps\n1 through 4. This can be found in Section 5 through 9. Due to space limitation, we only include\nSection 5 in the main body. We introduce some notations for analysis purpose.\nDe\ufb01nition 4.1. For each (cid:96) \u2208 [L], let D(cid:96) \u2208 Rm\u00d7m be the diagonal matrix where\n\nAs a result, we can write h(cid:96) = D(cid:96)W h(cid:96)\u22121. For each 1 \u2264 (cid:96) \u2264 a \u2264 L, we de\ufb01ne\n\n(D(cid:96))k,k = 1(W\u00b7h(cid:96)\u22121+Ax(cid:96))k\u22650 = 1(g(cid:96))k\u22650 .\n\nBack(cid:96)\u2192a = BDaW \u00b7\u00b7\u00b7 D(cid:96)+1W \u2208 Rd\u00d7m.\n\nwith the understanding that Back(cid:96)\u2192(cid:96) = B \u2208 Rd\u00d7m.\nThroughout the proofs, to simplify notations when specifying polynomial factors, we introduce\n\n\u03c1 := 100Ld log m and\n\n\u0001 :=\n\nWe assume m \u2265 poly(\u0001) for some suf\ufb01ciently large polynomial factor.\n\n\u221a\n\nL) \u00b7 log m\n\n100Ldp \u00b7 C\u03b5(\u03a6,\n\u03b5\n\n5 Existence of Good Network Through Backward\nOne of our main contributions is to show the existence of some \u201cgood linear network\u201d to approximate\nany target function. Let us explain what this means. Suppose W, A, B are at random initialization.\nWe consider a linear function over W\n\nfj(cid:48) :=(cid:80)j(cid:48)\n\n(cid:62) \u2208 Rm\u00d7m:\ni(cid:48)=2 Backi(cid:48)\u2192j(cid:48) Di(cid:48)W\n\n(cid:62)\n\nhi(cid:48)\u22121 .\n\n(5.1)\n\n6\n\n\fAs we shall see later, in \ufb01rst-order approximation, this linear function captures how much the output\nof the RNN changes at token j(cid:48), if one we move W to W + W (cid:48). The goal in this section is to\n(cid:62) \u2208 Rm\u00d7m satisfying that, for any true input x(cid:63) in the support of D, if we de\ufb01ne\nconstruct some W\nthe actual input x according to x(cid:63) (see De\ufb01nition 3.1), then with high probability,\n\n(cid:80)\nr\u2208[p] \u03a6i\u2192j(cid:48),r,s(cid:48)((cid:104)w\u2217\n(5.2)\n(cid:62) can simultaneously\nIn our sketched proof below, it shall become clear how this same matrix W\nrepresent functions \u03a6i\u2192j(cid:48) that come from different input tokens i. Since SGD can be shown to\ndescend in a direction \u201ccomparable\u201d to W\n\n(cid:62), it converges to a matrix W with similar guarantees.\n\nj(cid:48),s(cid:48)(x(cid:63)) =(cid:80)j(cid:48)\u22121\n\ni (cid:105))\ni\u2192j(cid:48),r,s(cid:48), x(cid:63)\n\nfj(cid:48),s(cid:48) \u2248 F \u2217\n\n\u2200s(cid:48) \u2208 [d]\n\ni=2\n\nIndicator to Function\n\n5.1\nIn order to show (5.2), we \ufb01rst show a variant of the \u201cindicator to function\u201d lemma from [1].\nLemma 5.1 (indicator to function). For every smooth function \u03a6 : R \u2192 R, every unit vector\nw\u2217 \u2208 Rdx with w\u2217\n= 0, every constant \u03c3 \u2265 0.1, every constant \u03b3 > 1, every constant \u03b5e \u2208\n\n1\n\ndx\n\nCs(\u03a6,O(\u03c3))\n\n(cid:1), there exists\n(cid:0)0,\n(a) (cid:12)(cid:12)Ea\u223cN (0,I),n\u223cN (0,\u03c32)\n(b) (cid:12)(cid:12)Ea\u223cN (0,I),n\u223cN (0,\u03c32)\n\nsuch that for every \ufb01xed unit vectors x(cid:63) \u2208 Rdx with x(cid:63)\n\nC(cid:48) = C\u03b5e (\u03a6, \u03c3) and a function H : R \u2192 [\u2212C(cid:48), C(cid:48)],\n\n(cid:2)1(cid:104)a,x(cid:63)(cid:105)+n\u22650H (a)(cid:3) \u2212 \u03a6((cid:104)w\u2217, x(cid:63)(cid:105))(cid:12)(cid:12) \u2264 \u03b5e\n(cid:2)1(cid:104)a,x(cid:63)(cid:105)+\u03b3n\u22650H (a)(cid:3) \u2212 \u03a6(0)(cid:12)(cid:12) \u2264 \u03b5e + O(cid:0) C(cid:48) log(\u03b3\u03c3)\n\n2 ,\n= 1\n\ndx\n\n\u03b3\u03c3\n\n(cid:1)\n\n(on target)\n\n(off target)\n\nAbove, Lemma 5.1a says that we can use a bounded function 1(cid:104)a,x(cid:63)(cid:105)+n\u22650H (a) to \ufb01t a target func-\ntion \u03a6((cid:104)w\u2217, x(cid:63)(cid:105)), and Lemma 5.1b says that if the magnitude of n is large then this function is close\nto being constant. For such reason, we can view n as \u201cnoise.\u201d While the proof of 5.1a is from\nprior work [1], our new property 5.1b is completely new and it requires some technical challenge to\nsimultaneously guarantee 5.1a and 5.1b. The proof is in Appendix G.1\n\n5.2 Fitting a Single Function\ni (cid:105)). For this\nWe now try to apply Lemma 5.1 to approximate a single function \u03a6i\u2192j,r,s((cid:104)w\u2217\npurpose, let us consider two (normalized) input sequences. The \ufb01rst (null) sequence x(0) is given as\n\ni\u2192j,r,s, x(cid:63)\n\nThe second sequence x is generated from an input x(cid:63) in the support of D (recall De\ufb01nition 3.1). Let\n\nx(0)\n1 = (0dx , 1)\n\nand x(0)\n\n(cid:96) = (0dx , \u03b5x) for (cid:96) = 2, 3, . . . , L\n\n\u2022 h(cid:96), D(cid:96), Backi\u2192j be de\ufb01ned with respect to W, A, B and input sequence x, and\n\u2022 h(0)\ni\u2192j be de\ufb01ned with respect to W, A, B and input sequence x(0)\n\n, Back(0)\n\n, D(0)\n\n(cid:96)\n\n(cid:96)\n\n(cid:96)\n\nWe remark that h(0)\nhas the good property that it does not depend x(cid:63) but somehow stays \u201cclose\nenough\u201d to the true h(cid:96) (see Appendix D for a full description).\nLemma 5.2 (\ufb01t single function). For every 2 \u2264 i < j \u2264 L, r \u2208 [p], s \u2208 [d] and every constant\n\u221a\n\n(cid:1), there exists C(cid:48) = C\u03b5e(\u03a6i\u2192j,r,s,\n\nL) so that, for every\n\n\u221a\n\n1\n\n\u03b5c = \u03b5e\u03b5x\n4C(cid:48)\n\n(cid:3), such that, let\n\n,\n\n1\n\nL))\n\n\u03c14C(cid:48)\n\nCs(\u03a6i\u2192j,r,s,O(\n\n\u03b5x \u2208 (0,\n\nand\n, 4(C(cid:48))2\n\n\u03b5e \u2208(cid:0)0,\n(cid:1)\nthere exists a function Hi\u2192j,r,s : R \u2192(cid:2) \u2212 4(C(cid:48))2\n\u2022 (cid:101)wk,(cid:101)ak \u223c N(cid:0)0, 2I\n(cid:12)(cid:12)(cid:12)(cid:12) E(cid:101)wk,(cid:101)ak\n\nwith probability at least 1 \u2212 e\u2212\u2126(\u03c12) over W and A,\n(a) (on target)\n\n(cid:20)\n1|(cid:104)(cid:101)wk,h(0)\n\n(cid:1) be freshly new random vectors,\n\ni\u22121(cid:105)|\u2264 \u03b5c\u221a\n\n(cid:21)\n1(cid:104)(cid:101)wk,hi\u22121(cid:105)+(cid:104)(cid:101)ak,xi(cid:105)\u22650Hi\u2192j,r,s((cid:101)ak)\n\n\u03b5e\u03b5x\n\n\u03b5e\u03b5x\n\nm\n\nm\n\n\u2022 x be a \ufb01xed input sequence de\ufb01ned by some x(cid:63) in the support of D (see De\ufb01nition 3.1),\n\u2022 W, A be at random initialization,\n\u2022 h(cid:96) be generated by W ,A,x and h(0)\n\n(cid:96) be generated by W ,A,x(0), and\n\n\u2212 \u03a6i\u2192j,r,s((cid:104)w\u2217\n\ni (cid:105))\ni\u2192j,r,s, x(cid:63)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b5e\n\n7\n\n\f(cid:20)\n(b) (off target), for every i(cid:48) (cid:54)= i\n1|(cid:104)(cid:101)wk,h(0)\n\n(cid:12)(cid:12)(cid:12)(cid:12) E(cid:101)wk,(cid:101)ak\n\ni\u22121(cid:105)|\u2264 \u03b5c\u221a\n\nm\n\n1(cid:104)(cid:101)wk,hi(cid:48)\u22121(cid:105)+(cid:104)(cid:101)ak,xi(cid:48)(cid:105)\u22650Hi\u2192j,r,s((cid:101)ak)\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b5e\n\nLemma 5.2 implies there is a quantity 1|(cid:104)(cid:101)wk,h(0)\nfunction and the random initialization (namely, (cid:101)wk,(cid:101)ak) such that,\n\u2022 when multiplying 1(cid:104)(cid:101)wk,hi\u22121(cid:105)+(cid:104)(cid:101)ak,xi(cid:105)\u22650 gives the target \u03a6i\u2192j,r,s((cid:104)w\u2217\n\u2022 when multiplying 1(cid:104)(cid:101)wk,hi(cid:48)\u22121(cid:105)+(cid:104)(cid:101)ak,xi(cid:48)(cid:105)\u22650 gives near zero.\n\ni\u22121(cid:105)|\u2264 \u03b5c\u221a\n\nm\n\ni\u2192j,r,s, x(cid:63)\n\ni (cid:105), but\n\nHi\u2192j,r,s((cid:101)ak) that only depends on the target\n\nThe full proof is in Appendix G.2 but we sketch why Lemma 5.2 can be derived from Lemma 5.1.\n\nx\n\ni(cid:48), 0)(cid:11); but\nm ) because (cid:107)hi(cid:48)\u22121(cid:107) \u2248 1 by random init. (see Lemma B.1a).\ntimes larger than (cid:104)(cid:101)ak, xi(cid:48)(cid:105).\nm because i = i(cid:48). Since h(0) can\nm. Condition-\n\nSketch proof of Lemma 5.2. Let us focus on indicator 1(cid:104)(cid:101)wk,hi(cid:48)\u22121(cid:105)+(cid:104)(cid:101)ak,xi(cid:48)(cid:105)\u22650:\nm ) because (cid:104)(cid:101)ak, xi(cid:48)(cid:105) =(cid:10)((cid:101)ak, (\u03b5xx(cid:63)\n\u2022 (cid:104)(cid:101)ak, xi(cid:48)(cid:105) is distributed like N (0, 2\u03b52\n\u2022 (cid:104)(cid:101)wk, hi(cid:48)\u22121(cid:105) is roughly N (0, 2\nThus, if we treat (cid:104)(cid:101)wk, hi(cid:48)\u22121(cid:105) as the \u201cnoise n\u201d in Lemma 5.1 it can be 1\nTo show Lemma 5.2a, we only need to focus on |(cid:104)(cid:101)wk, h(0)\nbe shown close to h (see Lemma D.1), this is almost equivalent to |(cid:104)(cid:101)wk, hi(cid:48)\u22121(cid:105)| \u2264 \u03b5c\u221a\ni(cid:48)\u22121(cid:105)| \u2264 \u03b5c\u221a\nTo show Lemma 5.2a, we can show when i(cid:48) (cid:54)= i, the indicator on |(cid:104)(cid:101)wk, hi\u22121(cid:105)| \u2264 \u03b5c\u221a\ninformation about the true noise (cid:104)(cid:101)wk, hi(cid:48)\u22121(cid:105). This is so because hi\u22121 and hi(cid:48)\u22121 are somewhat\nuncorrelated (details in Lemma B.1k). As a result, the \u201cnoise n\u201d is still large and thus Lemma 5.1b\n(cid:3)\napplies with \u03a6i\u2192j,r,s(0) = 0.\n\ning on this happens, the \u201cnoise n\u201d must be small so we can apply Lemma 5.1a.\n\nm gives little\n\n\u03b5x\n\n5.3 Fitting the Target Function\nWe are now ready to design W\n\nDe\ufb01nition 5.3. Suppose \u03b5e \u2208(cid:0)0,\n\n\u03b5c :=\n\nL\u22121(cid:88)\n\n\u03b5e\u03b5x\n4C(cid:48)\n\nL(cid:88)\n\n, C :=\n\n(cid:88)\n\ni=2\n\nj=i+1\n\nr\u2208[p],s\u2208[d]\n\n(cid:62)\nk :=\n\nw\n\n(cid:62) \u2208 Rm\u00d7m using Lemma 5.2.\n\n\u221a\n\nL))\n\n\u221a\n\n(cid:1), C(cid:48) = C\u03b5e(\u03a6,\n(cid:13)(cid:13)(cid:13)e(cid:62)\n\n1\nm\n\n, Ci\u2192j,s :=\n\n(cid:1), we choose\n\n1\n\n\u03c14C(cid:48)\n(cid:107)h(0)\ni\u22121(cid:107)2 .\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nL), \u03b5x \u2208 (0,\n\ns Back(0)\ni\u2192j\n\n1\nCs(\u03a6,O(\n\n4(C(cid:48))2\n\u03b5e\u03b5x\n\n1\n\nmCi\u2192j,s\n\ne(cid:62)\ns Back(0)\ni\u2192j\n\n1|(cid:104)wk,h(0)\n\ni\u22121(cid:105)|\u2264 \u03b5c\u221a\n\nm\n\nHi\u2192j,r,s(ak)h(0)\ni\u22121\n\nk\n\nWe construct W\n\n(cid:62) \u2208 Rm\u00d7m by de\ufb01ning its k-th row vector as follows:\n\n(cid:104)\n(cid:13)(cid:13)(cid:13)e(cid:62)\n\n(cid:105)\n(cid:13)(cid:13)(cid:13)2\n\nAbove, functions Hi\u2192j,r,s : R \u2192(cid:2) \u2212 C, C(cid:3) come from Lemma 5.2.\n\nwhere Ci\u2192j,s :=\n\ns Back(0)\ni\u2192j\n\ni\u22121(cid:107)2\n\n(cid:107)h(0)\n\n1\nm\n\n2\n\nThe following lemma that says fj(cid:48),s(cid:48) is close to the target function F \u2217\nj(cid:48),s(cid:48).\n(cid:62) in De\ufb01nition 5.3 satis\ufb01es the\nLemma 5.4 (existence through backward). The construction of W\nfollowing. For every normalized input sequence x generated from x(cid:63) in the support of D, we have\nwith probability at least 1 \u2212 e\u2212\u2126(\u03c12) over W, A, B, it holds for every 3 \u2264 j(cid:48) \u2264 L and s(cid:48) \u2208 [d]\n\nfj(cid:48),s(cid:48) =\n\n\u03a6i\u2192j(cid:48),r,s(cid:48)((cid:104)w\u2217\n\ni\u2192j(cid:48),r,s(cid:48), x(cid:63)\n\np\u03c111 \u00b7 O(\u03b5e + Cs(\u03a6, 1)\u03b51/3\n\nx + Cm\u22120.05)\n\n.\n\n(cid:17)\n\nj(cid:48)\u22121(cid:88)\n\n(cid:88)\n\ni=2\n\nr\u2208[p]\n\ni (cid:105)) \u00b1(cid:16)\n\n8\n\n\fProof sketch of Lemma 5.4. Using de\ufb01nition of fj(cid:48),s(cid:48) in (5.1) and W\ne(cid:62)\ns Back(0)\ni\u2192j\n\n(cid:2)e(cid:62)\ns(cid:48) Backi(cid:48)\u2192j(cid:48)(cid:3)\n\nfj(cid:48),s(cid:48) =\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n1\n\n(cid:104)\n\n(cid:105)\n\n(cid:18)\n\n(cid:62), one can write down\n\nmCi\u2192j(cid:48),s\n\u00d7 1|(cid:104)wk,h(0)\n\ni\u22121(cid:105)|\u2264 \u03b5c\u221a\n\nm\n\nk\n\n(cid:19)\ni\u22121(cid:105)\n1[gi(cid:48) ]k\u22650Hi\u2192j,r,s(ak)(cid:104)hi(cid:48)\u22121, h(0)\n\nk\n\ni(cid:48),j(cid:48),j\n\nr\u2208[p],s\u2208[d]\n\nk\u2208[m]\n\n(5.3)\n\nNow,\n\n\u2022 The summands in (5.3) with i (cid:54)= i(cid:48) are negligible owing to Lemma 5.2b.\n\u2022 The summands in (5.3) with i = i(cid:48) but j (cid:54)= j(cid:48) are negligible, after proving that Backi\u2192j and\n\u2022 The summands in (5.3) with s (cid:54)= s(cid:48) are negligible using the randomness of B.\n\u2022 One can also prove Backi(cid:48)\u2192j(cid:48) \u2248 Back(0)\n\nBacki\u2192j(cid:48) are very uncorrelated (details in Lemma C.1).\n\ni(cid:48)\u2192j(cid:48) and hi(cid:48)\u22121 \u2248 h(0)\n\ni(cid:48)\u22121 (details in Lemma D.1).\n\n(cid:19)\n\nTogether,\n\nfj(cid:48),s(cid:48) \u2248 j(cid:48)\u22121(cid:88)\n\ni(cid:48)=2\n\n(cid:88)\n\n(cid:88)\n\nr\u2208[p]\n\nk\u2208[m]\n\n(cid:18)\n\n(cid:0)(cid:104)\n\ne(cid:62)\ns(cid:48) Back(0)\n\ni(cid:48)\u2192j(cid:48)\n\n(cid:1)2\n\n(cid:105)\n\nk\n\n1\n\nmCi(cid:48)\u2192j(cid:48),s(cid:48)\n\n\u00b7 1|(cid:104)wk,h(0)\n\ni(cid:48)\u22121\n\n1[gi(cid:48) ]k\u22650Hi(cid:48)\u2192j,r,s(cid:48)(ak)(cid:107)h(0)\n\ni(cid:48)\u22121(cid:107)2\n\n(cid:105)|\u2264 \u03b5c\u221a\n\nm\n\nApplying Lemma 5.2a and using our choice of Ci(cid:48)\u2192j(cid:48),s(cid:48), this gives (in expectation)\nj(cid:48),s(cid:48)(x(cid:63)) .\n\ni (cid:105)) = F \u2217\n\ni\u2192j(cid:48),r,s(cid:48), x(cid:63)\n\n(cid:80)\nr\u2208[p] \u03a6i\u2192j(cid:48),r,s(cid:48)((cid:104)w\u2217\n\nfj(cid:48),s(cid:48) \u2248(cid:80)j(cid:48)\u22121\n\ni=2\n\nProving concentration (with respect to k \u2208 [m]) is a lot more challenging due to the sophisticated\nfresh new samples (cid:101)wk,(cid:101)ak for all k \u2208 N and apply concentration only with respect to k \u2208 N . Here,\ncorrelations across different indices k. To achieve this, we replace some of the pairs wk, ak with\nN is a random subset of [m] with cardinality m0.1. We show that the network stabilizes (details in\n(cid:3)\n(cid:1) (see Claim G.1). Crucially, this Frobenius norm scales\nSection E) against such re-randomization. Full proof is in Section G.3.\nFinally, one can show (cid:107)W\nin m\u22121/2 so standard SGD analysis shall ensure that our sample complexity does not depend on m\n(up to log factors).\n\n(cid:62)(cid:107)F \u2264 O(cid:0) p\u03c13C\u221a\n\nm\n\nReferences\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overpa-\nrameterized Neural Networks, Going Beyond Two Layers. In NeurIPS, 2019. Full version\navailable at http://arxiv.org/abs/1811.04918.\n\n[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent\nneural networks. In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1810.\n12065.\n\n[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\nover-parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/\n1811.03962.\n\n[4] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sen-\n\ntence embeddings. In ICLR, 2017.\n\n[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by\n\njointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[6] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[7] Digvijay Boob and Guanghui Lan. Theoretical properties of the global optimizer of two layer\n\nneural network. arXiv preprint arXiv:1710.11241, 2017.\n\n9\n\n\f[8] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[9] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent\n\nneural networks, 2019. URL https://openreview.net/forum?id=Skf-oo0qt7.\n\n[10] Bhaskar Dasgupta and Eduardo D Sontag. Sample complexity for learning recurrent perceptron\n\nmappings. In Advances in Neural Information Processing Systems, pages 204\u2013210, 1996.\n\n[11] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501, 2017.\n\n[12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nIn Acoustics, speech and signal processing (icassp), 2013 ieee\n\nrecurrent neural networks.\ninternational conference on, pages 6645\u20136649. IEEE, 2013.\n\n[13] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical\n\nsystems. The Journal of Machine Learning Research, 19(1):1025\u20131068, 2018.\n\n[14] David Haussler. Decision theoretic generalizations of the pac model for neural net and other\n\nlearning applications. Information and Computation, 100(1):78\u2013150, 1992.\n\n[15] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9\n\n(8):1735\u20131780, 1997.\n\n[16] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Informa-\n\ntion Processing Systems, pages 586\u2013594, 2016.\n\n[17] Pascal Koiran and Eduardo D Sontag. Vapnik-chervonenkis dimension of recurrent neural\n\nnetworks. Discrete Applied Mathematics, 86(1):63\u201379, 1998.\n\n[18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems\n(NIPS), 2018.\n\n[19] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[20] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang.\n\nAlgorithmic regularization in over-\nparameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018.\n\n[21] Percy Liang. CS229T/STAT231: Statistical Learning Theory (Winter 2016). https://web.\n\nstanford.edu/class/cs229t/notes.pdf, April 2016. accessed January 2019.\n\n[22] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International\n\nConference on Algorithmic Learning Theory, pages 3\u201317. Springer, 2016.\n\n[23] Jared Ostmeyer and Lindsay Cowell. Machine learning on sequential data using a recurrent\n\nweighted average. Neurocomputing, 2018.\n\n[24] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[25] Hojjat Salehinejad, Julianne Baarbe, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh\nValaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.\n\n[26] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the\narXiv preprint\n\noptimization landscape of over-parameterized shallow neural networks.\narXiv:1707.04926, 2017.\n\n[27] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guar-\n\nantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[28] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n10\n\n\f[29] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and\nits applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560,\n2017.\n\n[30] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks.\n\narXiv preprint Arxiv:1611.03131, 2016.\n\n[31] Jiong Zhang, Qi Lei, and Inderjit S Dhillon. Stabilizing gradients for deep neural networks via\n\nef\ufb01cient svd parameterization. arXiv preprint arXiv:1803.09327, 2018.\n\n[32] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guar-\n\nantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5449, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}]}