{"title": "Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers", "book": "Advances in Neural Information Processing Systems", "page_first": 6158, "page_last": 6169, "abstract": "The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?\n\nIn this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.\n\nOn the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network, and connect it to the SGD theory of escaping saddle points.", "full_text": "Learning and Generalization in Overparameterized\n\nNeural Networks, Going Beyond Two Layers\u2217\n\nZeyuan Allen-Zhu\nMicrosoft Research AI\n\nzeyuan@csail.mit.edu\n\nYuanzhi Li\n\nCarnegie Mellon University\nyuanzhil@andrew.cmu.edu\n\nYingyu Liang\n\nUniversity of Wisconsin-Madison\n\nyliang@cs.wisc.edu\n\nAbstract\n\nThe fundamental learning theory behind neural networks remains largely open.\nWhat classes of functions can neural networks actually learn? Why doesn\u2019t the\ntrained network over\ufb01t when it is overparameterized?\nIn this work, we prove that overparameterized neural networks can learn some\nnotable concept classes, including two and three-layer networks with fewer pa-\nrameters and smooth activations. Moreover, the learning can be simply done by\nSGD (stochastic gradient descent) or its variants in polynomial time using poly-\nnomially many samples. The sample complexity can also be almost independent\nof the number of parameters in the network.\nOn the technique side, our analysis goes beyond the so-called NTK (neural tan-\ngent kernel) linearization of neural networks in prior works. We establish a new\nnotion of quadratic approximation of the neural network, and connect it to the\nSGD theory of escaping saddle points.\n\nIntroduction\n\n1\nNeural network learning has become a key machine learning approach and has achieved remarkable\nsuccess in a wide range of real-world domains, such as computer vision, speech recognition, and\ngame playing [25, 26, 30, 41].\nIn contrast to the widely accepted empirical success, much less\ntheory is known. Despite a recent boost of theoretical studies, many questions remain largely open,\nincluding fundamental ones about the optimization and generalization in learning neural networks.\nOne key challenge in analyzing neural networks is that the corresponding optimization is non-convex\nand is theoretically hard in the general case [40, 55]. This is in sharp contrast to the fact that simple\noptimization algorithms like stochastic gradient descent (SGD) and its variants usually produce good\nsolutions in practice even on both training and test data. Therefore,\n\nwhat functions can neural networks provably learn?\n\nAnother key challenge is that, in practice, neural networks are heavily overparameterized (e.g., [53]):\nthe number of learnable parameters is much larger than the number of the training samples.\nIt\nis observed that overparameterization empirically improves both optimization and generalization,\nappearing to contradict traditional learning theory.2 Therefore,\n\nwhy do overparameterized networks (found by those training algorithms) generalize?\n\n\u2217Full version and future updates can be found on https://arxiv.org/abs/1811.04918.\n2For example, Livni et al. [36] observed that on synthetic data generated from a target network, SGD con-\nverges faster when the learned network has more parameters than the target. Perhaps more interestingly, Arora\net al. [6] found that overparameterized networks learned in practice can often be compressed to simpler ones\nwith much fewer parameters, without hurting their ability to generalize; however, directly learning such simpler\nnetworks runs into worse results due to the optimization dif\ufb01culty. We also have experiments in Figure 1(a).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 What Can Neural Networks Provably Learn?\nMost existing works analyzing the learnability of neural networks [9, 12, 13, 20, 21, 28, 33, 34, 42,\n43, 47, 49, 50, 56] make unrealistic assumptions about the data distribution (such as being random\nGaussian), and/or make strong assumptions about the network (such as using linear activations).\nLi and Liang [32] show that two-layer ReLU networks can learn classi\ufb01cation tasks when the data\ncome from mixtures of arbitrary but well-separated distributions.\nA theorem without distributional assumptions on data is often more desirable. Indeed, how to obtain\na result that does not depend on the data distribution, but only on the concept class itself, lies in the\ncenter of PAC-learning which is one of the foundations of machine learning theory [48]. Also,\nstudying non-linear activations is critical because otherwise one can only learn linear functions,\nwhich can also be easily learned via linear models without neural networks.\nBrutzkus et al. [14] prove that two-layer networks with ReLU activations can learn linearly-separable\ndata (and thus the class of linear functions) using just SGD. This is an (improper) PAC-learning\ntype of result because it makes no assumption on the data distribution. Andoni et al. [5] proves that\ntwo-layer networks can learn polynomial functions of degree r over d-dimensional inputs in sample\ncomplexity O(dr). Their learner networks use exponential activation functions, where in practice the\nrecti\ufb01ed linear unit (ReLU) activation has been the dominant choice across vastly different domains.\nOn a separate note, if one treats all but the last layer of neural networks as generating a random\nfeature map, then training only the last layer is a convex task, so one can learn the class of linear\nfunctions in this implicit feature space [15, 16]. This result implies low-degree polynomials and\ncompositional kernels can be learned by neural networks in polynomial time. Empirically, training\nlast layer greatly weakens the power of neural networks (see Figure 1).\nOur Result. We prove that an important concept class that contains three-layer (resp. two-layer)\nneural networks equipped with smooth activations can be ef\ufb01ciently learned by three-layer (resp.\ntwo-layer) ReLU neural networks via SGD or its variants.\nSpeci\ufb01cally, suppose in aforementioned class the best network (called the target function or target\nnetwork) achieves a population risk OPT with respect to some convex loss function. We show that\none can learn up to population risk OPT + \u03b5, using three-layer (resp. two-layer) ReLU networks of\nsize greater than a \ufb01xed polynomial in the size of the target network, in 1/\u03b5, and in the \u201ccomplexity\u201d\nof the activation function used in the target network. Furthermore, the sample complexity is also\npolynomial in these parameters, and only poly-logarithmic in the size of the learner ReLU network.\nWe stress here that this is agnostic PAC-learning because we allow the target function to have er-\nror (e.g., OPT can be positive for regression), and is improper learning because the concept class\nconsists of smaller neural networks comparing to the networks being trained.\nOur Contributions. We believe our result gives further insights to the fundamental questions about\nthe learning theory of neural networks.\n\n\u2022 To the best of our knowledge, this is the \ufb01rst result showing that using hidden layers of neural\nnetworks one can provably learn the concept class containing two (or even three) layer neural\nnetworks with non-trivial activation functions.3\n\n\u2022 Our three-layer result gives the \ufb01rst theoretical proof that learning neural networks, even with\nnon-convex interactions across layers, can still be plausible. In contrast, in the two-layer case\nthe optimization landscape with overparameterization is almost convex [17, 32]; and in previ-\nous studies on the multi-layer case, researchers have weakened the network by applying the so-\ncalled NTK (neural tangent kernel) linearization to remove all non-convex interactions [4, 27].\n\u2022 To some extent we explain the reason why overparameterization improves testing accuracy:\nwith larger overparameterization, one can hope to learn better target functions with possibly\nlarger size, more complex activations, smaller risk OPT, and to a smaller error \u03b5.\n\n\u2022 We establish new tools to tackle the learning process of neural networks in general, which can\nbe useful for studying other network architectures and learning tasks. (E.g., the new tools here\n\n3In contrast, Daniely [15] focuses on training essentially only the last layer (and the hidden-layer movement\nis negligible). After this paper has appeared online, Arora et al. [8] showed that neural networks can provably\nlearn two-layer networks with a slightly weaker class of smooth activation functions. Namely, the activation\nfunctions that are either linear functions or even functions.\n\n2\n\n\fhave allowed researchers to study also the learning of recurrent neural networks [2].)\n\nOther Related Works. We acknowledge a different line of research using kernels as improper\nlearners to learn the concept class of neural networks [22, 23, 36, 54]. This is very different from us\nbecause we use \u201cneural networks\u201d as learners. In other words, we study the question of \u201cwhat can\nneural networks learn\u201d but they study \u201cwhat alternative methods can replace neural networks.\u201d\nThere is also a line of work studying the relationship between neural networks and NTKs (neural\ntangent kernels) [3, 4, 7, 27, 31, 51]. These works study neural networks by considering their\n\u201clinearized approximations.\u201d There is a known performance gap between the power of real neural\nnetworks and the power of their linearized approximations. For instance, ResNet achieves 96%\ntest error on the CIFAR-10 data set but NTK (even with in\ufb01nite width) achieves 77% [7]. We also\nillustrate this in Figure 1.\n\n1.2 Why Do Overparameterized Networks Generalize?\nOur result above assumes that the learner network is suf\ufb01ciently overparameterized. So, why does it\ngeneralize to the population risk and give small test error? More importantly, why does it generalize\nwith a number of samples that is (almost) independent of the number of parameters?\nThis question cannot be studied under the traditional VC-dimension learning theory since the VC\ndimension grows with the number of parameters. Several works [6, 11, 24, 39] explain generaliza-\ntion by studying some other \u201ccomplexity\u201d of the learned networks. Most related to the discussion\nhere is [11] where the authors prove a generalization bound in the norms (of weight matrices) of\neach layer, as opposed to the number of parameters. There are two main concerns with those results.\n\u2022 Learnability = Trainability + Generalization. It is not clear from those results how a network\nwith both low \u201ccomplexity\u201d and small training loss can be found by the training method.\nTherefore, they do not directly imply PAC-learnability for non-trivial concept classes (at least\nfor those concept classes studied by this paper).\n\n\u2022 Their norms are \u201csparsity induced norms\u201d: for the norm not to scale with the number of hidden\nneurons m, essentially, it requires the number of neurons with non-zero weights not to scale\nwith m. This more or less reduces the problem to the non-overparameterized case.\n\nAt a high level, our generalization is made possible with the following sequence of conceptual steps.\n\u2022 Good networks with small risks are plentiful: thanks to overparameterization, with high prob-\nability over random initialization, there exists a good network in the close neighborhood of\nany point on the SGD training trajectory. (This corresponds to Section 6.2 and 6.3.)\n\n\u2022 The optimization in overparameterized neural networks has benign properties: essentially\nalong the training trajectory, there is no second-order critical points for learning three-layer\nnetworks, and no \ufb01rst-order critical points for two-layer. (This corresponds to Section 6.4.)\n\n\u2022 In the learned networks, information is also evenly distributed among neurons, by utilizing\neither implicit or explicit regularization. This structure allows a new generalization bound that\nis (almost) independent of the number of neurons. (This corresponds to Section 6.5 and 6.6,\nand we also empirically verify it in Section 7.1.)\n\nSince practical neural networks are typically overparameterized, we genuinely hope that our results\ncan provide theoretical insights to networks used in various applications.\n\n1.3 Roadmap\nIn the main body of this paper, we introduce notations in Section 2, present our main results and\ncontributions for two and three-layer networks in Section 3 and 4, and conclude in Section 5.\nFor readers interested in our novel techniques, we present in Section 6 an 8-paged proof sketch of our\nthree-layer result. For readers more interested in the practical relevance, we give more experiments\nin Section 7. In the appendix, we begin with mathematical preliminaries in Appendix A. Our full\nthree-layer proof is in Appendix C. Our two-layer proof is much easier and in Appendix B.\n\n3\n\n\f(a) N = 1000 and vary m\n\n(b) m = 2000 and vary N\n\nFigure 1: Performance comparison.\n\n3layer/2layer stands for training (hidden weights) in three and\ntwo-layer neural networks. (last) stands for conjugate kernel [15], meaning training only the\noutput layer. (NTK) stands for neural tangent kernel [27] with \ufb01nite width. We also implemented\nother direct kernels such as [54] but they perform much worse.\nSetup. We consider (cid:96)2 regression task on synthetic data where feature vectors x \u2208 R4\nare generated as normalized random Gaussian, and label\nfunction\nF \u2217(x) = (sin(3x1) + sin(3x2) + sin(3x3) \u2212 2)2 \u00b7 cos(7x4). We use N training samples,\nand SGD with mini-batch size 50 and best tune learning rates and weight decay parameters. See\nAppendix 7 for our experiment setup, how we choose such target function, and more experiments.\n\nis generated by target\n\nfunction \u03c6(z). Suppose \u03c6(z) =(cid:80)\u221e\n(C\u2217R)i +(cid:0)\u221a\nC\u03b5(\u03c6, R) :=(cid:80)\u221e\n\n(cid:16)\n\n2 Notations\n\u03c3(\u00b7) denotes the ReLU function \u03c3(x) = max{x, 0}. Given f : R \u2192 R and a vector x \u2208 Rm, f (x)\ndenotes f (x) = (f (x1), . . . , f (xm)). For a vector w, (cid:107)w(cid:107)p denote its p-th norm, and when clear\nfrom the context, abbreviate (cid:107)w(cid:107) = (cid:107)w(cid:107)2. For a matrix W \u2208 Rm\u00d7d, use Wi or sometimes wi to\ndenote the i-th row of W . The row (cid:96)p norm is (cid:107)W(cid:107)2,p :=\n, the spectral norm is\n(cid:107)W(cid:107)2, and the Frobenius norm is (cid:107)W(cid:107)F = (cid:107)W(cid:107)2,2. We say f : Rd \u2192 R is L-Lipschitz continuous\nif |f (x) \u2212 f (y)| \u2264 L(cid:107)x \u2212 y(cid:107)2; is L-Lipschitz smooth if (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)2 \u2264 L(cid:107)x \u2212 y(cid:107)2.\nFunction complexity. The following notion measures the complexity of any smooth activation\n\ni\u2208[m] (cid:107)Wi(cid:107)p\n\n(cid:17)1/p\n\n(cid:16)(cid:80)\n\n2\n\ni=0 cizi. Given a non-negative R, the complexity\n\nC\u2217R(cid:1)i(cid:17)|ci|, Cs(\u03c6, R) := C\u2217(cid:80)\u221e\n\n\u221a\nlog(1/\u03b5)\n\ni\n\ni=0\n\ni=0(i + 1)1.75Ri|ci|\nwhere C\u2217 is a suf\ufb01ciently large constant (e.g., 104). Intuitively, Cs measures the sample complexity:\nhow many samples are required to learn \u03c6 correctly; while C\u03b5 bounds the network size: how much\nover-parameterization is needed for the algorithm to ef\ufb01ciently learn \u03c6 up to \u03b5 error. It is always\ntrue that Cs(\u03c6, R) \u2264 C\u03b5(\u03c6, R) \u2264 Cs(\u03c6, O(R)) \u00d7 poly(1/\u03b5).4 While for sin z, exp(z) or low degree\npolynomials, Cs(\u03c6, O(R)) and C\u03b5(\u03c6, R) only differ by o(1/\u03b5).\nExample 2.1. If \u03c6(z) = ec\u00b7z \u2212 1, \u03c6(z) = sin(c \u00b7 z), \u03c6(z) = cos(c \u00b7 z) for constant c or \u03c6(z) is\nlow degree polynomial, then C\u03b5(\u03c6, 1) = o(1/\u03b5) and Cs(\u03c6, 1) = O(1). If \u03c6(z) = sigmoid(z) or\ntanh(z), we can truncate their Taylor series at degree \u0398(log 1\n\u03b5 ) to get \u03b5 approximation. One can\nverify this gives C\u03b5(\u03c6, 1) \u2264 poly(1/\u03b5) and Cs(\u03c6, 1) \u2264 O(1).\n\n3 Result for Two-Layer Networks\nWe consider learning some unknown distribution D of data points z = (x, y) \u2208 Rd \u00d7 Y, where x\nis the input point and y is the label. Without loss of generality, assume (cid:107)x(cid:107)2 = 1 and xd = 1\n2.5\nConsider a loss function L : Rk \u00d7 R \u2192 Y such that for every y \u2208 Y, the function L(\u00b7, y) is non-\nnegative, convex, 1-Lipschitz continuous and 1-Lipschitz smooth and L(0, y) \u2208 [0, 1]. This includes\nboth the cross-entropy loss and the (cid:96)2-regression loss (for bounded Y).\n\n4Recall(cid:0)\u221a\npadding(cid:112)1 \u2212 (cid:107)x(cid:107)2\n\n5 1\n\n\u221a\nlog(1/\u03b5)\ni\n\nC\u2217(cid:1)i \u2264 eO(log(1/\u03b5)) = 1\n\npoly(\u03b5) for every i \u2265 1.\n\n2 can always be padded to the last coordinate, and (cid:107)x(cid:107)2 = 1 can always be ensured from (cid:107)x(cid:107)2 \u2264 1 by\n\n2. This assumption is for simplifying the presentation.\n\n4\n\n0.0630.251.4.16.64.Test errorm = number of hidden neurons3layer2layer3layer(last)2layer(last)3layer(NTK)0.0010.0040.0160.0630.251.4.16.64.Test errorN = number of samples3layer2layer3layer(last)2layer(last)3layer(NTK)\fConcept class and target function F \u2217(x). Consider target functions F \u2217 : Rd \u2192 Rk of\n\nF \u2217 = (f\u2217\n\n1 , . . . , f\u2217\nk )\n\nr,i\u03c6i((cid:104)w\u2217\na\u2217\nwhere each \u03c6i : R \u2192 R is in\ufb01nite-order smooth and the weights w\u2217\nWe assume for simplicity (cid:107)w\u2217\n\n1,i, x(cid:105))(cid:104)w\u2217\n1,i \u2208 Rd, w\u2217\nr,i| \u2264 1.6 We denote by\n\n2,i(cid:107)2 = 1 and |a\u2217\n\n1,i(cid:107)2 = (cid:107)w\u2217\n\nf\u2217\nr (x) =\n\nand\n\ni=1\n\n2,i, x(cid:105)\n2,i \u2208 Rd and a\u2217\n\n(3.1)\nr,i \u2208 R.\n\np(cid:88)\n\nC\u03b5(\u03c6, R) := maxj\u2208[p]{C\u03b5(\u03c6j, R)}\n\nand Cs(\u03c6, R) := maxj\u2208[p]{Cs(\u03c6j, R)}\n\nthe complexity of F \u2217 and assume they are bounded.\nIn the agnostic PAC-learning language, our concept class consists of all functions F \u2217 in the form\nof (3.1) with complexity bounded by threshold C and parameter p bounded by threshold p0. Let\nOPT = E[L(F \u2217(x), y)] be the population risk achieved by the best target function in this concept\nclass. Then, our goal is to learn this concept class with population risk OPT + \u03b5 using sample and\ntime complexity polynomial in C, p0 and 1/\u03b5. In the remainder of this paper, to simplify notations,\nwe do not explicitly de\ufb01ne this concept class parameterized by C and p. Instead, we equivalently\nstate our theorem with respect to any (unknown) target function F \u2217 with speci\ufb01c parameters C and\np satisfying OPT = E[L(F \u2217(x), y)]. We assume OPT \u2208 [0, 1] for simplicity.\n1,i, x(cid:105)) are special cases of (3.1) (by\nRemark. Standard two-layer networks f\u2217\nsetting w\u2217\n2,i = (0, . . . , 0, 1) and \u03c6i = \u03c6). Our formulation (3.1) additionally captures combinations\nof correlations between non-linear and linear measurements of different directions of x.\nLearner network F (x; W ). Using a data set Z = {z1, . . . , zN} of N i.i.d. samples from D, we\ntrain a network F = (f1,\u00b7\u00b7\u00b7 , fk) : Rd \u2192 Rk with\n\nr (x) =(cid:80)p\n\nr,i\u03c6((cid:104)w\u2217\n\ni=1 a\u2217\n\nar,i\u03c3((cid:104)wi, x(cid:105) + bi) = a(cid:62)\n\ni=1\n\nfr(x) :=\n\n(3.2)\nwhere \u03c3 is the ReLU activation, W = (w1, . . . , wm) \u2208 Rm\u00d7d is the hidden weight matrix, b \u2208 Rm\nis the bias vector, and ar \u2208 Rm is the output weight vector. To simplify analysis, we only update W\nand keep b and ar at initialization values. For such reason, we write the learner network as fr(x; W )\nand F (x; W ). We sometimes use b(0) = b and a(0)\nr = ar to emphasize they are randomly initialized.\n\nOur goal is to learn a weight matrix W with population risk E(cid:2)L(F (x; W ), y)(cid:3) \u2264 OPT + \u03b5.\n\nr \u03c3(W x + b)\n\nm(cid:88)\n\nr\n\nare i.i.d. random Gaussians from N (0, \u03b52\n\nLearning Process. Let W (0) denote the initial value of the hidden weight matrix, and let W (0)+Wt\ndenote the value at time t. (Note that Wt is the matrix of increments.) The weights are initialized\nwith Gaussians and then W is updated by the vanilla SGD. More precisely,\n\u2022 entries of W (0) and b(0) are i.i.d. random Gaussians from N (0, 1/m),\n\u2022 entries of each a(0)\n\na) for some \ufb01xed \u03b5a \u2208 (0, 1].7\nAt time t, SGD samples z = (x, y) \u223c Z and updates Wt+1 = Wt \u2212 \u03b7\u2207L(F (x; W (0) + Wt), y).\n3.1 Main Theorem\nFor notation simplicity, with high probability (or w.h.p.) means with probability 1 \u2212 e\u2212c log2 m for a\n\nsuf\ufb01ciently large constant c, and (cid:101)O hides factors of polylog(m).\nTheorem 1 (two-layer). For every \u03b5 \u2208(cid:0)0,\n(cid:1), there exists\nsuch that for every m \u2265 M0 and every N \u2265 (cid:101)\u2126(N0), choosing \u03b5a = \u03b5/(cid:101)\u0398(1) for the initialization,\nchoosing learning rate \u03b7 = (cid:101)\u0398(cid:0) 1\n(cid:1) and\nT = (cid:101)\u0398\n\n(cid:18) (Cs(\u03c6, 1))2 \u00b7 k3p2\n\nM0 = poly(C\u03b5(\u03c6, 1), 1/\u03b5) and N0 = poly(Cs(\u03c6, 1), 1/\u03b5)\n\n(cid:19)\n\npkCs(\u03c6,1)\n\n\u03b5km\n\n1\n\n,\n\n\u03b52\n\n6For general (cid:107)w\u2217\n\nactivation function \u03c6(cid:48)(x) = \u03c6(Bx). Our results then hold by replacing the complexity of \u03c6 with \u03c6(cid:48).\n\n7We shall choose \u03b5a = (cid:101)\u0398(\u03b5) in the proof due to technical reason. As we shall see in the three-layer case, if\n\nr,i| \u2264 B, the scaling factor B can be absorbed into the\n\n1,i(cid:107)2 \u2264 B,(cid:107)w\u2217\n\n2,i(cid:107)2 \u2264 B, |a\u2217\n\nweight decay is used, one can relax this to \u03b5a = 1.\n\n5\n\n\fwith high probability over the random initialization, SGD after T iteration satis\ufb01es\n\n(cid:104) 1\n\nT\n\n(cid:80)T\u22121\n\nt=0\n\nEsgd\n\nE(x,y)\u223cDL(F (x; W (0) + Wt), y)\n\n(cid:105) \u2264 OPT + \u03b5.\n\nExample 3.1. For functions such as \u03c6(z) = ez, sin z, sigmoid(z), tanh(z) or low degree polynomi-\nals, using Example 2.1, our theorem indicates that for target networks with such activation functions,\nwe can learn them using two-layer ReLU networks with\n\nsize m =\n\npoly(k, p)\n\npoly(\u03b5)\n\nand sample complexity min{N, T} =\n\npoly(k, p, log m)\n\n\u03b52\n\nWe note sample complexity T is (almost) independent of m, the amount of overparametrization.\n\n3.2 Our Interpretations\nOverparameterization improves generalization. By increasing m, Theorem 1 supports more\ntarget functions with possibly larger size, more complex activations, and smaller population risk\nOPT. In other words, when m is \ufb01xed, among the class of target functions whose complexities\nare captured by m, SGD can learn the best function approximator of the data, with the smallest\npopulation risk. This gives intuition how overparameterization improves test error, see Figure 1(a).\nLarge margin non-linear classi\ufb01er. Theorem 1 is a nonlinear analogue of the margin theory for\nlinear classi\ufb01ers. The target function with a small population risk (and of bounded norm) can be\nviewed as a \u201clarge margin non-linear classi\ufb01er.\u201d In this view, Theorem 1 shows that assuming the\nexistence of such large-margin classi\ufb01er, SGD \ufb01nds a good solution with sample complexity mostly\ndetermined by the margin, instead of the dimension of the data.\nInductive bias. Recent works (e.g., [4, 32]) show that when the network is heavily overparame-\nterized (that is, m is polynomial in the number of training samples) and no two training samples are\nidentical, then SGD can \ufb01nd a global optimum with 0 classi\ufb01cation error (or \ufb01nd a solution with\n\u03b5 training loss) in polynomial time. This does not come with generalization, since it can even \ufb01t\nrandom labels. Our theorem, combined with [4], con\ufb01rms the inductive bias of SGD for two-layer\nnetworks: when the labels are random, SGD \ufb01nds a network that memorizes the training data; when\nthe labels are (even only approximately) realizable by some target network, then SGD learns and\ngeneralizes. This gives an explanation towards the well-known empirical observations of such in-\nductive bias (e.g., [53]) in the two-layer setting, and is more general than Brutzkus et al. [14] in\nwhich the target network is only linear.\n\n4 Result for Three-Layer Networks\nConcept class and target function F \u2217(x). This time we consider more powerful target functions\nF \u2217 = (f\u2217\n\nk ) of the form\n\na\u2217\nr,i\u03a6i\n\n1,i,j\u03c61,j((cid:104)w\u2217\nv\u2217\n\n1,j, x(cid:105))\n\n2,i,j\u03c62,j((cid:104)w\u2217\nv\u2217\n\n2,j, x(cid:105))\n\n(4.1)\n\n\uf8f6\uf8f8\n\n1 ,\u00b7\u00b7\u00b7 , f\u2217\n(cid:88)\n\nf\u2217\nr (x) :=\n\ni\u2208[p1]\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208[p2]\n\n\uf8f6\uf8f8\uf8eb\uf8ed(cid:88)\n\nj\u2208[p2]\n\nwhere each \u03c61,j, \u03c62,j, \u03a6i : R \u2192 R is in\ufb01nite-order smooth, and the weights w\u2217\nv\u2217\n1,i, v\u2217\nLet\n\n1,j(cid:107)2 = (cid:107)w\u2217\n\n2,j(cid:107)2 = (cid:107)v\u2217\n\nr,i \u2208 R satisfy (cid:107)w\u2217\n\n2,i \u2208 Rp2 and a\u2217\nC\u03b5(\u03c6, R) = maxj\u2208[p2],s\u2208[1,2]{C\u03b5(\u03c6s,j, R)},\nCs(\u03c6, R) = maxj\u2208[p2],s\u2208[1,2]{Cs(\u03c6s,j, R)},\n\n1,i(cid:107)2 = (cid:107)v\u2217\n\n1,i, w\u2217\n2,i(cid:107)2 = 1 and |a\u2217\nC\u03b5(\u03a6, R) = maxj\u2208[p1]{C\u03b5(\u03a6j, R)}\nCs(\u03a6, R) = maxj\u2208[p1]{Cs(\u03a6j, R)}\n\n2,i \u2208 Rd,\nr,i| \u2264 1.\n\nto denote the complexity of the two layers, and assume they are bounded.\nOur concept class contains measures of correlations between composite non-linear functions and\nnon-linear functions of the input, there are plenty of functions in this new concept class that may not\nnecessarily have small-complexity representation in the previous formulation (3.1), and as we shall\nsee in Figure 1(a), this is the critical advantage of using three-layer networks compared to two-\nlayer ones or their NTKs. The learnability of this correlation is due to the non-convex interactions\nbetween hidden layers. As a comparison, [15] studies the regime where the changes in hidden layers\nare negligible thus can not show how to learn this concept class with a three-layer network.\n\n6\n\n\fRemark 4.1. Standard three-layer networks\ni\u2208[p1] a\u2217\n\nf\u2217\n\nr,i\u03a6i\n\nr (x) =(cid:80)\n(cid:16)(cid:80)\n\n(cid:16)(cid:80)\n\n(cid:17)\n\nj\u2208[p2] v\u2217\n\ni,j\u03c6j((cid:104)w\u2217\n\nj , x(cid:105))\n\n(cid:17)(cid:16)(cid:80)\n\n(cid:17)\n2,j, x(cid:105))\n\nare only special cases of (4.1). Also, even in the special case of \u03a6i(z) = z, the target\n2,i,j\u03c62((cid:104)w\u2217\n\n1,i,j\u03c61((cid:104)w\u2217\n\n1,j, x(cid:105))\n\nj\u2208[p2] v\u2217\n\ni\u2208[p1] a\u2217\n\nf\u2217\n\nr,i\n\ncaptures combinations of correlations of non-linear measurements in different directions of x.\nLearner network F (x; W, V ). Our learners are three-layer networks F = (f1, . . . , fk) with\n\nfr(x) =\n\nar,i\u03c3(ni(x) + b2,i) where each ni(x) =\n\nvi,j\u03c3 ((cid:104)wj, x(cid:105) + b1,j)\n\nj\u2208[p2] v\u2217\n(cid:88)\n\nj\u2208[m1]\n\nr (x) =(cid:80)\n(cid:88)\n\ni\u2208[m2]\n\nThe \ufb01rst and second layers have m1 and m2 hidden neurons. Let W \u2208 Rm1\u00d7d and V \u2208 Rm2\u00d7m1\nrepresent the weights of the \ufb01rst and second hidden layers respectively, and b1 \u2208 Rm1 and b2 \u2208 Rm2\nrepresent the corresponding bias vectors, ar \u2208 Rm2 represent the output weight vector.\n4.1 Learning Process\nAgain for simplicity, we only update W and V . The weights are randomly initialized as:\n\n\u2022 entries of W (0) and b1 = b(0)\n\u2022 entries of V (0) and b2 = b(0)\n\u2022 entries of each ar = a(0)\n\n2\n\n1\n\nr\n\nare i.i.d. from N (0, 1/m1),\nare i.i.d. from N (0, 1/m2),\n\nare i.i.d. from N (0, \u03b52\n\na) for \u03b5a = 1.\n\nAs for the optimization algorithm, we use SGD with weight decay and an explicit regularizer.\nFor some \u03bb \u2208 (0, 1], we will use \u03bbF (x; W, V ) as the learner network, i.e., linearly scale F down\nby \u03bb. This is equivalent to replacing W , V with\n\u03bbV , since a ReLU network is positive\nhomogenous. The SGD will start with \u03bb = 1 and slowly decrease it, similar to weight decay.8\nWe also use an explicit regularizer for some \u03bbw, \u03bbv > 0 with9\nF + \u03bbw(cid:107)\n\n\u221a\n\u03bbV ) := \u03bbv(cid:107)\n\n\u03bbW(cid:107)4\n\n\u03bbV (cid:107)2\n\n\u03bbW ,\n\n2,4 .\n\n\u03bbW,\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nR(\n\nL2(\u03bbt\u22121; W (cid:48), V (cid:48)) := L\n\nNow, in each round t = 1, 2, . . . , T , we use (noisy) SGD to minimize the following stochastic\nobjective for some \ufb01xed \u03bbt\u22121:\n\n(cid:16)\n\u03bbt\u22121F(cid:0)x; W (0) + W \u03c1 + \u03a3W (cid:48), V (0) + V \u03c1 + V (cid:48)\u03a3(cid:1)(cid:17)\n+ R((cid:112)\u03bbt\u22121W (cid:48),(cid:112)\u03bbt\u22121V (cid:48))\n\n(4.2)\nAbove, the objective is stochastic because (1) z \u223c Z is a random sample from the training set, (2)\nW \u03c1 and V \u03c1 are two small perturbation random matrices with entries i.i.d. drawn from N (0, \u03c32\nw)\nv) respectively, and (3) \u03a3 \u2208 Rm1\u00d7m1 is a random diagonal matrix with diagonals i.i.d.\nand N (0, \u03c32\nuniformly drawn from {+1,\u22121}. We note that the use of W \u03c1 and V \u03c1 is standard for Gaussian\nsmoothing on the objective (and not needed in practice).10 The use of \u03a3 may be reminiscent of the\nDropout technique [46] in practice which randomly masks out neurons, and can also be removed.11\n\n8We illustrate the technical necessity of adding weight decay. During training, it is easy to add new infor-\nmation to the current network, but hard to forget \u201cfalse\u201d information that is already in the network. Such false\ninformation can be accumulated from randomness of SGD, non-convex landscapes, and so on. Thus, by scaling\ndown the network we can effectively forget false information.\n9This (cid:107) \u00b7 (cid:107)2,4 norm on W encourages weights to be more evenly distributed across neurons. It can be\n2,2+\u03b1 for any constant \u03b1 > 0 for our theoretical purpose. We choose \u03b1 = 2 for\nsimplicity, and observe that in practice, weights are automatically spread out due to data randomness, so this\nexplicit regularization may not be needed. See Section 7.1 for an experiment.\n\nreplaced with (cid:107)\u221a\n\n\u03bbt\u22121Wt\u22121(cid:107)2+\u03b1\n\n10Similar to known non-convex literature [19] or smooth analysis, we introduce Gaussian perturbation W \u03c1\nand V \u03c1 for theoretical purpose and it is not needed in practice. Also, we apply noisy SGD which is the vanilla\nSGD plus Gaussian perturbation, which again is needed in theory but believed unnecessary for practice [19].\n11In the full paper we study two variants of SGD. This present version is the \u201csecond variant,\u201d and the \ufb01rst\nvariant L1(\u03bbt\u22121; W (cid:48), V (cid:48)) is the same as (4.2) by removing \u03a3. Due to technical dif\ufb01culty, the best sample\ncomplexity we can prove for L1 is a bit higher.\n\n7\n\n\fAlgorithm 1 SGD for three-layer networks (second variant (4.2))\nInput: Data set Z, initialization W (0), V (0), step size \u03b7, number of inner steps Tw, \u03c3w, \u03c3v, \u03bbw, \u03bbv.\n\n1: W0 = 0, V0 = 0, \u03bb1 = 1, T = \u0398(cid:0)\u03b7\u22121 log log(m1m2)\n\n(cid:1).\n\n\u03b50\n\n2: for t = 1, 2, . . . , T do\n3:\n\nApply noisy SGD with step size \u03b7 on the stochastic objective L2(\u03bbt\u22121; W, V ) for Tw steps; the starting\n(cid:5) see Lemma A.9\npoint is W = Wt\u22121, V = Vt\u22121 and suppose it reaches Wt, Vt.\n\u03bbt+1 = (1 \u2212 \u03b7)\u03bbt.\n(cid:5) weight decay\n\n4:\n5: end for\n\n6: Randomly sample (cid:98)\u03a3 with diagonal entries i.i.d. uniform on {1,\u22121}\n0) many noise matrices(cid:8)W \u03c1,j, V \u03c1,j(cid:9). Let\n7: Randomly sample(cid:101)\u0398(1/\u03b52\n(cid:110)Ez\u2208Z L\n\nj\u2217 = arg minj\n\n\u03bbT F(cid:0)x; W (0) + W \u03c1,j +(cid:98)\u03a3WT , V (0) + V \u03c1,j + VT(cid:98)\u03a3(cid:1)(cid:17)(cid:111)\n(cid:16)\n+(cid:98)\u03a3WT , V (out)\n\n+ VT(cid:98)\u03a3.\n\n= V (0) + V \u03c1,j\u2217\n\nT\n\n8: Output W (out)\n\nT\n\n= W (0) + W \u03c1,j\u2217\n\nAlgorithm 1 presents the details. Speci\ufb01cally, in each round t, Algorithm 1 starts with weight ma-\ntrices Wt\u22121, Vt\u22121 and performs Tw iterations. In each iteration it goes in the negative direction of\nthe stochastic gradient \u2207W (cid:48),V (cid:48)L2(\u03bbt; W (cid:48), V (cid:48)). Let the \ufb01nal matrices be Wt, Vt. At the end of this\nround t, Algorithm 1 performs weight decay by setting \u03bbt = (1 \u2212 \u03b7)\u03bbt\u22121 for some \u03b7 > 0.\n4.2 Main Theorems\nFor notation simplicity, with high probability (or w.h.p.) means with probability 1 \u2212 e\u2212c log2(m1m2)\n\nand (cid:101)O hides factors of polylog(m1, m2).\n\nTheorem 2 (three-layer, second variant). Consider Algorithm 1. For every constant \u03b3 \u2208 (0, 1/4],\nevery \u03b50 \u2208 (0, 1/100], every \u03b5 =\n\n2Cs(\u03a6,p2Cs(\u03c6,1))Cs(\u03c6,1)2 , there exists\n\nkp1p2\n\n\u03b50\n\n\u221a\n\n(cid:16)\n(cid:17)\np2C\u03b5(\u03c6, 1)) \u00b7 C\u03b5(\u03c6, 1) \u00b7 \u221a\n\np2C\u03b5(\u03c6, 1)),\n\nC\u03b5(\u03a6,\n\n1\n\u03b5\n\n(cid:17)2(cid:19)\n\np2p1k2\n\n\u03b50\n\nM = poly\n\n(cid:18)(cid:16) C\u03b5(\u03a6,\n\n\u221a\n\nN \u2265(cid:101)\u2126\n\nsuch that for every m2 = m1 = m \u2265 M, and properly set \u03bbw, \u03bbv, \u03c3w, \u03c3v in Table 1, as long as\n\nthere is a choice \u03b7 = 1/poly(m1, m2) and T = poly(m1, m2) such that with probability \u2265 99/100,\n\nE(x,y)\u223cDL(\u03bbT F (x; W (out)\n\nT\n\n, V (out)\n\nT\n\n), y) \u2264 (1 + \u03b3)OPT + \u03b50.\n\n4.3 Our Contributions\nOur sample complexity N scales polynomially with the complexity of the target network, and is\n(almost) independent of m, the amount of overparameterization. This itself can be quite surprising,\nbecause recent results on neural network generalization [6, 11, 24, 39] require N to be polynomial\nin m. Furthermore, Theorem 2 shows three-layer networks can ef\ufb01ciently learn a bigger concept\nclass (4.1) comparing to what we know about two-layer networks (3.1).\nFrom a practical standpoint, one can construct target functions of the form (4.1) that cannot be\n(ef\ufb01ciently) approximated by any two-layer target function in (3.1). If data is generated according\nto such functions, then it may be necessary to use three-layer networks as learners (see Figure 1).\nFrom a theoretical standpoint, even in the special case of \u03a6(z) = z, our target function can cap-\nture correlations between non-linear measurements of the data (recall Remark 4.1). This means\np2C\u03b5(\u03c6, 1)), so learning it is essentially in the same complexity as\nC\u03b5(\u03a6, C\u03b5(\u03c6, 1)\nlearning each \u03c6s,j. For example, a three-layer network can learn cos(100(cid:104)w\u2217\n2 ,x(cid:105) up to\naccuracy \u03b5 in complexity poly(1/\u03b5), while it is unclear how to do so using two-layer networks.\nTechnical Contributions. We highlight some technical contributions in the proof of Theorem 2.\nIn recent results on the training convergence of neural networks for more than two layers [3, 4], the\noptimization process stays in a close neighborhood of the initialization so that, with heavy overpa-\nrameterization, the network becomes \u201clinearized\u201d and the interactions across layers are negligible.\nIn our three-layer case, this means that the matrix W never interacts with V . They then argue that\n\n1, x(cid:105)) \u00b7 e100(cid:104)w\u2217\n\np2) \u2248 O(\n\n\u221a\n\n\u221a\n\n8\n\n\fSGD simulates a neural tangent kernel so the learning process is almost convex [27]. In our analysis,\nwe directly tackle non-convex interactions between W and V , by studying a \u201cquadratic approxima-\ntion\u201d of the network. (See Remark 6.1 for a mathematical comparison.) Our new proofs techniques\nthat could be useful for future theoretical applications.\nAlso, for the results [3, 4] and our two-layer Theorem 1 to hold, it suf\ufb01ces to analyze a regime where\nthe \u201csign pattern\u201d of ReLUs can be replaced with that of the random initialization. (Recall \u03c3(x) =\nIx\u22650 \u00b7 x and we call Ix\u22650 the \u201csign pattern.\u201d) In our three-layer analysis, the optimization process\nhas moved suf\ufb01ciently away from initialization, so that the sign pattern change can signi\ufb01cantly\naffect output. This brings in additional technical challenge because we have to tackle non-convex\ninteractions between W and V together with changing sign patterns.12\nComparison to Daniely [15]. Daniely [15] studies the learnability of multi-layer networks when\n(essentially) only the output layer is trained, which reduces to a convex task. He shows that multi-\nlayer networks can learn a compositional kernel space, which implies two/three-layer networks can\nef\ufb01ciently learn low-degree polynomials. He did not derive the general sample/time complexity\nbounds for more complex functions such as those in our concept classes (3.1) and (4.1), but showed\nthat they are \ufb01nite.\nIn contrast, our learnability result of concept class (4.1) is due to the non-convex interaction between\nhidden layers. Since Daniely [15] studies the regime when the changes in hidden layers are negli-\ngible, if three layer networks are used, to the best of our knowledge, their theorem cannot lead to\nsimilar sample complexity bounds comparing to Theorem 2 by only training the last layer of a three-\nlayer network. Empirically, one can also observe that training hidden layers is better than training\nthe last layer (see Figure 1).\n\n5 Conclusion and Discussion\nWe show by training the hidden layers of two-layer (resp.\nthree-layer) overparameterized neu-\nral networks, one can ef\ufb01ciently learn some important concept classes including two-layer (resp.\nthree-layer) networks equipped with smooth activation functions. Our result is in the agnostic PAC-\nlearning language thus is distribution-free. We believe our work opens up a new direction in both\nalgorithmic and generalization perspectives of overparameterized neural networks, and pushing for-\nward can possibly lead to more understanding about deep learning.\nOur results apply to other more structured neural networks. As a concrete example, consider con-\nvolutional neural networks (CNN). Suppose the input is a two dimensional matrix x \u2208 Rd\u00d7s which\ncan be viewed as d-dimensional vectors in s channels, then a convolutional layer on top of x is\nde\ufb01ned as follows. There are d(cid:48) \ufb01xed subsets {S1, S2, . . . , Sd(cid:48)} of [d] each of size k(cid:48). The output\nof the convolution layer is a matrix of size d(cid:48) \u00d7 m, whose (i, j)-th entry is \u03c6((cid:104)wj, xSi(cid:105)), where\nxSi \u2208 Rk(cid:48)\u00d7s is the submatrix of x with rows indexed by Si; wj \u2208 Rk(cid:48)\u00d7s is the weight matrix of the\nj-th channel; and \u03c6 is the activation function. Overparameterization then means a larger number of\nchannels m in our learned network comparing to the target. Our analysis can be adapted to show a\nsimilar result for this type of networks.\nOne can also combine this paper with properties of recurrent neural networks (RNNs) [3] to derive\nPAC-learning results for RNNs [2], or use the existential tools of this paper to derive PAC-learning\nresults for three-layer residual networks (ResNet) [1]. The latter gives a provable separation between\nneural networks and kernels in the ef\ufb01cient PAC-learning regime.\n\nAcknowledgements\nThis work was supported in part by FA9550-18-1-0166. Y. Liang would also like to acknowl-\nedge that support for this research was provided by the Of\ufb01ce of the Vice Chancellor for Research\nand Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin\nAlumni Research Foundation.\n\n12For instance, the number of sign changes can be m0.999 for the second hidden layer (see Lemma 6.5).\nIn this region, the network output can be affected by m0.499 since each neuron is of value roughly m\u22121/2.\nTherefore, if after training we replace the sign pattern with random initialization, the output will be meaningless.\n\n9\n\n\fReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. What Can ResNet Learn Ef\ufb01ciently, Going Beyond Ker-\n\nnels? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1905.10337.\n\n[2] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD Learn Recurrent Neural Networks with Provable\nGeneralization? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1902.\n01028.\n\n[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent\nneural networks. In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1810.\n12065.\n\n[4] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\nover-parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/\n1811.03962.\n\n[5] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with\nneural networks. In International Conference on Machine Learning, pages 1908\u20131916, 2014.\n\n[6] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\n\nfor deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.\n\n[7] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[8] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis\nof optimization and generalization for overparameterized two-layer neural networks. arXiv\npreprint arXiv:1901.08584, 2019.\n\n[9] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer recti\ufb01ed neural\n\nnetworks in polynomial time. arXiv preprint arXiv:1811.01885, 2018.\n\n[10] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[11] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems, pages 6241\u20136250,\n2017.\n\n[12] Digvijay Boob and Guanghui Lan. Theoretical properties of the global optimizer of two layer\n\nneural network. arXiv preprint arXiv:1710.11241, 2017.\n\n[13] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[14] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-\nparameterized networks that provably generalize on linearly separable data. In International\nConference on Learning Representations, 2018.\n\n[15] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\n[16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural net-\nworks: The power of initialization and a dual view on expressivity. In Advances in Neural\nInformation Processing Systems (NIPS), pages 2253\u20132261, 2016.\n\n[17] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti-\n\nmizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[18] Ronen Eldan, Dan Mikulincer, and Alex Zhai. The clt in high dimensions: quantitative bounds\n\nvia martingale embedding. arXiv preprint arXiv:1806.09087, 2018.\n\n10\n\n\f[19] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonline stochas-\nIn Conference on Learning Theory, pages 797\u2013842,\n\ntic gradient for tensor decomposition.\n2015.\n\n[20] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501, 2017.\n\n[21] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks\n\nwith symmetric inputs. In International Conference on Learning Representations, 2019.\n\n[22] Surbhi Goel and Adam Klivans. Learning neural networks with two nonlinear layers in poly-\n\nnomial time. arXiv preprint arXiv:1709.06010v4, 2018.\n\n[23] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in\n\npolynomial time. In Conference on Learning Theory, pages 1004\u20131042, 2017.\n\n[24] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity\n\nof neural networks. In Proceedings of the Conference on Learning Theory, 2018.\n\n[25] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nIn Acoustics, speech and signal processing (icassp), 2013 ieee\n\nrecurrent neural networks.\ninternational conference on, pages 6645\u20136649. IEEE, 2013.\n\n[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-\ntion, pages 770\u2013778, 2016.\n\n[27] Arthur Jacot, Franck Gabriel, and Cl\u00b4ement Hongler. Neural tangent kernel: Convergence\nand generalization in neural networks. In Advances in neural information processing systems,\npages 8571\u20138580, 2018.\n\n[28] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Informa-\n\ntion Processing Systems, pages 586\u2013594, 2016.\n\n[29] Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape\n\nlocal minima? arXiv preprint arXiv:1802.06175, 2018.\n\n[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[31] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv preprint arXiv:1902.06720, 2019.\n\n[32] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems,\n2018.\n\n[33] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[34] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang.\n\nAlgorithmic regularization in over-\n\nparameterized matrix recovery. arXiv preprint arXiv:1712.09203, 2017.\n\n[35] Percy Liang. CS229T/STAT231: Statistical Learning Theory (Winter 2016). https://web.\n\nstanford.edu/class/cs229t/notes.pdf, April 2016. accessed January 2019.\n\n[36] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational ef\ufb01ciency of training\nIn Advances in Neural Information Processing Systems, pages 855\u2013863,\n\nneural networks.\n2014.\n\n[37] Martin J. Wainwright. Basic tail and concentration bounds. https://www.stat.berkeley.\nedu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf, 2015. Online; accessed\nOct 2018.\n\n11\n\n\f[38] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International\n\nConference on Algorithmic Learning Theory, pages 3\u201317. Springer, 2016.\n\n[39] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint\narXiv:1707.09564, 2017.\n\n[40] Ohad Shamir. Distribution-speci\ufb01c hardness of learning neural networks. Journal of Machine\n\nLearning Research, 19(32), 2018.\n\n[41] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van\nDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-\ntot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529\n(7587):484, 2016.\n\n[42] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the\narXiv preprint\n\noptimization landscape of over-parameterized shallow neural networks.\narXiv:1707.04926, 2017.\n\n[43] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guar-\n\nantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[44] Daniel Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex\nalgorithm usually takes polynomial time. In Proceedings of the thirty-third annual ACM sym-\nposium on Theory of computing, pages 296\u2013305. ACM, 2001.\n\n[45] Karthik Sridharan. Machine Learning Theory (CS 6783). http://www.cs.cornell.edu/\n\ncourses/cs6783/2014fa/lec7.pdf, 2014. accessed January 2019.\n\n[46] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-\ndinov. Dropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of\nMachine Learning Research, 15(1):1929\u20131958, 2014.\n\n[47] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and\nits applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560,\n2017.\n\n[48] Leslie Valiant. A theory of the learnable. Communications of the ACM, 1984.\n\n[49] Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for training\n\none-hidden-layer neural networks. arXiv preprint arXiv:1805.02677, 2018.\n\n[50] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks.\n\narXiv preprint Arxiv:1611.03131, 2016.\n\n[51] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-\ncess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint\narXiv:1902.04760, 2019.\n\n[52] Alex Zhai. A high-dimensional CLT in W2 distance with near optimal convergence rate. Prob-\n\nability Theory and Related Fields, 170(3-4):821\u2013845, 2018.\n\n[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\n\ning deep learning requires rethinking generalization. In ICLR, 2017. arXiv 1611.03530.\n\n[54] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improp-\nerly learnable in polynomial time. In International Conference on Machine Learning, pages\n993\u20131001, 2016.\n\n[55] Yuchen Zhang, Jason Lee, Martin Wainwright, and Michael Jordan. On the learnability of\nfully-connected neural networks. In Arti\ufb01cial Intelligence and Statistics, pages 83\u201391, 2017.\n\n[56] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guar-\n\nantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3315, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "University of Wisconsin Madison"}]}