{"title": "What Can ResNet Learn Efficiently, Going Beyond Kernels?", "book": "Advances in Neural Information Processing Systems", "page_first": 9017, "page_last": 9028, "abstract": "How can neural networks such as ResNet \\emph{efficiently} learn CIFAR-10 with test accuracy more than $96 \\%$, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap?\n\nRecently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class \\emph{better} than kernels?\n\n\nWe answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption.\nAt the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be \\emph{much smaller} than \\emph{any} kernel method, including neural tangent kernels (NTK).\n\nThe main intuition is that \\emph{multi-layer} neural networks can implicitly perform hierarchal learning using different layers, which reduces the sample complexity comparing to ``one-shot'' learning algorithms such as kernel methods.\n\nIn the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.", "full_text": "What Can ResNet Learn Ef\ufb01ciently,\n\nGoing Beyond Kernels?\u2217\n\nZeyuan Allen-Zhu\nMicrosoft Research AI\n\nzeyuan@csail.mit.edu\n\nYuanzhi Li\n\nCarnegie Mellon University\nyuanzhil@andrew.cmu.edu\n\nAbstract\n\nHow can neural networks such as ResNet ef\ufb01ciently learn CIFAR-10 with test\naccuracy more than 96%, while other methods, especially kernel methods, fall\nrelatively behind? Can we more provide theoretical justi\ufb01cations for this gap?\nRecently, there is an in\ufb02uential line of work relating neural networks to kernels in\nthe over-parameterized regime, proving they can learn certain concept class that is\nalso learnable by kernels with similar test error. Yet, can neural networks provably\nlearn some concept class better than kernels?\nWe answer this positively in the distribution-free setting. We prove neural net-\nworks can ef\ufb01ciently learn a notable class of functions, including those de\ufb01ned by\nthree-layer residual networks with smooth activations, without any distributional\nassumption. At the same time, we prove there are simple functions in this class\nsuch that with the same number of training examples, the test error obtained by\nneural networks can be much smaller than any kernel method, including neural\ntangent kernels (NTK).\nThe main intuition is that multi-layer neural networks can implicitly perform hi-\nerarchical learning using different layers, which reduces the sample complexity\ncomparing to \u201cone-shot\u201d learning algorithms such as kernel methods. In a follow-\nup work [2], this theory of hierarchical learning is further strengthened to incor-\nporate the \u201cbackward feature correction\u201d process when training deep networks.\nIn the end, we also prove a computation complexity advantage of ResNet with\nrespect to other learning methods including linear regression over arbitrary feature\nmappings.\n\nIntroduction\n\n1\nNeural network learning has become a key practical machine learning approach and has achieved\nremarkable success in a wide range of real-world domains, such as computer vision, speech recog-\nnition, and game playing [18, 19, 22, 31]. On the other hand, from a theoretical standpoint, it is\nless understood that how large-scale, non-convex, non-smooth neural networks can be optimized\nef\ufb01ciently over the training data and generalize to the test data with relatively few training examples.\nThere has been a sequence of research trying to address this question, showing that under certain\nconditions neural networks can be learned ef\ufb01ciently [8\u201310, 15, 16, 21, 24, 25, 32\u201335, 37, 40]. These\nprovable guarantees typically come with strong assumptions and the proofs heavily rely on them.\nOne common assumption from them is on the input distribution, usually being random Gaussian\nor suf\ufb01ciently close to Gaussian. While providing great insights to the optimization side of neural\nnetworks, it is not clear whether these works emphasizing on Gaussian inputs can coincide with\nthe neural network learning process in practice. Indeed, in nearly all real world data where deep\nlearning is applied to, the input distributions are not close to Gaussians; even worse, there may be\n\n\u2217Full version and future updates can be found on https://arxiv.org/abs/1905.10337.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fno simple model to capture such distributions.\nThe dif\ufb01culty of modeling real-world distributions brings us back to the traditional PAC-learning\nlanguage which is distribution-free. In this language, one of the most popular, provable learning\nmethods is the kernel methods, de\ufb01ned with respect to kernel functions K(x, x(cid:48)) over pairs of data\n(x, x(cid:48)). The optimization task associated with kernel methods is convex, hence the convergence rate\nand the generalization error bound are well-established in theory.\nRecently, there is a line of work studying the convergence of neural networks in the PAC-learning\nlanguage, especially for over-parameterized neural networks [1, 3\u20137, 12\u201314, 20, 23, 41], putting\nneural network theory back to the distribution-free setting. Most of these works rely on the so-called\nNeural Tangent Kernel (NTK) technique [12, 20], by relating the training process of suf\ufb01ciently\nover-parameterized (or even in\ufb01nite-width) neural networks to the learning process over a kernel\nwhose features are de\ufb01ned by the randomly initialized weights of the neural network.\nIn other\nwords, on the same training data set, these works prove that neural networks can ef\ufb01ciently learn a\nconcept class with as good generalization as kernels, but nothing more is known.2\nIn contrast, in many practical tasks, neural networks\ngive much better generalization error compared to ker-\nnels, although both methods can achieve zero training\nerror. For example, ResNet achieves 96% test accuracy\non the CIFAR-10 data set, but NTKs achieve 77% [6]\nand random feature kernels achieve 85% [29]. This gap\nbecomes larger on more complicated data sets.\nTo separate the generalization power of neural networks\nfrom kernel methods, the recent work [36] tries to iden-\ntify conditions where the solutions found by neural net-\nworks provably generalize better than kernels. This ap-\nproach assumes that the optimization converges to min-\nimal complexity solutions (i.e.\nthe ones minimizing\nthe value of the regularizer, usually the sum of squared\nFrobenius norms of weight matrices) of the training ob-\njective. However, for most practical applications, it is\nunclear how, when training neural networks, minimal\ncomplexity solutions can be found ef\ufb01ciently by local search algorithms such as stochastic gradient\ndescent. In fact, it is not true even for rather simple problems (see Figure 1).3 Towards this end, the\nfollowing fundamental question is largely unsolved:\n\nFigure 1: d = 40, N = 5000, after exhaustive\nsearch in network size, learning rate,\nweight decay, randomly initialized\nSGD still cannot \ufb01nd solutions with\nFrobenius norm comparable to what\nwe construct by hand. Details and\nmore experiments in Section 9.2.\n\nCan neural networks ef\ufb01ciently and distribution-freely learn a concept class,\n\nwith better generalization than kernel methods?\n\nIn this paper, we give arguably the \ufb01rst positive answer to this question for neural networks with\nReLU activations. We show without any distributional assumption, a three-layer residual network\n(ResNet) can (improperly) learn a concept class that includes three-layer ResNets of smaller size\nand smooth activations. This learning process can be ef\ufb01ciently done by stochastic gradient descent\n(SGD), and the generalization error is also small if polynomially many training examples are given.\nMore importantly, we give a provable separation between the generalization error obtained by neu-\nral networks and kernel methods. For some \u03b4 \u2208 (0, 1), with N = O(\u03b4\u22122) training samples, we\nprove that neural networks can ef\ufb01ciently achieve generalization error \u03b4 for this concept class over\nany distribution; in contrast, there exists rather simple distributions such that any kernel method (in-\n\n2Technically speaking, the three-layer learning theorem of [3] is beyond NTK, because the learned weights\nacross different layers interact with each other, while in NTK the learned weights of each layer only interact\nwith random weights of other layers. However, there exist other kernels \u2014 such as recursive kernels [39] \u2014\nthat can more or less ef\ufb01ciently learn the same concept class proposed in [3].\n\n\u221a\n3Consider the class of degree-6 polynomials over 6 coordinates of the d-dimensional input. There exist\ntwo-layer networks with F-norm O(\nd) implementing this function (thus have near-zero training and testing\nerror). By Rademacher complexity, O(d) samples suf\ufb01ce to learn if we are able to \ufb01nd a minimal complexity\nsolution. Unfortunately, due to the non-convexity of the optimization landscape, two-layer networks can not be\ntrained to match this F-norm even with O(d2) samples, see Figure 1.\n\n2\n\n01002003004000.0040.0160.0630.251.Frobenius normerrorwe construct by hand in train/test errorbest found by SGD in train errorbest found by SGD in test error\f\u03b4 for this class. To\ncluding NTK, recursive kernel, etc) cannot have generalization error better than\nthe best of our knowledge, this is the \ufb01rst work that gives provable, ef\ufb01ciently achievable separation\nbetween neural networks with ReLU activations and kernels in the distribution-free setting. In the\nend, we also prove a computation complexity advantage of neural networks with respect to linear\nregression over arbitrary feature mappings as well.\nRoadmap. We present detailed overview of our positive and negative results in Section 2 and 3.\nThen, we introduce notations in Section 4, formally de\ufb01ne our concept class in Section 5, and give\nproof overviews in Section 6 and 8.\n\n\u221a\n\n2 Positive Result: The Learnability of Three-Layer ResNet\nIn this paper, we consider learner networks that are single-skip three-layer ResNet\nwith ReLU activation, de\ufb01ned as a function out : Rd \u2192 Rk:\n\nout(x) = A (\u03c3 (Wx + b1) + \u03c3 (U\u03c3 (Wx + b1) + b2))\n\n(2.1)\nHere, \u03c3 is the ReLU function, W \u2208 Rm\u00d7d and U \u2208 Rm\u00d7m are the hidden\nweights, A \u2208 Rk\u00d7m is the output weight, and b1, b2 \u2208 Rm are two bias vectors.\nWe wish to learn a concept class given by target functions that can be written as\n\nH(x) = F(x) + \u03b1G (F(x))\n\n(2.2)\nwhere \u03b1 \u2208 [0, 1) and G : Rk \u2192 Rk,F : Rd \u2192 Rk are two functions that can be written as two-layer\nnetworks with smooth activations (see Section 5 for the formal de\ufb01nition). Intuitively, the target\nfunction is a mixture of two parts: the base signal F, which is simpler and contributes more to\nthe target, and the composite signal G (F), which is more complicated but contributes less. As an\nanalogy, F could capture the signal in which \u201c85%\u201d examples in CIFAR-10 can be learned by kernel\nmethods, and G (F) could capture the additional \u201c11%\u201d examples that are more complicated.\nThe goal is to use three-layer ResNet (2.1) to improperly learn this concept class (2.2), meaning\nlearning \u201cboth\u201d the base and composite signals, with as few samples as possible. In this paper, we\nconsider a simple (cid:96)2 regression task where the features x \u2208 Rd and labels y \u2208 Rk are sampled from\nsome unknown distribution D. Thus, given a network out(x), the population risk is\n\nE\n\n(x,y)\u223cD\n\n1\n2\n\n(cid:107)out (x) \u2212 y(cid:107)2\n\n2 .\n\nTo illustrate our result, we \ufb01rst assume for simplicity that y = H (x) for some H of the form (2.2)\n(so the optimal target has zero regression error). Our main theorem can be sketched as follows.\nLet CF and CG respectively be the individual \u201ccomplexity\u201d of F and G, which at a high level,\ncapture the size and smoothness of F and G. This complexity notion shall be formally introduced\nin Section 4, and is used by prior works such as [3, 7, 39].\n\nTheorem 1 (ResNet, sketched). For any distribution over x, for every \u03b4 \u2208 (cid:0)(\u03b1CG)4, 1(cid:1), with\n\nprobability at least 0.99, SGD ef\ufb01ciently learns a network out(x) in the form (2.1) satisfying\n\nE\n\n(x,y)\u223cD\n\n1\n2\n\n(cid:107)out (x) \u2212 y(cid:107)2\n\n2 \u2264 \u03b4\n\nsamples\n\nusing N = (cid:101)O\n\n(cid:16) C 2F\n\n(cid:17)\n\n\u03b42\n\nThe running time of SGD is polynomial in poly(CG, CF , \u03b1\u22121).\nIn other words, ResNet is capable of achieving population risk \u03b14, or equivalently learning the\noutput H(x) up to \u03b12 error. In our full theorem, we also allow label y to be generated from H(x)\nwith error, thus our result also holds in the agnostic learning framework.\n2.1 Our Contributions\nOur main contribution is to obtain time and sample complexity in CF and CG without any depen-\ndency on the composed function G(F) as in prior work [3, 39]. We illustrate this crucial difference\nwith an example. Suppose x \u223c N (0, I/d), k = 2 and F \u2208 Rd \u2192 R2 consists of two linear func-\nd, and G is degree-10 polynomial with\n\ntion: F(x) = (cid:0)(cid:104)w\u2217\nd) and CG = (cid:101)O(1). Theorem 1 implies\n2(cid:107)2 =\n\u2022 we need (cid:101)O(d) samples to ef\ufb01ciently learn H = F + \u03b1G(F) up to accuracy (cid:101)O(\u03b12).\n\n1(cid:107)2,(cid:107)w\u2217\n\u221a\nconstant coef\ufb01cient. As we shall see, CF = O(\n\n2, x(cid:105)(cid:1) with (cid:107)w\u2217\n\n1, x(cid:105),(cid:104)w\u2217\n\n\u221a\n\n3\n\n\ud835\udc4a\ud835\udc65ReLU\ud835\udc48ReLU\ud835\udc34\fIn contrast, the complexity of G(F) is (cid:101)O((\n\u2022 prior works [3, 39] need(cid:101)\u2126(d10) samples to ef\ufb01ciently learn H up to any accuracy o(\u03b1),\n\nd)10), so\n\n\u221a\n\n2, x(cid:105)10.4\n\n1, x(cid:105)10 \u2212 (cid:104)w\u2217\n\neven if G(x) is of some simple form such as (cid:104)w\u2217\nInductive Bias. Our network is over-parameterized, thus intuitively in the example above, with\nonly O(d) training examples, the learner network could over-\ufb01t to the training data since it has\nto decide from a set of d10 many possible coef\ufb01cients to learn the degree 10 polynomial G. This is\nindeed the case if we learn the target function using kernels, or possibly even learn it with a two-layer\nnetwork. However, three-layer ResNet posts a completely different inductive bias, and manages to\navoid over-\ufb01tting to G(F) with the help from F.\nImplicit Hierarchical Learning. Since H(x) = F(x) + \u03b1G (F(x)), if we only learn F but not\nG (F), we will have regression error \u2248 (\u03b1CG)2. Thus, to get to regression error (\u03b1CG)4, Theorem 1\nshows that ResNet is also capable of learning G (F) up to some good accuracy with relatively few\ntraining examples. This is also observed in practice, where with this number of training examples,\nthree-layer fully-connected networks and kernel methods can indeed fail to learn G (F) up to any\nnon-trivial accuracy, see Figure 2.\nIntuitively, there is a hierarchy of the learning process: we would like to \ufb01rst learn F, and then we\ncould learn G(F) much easier with the help of F using the residual link. In our learner network (2.1),\nthe \ufb01rst hidden layer serves to learn F and the second hidden layer serves to learn G with the help of\nF, which reduces the sample complexity. However, the important message is that F and G are not\ngiven as separate data to the network, rather the learning algorithm has to disentangle them from the\n\u201ccombined\u201d function H = F + \u03b1G(F) automatically during the training process. Moreover, since\nwe train both layers simultaneously, the learning algorithm also has to distribute the learning task\nof F and G onto different layers automatically.\nWe also emphasize that our result cannot be obtained by layer-wise training of the ResNet, that is,\n\ufb01rst training the hidden layer close to the input, and then training the hidden layer close to the output.\nSince it could be the case the \ufb01rst layer incurs some \u03b1 error (since it cannot learn G(F) directly),\nthen it could be really hard, or perhaps impossible, for the second layer to \ufb01x it only using inputs\nof the form F(x) \u00b1 \u03b1. In other words, it is crucial that the two hidden layers are simultaneously\ntrained. 5\nA follow-up work.\nIn a follow-up work [2], this theory of hierarchical learning is signi\ufb01cantly\nstrengthened to further incorporate the \u201cbackward feature correction\u201d when training deep neural\nnetworks. In the language of this paper, when the two layers trained together, given enough samples,\nthe accuracy in the \ufb01rst layer can actually be improved from F \u00b1 \u03b1 to arbitrarily close to F during\nthe training process. As a consequence, the \ufb01nal training and generalization error can be arbitrarily\nsmall as well, as opposite to \u03b14 in this work. The new \u201cbackward feature correction\u201d is also critical\nto extend the hierarchical learning process from 3 layers to arbitrarily number of layers.\n\n3 Negative Results\n3.1 Limitation of Kernel Methods\nGiven (Mercer) kernels K1, . . . , Kk : Rd\u00d7d \u2192 R and training examples {(x(i), y(i))}i\u2208[N ] from D,\na kernel method tries to learn a function K : Rd \u2192 Rk where each\nn\u2208[N ] Kj(x, x(n)) \u00b7 wj,n\n\nKj(x) =(cid:80)\n\n(3.1)\n\n4Of course, if one knew a priori the form H(x) = (cid:104)w\u2217\n\n2 , x(cid:105)10, one could also try to solve it\ndirectly by minimizing objective ((cid:104)w\u2217\n2 , x(cid:105)10 + (cid:104)w2, x(cid:105)10 \u2212 (cid:104)w1, x(cid:105)10)2 over w1, w2 \u2208 Rd. Un-\nfortunately, the underlying optimization process is highly non-convex and it remains unclear how to minimize\n\nit ef\ufb01ciently. Using matrix sensing [25], one can ef\ufb01ciently learn such H(x) in sample complexity (cid:101)O(d5).\n\n1 , x(cid:105)10 \u2212 (cid:104)w\u2217\n\n1 , x(cid:105)10 \u2212 (cid:104)w\u2217\n\n5However, this does not mean that the error of the \ufb01rst layer can be reduced by its own, since it is still\npossible for the \ufb01rst layer to learn F + \u03b1R(x) \u00b1 \u03b12 and the second layer to learn G(F)(x) \u2212 R(x) \u00b1 \u03b1, for\nan arbitrary (bounded) function R.\n\n4\n\n\fi=1\n\nj\u2208[k]\n\nj\n\n1\nN\n\n(cid:1)2\n\n+ R(w)\n\n(cid:80)\n\n(cid:80)N\n\n(cid:0)(cid:80)\nn\u2208[N ] Kj(x(i), x(n))wj,n \u2212 y(i)\n\n2/h; (2) arcsin kernel K(x, y) = arcsin(cid:0)(cid:104)x, y(cid:105)/((cid:107)x(cid:107)2(cid:107)y(cid:107)2)(cid:1); (3) recursive kernel with any\n\nis parameterized by a weight vector wj \u2208 RN . Usually, for the (cid:96)2 regression task, a kernel method\n\ufb01nds the optimal weights w1, . . . , wk \u2208 RN by solving the following convex minimization problem\n(3.2)\nfor some convex regularizer R(w).6 In this paper, however, we do not make assumptions about how\nK(x) is found as the optimal solution of the training objective. Instead, we focus on any kernel\nregression function that can be written in the form (3.1).\nMost of the widely-used kernels are Mercer kernels.7 This includes (1) Gaussian kernel K(x, y) =\ne\u2212(cid:107)x\u2212y(cid:107)2\nrecursive function [39]; (4) random feature kernel K(x, y) = Ew\u223cW \u03c6w(x)\u03c6w(y) for any function\n\u03c6w(\u00b7) and distribution W; (5) the conjugate kernel de\ufb01ned by the last hidden layer of random initial-\nized neural networks [11]; (6) the neural tangent kernels (NTK) for fully-connected [20] networks,\nconvolutional networks [6, 38] or more generally for any architectures [38].\nOur theorem can be sketched as follows:\nTheorem 3 (kernel, sketched). For every constant k \u2265 2, for every suf\ufb01ciently large d \u2265 2, there\nexist concept classes consisting of functions H(x) = F(x) + \u03b1G (F(x)) with complexities CF , CG\nand \u03b1 \u2208 (0, 1\n\nCG ) such that, letting\n\nNres be the sample complexity from Theorem 1 to achieve \u03b13.9 population risk,\n\nthen there exists simple distributions D over (x,H(x)) such that, for at least 99% of the functions\n\nH in this concept class, even given N = O(cid:0)(Nres)k/2(cid:1) training samples from D, any function K(x)\n\nof the form (3.1) has to suffer population risk\n2 > \u03b12\n\n2 (cid:107)K(x) \u2212 y(cid:107)2\n\nE(x,y)\u223cD 1\n\neven if the label y = H(x) has no error.\n\n\u221a\n\nContribution and Intuition. Let us compare this to Theorem 1. While both algorithms are ef-\n\ufb01cient, neural networks (trained by SGD) achieve population risk \u03b13.9 using Nres samples for any\ndistribution over x, while kernel methods cannot achieve any population risk better than \u03b12 for some\nsimple distributions even with N = (Nres)k/2 (cid:29) Nres samples.8 Our two theorems together gives a\nprovable separation between the generalization error of the solutions found by neural networks and\nkernel methods, in the ef\ufb01ciently computable regime.\nMore speci\ufb01cally, recall CF and CG only depend on individual complexity of G,F, but not on G(F).\nIn Theorem 3, we will construct F as linear functions and G as degree-k polynomials. This ensures\nd) and CG = O(1) for k being constant, but the combined complexity of G(F) is as\nCF = O(\nhigh as \u2126(dk/2). Since ResNet can perform hierarchical learning, it only needs sample complexity\nNres = O(d/\u03b18) instead of paying (square of) the combined complexity \u2126(dk).\nIn contrast, a kernel method is not hierarchical: rather than discovering F \ufb01rst and then learning\nG(F) with the guidance of it, kernel method tries to learn everything in one shot. This unavoidably\nrequires the sample complexity to be at least \u2126(dk). Intuitively, as the kernel method tries to learn\nG(F) from scratch, this means that it has to take into account all \u2126(dk) many possible choices\nof G(F) (recall that G is a degree k polynomial over dimension d). On the other hand, a kernel\nmethod with N samples only has N-degrees of freedom (for each output dimension). This means, if\nN (cid:28) o(dk), kernel method simply does not have enough degrees of freedom to distinguish between\ndifferent G(F), so has to pay \u2126(\u03b12) in population risk. Choosing for instance \u03b1 = d\u22120.1, we have\n\nthe desired negative result for all N \u2264 O(cid:0)(Nres)k/2(cid:1) (cid:28) o(dk).\n6In many cases, R(w) = \u03bb \u00b7(cid:80)\n\nj Kjwj is the norm associated with the kernel, for matrix Kj \u2208\n7Recall a Mercer kernel K : Rd\u00d7d \u2192 R can be written as K(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105) where \u03c6 : Rd \u2192 V is a\n\nj\u2208[k] w(cid:62)\nRN\u00d7N de\ufb01ned as [Kj]i,n = Kj(x(i), x(n)).\nfeature mapping to some inner product space V.\n\n8It is necessary the negative result of kernel methods is distribution dependent, since for trivial distributions\nwhere x is non-zero only on the \ufb01rst constantly many coordinates, both neural networks and kernel methods\ncan learn it with constantly many samples.\n\n5\n\n\f3.2 Limitation of Linear Regression Over Feature Mappings\nGiven an arbitrary feature mapping \u03c6 : Rd \u2192 RD, one may consider learning a linear function over\n\u03c6. Namely, to learn a function F : Rd \u2192 Rk where each\n\n(3.3)\nis parameterized by a weight vector wj \u2208 RD. Usually, these weights are determined by minimizing\nthe following regression objective:9\n\nj \u03c6(x)\n\nFj(x) = w(cid:62)\n\n(cid:80)\n\n(cid:80)\n\n1\nN\n\ni\u2208[N ]\n\nj\u2208[k]\n\n(cid:16)\n\nj \u03c6(cid:0)x(i)(cid:1) \u2212 y(i)\n\nj\n\nw(cid:62)\n\n(cid:17)2\n\n+ R(w)\n\nfor some regularizer R(w). In this paper, we do not make assumptions about how the weighted are\nfound. Instead, we focus on any linear function over such feature mapping in the form (3.3).\nTheorem 4 (feature mapping, sketched). For suf\ufb01ciently large integers d, k, there exist concept\nclasses consisting of functions H(x) = F(x) + \u03b1G (F(x)) with complexities CF , CG and \u03b1 \u2208\n(0, 1\n\nCG ) such that, letting\n\nTres be the time complexity from Theorem 1 to achieve \u03b13.9 population risk,\n\nthen for at least 99% of the functions H in this concept class, even with arbitrary D = (Tres)2\ndimensional feature mapping, any function F(x) of the form (3.3) has to suffer population risk\n\nE(x,y)\u223cD 1\n\n2 (cid:107)F(x) \u2212 y(cid:107)2\n\n2 > \u03b12\n\neven if the label y = H(x) has no error.\n\nInterpretation. Since any algorithm that optimizes linear functions over D-dimensional feature\nmapping has to run in time \u2126(D), this proves a time complexity separation between neural networks\n(say, for achieving population risk \u03b13.9) and linear regression over feature mappings (for achieving\neven any population risk better than \u03b12 (cid:29) \u03b13.9). Usually, such an algorithm also has to suffer from\n\u2126(D) space complexity. If that happens, we also have a space complexity separation. Our hard\ninstance in proving Theorem 4 is the same as Theorem 3, and the proof is analogous.\n\n4 Notations\nWe denote by (cid:107)w(cid:107)2 and (cid:107)w(cid:107)\u221e the Euclidean and in\ufb01nity norms of vectors w, and (cid:107)w(cid:107)0 the number\nof non-zeros of w. We also abbreviate (cid:107)w(cid:107) = (cid:107)w(cid:107)2 when it is clear from the context. We denote\nthe row (cid:96)p norm for W \u2208 Rm\u00d7d (for p \u2265 1) as\n\n(cid:107)W(cid:107)2,p :=(cid:0)(cid:80)\n\n(cid:1)1/p\n\n.\n\ni\u2208[m] (cid:107)wi(cid:107)p\n\n2\n\nBy de\ufb01nition, (cid:107)W(cid:107)2,2 = (cid:107)W(cid:107)F is the Frobenius norm of W. We use (cid:107)W(cid:107)2 to denote the matrix\nspectral norm. For a diagonal matrix D we use (cid:107)D(cid:107)0 to denote its sparsity. For a matrix W \u2208 Rm\u00d7d,\nwe use Wi or wi to denote the i-th row of W.\nWe use N (\u00b5, \u03c3) to denote Gaussian distribution with mean \u00b5 and variance \u03c3; or N (\u00b5, \u03a3) to denote\nGaussian vector with mean \u00b5 and covariance \u03a3. We use 1event or 1[event] to denote the indicator\nfunction of whether event is true. We use \u03c3(\u00b7) to denote the ReLU function, namely \u03c3(x) =\nmax{x, 0} = 1x\u22650 \u00b7 x. Given univariate function f : R \u2192 R, we also use f to denote the same\nfunction over vectors: f (x) = (f (x1), . . . , f (xm)) if x \u2208 Rm.\nFor notation simplicity, throughout this paper \u201cwith high probability\u201d (or w.h.p.) means with prob-\n\nability 1 \u2212 e\u2212c log2 m for a suf\ufb01ciently large constant c. We use (cid:101)O to hide polylog(m) factors.\nin\ufb01nite-order smooth function \u03c6 : R \u2192 R. Suppose \u03c6(z) =(cid:80)\u221e\n(C\u2217)i +(cid:0)\u221a\n\nFunction complexity. The following notions introduced in [3] measure the complexity of any\n\ni=0 cizi is its Taylor expansion.\n\u221a\nlog(1/\u03b5)\n\nC\u2217(cid:1)i(cid:17)|ci|\n\n(cid:16)\nC\u03b5(\u03c6) = C\u03b5(\u03c6, 1) :=(cid:80)\u221e\nCs(\u03c6) = Cs(\u03c6, 1) := C\u2217(cid:80)\u221e\ni=0(i + 1)|ci|\n\ni=0\n\nwhere C\u2217 is a suf\ufb01ciently large constant (e.g., 104).\nExample 4.1. If \u03c6(z) = ec\u00b7z \u2212 1, sin(c \u00b7 z), cos(c \u00b7 z) or degree-c polynomial for constant c, then\nC\u03b5(\u03c6, 1) = o(1/\u03b5) and Cs(\u03c6, 1) = O(1). If \u03c6(z) = sigmoid(z) or tanh(z), to get \u03b5 approximation\n9If R(w) is the (cid:96)2 regularizer, then this becomes a kernel method again since the minimizer can be written\n\ni\n\nin the form (3.1). For other regularizers, this may not be the case.\n\n6\n\n\fwe can truncate their Taylor series at degree \u0398(log 1\nthe fact that (log(1/\u03b5)/i)i \u2264 poly(\u03b5\u22121) for every i \u2265 1, and Cs(\u03c6, 1) \u2264 O(1).\n\n\u03b5 ). One can verify that C\u03b5(\u03c6, 1) \u2264 poly(1/\u03b5) by\n\n5 Concept Class\nWe consider learning some unknown distribution D of data points z = (x, y) \u2208 Rd \u00d7 Rk, where\nx \u2208 Rd is the input vector and y is the associated label. Let us consider target functions H : Rd \u2192\nRk coming from the following concept class.\nConcept 1. H is given by two smooth functions F,G : Rk \u2192 Rk and a value \u03b1 \u2208 R+:\n\n(cid:0)(cid:104)v\u2217\nr,i, h(cid:105)(cid:1)\n\n(5.1)\n\n(5.2)\n\nwhere for each output coordinate r,\n\nFr(x) =\n\nF ,r,i \u00b7 Fr,i\na\u2217\n\n(cid:88)\n\ni\u2208[pF ]\n\nH(x) = F(x) + \u03b1G (F(x)) ,\n\n(cid:0)(cid:104)w\u2217\nr,i, x(cid:105)(cid:1)\n\nand Gr(h) =\n\n(cid:88)\nr,i \u2208 Rd and v\u2217\n\ni\u2208[pG ]\n\nG,r,i \u00b7 Gr,i\na\u2217\n\nF ,r,i, a\u2217\n\u221a\nr,i(cid:107)2 = 1/\n\nG,r,i \u2208 [\u22121, 1] and vectors w\u2217\n\nr,i \u2208 Rk. We assume for\n2.10 For simplicity, we assume (cid:107)x(cid:107)2 = 1 and (cid:107)F(x)(cid:107)2 = 1 for\n\nfor some parameters a\u2217\nsimplicity (cid:107)w\u2217\nr,i(cid:107)2 = (cid:107)v\u2217\n(x, y) \u223c D and in Appendix A we state a more general Concept 2 without these assumptions.11\nWe denote by C\u03b5(F) = maxr,i{C\u03b5(Fr,i)} and Cs(F) = maxr,i{Cs(Fr,i)}. Intuitively, F and G\nare both generated by two-layer neural networks with smooth activation functions Fr,i and Gr,i.\nBorrowing the agnostic PAC-learning language, our concept class consists of all functions H(x)\nin the form of Concept 1 with complexity bounded by tuple (pF , CF , pG, CG). Let OPT be the\npopulation risk achieved by the best target function in this concept class. Then, our goal is to learn\nthis concept class with population risk O(OPT)+\u03b5 using sample and time complexity polynomial in\npF , CF , pG, CG and 1/\u03b5. In the remainder of this paper, to simplify notations, we do not explicitly\nde\ufb01ne this concept class parameterized by (pF , CF , pG, CG). Instead, we equivalently state our\ntheorem with respect to any (unknown) \ufb01xed target function H with with population risk OPT:\n\nE(x,y)\u223cD(cid:2) 1\nIn the analysis we adopt the following notations. For every (x, y) \u223c D, it satis\ufb01es (cid:107)F(x)(cid:107)2 \u2264 BF\nand (cid:107)G(F(x))(cid:107)2 \u2264 BF\u25e6G. We assume G(\u00b7) is LG-Lipschitz continuous. It is a simple exercise\n(see Fact A.3) to verify that LG \u2264 \u221a\nkpF Cs(F) and BF\u25e6G \u2264 LGBF +\n\u221a\nkpGC(G) \u2264 kpF Cs(F)pGCs(G).\n\nkpGCs(G), BF \u2264 \u221a\n\n(cid:3) \u2264 OPT .\n\n2(cid:107)H(x) \u2212 y(cid:107)2\n\n2\n\n6 Overview of Theorem 1\nWe learn the unknown distribution D with three-layer ResNet with ReLU activation (2.1) as learners.\nFor notation simplicity, we absorb the bias vector into weight matrix: that is, given W \u2208 Rm\u00d7d and\nbias b1 \u2208 Rm, we rewrite Wx + b as W(x, 1) for a new weight matrix W \u2208 Rm\u00d7(d+1). We\nalso re-parameterize U as U = VA and we \ufb01nd this parameterization (similar to the \u201cbottleneck\u201d\nstructure in ResNet) simpli\ufb01es the proof and also works well empirically for our concept class. After\nsuch notation simpli\ufb01cation and re-parameterization, we can rewrite out(x) : Rd \u2192 Rk as\n\n(cid:16)\n\n(cid:17)\n\nout(W, V; x) = out(x) = out1(x) + A\u03c3\n\n(V(0) + V)(out1(x), 1)\n\nout1(W, V; x) = out1(x) = A\u03c3(W(0) + W)(x, 1) .\n\nAbove, A \u2208 Rk\u00d7m, V(0) \u2208 Rm\u00d7(k+1), W(0) \u2208 Rm\u00d7(d+1) are weight matrices corresponding\nto random initialization, and W \u2208 Rm\u00d7(k+1), W \u2208 Rm\u00d7(d+1) are the additional weights to be\nlearned by the algorithm. To prove the strongest result, we only train W, V and do not train A.12\n10For general (cid:107)w\u2217\nr,i| \u2264 B, the scaling factor B can be absorbed into the\n11Since we use ReLU networks as learners, they are positive homogeneous so to learn general functions F,G\n\nactivation function \u03c6(cid:48)(x) = \u03c6(Bx). Our results then hold by replacing the complexity of \u03c6 with \u03c6(cid:48).\n\n1,i(cid:107)2 \u2264 B,(cid:107)w\u2217\n\n2,i(cid:107)2 \u2264 B, |a\u2217\n\nwhich may not be positive homogenous, it is in some sense necessary that the inputs are scaled properly.\n\n12This can be more meaningful than training all the layers together, in which if one is not careful with\nparameter choices, the training process can degenerate as if only the last layer is trained [11]. (That is a convex\n\n7\n\n\fWe consider random Gaussian initialization where the entries of A, W(0), V(0) are independently\ngenerated as follows:\n\n[W(0)]i,j \u223c N(cid:0)0, \u03c32\n\nw\n\n(cid:1)\n\n[V(0)]i,j \u223c N(cid:0)0, \u03c32\nv/m(cid:1)\n\nAi,j \u223c N(cid:0)0, 1\n\n(cid:1)\n\nm\n\nIn this paper we focus on the (cid:96)2 loss function between H and out, given as:\n\n(cid:107)y \u2212 out(W, V; x)(cid:107)2\n\n1\n2\n\n2\n\n\u2202Obj(W,V;(xt,yt))\n\nObj(W, V; (x, y)) =\n\nWt+1 \u2190 Wt \u2212 \u03b7w\nVt+1 \u2190 Vt \u2212 \u03b7v\n\n(6.1)\nWe consider the vanilla SGD algorithm. Starting from W0 = 0, V0 = 0, in each iteration t =\n0, 1, . . . , T \u2212 1, it receives a random sample (xt, yt) \u223c D and performs SGD updates13\n\n(cid:12)(cid:12)W=Wt,V=Vt\n(cid:12)(cid:12)W=Wt,V=Vt\nTheorem 1. Under Concept 1 or Concept 2, for every \u03b1 \u2208 (cid:0)0,(cid:101)\u0398(\nkpG Cs(G) )(cid:1) and \u03b4 \u2265 OPT +\n(cid:101)\u0398(cid:0)\u03b14(kpGCs(G))4(1 + BF )2(cid:1). There exist M = poly(C\u03b1(F), C\u03b1(G), pF , \u03b1\u22121) satisfying that for\n(cid:16) \u03b1pG Cs(G)\n\nevery m \u2265 M, with high probability over A, W(0), V(0), for a wide range of random initialization\nparameters \u03c3w, \u03c3v (see Table 1), choosing\n\n\u03b7w = (cid:101)\u0398 (min{1, \u03b4})\n\n\u03b7v = \u03b7w \u00b7(cid:101)\u0398\n\nT = (cid:101)\u0398\n\n\u2202Obj(W,V;(xt,yt))\n\n(cid:17)2\n\n(cid:17)\n\n\u2202W\n\n\u2202V\n\n1\n\npF Cs(F )\n\nWith high probability, the SGD algorithm satis\ufb01es\n\nE(x,y)\u223cD (cid:107)H(x) \u2212 out(Wt, Vt; x)(cid:107)2\n\n2 \u2264 O(\u03b4) .\n\nAs a corollary, under Concept 1, we can archive population risk\n\nt=0\n\nO(OPT) +(cid:101)\u0398(cid:0)\u03b14(kpGCs(G))4(cid:1)\n\n(6.2)\nRemark 6.1. Our Theorem 1 is almost in the PAC-learning language, except that the \ufb01nal error has\nan additive \u03b14 term that can not be arbitrarily small.\n\nusing sample complexity T .\n\n(cid:16) (kpF Cs(F ))2\n(cid:80)T\u22121\n\nmin{1,\u03b42}\n\n1\nT\n\n6.1 Proof Overview\nIn the analysis, let us de\ufb01ne diagonal matrices\n\nDW(0) = diag{1\nDW = diag{1\n\nW(0)(x,1)\u22650}\n(W(0)+W)(x,1)\u22650}\n\nDV(0),W = diag{1\nDV,W = diag{1\n\nV(0)(out1(x),1)\u22650}\n(V(0)+V)(out1(x),1)\u22650}\n\nwhich satisfy out1(x) = ADW(W(0) + W)(x, 1) and out(x) = ADV,W(V(0) + V)(out1(x), 1).\nThe proof of Theorem 1 can be divided into three simple steps with parameter choices in Table 1.\n\nIn this paper, we assume 0 < \u03b1 \u2264 (cid:101)O(\n\n1\n\nkpG Cs(G) ) and choose parameters\n\n\u03c3w \u2208 [m\u22121/2+0.01, m\u22120.01]\n\n\u03c4w := (cid:101)\u0398(kpF Cs(F)) \u2265 1\n(cid:3)\n\u03c4w \u2208(cid:2)m1/8+0.001\u03c3w, m1/8\u22120.001\u03c31/4\n\u03c3w, \u03c3v: recall entries of W(0) and V(0) are from N(cid:0)0, \u03c32\n\nw\n\n\u03c3v \u2208 [polylog(m), m3/8\u22120.01]\n\n\u03c4v := (cid:101)\u0398(\u03b1kpGCs(G)) \u2264\n\u03c4v \u2208(cid:2)\u03c3v \u00b7 (k/m)3/8,\nv/m(cid:1).\n(cid:1) and N(cid:0)0, \u03c32\n\n\u03c4w, \u03c4v: the proofs work with respect to (cid:107)W(cid:107)2 \u2264 \u03c4w and (cid:107)V(cid:107)2 \u2264 \u03c4v.\n\nw\n\n1\n\n(cid:3)\n\npolylog(m)\n\u03c3v\n\npolylog(m)\n\nTable 1: Three-layer ResNet parameter choices.\n\nand they satisfy\n\nIn the \ufb01rst step, we prove that for all weight matrices not very far from random initialization (namely,\nall (cid:107)W(cid:107)2 \u2264 \u03c4w and (cid:107)V(cid:107)2 \u2264 \u03c4v), many good \u201ccoupling properties\u201d occur. This includes upper\nkernel method.) Of course, as a simple corollary, our result also applies to training all the layers together, with\nappropriately chosen random initialization and learning rate.\n\n13Performing SGD with respect to W(0) + W and V(0) + V is the same as that with respect to W and\nV; we introduce W(0), V(0) notation for analysis purpose. Note also, one can alternatively consider having a\ntraining set and then performing SGD on this training set with multiple passes; similar results can be obtained.\n\n8\n\n\fbounds on the number of sign changes (i.e., on (cid:107)DW(0) \u2212 DW(cid:107)0 and(cid:13)(cid:13)DV(0),W \u2212 DV,W\n\n(cid:13)(cid:13)0) as\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nADWW\n\n(cid:62)(cid:107)F \u2264 \u03c4v\n\n, V\n(x, 1) \u2248 F(x) and ADV(0),WV\n\nwell as vanishing properties such as ADWW(0), ADV,WV(0) being negligible. We prove such\nproperties using techniques from prior works [3, 5]. Details are in Section C.1.\nIn the second step, we prove the existence of W\n10 sat-\n(out1(x), 1) \u2248 \u03b1G (out1(x)). This existen-\nisfying ADW(0)W\ntial proof relies on an \u201cindicator to function\u201d lemma from [3]; for the purpose of this paper we have\nto revise it to include a trainable bias term (or equivalently, to support vectors of the form (x, 1)).\nCombining it with the aforementioned vanishing properties, we derive (details are in Section C.2):\n(6.3)\nIn the third step, consider iteration t of SGD with sample (xt, yt) \u223c D. For simplicity we assume\nOPT = 0 so yt = H(xt). One can carefully write down gradient formula, and plug in (6.3) to derive\n\n(out1(x), 1) \u2248 \u03b1G (out1(x)) .\n\n(cid:62) with (cid:107)W\n(cid:62)\n\n(x, 1) \u2248 F(x)\n\nand ADV,WV\n\n10 and (cid:107)V\n\n(cid:62)(cid:107)F \u2264 \u03c4w\n\n\u2265 1\n\n, Vt \u2212 V\n\n2 is as small as E[(cid:107)Errt(cid:107)2\n\n2(cid:107)H(xt) \u2212 out(Wt, Vt; xt)(cid:107)2\n\n\u039et := (cid:104)\u2207W,VObj(Wt, Vt; (xt, yt)), (Wt \u2212 W\n(cid:62)\n2 \u2212 2(cid:107)Errt(cid:107)2\n\n(cid:3) \u2264 (cid:101)\u0398(cid:0)\u03b14(kpGCs(G))4(cid:1). This quantity \u039et is quite famous in classical mirror\n\ndescent analysis: for appropriately chosen learning rates, \u039et must converge to zero.14\nIn other\nwords, by concentration, SGD is capable of \ufb01nding solutions Wt, Vt so that the population risk\n(cid:107)H(xt) \u2212 out(Wt, Vt; xt)(cid:107)2\n2]. This is why we can obtain population risk\n\nwith E(cid:2)(cid:107)Errt(cid:107)2\n(cid:101)\u0398(cid:0)\u03b14(kpGCs(G))4(cid:1) in (6.2). Details are in Section C.3 and C.4.\nF(x) = W\u2217x and G(y) =(cid:0)(cid:81)\nevery \u03b1 \u2208(cid:2)\u2126(d\u2212k/4), 1(cid:1), let X = {x(1), . . . , x(N )} be i.i.d. drawn from the uniform distribution\n\n7 Overview of Theorem 3\nLet us consider the following simple distribution of functions H(x) = F(x) + \u03b1G(F(x)), with\nd(ei1 , ei2 ,\u00b7\u00b7\u00b7 eik ) for i1 (cid:54)= i2 \u00b7\u00b7\u00b7 (cid:54)= ik\n\nbeing uniformly at random chosen from [d].\nTheorem 2. For every constant k \u2265 4, for suf\ufb01ciently large d, for every N \u2264 O( dk/2\n\nlogk d ), for\nover { \u00b11\u221a\n}d and y(i) = H(x(i)), and let K(x) represent the optimal solution to (3.2) using any\ncorrelation kernel, then we have with probability at least 0.99 over the randomness of W\u2217 and X\n\ni\u2208[k]. Here, W\u2217 =\n\nj\u2208[k] yj\n\n))(cid:105)\n\n(cid:1)\n\n\u221a\n\n(cid:62)\n\nd\n\n2\n\n2\n\nx\u223cU ({\u22121/\n\nE\n\u221a\nd, +1/\n\n\u221a\n\nd}d)\n\n(cid:107)H(x) \u2212 K(x)(cid:107)2\n\n2 > \u03b12/4\n\n\u221a\n\n2 \u2264 (cid:101)O(\u03b14)\n\n(cid:1) samples .\n\nin N = (cid:101)\u0398(cid:0) d\n\nRemark. One can relax the assumption on K(x) to being any approximate minimizer of (3.2).\nIn contrast, using Theorem 1, one can show Cs(F) = O(\n\nd) and Cs(G) = O(1) so SGD can \ufb01nd\n\nEx (cid:107)H(x) \u2212 out(x)(cid:107)2\n\nAs an example, when k \u2265 4 and \u03b1 = d\u22120.01, ResNet achieves regression error \u03b14 in N = (cid:101)O(d1.08)\nsamples, but kernel methods cannot achieve \u03b12 error even with N = (cid:101)\u0398(dk/2) = (cid:101)\u2126(d2) samples.\n\nWe sketch the proof of Theorem 3 in half a page on Page 13.\nConclusion. We give the \ufb01rst provable separation between neural networks and kernel methods,\nin the ef\ufb01cient and distribution-free learning regime. We show that neural networks can implicitly\nhierarchical learn functions G(F(x)) with the help of F(x) using residual links, without paying\nsample complexity comparing to \u201cone-shot\u201d learning algorithms that directly learns G(F(x)). Fi-\nnally, we would like to point out there are kernels not captured by our correlation kernels, such as\nconvolutional networks with \u201cglobal average pooling.\u201d To prove bounds for them is an interesting\nfuture direction.\n\n\u03b18\n\n14Indeed, one can show(cid:80)T\u22121\n\n\u221a\nbe made O(\n\nT ) ignoring other factors.\n\n9\n\nt=0 \u039et \u2264 O(\u03b7w + \u03b7v)\u00b7 T +\n\n(cid:107)W\n\n(cid:62)(cid:107)2\n\u03b7w\n\nF\n\n+\n\n(cid:107)V\n\n(cid:62)(cid:107)2\n\u03b7v\n\nF\n\n, and thus the right hand side can\n\n\fReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD learn recurrent neural networks with provable\ngeneralization? CoRR, abs/1902.01028, 2019. URL http://arxiv.org/abs/1902.01028.\n\n[2] Zeyuan Allen-Zhu and Yuanzhi Li. Backward Feature Correction: How Deep Learning Per-\n\nforms Deep Learning. arXiv preprint, January 2020.\n\n[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overpa-\nrameterized Neural Networks, Going Beyond Two Layers. In NeurIPS, 2019. Full version\navailable at http://arxiv.org/abs/1811.04918.\n\n[4] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent\nneural networks. In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1810.\n12065.\n\n[5] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\nover-parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/\n1811.03962.\n\n[6] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[7] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis\nof optimization and generalization for overparameterized two-layer neural networks. CoRR,\nabs/1901.08584, 2019. URL http://arxiv.org/abs/1901.08584.\n\n[8] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer recti\ufb01ed neural\n\nnetworks in polynomial time. arXiv preprint arXiv:1811.01885, 2018.\n\n[9] Digvijay Boob and Guanghui Lan. Theoretical properties of the global optimizer of two layer\n\nneural network. arXiv preprint arXiv:1710.11241, 2017.\n\n[10] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[11] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\n[12] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural net-\nworks: The power of initialization and a dual view on expressivity. In Advances in Neural\nInformation Processing Systems (NIPS), pages 2253\u20132261, 2016.\n\n[13] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[14] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti-\n\nmizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[15] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501, 2017.\n\n[16] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks\n\nwith symmetric inputs. In International Conference on Learning Representations, 2019.\n\n[17] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity\n\nof neural networks. In Proceedings of the Conference on Learning Theory, 2018.\n\n[18] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nIn Acoustics, speech and signal processing (icassp), 2013 ieee\n\nrecurrent neural networks.\ninternational conference on, pages 6645\u20136649. IEEE, 2013.\n\n10\n\n\f[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-\ntion, pages 770\u2013778, 2016.\n\n[20] Arthur Jacot, Franck Gabriel, and Cl\u00b4ement Hongler. Neural tangent kernel: Convergence\nand generalization in neural networks. In Advances in neural information processing systems,\npages 8571\u20138580, 2018.\n\n[21] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Informa-\n\ntion Processing Systems, pages 586\u2013594, 2016.\n\n[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[23] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems,\n2018.\n\n[24] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[25] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang.\n\nAlgorithmic regularization in over-\nparameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018.\n\n[26] Tengyu Ma. CS229T/STAT231: Statistical Learning Theory (Fall 2017). https://web.\nac-\n\nstanford.edu/class/cs229t/scribe_notes/10_17_final.pdf, October 2017.\ncessed May 2019.\n\n[27] Martin J. Wainwright. Basic tail and concentration bounds. https://www.stat.berkeley.\nedu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf, 2015. Online; accessed\nOct 2018.\n\n[28] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in\n\nneural networks. In Conference on Learning Theory, pages 1376\u20131401, 2015.\n\n[29] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10\n\nClassi\ufb01ers Generalize to CIFAR-10? arXiv preprint arXiv:1806.00451, 2018.\n\n[30] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme\nsingular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM\n2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II\u2013IV: Invited Lectures,\npages 1576\u20131602. World Scienti\ufb01c, 2010.\n\n[31] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van\nDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-\ntot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529\n(7587):484, 2016.\n\n[32] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the\narXiv preprint\n\noptimization landscape of over-parameterized shallow neural networks.\narXiv:1707.04926, 2017.\n\n[33] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guar-\n\nantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[34] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and\nits applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560,\n2017.\n\n[35] Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for training\n\none-hidden-layer neural networks. arXiv preprint arXiv:1805.02677, 2018.\n\n[36] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward\n\nneural networks. arXiv preprint arXiv:1810.05369, 2018.\n\n11\n\n\f[37] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks.\n\narXiv preprint Arxiv:1611.03131, 2016.\n\n[38] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-\ncess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint\narXiv:1902.04760, 2019.\n\n[39] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improp-\nerly learnable in polynomial time. In International Conference on Machine Learning, pages\n993\u20131001, 2016.\n\n[40] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guar-\n\nantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n[41] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent opti-\n\nmizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4827, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}]}