{"title": "Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10612, "page_last": 10622, "abstract": "We study the sample complexity of learning one-hidden-layer convolutional neural networks (CNNs) with non-overlapping filters. We propose a novel algorithm called approximate gradient descent for training CNNs, and show that, with high probability, the proposed algorithm with random initialization grants a linear convergence to the ground-truth parameters up to statistical precision. Compared with existing work, our result applies to general non-trivial, monotonic and Lipschitz continuous activation functions including ReLU, Leaky ReLU, Sigmod and Softplus etc. Moreover, our sample complexity beats existing results in the dependency of the number of hidden nodes and filter size. In fact, our result matches the information-theoretic lower bound for learning one-hidden-layer CNNs with linear activation functions, suggesting that our sample complexity is tight. Our theoretical analysis is backed up by numerical experiments.", "full_text": "Tight Sample Complexity of Learning\n\nOne-hidden-layer Convolutional Neural Networks\n\nYuan Cao\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nCA 90095, USA\n\nyuancao@cs.ucla.edu\n\nCA 90095, USA\n\nqgu@cs.ucla.edu\n\nAbstract\n\nWe study the sample complexity of learning one-hidden-layer convolutional neural\nnetworks (CNNs) with non-overlapping \ufb01lters. We propose a novel algorithm\ncalled approximate gradient descent for training CNNs, and show that, with high\nprobability, the proposed algorithm with random initialization grants a linear con-\nvergence to the ground-truth parameters up to statistical precision. Compared with\nexisting work, our result applies to general non-trivial, monotonic and Lipschitz\ncontinuous activation functions including ReLU, Leaky ReLU, Sigmod and Soft-\nplus etc. Moreover, our sample complexity beats existing results in the dependency\nof the number of hidden nodes and \ufb01lter size. In fact, our result matches the\ninformation-theoretic lower bound for learning one-hidden-layer CNNs with linear\nactivation functions, suggesting that our sample complexity is tight. Our theoretical\nanalysis is backed up by numerical experiments.\n\n1\n\nIntroduction\n\nDeep learning is one of the key research areas in modern arti\ufb01cial intelligence. Deep neural networks\nhave been successfully applied to various \ufb01elds including image processing [25], speech recognition\n[20] and reinforcement learning [33]. Despite the remarkable success in a broad range of applications,\ntheoretical understandings of neural network models remain largely incomplete:\nthe high non-\nconvexity of neural networks makes convergence analysis of learning algorithms very dif\ufb01cult;\nnumerous practically successful choices of the activation function, twists of the training process and\nvariants of the network structure make neural networks even more mysterious.\nOne of the fundamental problems in learning neural networks is parameter recovery, where we\nassume the data are generated from a \u201cteacher\u201d network, and the task is to estimate the ground-\ntruth parameters of the teacher network based on the generated data. Recently, a line of research\n[41, 16, 38] gives parameter recovery guarantees for gradient descent based on the analysis of local\nconvexity and smoothness properties of the square loss function. The results of Zhong et al. [41] and\nFu et al. [16] hold for various activation functions except ReLU activation function, while Zhang et al.\n[38] prove the corresponding result for ReLU. Their results are for fully connected neural networks\nand their analysis requires accurate knowledge of second-layer parameters. For instance, Fu et al.\n[16] and Zhang et al. [38] directly assume that the second-layer parameters are known, while Zhong\net al. [41] reduce the second-layer parameters to be \u00b11\u2019s with the homogeneity assumption, and\nthen exactly recovers them with a tensor initialization algorithm. Moreover, it may not be easy to\ngeneralize the local convexity and smoothness argument to other algorithms that are not based on the\nexact gradient of the loss function. Another line of research [6, 13, 18, 10] focuses on convolutional\nneural networks with ReLU activation functions. Brutzkus and Globerson [6], Du et al. [13] provide\nconvergence analysis for gradient descent on parameters of both layers, while Goel et al. [18], Du\nand Goel [10] proposed new algorithms to learn single-hidden-layer CNNs. However, these results\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Comparison with related work [13, 14, 18, 10]. Note that Du et al. [14] did not study any\nspeci\ufb01c learning algorithms. All sample complexity results are calculated for standard Gaussian\ninputs and non-overlapping \ufb01lters. The D. Convotron stands for Double Convotron, which is an\nalgorithm proposed by Du and Goel [10].\n\nDu et al. [13]\nDu et al. [14]\nConvotron [18]\n\nD. Convotron [10] sublinear (cid:101)O(poly(k, r, \u0001\u22121)) (leaky) ReLU symmetric\n\nsub-Gaussian\n(leaky) ReLU symmetric\n\n(sub)linear1\n\n-\n\nThis paper\n\nlinear\n\ngeneral\n\nGaussian\n\nno\nyes\nyes\nyes\nno\n\nyes\n-\nno\nyes\nyes\n\nConv. rate\n\nlinear\n\n-\n\nSample comp.\n\n(cid:101)O((k + r) \u00b7 \u0001\u22122)\n(cid:101)O(k2r \u00b7 \u0001\u22122)\n(cid:101)O((k + r) \u00b7 \u0001\u22122)\n\nAct. fun.\nReLU\nLinear\n\nData input Overlap Sec. layer\nGaussian\n\nheavily rely on the exact calculation of the population gradient for ReLU networks, and do not\nprovide tight sample complexity guarantees.\nIn this paper, we study the parameter recovery problem for non-overlapping convolutional neural\nnetworks. We aim to develop a new convergence analysis framework for neural networks that: (i)\nworks for a class of general activation functions, (ii) does not rely on ad hoc initialization, (iii) can be\npotentially applied to different variants of the gradient descent algorithm. The main contributions of\nthis paper is as follows:\n\u2022 We propose an approximate gradient descent algorithm that learns the parameters of both layers in\na non-overlapping convolutional neural network. With weak requirements on initialization that can\nbe easily satis\ufb01ed, the proposed algorithm converges to the ground-truth parameters linearly up to\nstatistical precision.\n\n\u2022 Our convergence result holds for all non-trivial, monotonic and Lipschitz continuous activation\nfunctions. Compared with the results in Brutzkus and Globerson [6], Du et al. [13], Goel et al.\n[18], Du and Goel [10], our analysis does not rely on any analytic calculation related to the\nactivation function. We also do not require the activation function to be smooth, which is assumed\nin the work of Zhong et al. [41] and Fu et al. [16].\n\n\u2022 We consider the empirical version of the problem where the estimation of parameters is based on\nn independent samples. We avoid the usual analysis with sample splitting by proving uniform\nconcentration results. Our method outperforms the state-of-the-art results in terms of sample\ncomplexity. In fact, our result for general non-trivial, monotonic and Lipschitz continuous activation\nfunctions matches the lower bound given for linear activation functions in Du et al. [14], which\nimplies the statistical optimality of our algorithm.\n\nDetailed comparison between our results and the state-of-the-art on learning one-hidden-layer CNNs is\ngiven in Table 1. We compare the convergence rates and sample complexities obtained by recent work\nwith our result. We also summarize the applicable activation functions and data input distributions,\nand whether overlapping/non-overlapping \ufb01lters and second layer training are considered in each of\nthe work.\nNotation: Let A = [Aij] \u2208 Rd\u00d7d be a matrix and x = (x1, ..., xd)(cid:62) \u2208 Rd be a vector. We use\ni=1 |xi|q)1/q to denote (cid:96)q vector norm for 0 < q < +\u221e. The spectral and Frobenius\nnorms of A are denoted by (cid:107)A(cid:107)2 and (cid:107)A(cid:107)F . For a symmetric matrix A, we denote by \u03bbmax(A),\n\u03bbmin(A) and \u03bbi(A) the maximum, minimum and i-th largest eigenvalues of A. We denote by A (cid:23) 0\nthat A is positive semide\ufb01nite (PSD). Given two sequences {an} and {bn}, we write an = O(bn)\nif there exists a constant 0 < C < +\u221e such that an \u2264 C bn, and an = \u2126(bn) if an \u2264 C bn for\na \u2227 b = min{a, b}, a \u2228 b = max{a, b}.\n\n(cid:107)x(cid:107)q = ((cid:80)d\nsome constant C. We use notations (cid:101)O(\u00b7),(cid:101)\u2126(\u00b7) to hide the logarithmic factors. Finally, we denote\n\n1[18] provided a general sublinear convergence result as well as a linear convergence rate for the noiseless\n\ncase. We only list their sample complexity result of the noisy case in the table for proper comparison.\n\n2\n\n\f2 Related Work\n\nThere has been a vast body of literature on the theory of deep learning. We will review in this section\nthe most relevant work to ours.\nIt is well known that neural networks have remarkable expressive power due to the universal ap-\nproximation theorem [22]. However, even learning a one-hidden-layer neural network with a sign\nactivation can be NP-hard [5] in the realizable case. In order to explain the success of deep learning\nin various applications, additional assumptions on the data generating distribution have been explored\nsuch as symmetric distributions [4] and log-concave distributions [24]. More recently, a line of\nresearch has focused on Gaussian distributed input for one-hidden-layer or two-layer networks with\ndifferent structures [23, 35, 6, 27, 41, 40, 17, 38, 16]. Compared with these results, our work aims at\nproviding tighter sample complexity for more general activation functions.\nA recent line of work [28, 32, 29, 26, 15, 2, 11, 42, 1, 3, 7] studies the training of neural networks in\nthe over-parameterized regime. Mei et al. [28], Shamir [32], Mei et al. [29] studied the optimization\nlandscape of over-parameterized neural networks. Li and Liang [26], Du et al. [15], Allen-Zhu\net al. [2], Du et al. [11], Zou et al. [42] proved that gradient descent can \ufb01nd the global minima\nof over-parameterized neural networks. Generalization bounds under the same setting are studied\nin Allen-Zhu et al. [1], Arora et al. [3], Cao and Gu [7]. Compared with these results in the over-\nparameterized setting, the parameter recovery problem studied in this paper is in the classical setting,\nand therefore different approaches need to be taken for the theoretical analysis.\nThis paper studies convolutional neural networks (CNNs). There are not much theoretical literature\nspeci\ufb01cally for CNNs. The expressive power of CNNs is shown in Cohen and Shashua [8]. Nguyen\nand Hein [30] study the loss landscape in CNNs and Brutzkus and Globerson [6] show the global\nconvergence of gradient descent on one-hidden-layer CNNs. Du et al. [12] extend the result to\nnon-Gaussian input distributions with ReLU activation. Zhang et al. [39] relax the class of CNN\n\ufb01lters to a reproducing kernel Hilbert space and prove the generalization error bound for the relaxation.\nGunasekar et al. [19] show that there is an implicit bias in gradient descent on training linear CNNs.\n\n3 The One-hidden-layer Convolutional Neural Network\n\nIn this section we formalize the one-hidden-layer convolutional neural network model. In a convo-\nlutional network with neuron number k and \ufb01lter size r, a \ufb01lter w \u2208 Rr interacts with the input x\nat k different locations I1, . . . ,Ik, where I1, . . . ,Ik \u2286 {1, 2, . . . , d} are index sets of cardinality r.\nLet Ij = {pj1, . . . , pjr}, j = 1, . . . , k, then the corresponding selection matrices P1, . . . , Pk are\nde\ufb01ned as Pj = (epj1, . . . , epjr )(cid:62), j = 1, . . . , k.\nWe consider a convolutional neural network of the form\n\ny =\n\nvj\u03c3(w(cid:62)Pjxi),\n\nk(cid:88)\n\nj=1\n\nk(cid:88)\n\nwhere \u03c3(\u00b7) is the activation function, and w \u2208 Rr, v \u2208 Rk are the \ufb01rst and second layer parameters\nrespectively. Suppose that we have n samples {(xi, yi)}n\ni=1, where x1, . . . , xn \u2208 Rd are generated\nindependently from standard Gaussian distribution, and the corresponding output y1, . . . , yn \u2208 R are\ngenerated from the teacher network with true parameters w\u2217 and v\u2217 as follows.\n\nyi =\n\nv\u2217\nj \u03c3(w\u2217(cid:62)Pjxi) + \u0001i,\n\nj=1\n\nwhere k is the number of hidden neurons, and \u00011, . . . , \u0001n are independent sub-Gaussian white noises\nwith \u03c82 norm \u03bd. Through out this paper, we assume that (cid:107)w\u2217(cid:107)2 = 1.\nThe choice of activation function \u03c3(\u00b7) determines the landscape of the neural network. In this paper,\nwe assume that \u03c3(\u00b7) is a non-trivial, Lipschitz continuous increasing function.\nAssumption 3.1. \u03c3 is 1-Lipschitz continuous: |\u03c3(z1) \u2212 \u03c3(z2)| \u2264 |z1 \u2212 z2| for all z1, z2 \u2208 R.\nAssumption 3.2. \u03c3 is a non-trivial (not a constant) increasing function.\nRemark 3.3. Assumptions 3.1 and 3.2 are fairly weak assumptions satis\ufb01ed by most practically\nused activation functions including the recti\ufb01ed linear unit (ReLU) function \u03c3(z) = max(z, 0), the\n\n3\n\n\fand the erf function \u03c3(z) =(cid:82) z\n\nsigmoid function \u03c3(z) = 1/(1 + ez), the hyperbolic tangent function \u03c3(z) = (ez \u2212 e\u2212z)/(ez + e\u2212z),\n0 e\u2212t2/2dt. Since we do not make any assumptions on the second layer\ntrue parameter v\u2217, our assumptions can be easily relaxed to any non-trivial, L-Lipschitz continuous\nand monotonic functions for arbitrary \ufb01xed positive constant L.\n\n4 Approximate Gradient Descent\nIn this section we present a new algorithm for the estimation of w\u2217 and v\u2217.\n4.1 Algorithm Description\nLet y = (y1, . . . , yn)(cid:62), \u03a3(w) = [\u03c3(w(cid:62)Pjxi)]n\u00d7k and \u03be = Ez\u223cN (0,1)[\u03c3(z)z]. The algorithm\nis given in Algorithm 1. We call it approximate gradient descent because it is derived by simply\nreplacing the \u03c3(cid:48)(\u00b7) terms in the gradient of the empirical square loss function by the constant \u03be\u22121. It\nis easy to see that under Assumption 3.1 and 3.2, we have \u03be > 0. Therefore, replacing \u03c3(cid:48)(\u00b7) > 0 with\n\u03be\u22121 will not drastically change the gradient direction, and gives us an approximate gradient.\nAlgorithm 1 is also related to, but different from the Convotron algorithm proposed by Goel et al.\n[18] and the Double Convotron algorithm proposed by Du and Goel [10]. The Approximate Gradi-\nent Descent algorithm can be seen a generalized version of the Convotron algorithm, which only\nconsiders optimizing over the \ufb01rst layer parameters of the convolutional neural network. Compared\nto the Double Convotron, Algorithm 1 implements a simpler update rule based on iterative weight\nnormalization for the \ufb01rst layer parameters, and uses a different update rule for the second layer\nparameters.\n\nAlgorithm 1 Approximate Gradient Descent for Non-overlapping CNN\nRequire: Training data {(xi, yi)}n\n\ni=1, number of iterations T , step size \u03b1, initialization w0 \u2208 Sr\u22121,\n\nv0.\nfor t = 0, 1, 2, . . . , T \u2212 1 do\n\n(cid:2)yi \u2212(cid:80)k\n\n(cid:80)n\nj=1 vt\nn \u03a3(cid:62)(wt)[y \u2212 \u03a3(wt)vt]\n\nw = \u2212 1\ngt\nv = \u2212 1\ngt\nut+1 = wt \u2212 \u03b1gt\n\ni=1\n\nn\n\nj(cid:48)=1 \u03be\u22121vt\nw, wt+1 = ut+1/(cid:107)ut+1(cid:107)2, vt+1 = vt \u2212 \u03b1gt\n\nv\n\nj\u03c3(wt(cid:62)Pjxi)(cid:3) \u00b7(cid:80)k\n\nj(cid:48)Pj(cid:48)xi\n\nend for\n\nEnsure: wT , vT\n\n4.2 Convergence Analysis of Algorithm 1\n\nIn this section we give the main convergence result of Algorithm 1. We \ufb01rst introduce some notations.\nThe following quantities are determined purely by the activation function:\n\n\u03ba := Ez\u223cN (0,1)[\u03c3(z)], \u2206 := Varz\u223cN (0,1)[\u03c3(z)], L := 1 + |\u03c3(0)|, \u0393 := 1 + |\u03c3(0) \u2212 \u03ba|,\n\u03c6(w, w(cid:48)) := Covz\u223cN (0,I)[\u03c3(w(cid:62)z), \u03c3(w(cid:48)(cid:62)z)].\n\nThe following lemma shows that the function \u03c6(w, w(cid:48)) can in fact be written as a function of w(cid:62)w(cid:48),\nwhich we denote as \u03c8(w(cid:62)w(cid:48)). The lemma also reveals that \u03c8(\u00b7) is an increasing function.\nLemma 4.1. Under Assumption 3.2, there exists an increasing function \u03c8(\u03c4 ) such that \u03c8(w(cid:62)w(cid:48)) =\n\u03c6(w, w(cid:48)), and \u2206 \u2265 \u03c8(\u03c4 ) > 0 for all \u03c4 > 0.\nWe further de\ufb01ne the following quantities.\n\u221a\n\n(cid:27)\n\nk)\n\n,|\u03ba1(cid:62)(v0 \u2212 v\u2217)|\n\n,\n\n(cid:26)|\u03ba|(2L|1(cid:62)v\u2217| +\n(cid:115)\n(cid:40)\n(cid:26) 1\n\n(cid:107)v0 \u2212 v\u2217(cid:107)2,\n\n(cid:18) 1\n\n\u2206 + \u03ba2k\n\n\u03c8\n\n2 + \u2206\n\n2\n\n(cid:19)\n\nM = max\n\nD = max\n\n\u03c1 = min\n\n4(1 + 4\u03b1\u2206)L2(cid:107)v\u2217(cid:107)2\n\n2 + 4\u03b1\u2206(\u03ba2M 2k + 1) + 2\n\n\u22062(1 \u2212 4\u03b1\u2206)\n\nw\u2217(cid:62)w0\n\n(cid:107)v\u2217(cid:107)2\n\n2, v\u2217(cid:62)v0\n\n(cid:27)\n\n.\n\n(cid:41)\n\n,\n\n4\n\n\fLet D0 = D + (cid:107)v\u2217(cid:107)2. Note that in our problem setting the number of \ufb01lters k can scale with n.\nHowever, although k is used in the de\ufb01nition of M, D, \u03c1 and D0, it is not dif\ufb01cult to check that all\nthese quantities can be upper bounded, as is stated in the following lemma.\nLemma 4.2. If \u03b1 \u2264 1/(8\u2206), then M, D, D0 have upper bounds, and \u03c1 has a lower bound, that only\ndepend on the activation function \u03c3(\u00b7), the ground-truth (w\u2217, v\u2217) and the initialization (w0, v0).\nWe now present our main result, which states that the iterates wt and vt in Algorithm 1 converge\nlinearly towards w\u2217 and v\u2217 respectively up to statistical accuracy.\nTheorem 4.3. Let \u03b4 \u2208 (0, 1), \u03b31 = (1 + \u03b1\u03c1)\u22121/2, \u03b32 =\n\u03ba2k). Suppose that the initialization (w0, v0) satis\ufb01es\n\n1 \u2212 \u03b1\u2206 + 4\u03b12\u22062, and \u03b33 = 1 \u2212 \u03b1(\u2206 +\n\n\u221a\n\nw\u2217(cid:62)w0 > 0, v\u2217(cid:62)v0 > 0, \u03ba2(1(cid:62)v\u2217)1(cid:62)(v0 \u2212 v\u2217) \u2264 \u03c1,\n\n(4.1)\n\nand the step size \u03b1 is chosen such that\n\n1\n\n2(\u2206 + \u03ba2k)\n\n\u2227 1\n8\u2206\n\n\u2227\n\n\u22062\n(24L2 + 2\u22062)(cid:107)v\u2217(cid:107)2\n\n\u03b1 \u2264\n\nIf\n\n(cid:114)\n\nc1\n\n(r + k) log(c2nk/\u03b4)\n\nn\n\n\u221a\n\u2264 1 \u2227 \u03c1\n\n2 + 2M 2k + 10\n\n|1(cid:62)v\u2217| \u2227 \u0393(cid:112)k(r + k)\n(cid:18)\n\n1 + L + \u03be\n1 \u2227\n\nk\n\n\u03be\n\nD0(D0\u0393 + M + \u03bd)\n\n\u221a\n(\u0393 + \u03ba\n\n1\n\nk)(D0\u0393 + M + \u03bd)\n\n\u2227\n\n\u2227\n\n.\n\n2 + (cid:107)v\u2217(cid:107)2\n2)\n\u221a\nk)\u0393\n\nr + k\nL2 + |\u03ba| + \u2206 + L\n\n1\n2((cid:107)v0 \u2212 v\u2217(cid:107)2\n\n\u2227\n\u2227 (\u0393 + |\u03ba|\u221a\n(cid:18)\nw\u2217(cid:62)w0\n\n1 + \u03b1\u03c1\n1 \u2227\n\n\u03c1\n\n\u03c1\n\n(cid:107)v\u2217(cid:107)2\n\n(cid:19)\n(cid:19)\n\nfor some large enough absolute constants c1 and c2, then there exists absolute constants C and C(cid:48)\nsuch that, with probability at least 1 \u2212 \u03b4 we have\n\n(cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u03b3t\n(cid:107)vt \u2212 v\u2217(cid:107)2 \u2264 R1t3/2(\u03b31 \u2228 \u03b32 \u2228 \u03b33)t + (R2 + R3|\u03ba|\n\n1(cid:107)w0 \u2212 w\u2217(cid:107)2 + 8\u03c1\u22121\u03b3\u22122\n\n1 \u03b7w,\n\n\u221a\n\nk)(\u03b7w + \u03b7v),\n\n(4.2)\n\n(4.3)\n(4.4)\n\n(4.5)\n\nfor all t = 0, . . . , T , where\n\n\u03b7w = C\u03be\u22121D0(D0\u0393 + M + \u03bd) \u00b7\n\n(cid:114)\n\n(r + k) log(120nk/\u03b4)\n\n,\n\nn\n\n(cid:114)\n\n\u221a\n\u03b7v = C(cid:48)(\u0393 + \u03ba\n\n(r + k) log(120nk/\u03b4)\n\nk)(D0\u0393 + M + \u03bd) \u00b7\n\n,\n\nn\n\n\u221a\n\n\u221a\nk)\n\nbe interpreted as an assumption that n \u2265 (cid:101)\u2126((1 + \u03ba\n\n(4.6)\nand R1, R2, R3 are constants that only depend on the choice of activation function \u03c3(\u00b7), the ground-\ntruth parameters (w\u2217, v\u2217) and the initialization (w0, v0).\nEquation (4.2) is an assumption on the sample size n. Although this assumption looks complicated,\nessentially except k and r, all quantities in this condition can be treated as constants, and (4.2) can\nr + k), which is by no means a strong\nassumption. The second and third lines of (4.2) are to guarantee \u03b7w \u2264 1 \u2227 [\u03c1/(1 + \u03b1\u03c1)w\u2217(cid:62)w0] and\n\u03b7v \u2264 1 \u2227 (\u03c1/(cid:107)v\u2217(cid:107)2) respectively, while the \ufb01rst line is for technical purposes to ensure convergence.\nRemark 4.4. Theorem 4.3 shows that with initialization satisfying (4.1), Algorithm 1 linearly\nconverges to true parameters up to statistical error. This condition for initialization can be easily\nsatis\ufb01ed with random initialization. In Section 4.3, we will give a detailed initialization algorithm\ninspired by a random initialization method proposed in Du et al. [13].\nRemark 4.5. Compared with the most relevant convergence results in literature given by Du et al.\n[13], our result is based on optimizing the empirical loss function instead of the population loss\nfunction. In particular, when \u03ba = 0, Theorem 4.3 proves that Algorithm 1 eventually gives es-\ntimation of parameters with statistical error of order O\n. This rate matches\nthe information-theoretic lower bound for one-hidden-layer convolutional neural networks with\nlinear activation functions. Note that our result holds for general activation functions. Matching the\n\n(cid:16)(cid:113) (r+k) log(120nk/\u03b4)\n\n(cid:17)\n\nn\n\n5\n\n\flower bound of the linear case implies the optimality of our algorithm. Compared with two recent\nresults, namely the Convotron algorithm proposed by Goel et al. [18] and the Double Convotron\nalgorithm proposed by Du and Goel [10], which work for ReLU activation and generic symmetric\ninput distributions, our theoretical guarantee for Algorithm 1 gives a tighter sample complexity for\nmore general activation functions, but requires the data inputs to be Gaussian. We remark that if\nrestricting to ReLU activation function, our analysis can be extended to generic symmetric input\ndistributions as well, and can still provide tight sample complexity.\nRemark 4.6. A recent result by Du et al. [13] discussed a speed-up in convergence when training\nnon-overlapping CNNs with ReLU activation function. This phenomenon also exists in Algorithm 1.\nTo show it, we \ufb01rst note that in Theorem 4.3, the convergence rate of wt and vt are essentially\ndetermined by \u03b31 = (1 + \u03b1\u03c1)\u22121/2. For appropriately chosen \u03b1, \u03c1 being too small (i.e. w\u2217(cid:62)w0 or\nv\u2217(cid:62)v0 being too small, by Lemma 4.1) is the only possible reason of slow convergence. Now by\nthe iterative nature of Algorithm 1, for any T1 > 0, we can analyze the convergence behavior after\nT1 by treating wT1 and vT1 as new initialization and applying Theorem 4.3 again. By Theorem 4.3,\neven if the initialization gives small w\u2217(cid:62)w0 and v\u2217(cid:62)v0, after certain number of iterations, w\u2217(cid:62)wt\nand v\u2217(cid:62)vt become much larger and therefore the convergence afterwards gets much faster. This\nphenomenon is comparable with the two-phase convergence result of Du et al. [13]. However, while\nDu et al. [13] only show the convergence of second layer parameters for phase II of their algorithm,\nour result shows that linear convergence of Algorithm 1 starts at the \ufb01rst iteration.\n\n4.3\n\nInitialization\n\nTo complete the theoretical analysis of Algorithm 1, it remains to show that initialization condition\n(4.1) can be achieved with practical algorithms. The following theorem is inspired by a similar\nmethod proposed by Du et al. [13]. It gives a simple random initialization method that satisfy (4.1).\nTheorem 4.7. Suppose that w \u2208 Rr and v \u2208 Rk be vectors generated by Pw and\nPv with support Sr\u22121 and B(0, k\u22121/2|1(cid:62)v\u2217|) respectively. Then there exists (w0, v0) \u2208\n{(w, v), (\u2212w, v), (w,\u2212v), (\u2212w,\u2212v)} that satis\ufb01es (4.1).\nRemark 4.8. The proof of Theorem 4.7 is fairly straightforward\u2013the vector v generated by pro-\nposed initialization method in fact satis\ufb01es that \u03ba2(1(cid:62)v\u2217)1(cid:62)(v \u2212 v\u2217) \u2264 0. Moreover, it is worth\nnoting that for activation functions with \u03ba = Ez\u223cN (0,1)[\u03c3(z)] = 0, the initialization condition\n\u03ba2(1(cid:62)v\u2217)1(cid:62)(v0 \u2212 v\u2217) \u2264 \u03c1 is automatically satis\ufb01ed. Therefore for any vector w \u2208 Sr\u22121 and\nv \u2208 Rk, one of (w, v), (\u2212w, v), (w,\u2212v), (\u2212w,\u2212v) satis\ufb01es the initialization condition, making\ninitialization for Algorithm 1 extremely easy.\n\n5 Proof of the Main Theory\n\nIn this section we give the proof of Theorem 4.3. The proof consists of three steps: (i) prove uniform\nconcentration inequalities for approximate gradients, (ii) give recursive upper bounds for (cid:107)wt\u2212 w\u2217(cid:107)2\nand (cid:107)vt \u2212 v\u2217(cid:107)2, (iii) derive the \ufb01nal convergence result (4.3) and (4.4).\nWe \ufb01rst analyze how well the approximate gradients gt\nInstead of using the classic analysis on gt\nwe consider uniform concentration over a parameter set W0 \u00d7 V0 de\ufb01ned as follows.\n\nv conditioning on all previous iterations {ws, vs}t\n\nv concentrate around their expectations.\ns=1,\n\nw and gt\n\nw and gt\n\nW0 := Sr\u22121 = {w : (cid:107)w(cid:107)2 = 1}, V0 := {v : (cid:107)v \u2212 v\u2217(cid:107)2 \u2264 D, |\u03ba1(cid:62)(v \u2212 v\u2217)| \u2264 M}.\n\nDe\ufb01ne\n\n(cid:35)\n\n\u00b7 k(cid:88)\n\n\u03be\u22121vj(cid:48)Pj(cid:48)xi,\n\n(cid:34)\n\nyi \u2212 k(cid:88)\n\nn(cid:88)\n\ni=1\n\nvj\u03c3(w(cid:62)Pjxi)\n\ngw(w, v) = \u2212 1\nn\ngv(w, v) = \u2212 1\nn\ngw(w, v) = (cid:107)v(cid:107)2\ngv(w, v) = (\u2206I + \u03ba211(cid:62))v \u2212 [\u03c6(w, w\u2217)I + \u03ba211(cid:62)]v\u2217.\n\n\u03a3(cid:62)(w)[y \u2212 \u03a3(w)v],\n2w \u2212 (v\u2217(cid:62)v)w\u2217,\n\nj(cid:48)=1\n\nj=1\n\nThe following claim follows by direct calculation.\n\n6\n\n\fClaim 5.1. For any \ufb01xed w, v, it holds that E[gw(w, v)] = gw(w, v) and E[gv(w, v)] = gv(w, v),\nwhere the expectation is taken over the randomness of data.\nOur goal is to bound sup(w,v)\u2208W0\u00d7V0 (cid:107)gw(w, v) \u2212 gw(w, v)(cid:107)2 and sup(w,v)\u2208W0\u00d7V0 (cid:107)gv(w, v) \u2212\ngv(w, v)(cid:107)2. A key step for proving such uniform bounds is to show thee uniform Lipschitz continuity\nof gw(w, v) and gv(w, v), which is given in the following lemma.\nLemma 5.2. For any \u03b4 > 0, if n \u2265 (r + k) log(324/\u03b4), then with probability at least 1 \u2212 \u03b4, the\nfollowing inequalities hold uniformly over all w, w(cid:48) \u2208 W0 and v, v(cid:48) \u2208 V0:\n(cid:107)gw(w, v) \u2212 gw(w(cid:48), v)(cid:107)2 \u2264 C\u03be\u22121D2\nk \u00b7 (cid:107)w \u2212 w(cid:48)(cid:107)2,\n\u221a\nk) \u00b7 (cid:107)v \u2212 v(cid:48)(cid:107)2,\n(cid:107)gw(w, v) \u2212 gw(w, v(cid:48))(cid:107)2 \u2264 C\u03be\u22121(\u03bd + D0L\n\u221a\n\u221a\n(cid:107)gv(w, v) \u2212 gv(w(cid:48), v)(cid:107)2 \u2264 C(\u03bd + D0L\nk \u00b7 (cid:107)w \u2212 w(cid:48)(cid:107)2,\nk)\n(cid:107)gv(w, v) \u2212 gv(w, v(cid:48))(cid:107)2 \u2264 CL2k \u00b7 (cid:107)v \u2212 v(cid:48)(cid:107)2,\n\n(5.1)\n(5.2)\n(5.3)\n(5.4)\n\n\u221a\n\n0\n\nwhere C is an absolute constant.\n\nIf gw and gv are gradients of some objective function f, then Lemma 5.2 essentially proves the\nuniform smoothness of f. However, in our algorithm, gw is not the exact gradient, and therefore the\nresults are stated in the form of Lipschitz continuity of gw. Lemma 5.2 enables us to use a covering\nnumber argument together with point-wise concentration inequalities to prove uniform concentration,\nwhich is given as Lemma 5.3.\nLemma 5.3. Assume that n \u2265 (r + k) log(972/\u03b4), and\n\n\u03be\u22121D0(D0\u0393 + M + \u03bd)(cid:112)n(r + k) log(90nk/\u03b4) \u2265 D0(D0 + 1) \u2228 \u03be\u22121(\u03bd + D2\n\nk)[D0\u0393 + M + \u03bd](cid:112)n(r + k) log(90nk/\u03b4) \u2265 (\u2206 + \u03ba + D0L) \u2228 (\u03bd + D0L + L2).\n\n\u221a\n(\u0393 + \u03ba\n\n0 + D0L),\n\nThen with probability at least 1 \u2212 \u03b4 we have\n\nsup\n\n(w,v)\u2208W0\u00d7V0\n\nsup\n\n(w,v)\u2208W0\u00d7V0\n\n(cid:107)gw(w, v) \u2212 gw(w, v)(cid:107)2 \u2264 \u03b7w,\n(cid:107)gv(w, v) \u2212 gv(w, v)(cid:107)2 \u2264 \u03b7v,\n\n(5.5)\n\n(5.6)\n\nwhere \u03b7w and \u03b7v are de\ufb01ned in (4.5) and (4.6) respectively with large enough constants C and C(cid:48).\nWe now proceed to study the recursive properties of {wt} and {vt}. De\ufb01ne\n\nW := {w : (cid:107)w(cid:107)2 = 1, w\u2217(cid:62)w \u2265 w\u2217(cid:62)w0/2},\nV := {v : (cid:107)v \u2212 v\u2217(cid:107)2 \u2264 D, |\u03ba1(cid:62)(v \u2212 v\u2217)| \u2264 M, v\u2217(cid:62)v \u2265 \u03c1, \u03ba2(1(cid:62)v\u2217)1(cid:62)(v \u2212 v\u2217) \u2264 \u03c1}.\n\nThen clearly W \u00d7 V \u2286 W0 \u00d7 V0, and therefore the results of Lemma 5.3 hold for (w, v) \u2208 W \u00d7 V.\nLemma 5.4. Suppose that (5.5) and (5.6) hold. Under the assumptions of Theorem 4.3, in Algo-\nrithm 1, if (wt, vt) \u2208 W \u00d7 V, then\n[(cid:107)wt \u2212 w\u2217(cid:107)2 \u2212 8\u03c1\u22121(1 + \u03b1\u03c1)\u03b7w],\n(cid:107)wt+1 \u2212 w\u2217(cid:107)2 \u2212 8\u03c1\u22121(1 + \u03b1\u03c1)\u03b7w \u2264\n|1(cid:62)(vt+1 \u2212 v\u2217)| \u2264 [1 \u2212 \u03b1(\u2206 + \u03ba2k)]|1(cid:62)(vt \u2212 v\u2217)| + \u03b1L(cid:107)wt \u2212 w\u2217(cid:107)2|1(cid:62)v\u2217| + \u03b1\n(cid:107)vt+1 \u2212 v\u2217(cid:107)2\n\n2 \u2264 (1 \u2212 \u03b1\u2206 + 4\u03b12\u22062)(cid:107)vt \u2212 v\u2217(cid:107)2\n\n2(cid:107)wt \u2212 w\u2217(cid:107)2\n\n(cid:18) \u03b1L2\n\n1\u221a\n1 + \u03b1\u03c1\n\nk\u03b7v, (5.8)\n\n(cid:107)v\u2217(cid:107)2\n\n+ 4\u03b12L2\n\n(cid:19)\n\n(5.7)\n\n\u221a\n\n2\n\n+ 4\u03b12\u03ba4k[1(cid:62)(vt \u2212 v\u2217)]2 +\n\n(wt+1, vt+1) \u2208 W \u00d7 V.\n\n2 +\n\n(cid:18) 2\u03b1\n\n\u2206\n\n(cid:19)\n\n\u2206\n\n+ 4\u03b12\n\n\u03b72\nv,\n\n(5.9)\n\n(5.10)\n\nIt is not dif\ufb01cult to see that the results of Lemma 5.4 imply convergence of wt and vt up to statistical\nerror. To obtain the \ufb01nal convergence result of Theorem 4.3, it suf\ufb01ces to rewrite the recursive bounds\ninto explicit bounds, which is mainly tedious calculation. We therefore summarize the result as the\nfollowing lemma, and defer the detailed calculation to appendix.\n\n7\n\n\fLemma 5.5. Suppose that (5.7), (5.8) and (5.9) hold for all t = 0, . . . , T . Then\n\n(cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u03b3t\n(cid:107)vt \u2212 v\u2217(cid:107)2 \u2264 R1t3(\u03b31 \u2228 \u03b32 \u2228 \u03b33)t + (R2 + R3|\u03ba|\n\n1(cid:107)w0 \u2212 w\u2217(cid:107)2 + 8\u03c1\u22121\u03b3\u22122\n\n1 \u03b7w,\n\n\u221a\n\nk)(\u03b7w + \u03b7v)\n\n\u221a\n\n1 \u2212 \u03b1\u2206 + 4\u03b12\u22062, and \u03b33 = 1\u2212\u03b1(\u2206+\u03ba2k),\nfor all t = 0, . . . , T , where \u03b31 = (1+\u03b1\u03c1)\u22121/2, \u03b32 =\nand R1, R2, R3 are constants that only depend on the choice of activation function \u03c3(\u00b7), the ground-\ntruth parameters (w\u2217, v\u2217) and the initialization (w0, v0).\n\nWe are now ready to present the \ufb01nal proof of Theorem 4.3, which is a straightforward combination\nof the results of Lemma 5.4 and Lemma 5.5.\nProof of Theorem 4.3. By Lemma 5.4, as long as (w0, v0) \u2208 W \u00d7 V, (5.7), (5.8) and (5.9) hold for\nall t = 0, . . . , T . Therefore by Lemma 5.5, we have\n\n(cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u03b3t\n(cid:107)vt \u2212 v\u2217(cid:107)2 \u2264 R1t3(\u03b31 \u2228 \u03b32 \u2228 \u03b33)t + (R2 + R3|\u03ba|\n\n1(cid:107)w0 \u2212 w\u2217(cid:107)2 + 8\u03c1\u22121\u03b3\u22122\n\n1 \u03b7w,\n\n\u221a\n\nk)(\u03b7w + \u03b7v)\n\nfor all t = 0, . . . , T . This completes the proof of Theorem 4.3.\n\n6 Experiments\n\nWe perform numerical experiments to backup our theoretical analysis. We test Algorithm 1 together\nwith the initialization method given in Theorem 4.7 for ReLU, sigmoid and hyperbolic tangent\nnetworks, and compare its performance with the Double Convotron algorithm proposed by Du and\nGoel [10]. To give a reasonable comparison, we use a batch version of Double Convotron without the\nadditional noises on unit sphere, which gives the best performance for Double Convotron, and makes\nit directly comparable with our algorithm. The detailed parameter choices are given as follows:\n\u2022 For all experiments, we set the number of iterations T = 100, sample size n = 1000.\n\u2022 We tune the step size \u03b1 to maximize performance. Speci\ufb01cally, we set \u03b1 = 0.04 for ReLU,\n\u03b1 = 0.25 for sigmoid, and \u03b1 = 0.1 for hyperbolic tangent networks. Note that for sigmoid and\nhyperbolic tangent networks, an inappropriate step size can easily lead to blown up errors for\nDouble Convotron.\n\n\u2022 We uniformly generate w\u2217 from unit sphere, and generate v\u2217 as a standard Gaussian vector.\n\n\u2022 We consider two settings: (i) k = 15, r = 5,(cid:101)\u03bd = 0.08, (ii) k = 30, r = 9,(cid:101)\u03bd = 0.04, where(cid:101)\u03bd is\n\nthe standard deviation of white Gaussian noises.\n\nThe random initialization is performed as follows: we generate w uniformly over the unit sphere.\nWe then generate a standard Gaussian vector v. If (cid:107)v(cid:107)2 \u2265 k\u22121/2|1(cid:62)v\u2217|/2, then v is projected onto\nthe ball B(0, k\u22121/2|1(cid:62)v\u2217|/2). We then run the approximate gradient descent algorithm and Double\nConvotron algorithm starting with each of (w, v), (\u2212w, v), (w,\u2212v), (\u2212w,\u2212v), and present the\nresults corresponding to the starting point that gives the smallest (cid:107)wT \u2212 w\u2217(cid:107)2.\nFigure 1 gives the experimental results in semi-log plots. We summarize the results as follows.\n\n1. For all the six cases, the approximate gradient descent algorithm eventually reaches a stable state\n\nof linear convergence, until reaching very small error.\n\n2. For ReLU networks, both algorithms converges. The convergence of approximated gradient\ndescent algorithm is slower compared with Double Convotron, but it eventually reaches smaller\nstatistical error, indicating a better sample complexity.\n\n3. For sigmoid and hyperbolic tangent networks, not surprisingly, Double Convotron does not\n\nconverge. In contrast, approximated gradient descent still converges in a linear rate.\n\nThe experimental results discussed above clearly demonstrates the validity of our theoretical analysis.\nIn Appendix F, we also present some additional experiments on non-Gaussian inputs, and demonstrate\nthat although this setting is not the focus of our theoretical results, approximate gradient descent still\nhas promising performance on symmetric data distributions.\n\n8\n\n\f(a) ReLU\n\n(b) Sigmod\n\n(c) Hyperbolic tangent\n\n(d) ReLU\n\n(e) Sigmod\n\n(f) Hyperbolic tangent\n\nFigure 1: Numerical simulation for Algorithm 1 and the Double Convotron algorithm proposed by\nDu and Goel [10] with different activation functions, number of hidden nodes and \ufb01lter sizes. The\nresults for the case k = 15, r = 5 and k = 30, r = 9 are shown in (a)-(c) and (d)-(f) respectively. (a),\n(d) show the results for ReLU networks; (b), (e) give the results for sigmoid networks; and \ufb01nally the\nresults for hyperbolic tangent activation function are in (c) and (f). All plots are semi-log plots.\n\n7 Conclusions and Future Work\n\nWe propose a new algorithm namely approximate gradient descent for training CNNs, and show that,\nwith high probability, the proposed algorithm with random initialization can recover the ground-truth\nparameters up to statistical precision at a linear convergence rate . Compared with previous results, our\nresult applies to a class of monotonic and Lipschitz continuous activation functions including ReLU,\nLeaky ReLU, Sigmod and Softplus etc. Moreover, our algorithm achieves better sample complexity\nin the dependency of the number of hidden nodes and \ufb01lter size. In particular, our result matches\nthe information-theoretic lower bound for learning one-hidden-layer CNNs with linear activation\nfunctions, suggesting that our sample complexity is tight. Numerical experiments on synthetic data\ncorroborate our theory. Our algorithms and theory can be extended to learn one-hidden-layer CNNs\nwith overlapping \ufb01lters. We leave it as a future work. It is also of great importance to extend the\ncurrent result to deeper CNNs with multiple convolution \ufb01lters.\n\nAcknowledgement\n\nWe thank the anonymous reviewers and area chair for their helpful comments. This research was\nsponsored in part by the National Science Foundation CAREER Award IIS-1906169, IIS-1903202,\nand Salesforce Deep Learning Research Award. The views and conclusions contained in this paper\nare those of the authors and should not be interpreted as representing any funding agencies.\n\nReferences\n\n[1] ALLEN-ZHU, Z., LI, Y. and LIANG, Y. (2018). Learning and generalization in overparameter-\n\nized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 .\n\n[2] ALLEN-ZHU, Z., LI, Y. and SONG, Z. (2018). A convergence theory for deep learning via\n\nover-parameterization. arXiv preprint arXiv:1811.03962 .\n\n[3] ARORA, S., DU, S. S., HU, W., LI, Z. and WANG, R. (2019). Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584 .\n\n9\n\n02040608010010\u2212310\u2212210\u22121100101epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||202040608010010\u2212210\u22121100101epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||202040608010010\u2212310\u2212210\u22121100101102epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||202040608010010\u2212410\u2212310\u2212210\u22121100101epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||202040608010010\u2212210\u22121100101102epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||202040608010010\u2212310\u2212210\u22121100101102epochl2 error ApproxGD: || w \u2212 w* ||2ApproxGD: || v \u2212 v* ||2DoubleConvotron: || w \u2212 w* ||2DoubleConvotron: || v \u2212 v* ||2\f[4] BAUM, E. B. (1990). A polynomial time algorithm that learns two hidden unit nets. Neural\n\nComputation 2 510\u2013522.\n\n[5] BLUM, A. and RIVEST, R. L. (1989). Training a 3-node neural network is np-complete. In\n\nAdvances in neural information processing systems.\n\n[6] BRUTZKUS, A. and GLOBERSON, A. (2017). Globally optimal gradient descent for a convnet\n\nwith gaussian inputs. arXiv preprint arXiv:1702.07966 .\n\n[7] CAO, Y. and GU, Q. (2019). A generalization theory of gradient descent for learning over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1902.01384 .\n\n[8] COHEN, N. and SHASHUA, A. (2016). Convolutional recti\ufb01er networks as generalized tensor\n\ndecompositions. In International Conference on Machine Learning.\n\n[9] CUADRAS, C. M. (2002). On the covariance between functions. Journal of Multivariate\n\nAnalysis 81 19\u201327.\n\n[10] DU, S. S. and GOEL, S. (2018). Improved learning of one-hidden-layer convolutional neural\n\nnetworks with overlaps. arXiv preprint arXiv:1805.07798 .\n\n[11] DU, S. S., LEE, J. D., LI, H., WANG, L. and ZHAI, X. (2018). Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804 .\n\n[12] DU, S. S., LEE, J. D. and TIAN, Y. (2017). When is a convolutional \ufb01lter easy to learn? arXiv\n\npreprint arXiv:1709.06129 .\n\n[13] DU, S. S., LEE, J. D., TIAN, Y., POCZOS, B. and SINGH, A. (2017). Gradient descent\nlearns one-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint\narXiv:1712.00779 .\n\n[14] DU, S. S., WANG, Y., ZHAI, X., BALAKRISHNAN, S., SALAKHUTDINOV, R. and SINGH, A.\n(2018). How many samples are needed to learn a convolutional neural network? arXiv preprint\narXiv:1805.07883 .\n\n[15] DU, S. S., ZHAI, X., POCZOS, B. and SINGH, A. (2018). Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054 .\n\n[16] FU, H., CHI, Y. and LIANG, Y. (2018). Local geometry of one-hidden-layer neural networks\n\nfor logistic regression. arXiv preprint arXiv:1802.06463 .\n\n[17] GE, R., LEE, J. D. and MA, T. (2017). Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501 .\n\n[18] GOEL, S., KLIVANS, A. and MEKA, R. (2018). Learning one convolutional layer with\n\noverlapping patches. arXiv preprint arXiv:1802.02547 .\n\n[19] GUNASEKAR, S., LEE, J., SOUDRY, D. and SREBRO, N. (2018). Implicit bias of gradient\n\ndescent on linear convolutional networks. arXiv preprint arXiv:1806.00468 .\n\n[20] HINTON, G., DENG, L., YU, D., DAHL, G. E., MOHAMED, A.-R., JAITLY, N., SENIOR,\nA., VANHOUCKE, V., NGUYEN, P., SAINATH, T. N. ET AL. (2012). Deep neural networks\nfor acoustic modeling in speech recognition: The shared views of four research groups. IEEE\nSignal Processing Magazine 29 82\u201397.\n\n[21] HOEFFDING, W. (1940). Masstabinvariante korrelationtheorie, schriften des mathematis\nchen instituts und des instituts f\u00fcr angewandte mathematik der universit\u00e4t berlin 5, 181#\n233.(translated in \ufb01sher, ni and pk sen (1994). the collected works of wassily hoeffding, new\nyork.\n\n[22] HORNIK, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural\n\nnetworks 4 251\u2013257.\n\n10\n\n\f[23] JANZAMIN, M., SEDGHI, H. and ANANDKUMAR, A. (2015). Beating the perils of non-\nconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint\narXiv:1506.08473 .\n\n[24] KLIVANS, A. R., LONG, P. M. and TANG, A. K. (2009). Baum\u2019s algorithm learns intersections\nof halfspaces with respect to log-concave distributions. In Approximation, Randomization, and\nCombinatorial Optimization. Algorithms and Techniques. Springer, 588\u2013600.\n\n[25] KRIZHEVSKY, A., SUTSKEVER, I. and HINTON, G. E. (2012). Imagenet classi\ufb01cation with\n\ndeep convolutional neural networks. In Advances in neural information processing systems.\n\n[26] LI, Y. and LIANG, Y. (2018). Learning overparameterized neural networks via stochastic\n\ngradient descent on structured data. arXiv preprint arXiv:1808.01204 .\n\n[27] LI, Y. and YUAN, Y. (2017). Convergence analysis of two-layer neural networks with relu\n\nactivation. arXiv preprint arXiv:1705.09886 .\n\n[28] MEI, S., BAI, Y. and MONTANARI, A. (2016). The landscape of empirical risk for non-convex\n\nlosses. arXiv preprint arXiv:1607.06534 .\n\n[29] MEI, S., MONTANARI, A. and NGUYEN, P.-M. (2018). A mean \ufb01eld view of the landscape of\n\ntwo-layers neural networks. arXiv preprint arXiv:1804.06561 .\n\n[30] NGUYEN, Q. and HEIN, M. (2017). The loss surface and expressivity of deep convolutional\n\nneural networks. arXiv preprint arXiv:1710.10928 .\n\n[31] SEN, P. K. (1994). The impact of wassily hoeffding\u2019s research on nonparametrics. In The\n\nCollected Works of Wassily Hoeffding. Springer, 29\u201355.\n\n[32] SHAMIR, O. (2016). Distribution-speci\ufb01c hardness of learning neural networks. arXiv preprint\n\narXiv:1609.01037 .\n\n[33] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE,\nG., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M. ET AL.\n(2016). Mastering the game of go with deep neural networks and tree search. Nature 529\n484\u2013489.\n\n[34] SLEPIAN, D. (1962). The one-sided barrier problem for gaussian noise. Bell Labs Technical\n\nJournal 41 463\u2013501.\n\n[35] TIAN, Y. (2016). Symmetry-breaking convergence analysis of certain two-layered neural\n\nnetworks with relu nonlinearity .\n\n[36] VERSHYNIN, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027 .\n\n[37] YI, X. and CARAMANIS, C. (2015). Regularized em algorithms: A uni\ufb01ed framework and\n\nstatistical guarantees. In Advances in Neural Information Processing Systems.\n\n[38] ZHANG, X., YU, Y., WANG, L. and GU, Q. (2018). Learning one-hidden-layer relu networks\n\nvia gradient descent. arXiv preprint arXiv:1806.07808 .\n\n[39] ZHANG, Y., LIANG, P. and WAINWRIGHT, M. J. (2016). Convexi\ufb01ed convolutional neural\n\nnetworks. arXiv preprint arXiv:1609.01000 .\n\n[40] ZHONG, K., SONG, Z. and DHILLON, I. S. (2017). Learning non-overlapping convolutional\n\nneural networks with multiple kernels. arXiv preprint arXiv:1711.03440 .\n\n[41] ZHONG, K., SONG, Z., JAIN, P., BARTLETT, P. L. and DHILLON, I. S. (2017). Recovery\n\nguarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175 .\n\n[42] ZOU, D., CAO, Y., ZHOU, D. and GU, Q. (2018). Stochastic gradient descent optimizes\n\nover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 .\n\n11\n\n\f", "award": [], "sourceid": 5641, "authors": [{"given_name": "Yuan", "family_name": "Cao", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}