{"title": "On the Power and Limitations of Random Features for Understanding Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6598, "page_last": 6608, "abstract": "Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these \\emph{explicitly} leads to the well-known approach of learning with random features (e.g. \\citep{rahimi2008random,rahimi2009weighted}). In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we formalize the link between existing results and random features, and argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we prove that random features cannot be used to learn \\emph{even a single ReLU neuron} (over standard Gaussian inputs in $\\reals^d$ and $\\text{poly}(d)$ weights), unless the network size (or magnitude of its weights) is exponentially large in $d$. Since a single neuron \\emph{is} known to be learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks. For completeness we also provide a simple self-contained proof, using a random features technique, that one-hidden-layer neural networks can learn low-degree polynomials.", "full_text": "On the Power and Limitations of Random Features\n\nfor Understanding Neural Networks\n\nGilad Yehudai\n\nOhad Shamir\n\nWeizmann Institute of Science\n\n{gilad.yehudai,ohad.shamir}@weizmann.ac.il\n\nAbstract\n\nRecently, a spate of papers have provided positive theoretical results for training\nover-parameterized neural networks (where the network size is larger than what\nis needed to achieve low error). The key insight is that with suf\ufb01cient over-\nparameterization, gradient-based methods will implicitly leave some components\nof the network relatively unchanged, so the optimization dynamics will behave as if\nthose components are essentially \ufb01xed at their initial random values. In fact, \ufb01xing\nthese explicitly leads to the well-known approach of learning with random features\n(e.g. [27, 29]). In other words, these techniques imply that we can successfully\nlearn with neural networks, whenever we can successfully learn with random\nfeatures. In this paper, we formalize the link between existing results and random\nfeatures, and argue that despite the impressive positive results, random feature\napproaches are also inherently limited in what they can explain. In particular, we\nprove that random features cannot be used to learn even a single ReLU neuron\n(over standard Gaussian inputs in Rd and poly(d) weights), unless the network size\n(or magnitude of its weights) is exponentially large in d. Since a single neuron\nis known to be learnable with gradient-based methods, we conclude that we are\nstill far from a satisfying general explanation for the empirical success of neural\nnetworks. For completeness we also provide a simple self-contained proof, using\na random features technique, that one-hidden-layer neural networks can learn\nlow-degree polynomials.\n\n1\n\nIntroduction\n\nDeep learning, in the form of arti\ufb01cial neural networks, has seen a dramatic resurgence in popularity\nin recent years. This is mainly due to impressive performance gains on various dif\ufb01cult learning\nproblems, in \ufb01elds such as computer vision, natural language processing and many others. Despite the\npractical success of neural networks, our theoretical understanding of them is still very incomplete.\nA key aspect of modern networks is that they tend to be very large, usually with many more parameters\nthan the size of the training data: In fact, so many that in principle, they can simply memorize all\nthe training examples (as shown in the in\ufb02uential work of Zhang et al. [40]). The fact that such\nhuge, over-parameterized networks are still able to learn and generalize is one of the big mysteries\nconcerning deep learning. A current leading hypothesis is that over-parameterization makes the\noptimization landscape more benign, and encourages standard gradient-based training methods to \ufb01nd\nweight con\ufb01gurations that \ufb01t the training data as well as generalize (even though there might be many\nother con\ufb01gurations which \ufb01t the training data without any generalization). However, pinpointing the\nexact mechanism by which over-parameterization helps is still an open problem.\nRecently, a spate of papers (such as [4, 11, 14, 2, 23, 15, 9, 3, 1]) provided positive results for training\nand learning with over-parameterized neural networks. Although they differ in details, they are\nall based on the following striking observation: When the networks are suf\ufb01ciently large, standard\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fgradient-based methods change certain components of the network (such as the weights of a certain\nlayer) very slowly, so that if we run these methods for a bounded number of iterations, they might as\nwell be \ufb01xed. To give a concrete example, consider one-hidden-layer neural networks, which can be\nwritten as a linear combination of r neurons\n\nN (x) =\n\nui\u03c3((cid:104)wi, x(cid:105) + bi) ,\n\n(1)\n\nr(cid:88)\n\ni=1\n\nusing weights {ui, wi, bi}r\ni=1 and an activation function \u03c3. When r is suf\ufb01ciently large, and with\nstandard random initializations, it can be shown that gradient descent will leave the weights wi, bi\nin the \ufb01rst layer nearly unchanged (at least initially). As a result, the dynamics of gradient descent\nwill resemble those where {wi, bi} are \ufb01xed at random initial values \u2013 namely, where we learn a\nlinear predictor (parameterized by u1, . . . , ur) over a set of r random features of the form x (cid:55)\u2192\n\u03c3((cid:104)wi, x(cid:105) + bi) (for some random choice of wi, bi). For such linear predictors, it is not dif\ufb01cult to\nshow that they will converge quickly to an optimal predictor (over the span of the random features).\nThis leads to learning guarantees with respect to hypothesis classes which can be captured well\nby such random features: For example, most papers focus (explicitly or implicitly) on multivariate\npolynomials with certain constraints on their degree or the magnitude of their coef\ufb01cients. We discuss\nthese results in more detail (and demonstrate their close connection to random features) in Section 2.\nTaken together, these results are a signi\ufb01cant and insightful advance in our understanding of neural\nnetworks: They rigorously establish that suf\ufb01cient over-parameterization allows us to learn compli-\ncated functions, while solving a non-convex optimization problem. However, it is important to realize\nthat this approach can only explain learnability of hypothesis classes which can already be learned\nusing random features. Considering the one-hidden-layer example above, this corresponds to learning\nlinear predictors over a \ufb01xed representation (chosen obliviously and randomly at initialization). Thus,\nit does not capture any element of representation learning, which appears to lend much of the power\nof modern neural networks.\nIn this paper we show that there are inherent limitations on what predictors can be captured with\nrandom features, and as a result, on what can be provably learned with neural networks using the\ntechniques described earlier. We consider features of the form fi : Rd \u2192 R which are chosen\n(randomly or deterministically) and then \ufb01xed. The fi\u2019s can be arbitrary functions, including\nmultilayered neural networks, as long as their norm (suitably de\ufb01ned) is not exponential in the input\ni=1 uifi(x) we cannot ef\ufb01ciently approximate even a\nsingle ReLU neuron:\nTheorem 1.1 (Informal version of Theorem 4.8). Let F be a family of function on Rd, where for every\nf \u2208 F, Ex\u223cN (0,I)[f (x)2] is less than exponential in the dimension d, and let D be a distribution\nover tuples (f1, . . . , fr) of functions from F. Then there exist a weight vector w\u2217 \u2208 Rd,(cid:107)w\u2217(cid:107) = d2\nand bias term b\u2217 \u2208 R such that w.h.p over the choice of functions from F, if:\n\ndimension. We show that using N (x) =(cid:80)r\n\n(cid:104)\n\n(N (x) \u2212 [(cid:104)w\u2217, x(cid:105) + b\u2217]+)2(cid:105) \u2264 1\n\n50\n\nEx\u223cN (0,I)\n\n(where [z]+ = max{0, z} is the ReLU function and x has a standard Gaussian distribution) then\n\nr \u00b7 max\n\n|ui| \u2265 exp(\u2126(d)).\n\ni\n\nIn other words, either the number of neurons r or the magnitude of the weights (or both) must be\nexponential in the dimension d, which generally implies exponential time to learn the target function\n(see Remark 4.5). Moreover, if the features can be written as functions operating on a random linear\ntransformation over the data, then the same result holds for any choice of w\u2217. In more details:\nTheorem 1.2 (Informal version of Theorem 4.2). Let F be a family of functions where each f \u2208 F\ncan be written as x (cid:55)\u2192 f (W x) where W is a random matrix whose rows are sampled uniformly at\nrandom from the unit sphere. Then the result of theorem 3.1 holds for any w\u2217 \u2208 Rd,(cid:107)w\u2217(cid:107) = d2.\nBoth theorems apply for a large family of functions, particular examples include two layered neural\nnetworks (Theorm 1.2) and \"linearized\" neural tangent kernel (Theorem 1.1, also see Jacot et al.\n[20]). These results imply that the random features approach cannot fully explain polynomial-time\nlearnability of neural networks, even with respect to data generated by an extremely simple neural\nnetwork, composed of a single neuron. This is despite the fact that single ReLU neurons are easily\n\n2\n\n\flearnable with gradient-based methods (e.g., [26], [34], also see Section 4 for further details). The\npoint we want to make here is that the random feature approach, as a theory for explaining the success\nof neural networks, cannot explain even the fact that single neurons are learnable.\nFor completeness we also provide a simple, self-contained analysis, showing how over-parameterized,\none-hidden-layer networks can provably learn polynomials with bounded degrees and coef\ufb01cients,\nusing standard stochastic gradient descent with standard initialization.\nWe emphasize that there is no contradiction between our positive and negative results: In the positive\nresult on learning polynomials, the required size of the network is exponential in the degree of the\npolynomial, and low-degree polynomials cannot express even a single ReLU neuron if its weights are\nlarge enough.\nOverall, we argue that although the random feature approach captures important aspects of training\nneural networks, it is by no means the whole story, and we are still quite far from a satisfying general\nexplanation for the empirical success of neural networks.\n\nRelated Work\n\nmachine in the 1950\u2019s). These involve learning predictors of the form x (cid:55)\u2192(cid:80)r\n\nThe recent literature on the theory of deep learning is too large to be thoroughly described here.\nInstead, we survey here some of the works most directly relevant to the themes of our paper. In\nSection 2, we provide a more technical explanation on the connection of recent results to random\nfeatures.\nThe Power of Over-Parameterization. The fact that over-parameterized networks are easier to train\nwas empirically observed, for example, in [25], and was used in several contexts to show positive\nresults for learning and training neural networks. For example, it is known that adding more neurons\nmakes the optimization landscape more benign (e.g., [30, 36, 37, 31, 10, 35]), or allows them to learn\nin various settings (e.g., besides the papers mentioned in the introduction, [8, 24, 7, 39, 13]).\nRandom Features. The technique of random features was proposed and formally analyzed in\n[27, 28, 29], originally as a computationally-ef\ufb01cient alternative to kernel methods (although as\na heuristic, it can be traced back to the \u201crandom connections\u201d feature of Rosenblatt\u2019s Perceptron\ni=1 ui\u03c8i(x), where\n\u03c8i are random non-linear functions. The training involves only tuning of the ui weights. Thus, the\nlearning problem is as computationally easy as training linear predictors, but with the advantage\nthat the resulting predictor is non-linear, and in fact, if r is large enough, can capture arbitrarily\ncomplex functions. The power of random features to express certain classes of functions has been\nstudied in past years (for example [5, 28, 21, 38]). However, in our paper we also consider negative\nrather than positive results for such features. [5] also discusses the limitation of approximating\nfunctions with a bounded number of such features, but in a different setting than ours (worst-case\napproximation of a large function class using a \ufb01xed set of features, rather than inapproximability of\na \ufb01xed target function, and not in the context of single neurons). Less directly related, [41] studied\nlearning neural networks using kernel methods, which can be seen as learning a linear predictor over\na \ufb01xed non-linear mapping. However, the algorithm is not based on training neural networks with\nstandard gradient-based methods. In a very recent work (and following the initial dissemination\nof our paper), Ghorbani et al. [17] studied the representation capabilities of random features, and\nshowed that in high dimensions random features are not good at \ufb01tting high degree polynomials.\n\nNotation\n\nDenote by U(cid:0)[a, b]d(cid:1) the d-dimensional uniform distribution over the rectangle [a, b]d, and by\n\nN (0, \u03a3) the multivariate Gaussian distribution with covariance matrix \u03a3. For T \u2208 N let [T ] =\n{1, 2, . . . , T}, and for a vector w \u2208 Rd we denote by (cid:107)w(cid:107) the L2 norm. We denote the ReLU\nfunction by [x]+ = max{0, x}.\n\n2 Analysis of Neural Networks as Random Features\n\nIn many previous works, a key element in the analysis of neural networks is to build a reduction from\nneural networks to random features. In this reduction, usually by choosing appropriate learning rate\nand initialization, close to the initialization gradient descent is optimizing neural network similarly to\n\n3\n\n\fthe way it would optimize random features. Thus, it is enough to analyze the optimization process of\nrandom features, and deduce that if random features can achieve good performance, the same holds\nfor neural networks. Here we survey some of these works and how they can actually be viewed as\nrandom features.\n\n2.1 Optimization with Coupling, Fixing the Output Layer\n\nOne approach is to \ufb01x the output layer and do optimization only on the inner layers. Most works\nthat use this method (e.g. [23], [15], [3], [9], [2], [1]) also use the method of \"coupling\" and the\npopular ReLU activation. This method uses the following observation: a ReLU neuron can be viewed\nas a linear predictor multiplied by a threshold function, that is: [(cid:104)w, x(cid:105)]+ = (cid:104)w, x(cid:105)1(cid:104)w,x(cid:105)\u22650. The\ncoupling method informally states that after doing gradient descent with appropriate learning rate\nand a limited number of iterations, the amount of neurons that change the sign of (cid:104)w, x(cid:105) (for x in\nthe data) is small. Thus, it is enough to analyze a linear network over random features of the form:\nx (cid:55)\u2192 (cid:104)w, x(cid:105)1(cid:104)w(0),x(cid:105)\u22650 where w(0) are randomly chosen.\nFor example, a one-hidden-layer neural network where the activation \u03c3 is the ReLU function can be\nwritten as\n\ni \u03c3((cid:104)w(t)\nu(t)\n\ni\n\n, x(cid:105)) =\n\ni (cid:104)w(t)\nu(t)\n\ni\n\n, x(cid:105)1(cid:104)w(t)\n\ni\n\n,x(cid:105)\u22650.\n\nUsing the coupling method, after doing gradient descent, the amount of neurons that change sign, i.e.\nthe sign of (cid:104)w(t)\n, x(cid:105) changes, is small. As a result, using the homogeneity of the ReLU function, the\nfollowing network can actually be analyzed:\n\ni\n\ni (cid:104)w(t)\nu(t)\n\ni\n\n, x(cid:105)1(cid:104)w(0)\n\ni\n\n,x(cid:105)\u22650 =\n\n(cid:104)u(t)\n\ni\n\n\u00b7 w(t)\n\ni\n\n, x(cid:105)1(cid:104)w(0)\n\ni\n\n,x(cid:105)\u22650,\n\ni\n\ni\n\nwhere w(0)\nare randomly chosen. This is just analyzing a linear predictor with random features of\nthe form x (cid:55)\u2192 xj 1(cid:104)w(0)\n,x(cid:105)\u22650. Note that the homogeneity of the ReLU function is used in order to\nshow that \ufb01xing the output layer does not change the network\u2019s expressiveness. This is not true in\nterms of optimization, as optimizing both the inner layers and the output layers may help the network\nconverge faster, and to \ufb01nd a predictor which has better generalization properties. Thus, the challenge\nin this approach is to \ufb01nd functions or distributions that can be approximated with this kind of random\nfeatures network, using a polynomial number of features.\n\n2.2 Optimization on all the Layers\n\nA second approach in the literature (e.g. Andoni et al. [4], Daniely et al. [12], Du et al. [14]) is to\nperform optimization on all the layers of the network, choose a \"good\" learning rate and bound the\nnumber of iterations such that the inner layers stay close to their initialization. For example, in the\nsetting of a one-hidden-layer network, for every \u0001 > 0, a learning rate \u03b7 and number of iterations\nT are chosen, such that after running gradient descent with these parameters, there is an iteration\n1 \u2264 t \u2264 T such that:\n\nr(cid:88)\n\ni=1\n\nr(cid:88)\n\ni=1\n\nr(cid:88)\n\ni=1\n\nr(cid:88)\n\ni=1\n\nHence, it is enough to analyze a linear predictor over a set of random features:\n\n(cid:13)(cid:13)(cid:13)U (t)\u03c3(W (t)x) \u2212 U (t)\u03c3(W (0)x)\n(cid:13)(cid:13)(cid:13) \u2264 \u0001.\n, x(cid:105)(cid:17)\n(cid:16)(cid:104)w(0)\n(cid:16)\n\nr(cid:88)\n\nW (0)x\n\n(cid:17)\n\n=\n\nu(t)\ni \u03c3\n\ni\n\nU (t)\u03c3\n\n,\n\nwhere \u03c3 is not necessarily the ReLU function. Again, the dif\ufb01culty here is \ufb01nding the functions that\ncan be approximated in this form, where r (the amount of neurons) is only polynomial in the relevant\nparameters.\n\ni=1\n\n3 Over-Parameterized Neural Networks Learn Polynomials\n\nFor completeness we provide a simple, self-contained analysis, showing how over-parameterized,\none-hidden-layer networks can provably learn polynomials with bounded degrees and coef\ufb01cients,\nusing standard stochastic gradient descent with standard initialization. In more details:\n\n4\n\n\fTheorem 3.1 (Informal). Given any distribution D over labeled data (x, y), where x \u2208 Rd, y \u2208\n{\u22121, +1}, any \u0001 > 0, and almost any multivariate polynomial P (x) with degree at most k and\ncoef\ufb01cients of magnitude at most \u03b1, if we take a one-hidden-layer neural network N (x) with analytic\nactivation functions which are L \u2212 Lipschitz, and with at least\n\n(cid:18) 1\n\n\u0001\n\n(cid:18) 1\n\n(cid:19)\n\n\u03b4\n\n(cid:19)\n\nr > poly\n\n, log\n\n, dk2\n\n, \u03b1k, L\n\nneurons, and run poly(r) many iterations of stochastic gradient descent on i.i.d. examples, then with\nprobability at least 1 \u2212 \u03b4 over the random initialization, it holds that\nE[LD(N (x))] \u2264 LD(P (x)) + \u0001 ,\n\nwhere LD is the expected hinge loss and the expectation is over the random choice of examples.\nFor a formal statement and full proof see Appendix C. We emphasize that although our analysis\nimproves on previous ones in certain aspects (discussed in more detail in Appendix A), it is not\nfundamentally novel: Our goal is mostly to present a transparent and self-contained result using the\napproach developed in previous papers, focusing on clarity rather than generality. In comparison,\nsome of the related results assume that the output layer is \ufb01xed (although in practice all layers are\noptimized); some focus on training error rather than population risk; some do not quantitatively\ncharacterize the class of polynomials learned by the network; and some consider different networks\narchitectures or optimization methods. For an in-depth review of previous works and how they differ\nfrom the above result see Appendix A.\n\n4 Limitations of Random Features\n\nHaving discussed and shown positive results for learning using (essentially) random features, we turn\nto discuss the limitations of this approach.\nConcretely, we will consider in this section data (x, y), where x \u2208 Rd is drawn from a standard\nGaussian on Rd, and there exists some single ground-truth neuron which generates the target values\ny: Namely, y = \u03c3((cid:104)w\u2217, x(cid:105) + b\u2217) for some \ufb01xed w\u2217, b\u2217. We also consider the squared loss l(\u02c6y, y) =\n(\u02c6y \u2212 y)2, so the expected loss we wish to minimize takes the form\nuifi(x) \u2212 \u03c3((cid:104)w\u2217, x(cid:105) + b\u2217)\n\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\n(cid:33)2\uf8f9\uf8fb\n\nEx\n\n(2)\n\ni=1\n\nwhere fi are the random features. Importantly, when r = 1, \u03c3 is the ReLU function, and f1(x) =\n\u03c3((cid:104)w, x(cid:105) + b) (that is, we train a single neuron to learn a single target neuron), this problem is quite\ntractable with standard gradient-based methods (see, e.g., [26], [34]). In this section, we ask whether\nthis positive result \u2013 that single target neurons can be learned \u2013 can be explained by the random\nfeatures approach. Speci\ufb01cally, we consider the case where the function fi are arbitrary functions\nchosen obliviously of the target neuron (e.g. multilayered neural networks at a standard random\ninitialization), and ask what conditions on r and ui are required to minimize Eq. (2). Our main\nresults (Theorem 4.2 and Theorem 4.8) show that either one of them has to be exponential in the\ndimension d, as long as the sizes of w\u2217, b\u2217 are allowed to be polynomial in d. Since networks with\nexponentially-many neurons or exponentially-sized weights are generally not ef\ufb01ciently trainable, we\nconclude that an approach based on random-features cannot explain why learning single neurons is\ntractable in practice. In Theorem 4.2, we show the result for any choice of w\u2217 with some \ufb01xed norm,\nbut require the feature functions to have a certain structure (which is satis\ufb01ed by neural networks). In\nTheorem 4.8, we drop this requirement, but then the result only holds for a particular w\u2217.\nTo simplify the notation in this section, we consider functions on x as elements of the L2(Rd) space\nweighted by a standard Gaussian measureSpeci\ufb01cally, for a function f : Rd \u2192 R we denote\n\n(cid:2)f 2(x)(cid:3) = cd\n\n(cid:90)\n\n(cid:107)f (\u00b7)(cid:107)2 := Ex\u223cN (0,I)\nRd\n(cid:104)f (\u00b7), g(\u00b7)(cid:105) := Ex\u223cN (0,I)[f (x)g(x)] = cd\n\n(cid:90)\n\nRd\n\nf 2(x)e\n\n\u2212(cid:107)x(cid:107)2\n\n2 dx,\n\nf (x)g(x)e\n\n\u2212(cid:107)x(cid:107)2\n\n2 dx,\n\nis a normalization term. For example, Eq. (2) can also be written as\n\n(cid:17)d\n\n(cid:16) 1\u221a\n\n2\u03c0\n\nwhere cd =\n\n(cid:107)(cid:80)r\n\ni=1 uifi(\u00b7) \u2212 \u03c3((cid:104)w\u2217,\u00b7(cid:105) + b\u2217)(cid:107)2.\n\n5\n\n\f4.1 Warm up: Linear predictors\n\nBefore stating our main results, let us consider a particularly simple case, where \u03c3 is the identity,\nand our goal is to learn a linear predictor x (cid:55)\u2192 (cid:104)w\u2217, x(cid:105) with (cid:107)w\u2217(cid:107) = 1. We will show that already\nin this case, there is a signi\ufb01cant cost to pay for using random features. The main result in the next\nsubsection can be seen as an elaboration of this idea.\nIn this setting, \ufb01nding a good linear predictor, namely minimizing (cid:107)(cid:104)w,\u00b7(cid:105) \u2212 (cid:104)w\u2217,\u00b7(cid:105)(cid:107) is easy: It is a\nconvex optimization problem, and is easily solved using standard gradient-based methods. Suppose\n\nnow that we are given random features w1, . . . , wr \u223c N(cid:0)0, 1\n(cid:1) and want to \ufb01nd u1, . . . , ur \u2208 R\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u0001.\n\nui(cid:104)wi,\u00b7(cid:105) \u2212 (cid:104)w\u2217,\u00b7(cid:105)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r(cid:88)\n\ni=1\n\nsuch that:\n\nd Id\n\n(3)\n\nThe following proposition shows that with high probability, Eq. (3) cannot hold unless r = \u2126(d).\nThis shows that even for linear predictors, there is a price to pay for using a combination of random\nfeatures, instead of learning the linear predictor directly.\nProposition 4.1. Let w\u2217 be some unit vector in Rd, and suppose that we pick random features\nw1, . . . , wr i.i.d. from any spherically symmetric distribution in Rd. If r \u2264 d\n2 , then with probability\nat least 1 \u2212 exp(\u2212cd) (for some universal constant c > 0), for any choice of weights u1, . . . , ur, it\n\nholds that (cid:107)(cid:80)r\n\ni=1 ui(cid:104)wi,\u00b7(cid:105) \u2212 (cid:104)w\u2217,\u00b7(cid:105)(cid:107)2 \u2265 1\n4 .\n\nThe full proof appears in Appendix B, but the intuition is quite simple: With random features, we\nare forced to learn a linear predictor in the span of w1, . . . , wr, which is a random r-dimensional\nsubspace of Rd. Since this subspace is chosen obliviously of w\u2217, and r \u2264 d\n2 , it is very likely that w\u2217\nis not close to this subspace (namely, the component of w\u2217 orthogonal to this subspace is large), and\ntherefore we cannot approximate this target linear predictor very well.\n\n4.2 Features Based on Random Linear Transformations\n\nHaving discussed the linear case, let us return to the case of a non-linear neuron. Speci\ufb01cally, we\nwill show that even a single ReLU neuron cannot be approximated by a very large class of random\nfeature predictors, unless the amount of neurons in the network is exponential in the dimension, or\nthe coef\ufb01cients of the linear combination are exponential in the dimension. In more details:\nTheorem 4.2. There exists a universal constant c > 0 such that the following holds. Let d > 40,\nk \u2208 N, and let F be a family of functions from Rk to R. Also, let W \u2208 Rk\u00d7d be a random matrix\nwhose rows are sampled uniformly at random from the unit sphere. Suppose that fW (x) := f (W x)\nsatis\ufb01es (cid:107)fW (\u00b7)(cid:107) \u2264 exp(d/40) for any realization of W and for all f \u2208 F. Then for every w\u2217 \u2208 Rd\nwith (cid:107)w\u2217(cid:107) = d2, there exists b\u2217 \u2208 R with |b\u2217| \u2264 6d3 + 1, such that for any f1, . . . fr \u2208 F, w.p\n> 1 \u2212 exp(\u2212cd) over sampling W , if\n\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\n(cid:33)2\uf8f9\uf8fb \u2264 1\n\n,\n\n50\n\nEx\u223cN (0,I)\n\nthen\n\nuifi(W x) \u2212 [(cid:104)w\u2217, x(cid:105) + b\u2217]+\n\ni=1\n\nr \u00b7 max\n\ni\n\n|ui| \u2265 1\n\n48d2 exp (cd) .\n\nNote that the theorem allows any \u201crandom\u201d feature which can be written as a composition of some\nfunction f (chosen randomly or not), and a random linear transformation.\nExample 4.3. As a special case of Theorem 4.2, consider two layer neural networks with activation\n\u03c3 : R \u2192 R and without a bias term. Denote by Wi the i-th row of W , then two layer neural networks\ncan be viewed as a linear combination of functions of the form:\nfi(W x) = \u03c3((cid:104)Wi, x(cid:105)),\n\nwhere if each Wi is bounded and the norm of the activation \u03c3 is bounded, then the norm of fi(W x)\nis also bounded. This example can be extended to multi-layer neurons of any depth as long as the\n\ufb01rst layer performs a random linear transformation.\n\n6\n\n\fRemark 4.4. The requirement that the rows of W are sampled uniformly from the unit sphere can be\neasily relaxed so that the rows of W have a spherically symmetrical distribution and are bounded\nw.h.p. These kind of distributions includes, among others, bounded distributions and standard\nGaussian distributions. However it would require a more careful analysis on the bound of the\nfunctions fW , as they might be only bounded w.h.p.\nRemark 4.5. As opposed to the linear case, here we also have a restriction on maxi |ui|. We\nconjecture that it is possible to remove this dependence and leave it to future work. With that said,\nexisting analyses of stochastic gradient descent, even for convex functions, imply that the required\nnumber of iterations scales polynomially with the norm of the target solution (e.g., Hazan et al. [19]),\nwhich would mean exponentially many iterations in our case. Moreover, practically speaking, such\nhuge coef\ufb01cients can cause over\ufb02ow when running SGD on a computer with standard \ufb02oating point\nformats.\n\nTo prove the theorem, we will use the following proposition, which implies that functions of the\nform x (cid:55)\u2192 \u03c8((cid:104)w, x(cid:105)) for a certain sine-like \u03c8 and \"random\" w are nearly uncorrelated with any \ufb01xed\nfunction.\nProposition 4.6. Let d \u2208 N, where d > 40, and let a = 6d2 + 1. De\ufb01ne the following function:\n\na(cid:88)\n\n\u03c8(x) = [x + a]+ +\n\n2[x + a \u2212 2n]+(\u22121)n \u2212 1,\n\nThen \u03c8 : R \u2192 R satis\ufb01es the following:\n\nn=1\n\n1. It is a periodic odd function on the interval [\u2212a, a]\n2. For every w\u2217 \u2208 Rd with (cid:107)w\u2217(cid:107) = d, (cid:107)\u03c8((cid:104)w\u2217,\u00b7(cid:105))(cid:107)2 \u2265 d.\n3. For every f \u2208 L2(Rd), we have Ew\n\n(cid:0)(cid:104)f (\u00b7), \u03c8 ((cid:104)w,\u00b7(cid:105))(cid:105)2(cid:1) \u2264 20(cid:107)f(cid:107)2 \u00b7 exp(\u2212cd), where w\n\nis sampled uniformly from {w : (cid:107)w(cid:107) = d}, and c > 0 is a universal constant.\n\nItems 1 and 2 follow by a straightforward calculation, where in item 2 we also used the fact that x\nhas a symmetric distribution. Item 3 relies on a claim from [33], which shows that periodic functions\nof the form x (cid:55)\u2192 \u03c8((cid:104)w, x(cid:105)) for a random w with suf\ufb01ciently large norm have low correlation with\nany \ufb01xed function. The full proof can be found in Appendix B.\nAt a high level, the proof of Theorem 4.2 proceeds as follows: If we choose and \ufb01x {fi}r\ni=1 and\nW , then any linear combination of random features fi(W x) with small weights will be nearly\nuncorrelated with \u03c8((cid:104)w\u2217, x(cid:105)), in expectation over w\u2217 . But, we know that \u03c8((cid:104)w\u2217, x(cid:105)) can be written\nas a linear combination of ReLU neurons, so there must be some ReLU neuron which will be\nnearly uncorrelated with any linear combination of the random features (and as a result, cannot be\nwell-approximated by them). Finally, by a symmetry argument, we can actually \ufb01x w\u2217 arbitrarily and\nthe result still holds. We now turn to provide the formal proof:\nProof of Theorem 4.2. Take \u03c8(x) from Proposition 4.6 and denote for w \u2208 Rd, \u03c8w(x) = \u03c8((cid:104)w, x(cid:105)) :\nRd \u2192 R. If we sample w\u2217 uniformly from {w : (cid:107)w(cid:107) = d}, then for all f \u2208 F:\nEw\u2217 [|(cid:104)fW , \u03c8w\u2217(cid:105)|] \u2264 20(cid:107)fW(cid:107)2 exp(\u2212c(cid:48)d) \u2264 exp(\u2212cd),\n\nwhere c is a universal constant that depends only on the constant c(cid:48) from Proposition 4.6. Hence also:\n\nEw\u2217 [EW [|(cid:104)fW , \u03c8w\u2217(cid:105)|]] \u2264 exp(\u2212cd)\n\nWe now show that EW [|(cid:104)fW , \u03c8w\u2217(cid:105)|] doesn\u2019t depend on w\u2217. Fix w\u2217, then any w \u2208 Rd with (cid:107)w(cid:107) = d\ncan be written as M w\u2217 for some orthogonal matrix M. Now:\nEW [|(cid:104)fW , \u03c8w\u2217(cid:105)|] EW [Ex [|f (W x) \u00b7 \u03c8((cid:104)w\u2217, x(cid:105))|]] = EW [Ex\n= EW [Ex [|f (W x) \u00b7 \u03c8((cid:104)M w\u2217, x(cid:105))|]] = EW [|(cid:104)fW , \u03c8M w\u2217(cid:105)|]\nwhere we used the fact that both x and the rows of W have a spherically symmetric distribution.\nTherefore, for all w\u2217 with (cid:107)w\u2217(cid:107) = d:\n\n(cid:2)(cid:12)(cid:12)f (W M T M x) \u00b7 \u03c8((cid:104)M w\u2217, M x(cid:105))(cid:12)(cid:12)(cid:3)]\n\nEW [|(cid:104)fW , \u03c8w\u2217(cid:105)|] \u2264 exp(\u2212cd)\n\n(4)\n\n7\n\n\fUsing Markov\u2019s inequality and dividing c by a factor of 2, we get w.p > 1\u2212 exp (\u2212cd) over sampling\nof W , |(cid:104)fW , \u03c8w\u2217(cid:105)| \u2264 exp(\u2212cd). Finally, if we pick f1, . . . , fr \u2208 F, using the union bound we get\nthat w.p > 1 \u2212 r exp (\u2212cd) over sampling of W :\n\nWe can write \u03c8(x) =(cid:80)a\n\n\u2200i \u2208 {1, . . . , r}, |(cid:104)fiW , \u03c8w\u2217(cid:105)| \u2264 exp(\u2212cd)\nj=1 aj[x + cj]+ \u2212 1, where a = 6d2 + 1 and |aj| \u2264 2, cj \u2208 [\u2212a, a] for\nj (x) = [(cid:104)w\u2217, x(cid:105) + cj]+. Assume that for\n\nj = 1, . . . a. Let w\u2217 \u2208 Rd with (cid:107)w\u2217(cid:107) = d, and denote f\u2217\nevery j we can \ufb01nd uj \u2208 Rr and f1, . . . , fr \u2208 F such that\nis distributed as above, otherwise there is a ReLU neuron that cannot be represented by a linear\ncombination of random features, which is what we want to prove.. Let f0(W x) = b0 the bias term of\nthe output layer of the network, then also:\n\ni fiW \u2212 f\u2217\n\ni=1 uj\n\nj\n\n(cid:13)(cid:13)(cid:13)(cid:80)r\n\uf8f6\uf8f82\uf8f9\uf8fa\uf8fb = Ex\n\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\n(cid:13)(cid:13)(cid:13)2 \u2264 \u0001, where W\n(cid:33)2\uf8f9\uf8fb\n\n(cid:101)uifi(W x) \u2212 \u03c8((cid:104)w\u2217, x(cid:105))\n\nj (x) \u2212 1\n\nuj\ni aj\n\ni=0\n\nj=1\n\nEx\n\n\u2264\u0001\n\n\uf8ee\uf8ef\uf8f0\n\uf8eb\uf8ed r(cid:88)\n\uf8eb\uf8ed a(cid:88)\na(cid:88)\n(cid:16)(cid:80)a\nwhere(cid:101)ui =\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\n|aj|2 \u2264 24d2\u0001\n\nj=1 uj\n\nEx\n\nj=1\n\ni aj\n\ni=0\n\n\u2265d \u2212 2 max\n\ni\n\nj=1\n\najf\u2217\n\n\uf8f6\uf8f8 fi(W x) \u2212 a(cid:88)\n(cid:17)\n(cid:101)uifi(W x) \u2212 \u03c8((cid:104)w\u2217, x(cid:105))\n|(cid:101)ui| r(cid:88)\n\ni=1\n\nw.p > 1 \u2212 r exp (\u2212cd) over the distribution of W that:\n\n|(cid:104)fiW , \u03c8w\u2217(cid:105)| \u2265 d \u2212 2 max\n\n. On the other hand using Eq. (4) and item 2 from Proposition 4.6 we get\n\n(cid:33)2\uf8f9\uf8fb \u2265 (cid:107)\u03c8w\u2217(cid:107)2 \u2212 2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:104) r(cid:88)\n|(cid:101)ui|r exp (\u2212cd)\n\ni=0\n\ni\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:101)uifiW , \u03c8w\u2217(cid:105)\n\ni=0\n\n(5)\n\n(cid:33)2\uf8f9\uf8fb \u2264 \u0001\n\n(cid:33)2\uf8f9\uf8fb \u2264 1\n\n,\n\n50\n\n\u2265d \u2212 4a max\n\n|uj\ni|r exp (\u2212cd) ,\n\n(6)\nwhere the last inequality is true for all j = 1, . . . , a. Combining Eq. (5) and Eq. (6), for all w\u2217 \u2208 Rd\nwith (cid:107)w\u2217(cid:107) = d there is b\u2217 \u2208 [\u2212a, a] such that w.p > 1 \u2212 r exp (\u2212cd) over the sampling of W , if\nthere is u \u2208 Rr that:\n\ni\n\nuifi(W x) \u2212 [(cid:104)w\u2217, x(cid:105) + b\u2217]+\n\n(7)\n\nEx\n\n\uf8ee\uf8f0(cid:32) r(cid:88)\nthen r maxi |ui| \u2265(cid:0)1 \u2212 24d\u0001) 1\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\nEx\n\ni=1\n\ni=1\n\n24d exp (cd).\n\nLastly, by multiplying both sides of Eq. (7) by d, using the homogeneity of ReLU and setting \u0001 = 1\n50d\nwe get that for every \u02c6w\u2217 \u2208 Rd with (cid:107) \u02c6w\u2217(cid:107) = d2 there is \u02c6b\u2217 \u2208 [\u22126d3 \u2212 1, 6d3 + 1] such that w.p\n> 1 \u2212 r exp (\u2212cd) over the sampling of W , if there is \u02c6u \u2208 Rr such that\n\n\u02c6uifi(W x) \u2212 [(cid:104) \u02c6w\u2217, x(cid:105) + \u02c6b\u2217]+\n\n48d2 exp (cd).\n\nthen r maxi |\u02c6ui| \u2265 1\nRemark 4.7. It is possible to trade-off between the norm of w\u2217 and the required error. This\ncan be done by altering the proof of Theorem 4.2, where instead of multiplying both sides of\nEq. (7) by d we could have multiplied both sides by \u03b1\nd . This way the following is proved: With\nthe same assumptions as in Theorem 4.2, for all w\u2217 \u2208 Rd with (cid:107)w\u2217(cid:107) = \u03b1 there is b\u2217 \u2208 R with\n\n(cid:1) and all f1, . . . , fr \u2208 F, w.h.p over sampling of W , if\n(cid:1)2(cid:105) \u2264 \u0001, then r \u00b7 maxi |ui| \u2265 (1 \u2212 5d2\n\n|b\u2217| \u2264 6\u03b1d + 1 such that for all \u0001 \u2208(cid:0)0, \u03b1\n(cid:104)(cid:0)(cid:80)r\n\ni=1 uifi(W x) \u2212 [(cid:104)w\u2217, x(cid:105) + b\u2217]+\n\n8\u03b1 exp (c3d) for\n\n\u03b1 \u0001) 1\n\n5d2\n\nEx\na universal constant c3.\n\n8\n\n\f4.3 General Features\nIn the previous subsection, we assumed that our features have a structure of the form x (cid:55)\u2192 f (W x)\nfor a random matrix W . We now turn to a more general case, where we are given features of any\nkind without any assumptions on their internal structure, as long as they are sampled from some \ufb01xed\ndistribution. We show that even in this setting, such features cannot capture single ReLU neurons in\nthe worst-case (at the cost of proving this for some target weight vector w\u2217, instead of any w\u2217).\nTheorem 4.8. There exists a universal constant c such that the following holds. Let d > 40, and let\nF be a family of functions from Rd to R, such that (cid:107)f(cid:107) \u2264 exp(d/40) for all f \u2208 F. Also, for some\nr \u2208 N, let D be an arbitrary distribution over tuples (f1, . . . fr) of functions from F. Then there\nexists w\u2217 \u2208 Rd with (cid:107)w\u2217(cid:107) = d2, and b\u2217 \u2208 R with |b\u2217| \u2264 6d3 + 1, such that with probability at least\n1 \u2212 r exp(\u2212cd) over sampling f1, . . . , fr, if\n\n\uf8ee\uf8f0(cid:32) r(cid:88)\n\n(cid:33)2\uf8f9\uf8fb \u2264 1\n\n,\n\n50\n\nEx\u223cN (0,I)\n\nthen\n\nuifi(x) \u2212 [(cid:104)w\u2217, x(cid:105) + b\u2217]+\n\ni=1\n\nr \u00b7 max\n\ni\n\n|ui| \u2265 1\n\n48d2 exp (cd) .\n\nThe proof is similar to the proof of Theorem 4.2. The main difference is that we do not have any\nassumptions on the distribution of the random features (as the assumption on the distribution on\nW in Theorem 4.2), hence we can only show that there exists some ReLU neuron that cannot be\nwell-approximated. On the other hand, we have almost no restrictions on the random features, e.g.\nthey can be multi-layered neural network of any architecture and with any random initialization.\nExample 4.9. Besides generalizing the setting of the previous subsection, Theorem 4.8 also captures\nthe setting of \"linearized\" neural tangent kernel (see Jacot et al. [20] and Subsection 2.1), where we\nconsider a linear combination of functions of the form:\n\n(8)\nwhere wi are randomaly chosen and ai are being optimized using gradient based methods. This is\nbecause each such function is a weighted sum of the features fi,j(x) = \u03c3(cid:48)((cid:104)wi, x(cid:105))xj for 1 \u2264 j \u2264 d\nand wi are randomly initialized.\n\n\u03c3(cid:48)((cid:104)wi, x(cid:105))(cid:104)ai, x(cid:105),\n\nAcknowledgements\n\nThis research is supported in part by European Research Council (ERC) grant 754705. We thank\nYuanzhi Li for some helpful comments on the previous version of this paper.\n\nReferences\n[1] Z. Allen-Zhu and Y. Li. Can SGD learn recurrent neural networks with provable generalization?\n\narXiv preprint arXiv:1902.01028, 2019.\n\n[2] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural\n\nnetworks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.\n\n[3] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint arXiv:1811.03962, 2018.\n\n[4] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks.\n\nIn International Conference on Machine Learning, pages 1908\u20131916, 2014.\n\n[5] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE\n\nTransactions on Information theory, 39(3):930\u2013945, 1993.\n\n[6] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[7] A. Brutzkus and A. Globerson. Over-parameterization improves generalization in the xor\n\ndetection problem. arXiv preprint arXiv:1810.03037, 2018.\n\n9\n\n\f[8] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. Sgd learns over-parameterized\nnetworks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174,\n2017.\n\n[9] Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized\n\ndeep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.\n\n[10] L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized\nmodels using optimal transport. In Advances in neural information processing systems, pages\n3040\u20133050, 2018.\n\n[11] A. Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\n[12] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The\npower of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\n[13] S. S. Du and J. D. Lee. On the power of over-parametrization in neural networks with quadratic\n\nactivation. arXiv preprint arXiv:1803.01206, 2018.\n\n[14] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent \ufb01nds global minima of deep\n\nneural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[15] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\n\nparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[16] B. R. Gelbaum, J. G. De Lamadrid, et al. Bases of tensor products of banach spaces. Paci\ufb01c\n\nJournal of Mathematics, 11(4):1281\u20131286, 1961.\n\n[17] B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. Linearized two-layers neural networks\n\nin high dimension. arXiv preprint arXiv:1904.12191, 2019.\n\n[18] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence\nand statistics, pages 249\u2013256, 2010.\n\n[19] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[20] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization\nin neural networks. In Advances in neural information processing systems, pages 8571\u20138580,\n2018.\n\n[21] J. M. Klusowski and A. R. Barron. Approximation by combinations of ReLU and squared\nReLU ridge functions with l1 and l0 controls. IEEE Transactions on Information Theory, 64\n(12):7649\u20137656, 2018.\n\n[22] M. Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical\n\nSoc., 2001.\n\n[23] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent\non structured data. In Advances in Neural Information Processing Systems, pages 8168\u20138177,\n2018.\n\n[24] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix sensing\n\nand neural networks with quadratic activations. arXiv preprint arXiv:1712.09203, 2017.\n\n[25] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational ef\ufb01ciency of training neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 855\u2013863, 2014.\n\n[26] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for non-convex losses. arXiv\n\npreprint arXiv:1607.06534, 2016.\n\n10\n\n\f[27] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2008.\n\n[28] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008\n46th Annual Allerton Conference on Communication, Control, and Computing, pages 555\u2013561.\nIEEE, 2008.\n\n[29] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization\nwith randomization in learning. In Advances in neural information processing systems, pages\n1313\u20131320, 2009.\n\n[30] I. Safran and O. Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks. In\n\nInternational Conference on Machine Learning, pages 774\u2013782, 2016.\n\n[31] I. Safran and O. Shamir. Spurious local minima are common in two-layer relu neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\n[32] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[33] O. Shamir. Distribution-speci\ufb01c hardness of learning neural networks. The Journal of Machine\n\nLearning Research, 19(1):1135\u20131163, 2018.\n\n[34] M. Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural Information\n\nProcessing Systems, pages 2007\u20132017, 2017.\n\n[35] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization\nlandscape of over-parameterized shallow neural networks. IEEE Transactions on Information\nTheory, 65(2):742\u2013769, 2019.\n\n[36] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees\n\nfor multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[37] D. Soudry and E. Hoffer. Exponentially vanishing sub-optimal local minima in multilayer\n\nneural networks. arXiv preprint arXiv:1702.05777, 2017.\n\n[38] Y. Sun, A. Gilbert, and A. Tewari. Random ReLU features: Universality, approximation, and\n\ncomposition. arXiv preprint arXiv:1810.04374, 2018.\n\n[39] G. Wang, G. B. Giannakis, and J. Chen. Learning relu networks on linearly separable data:\n\nAlgorithm, optimality, and generalization. arXiv preprint arXiv:1808.04685, 2018.\n\n[40] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[41] Y. Zhang, J. D. Lee, and M. I. Jordan. l1-regularized neural networks are improperly learnable\nin polynomial time. In International Conference on Machine Learning, pages 993\u20131001, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3568, "authors": [{"given_name": "Gilad", "family_name": "Yehudai", "institution": "Weizmann Institute of Science"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}