{"title": "The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies", "book": "Advances in Neural Information Processing Systems", "page_first": 4761, "page_last": 4771, "abstract": "We study the relationship between the frequency of a function and the speed at which a neural network learns it.  We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system.  When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions.  We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model.  This bias term had been omitted from the linear network model without significantly affecting previous theoretical results.  However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies.  Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency.  These predictions match the empirical behavior of both shallow and deep networks.", "full_text": "The Convergence Rate of Neural Networks for\nLearned Functions of Different Frequencies\n\nRonen Basri1\n\nDavid Jacobs2\n\nYoni Kasten1\n\nShira Kritchman1\n\n1Department of Computer Science, Weizmann Institute of Science, Rehovot, Israel\n\n2Department of Computer Science, University of Maryland, College Park, MD\n\nAbstract\n\nWe study the relationship between the frequency of a function and the speed at\nwhich a neural network learns it. We build on recent results that show that the\ndynamics of overparameterized neural networks trained with gradient descent\ncan be well approximated by a linear system. When normalized training data is\nuniformly distributed on a hypersphere, the eigenfunctions of this linear system\nare spherical harmonic functions. We derive the corresponding eigenvalues for\neach frequency after introducing a bias term in the model. This bias term had been\nomitted from the linear network model without signi\ufb01cantly affecting previous\ntheoretical results. However, we show theoretically and experimentally that a\nshallow neural network without bias cannot represent or learn simple, low frequency\nfunctions with odd frequencies. Our results lead to speci\ufb01c predictions of the time\nit will take a network to learn functions of varying frequency. These predictions\nmatch the empirical behavior of both shallow and deep networks.\n\n1\n\nIntroduction\n\nNeural networks have proven effective even though they often contain a large number of trainable\nparameters that far exceeds the training data size. This de\ufb01es conventional wisdom that such\noverparameterization would lead to over\ufb01tting and poor generalization. The dynamics of neural\nnetworks trained with gradient descent can help explain this phenomenon. If networks explore\nsimpler solutions before complex ones, this would explain why even overparameterized networks\nsettle on simple solutions that do not over\ufb01t. It will also imply that early stopping can select simpler\nsolutions that generalize well, [13]. This is demonstrated in Figure 1-left.\nWe analyze the dynamics of neural networks using a frequency analysis (see also [21, 27, 26, 9],\ndiscussed in Section 2). Building on [25, 7, 2] (and under the same assumptions) we show that\nwhen a network is trained with a regression loss to learn a function over data drawn from a uniform\ndistribution, it learns the low frequency components of the function signi\ufb01cantly more rapidly than\nthe high frequency components (see Figure 2).\nSpeci\ufb01cally, [7, 2] show that the time needed to learn a function, f, is determined by the projection\nof f onto the eigenvectors of a matrix H\u221e, and their corresponding eigenvalues. [25] had previously\nnoted that for uniformly distributed training data, the eigenvectors of this matrix are spherical\nharmonic functions (analogs to the Fourier basis on hyperspheres). This work makes a number of\nstrong assumptions. They analyze shallow, massively overparameterized networks with no bias. Data\nis assumed to be normalized.\nBuilding on these results, we compute the eigenvalues of this linear system. Our computation allows\nus to make speci\ufb01c predictions about how quickly each frequency of the target function will be\nlearned. For example, for the case of 1D functions, we show that a function of frequency k can be\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: We train a CNN on MNIST data with 50% of the labels randomly changed. As the network\ntrains, accuracy on uncorrupted test data (in blue) \ufb01rst improves dramatically, suggesting that the network \ufb01rst\nsuccessfully \ufb01ts the uncorrupted data. Test accuracy then decreases as the network memorizes the incorrectly\nlabeled data. The green curve shows accuracy on test data with mixed correctly/incorrectly labeled data, while\nthe red curve shows training accuracy. (Other papers also mention this phenomenon, e.g., [18]) Right: Given the\n1D training data points (x1, ..., x32 \u2208 S1) marked in black, a two layer network learns the function represented\nby the orange curve, interpolating the missing data to form an approximate sinusoid of low frequency.\n\nFigure 2: Network prediction (dark blue) for a superposition of two sine waves with frequencies k = 4, 14\n(light blue). The network \ufb01ts the lower frequency component of the function after 50 epochs, while \ufb01tting the\nfull function only after \u223c22K epochs.\n\nlearned in time that scales as k2. We show experimentally that this prediction is quite accurate, not\nonly for the simpli\ufb01ed networks we study analytically, but also for realistic deep networks.\nBias terms in the network may be neglected without affecting previous theoretical results. However,\nwe show that without bias, two-layer neural networks cannot learn or even represent functions with\nodd frequencies. This means that in the limit of large data, the bias-free networks studied by [25, 7, 2]\ncannot learn certain simple, low-frequency functions. We show experimentally that a real shallow\nnetwork with no bias cannot learn such functions in practice. We therefore modify the model to\ninclude bias. We show that with bias added, the eigenvectors remain spherical harmonics, and that\nodd frequencies can be learned at a rate similar to even frequencies.\nOur results show that essentially a network \ufb01rst \ufb01ts the training data with low frequency functions\nand then gradually adds higher and higher frequencies to improve the \ufb01t. Figure 1-right shows a\nrather surprising consequence of this. A deep network is trained on the black data points. The\norange curve shows the function the network learns. Notice that where there is data missing, the\nnetwork interpolates with a low frequency function, rather than with a more direct curve. This is\nbecause a more straightforward interpolation of the data, while fairly smooth, would contain some\nhigh frequency components. The function that is actually learned is almost purely low frequency1.\nThis example is rather extreme. In general, our results help to explain why networks generalize well\nand don\u2019t over\ufb01t. Because networks learn low frequency functions faster than high frequency ones, if\nthere is a way to \ufb01t the data with low-frequency, the network will do this instead of over\ufb01tting with a\ncomplex, high-frequency function.\n\n2 Prior Work\n\nSome prior work has examined the way that the dynamics or architecture of neural networks is\nrelated to the frequency of the functions they learn. [21] bound the Fourier transform of the function\ncomputed by a deep network and of each gradient descent (GD) update. Their method makes the\n\n1 [10] show a related \ufb01gure. In the context of meta-learning they show that a network trained to regress to\nsine waves can learn a new sine wave from little training data. Our \ufb01gure shows a different phenomenon, that,\nwhen possible, a generic network will \ufb01t data with low-frequency sine waves.\n\n2\n\n0400800Epochs00.51Accuracy-0-101Epoch = 0Epoch = 50Epoch = 500Epoch = 22452\fstrong assumption that the network produces zeros outside a bounded domain. A related analysis for\nshallow networks is presented in [27, 26]. Neither paper makes an explicit prediction of the speed of\nconvergence. [9] derive bounds that show that for band limited functions two-layer networks converge\nto a generalizable solution. [20, 24, 8] show that deeper networks can learn high frequency functions\nthat cannot be learned by shallow networks with a comparable number of units. [22] analyzes the\nability of networks to learn based on the frequency of functions computed by their components.\nRecent papers study the relationship between the dynamics of gradient descent and the ability to\ngeneralize. [23] shows that in logistic regression gradient descent leads to max margin solutions\nfor linearly separable data. [5] shows that with the hinge loss a two layer network provably \ufb01nds a\ngeneralizeable solution for linearly separable data. [14, 17] provide related results. [16] studies the\neffect of gradient descent on the alignment of the weight matrices for linear neural networks. [2] uses\nthe model discussed in this paper to study generalization.\nIt has been shown that the weights of heavily overparameterized networks change little during training,\nallowing them to be accurately approximated by linear models that capture the nonlinearities caused\nby ReLU at initialization [25, 7, 2]. These papers and others analyze neural networks without an\nexplicit bias term [28, 19, 12, 1]. As [1] points out, bias can be ignored without loss of generality for\nthese results, because a constant value can be appended to the training data after it is normalized. [4],\nbuilding on the work of [3], perform a frequency analysis of the inductive bias of networks, using the\nNeural Tangent Kernel. They produce results related to ours for bias-free networks. We also analyze\nthe signi\ufb01cant effect that bias has on the eigenvalues of these linear systems.\nSome recent work (e.g., [6], [12]) raises questions about the relevance of this lazy training to practical\nsystems. Interestingly, our experiments indicate that our theoretical predictions, based on lazy training,\n\ufb01t the behavior of real, albeit simple, networks. The relevance of results based on lazy training to\nlarge-scale real-world systems remains an interesting topic for future research.\n\n3 Background\n\n3.1 A Linear Dynamics Model\n\nWe begin with a brief review of [7, 2]\u2019s linear dynamics model. We consider a network with two\nlayers, implementing the function\n\nf (x; W, a) =\n\n1\u221a\nm\n\nm(cid:88)\n\nr=1\n\nar\u03c3(wT\n\nr x),\n\n(1)\n\nwhere x \u2208 Rd+1 is the input and (cid:107)x(cid:107) = 1 (denoted x \u2208 Sd), W = [w1, ..., wm] \u2208 R(d+1)\u00d7m and\na = [a1, ..., am]T \u2208 Rm respectively are the weights of the \ufb01rst and second layers, and \u03c3 denotes the\nReLU function, \u03c3(x) = max(x, 0). This model does not explicitly include bias. Let the training data\nconsist of n pairs {xi, yi}n\ni=1, xi \u2208 Sd and yi \u2208 R. Gradient descent (GD) minimizes the L2 loss\n\nn(cid:88)\n\ni=1\n\n\u03a6(W ) =\n\n1\n2\n\n(yi \u2212 f (xi; W, a))2,\n\nwhere we initialize the network with wr(0) \u223c N (0, \u03ba2I). We further set ar \u223c Uniform{\u22121, 1} and\nmaintain it \ufb01xed throughout the training.\nFor the dynamic model we de\ufb01ne the (d + 1)m \u00d7 n matrix\n\n\uf8eb\uf8ec\uf8ed a1I11x1\n\na2I21x1\n\n...\n\nZ =\n\n1\u221a\nm\n\na1I12x2\na2I22x2\n\n...\n...\n\na1I1nxn\na2I2nxn\n\n...\n\n\uf8f6\uf8f7\uf8f8 ,\n\namIm1x1 amIm2x2\ni xj \u2265 0 and zero otherwise. Note that this indicator changes from\nwhere the indicator Iij = 1 if wT\none GD iteration to the next, and so Z = Z(t). The network output over the training data can be\nexpressed as u(t) = Z T w \u2208 Rn, where w = (wT\nm)T . We further de\ufb01ne the n \u00d7 n Gram\nmatrix H = H(t) = Z T Z with Hij = 1\n\n... amImnxn\n\n(cid:80)m\n\n1 , ..., wT\nr=1\n\nIriIrj.\n\nm xT\n\ni xj\n\n(2)\n\n(3)\n\n3\n\n\fNext we de\ufb01ne the main object of analysis, the n \u00d7 n matrix H\u221e, de\ufb01ned as the expectation of H\nover the possible initializations. Its entries are given by\n\nH\u221e\nij = Ew\u223cN (0,\u03ba2I)Hij =\n\nThm. 4.1 in [2] relates the convergence of training a shallow network with GD to the eigenvalues of\nH\u221e. For a network with m = \u2126\ndenotes the minimal eigenvalue of H\u221e), then with probability 1 \u2212 \u03b4 over the random initializations\n\nunits, \u03ba = O\n\n\u03bb4\n\nn2\n\nn\n\n1\n2\u03c0\n\ni xj(\u03c0 \u2212 arccos(xT\nxT\n\ni xj)).\n\n(cid:17)\n(cid:16) \u0001\u03b4\u221a\nand learning rate \u03b7 = O(cid:0) \u03bb0\n(cid:33)1/2\n(1 \u2212 \u03b7\u03bbi)2t(cid:0)vT\ni y(cid:1)2\n\n\u00b1 \u0001,\n\n(4)\n\n(cid:1) (\u03bb0\n\n(5)\n\n(cid:107)y \u2212 u(t)(cid:107)2 =\n\n(cid:17)\n(cid:16) n7\n(cid:32) n(cid:88)\n\n0\u03ba2\u00012\u03b4\n\ni=1\n\nwhere v1, ..., vn and \u03bb1, ..., \u03bbn respectively are the eigenvectors and eigenvalues of H\u221e.\n3.2 The Eigenvectors of H\u221e for Uniform Data\nAs is noted in [25], when the training data distributes uniformly on a hypersphere the eigenvectors of\nH\u221e are the spherical harmonics. In this case H\u221e forms a convolution matrix. A convolution on a\nhypersphere is de\ufb01ned by\n\nK \u2217 f (u) =\n\nK(uT v)f (v)dv,\n\n(6)\n\n(cid:90)\n\nSd\n\n(cid:80)n\nj=1 H\u221e\n\nwhere the kernel K(u, v) = K(uT v) is measureable and absolutely integrable on the hypersphere.\nIt is straightforward to verify that in S1 this de\ufb01nition is consistent with the standard 1-D convolution\nwith a periodic (and even) kernel, since K depends through the cosine function on the angular\ndifference between u and v. For d > 1 this de\ufb01nition requires the kernel to be rotationally symmetric\naround the pole. This is essential in order for its rotation on Sd to make sense. We formalize this\nobservation in a theorem.\nTheorem 1. Suppose the training data {xi}n\ni=1 is distributed uniformly in Sd, then H\u221e forms a\nconvolution matrix in Sd.\nProof. Let f : Sd \u2192 R be a scalar function, and let f \u2208 Rn be a vector whose entries are the\nfunction values at the training points, i.e., fi = f (xi). Consider the application of H\u221e to f,\nij fj, where A(Sd) denotes the total surface area of Sd. As n \u2192 \u221e this sum\ngi = A(Sd)\ni xj)f (xj)dxj, where dxj denotes a surface element\nof Sd. Let the kernel K\u221e be de\ufb01ned as in (4), i.e., K\u221e(xi, xj) = 1\ni xj)).\n2\u03c0 xT\nClearly, K\u221e is rotationally symmetric around xi, and therefore g = K\u221e \u2217 f. H\u221e moreover forms a\ndiscretization of K\u221e, and its rows are phase-shifted copies of each other.\nTheorem 1 implies that for uniformly distributed data the eigenvectors of H\u221e are the Fourier series\nin S1 or, using the Funk-Hecke Theorem (as we will discuss), the spherical harmonics in Sd, d > 1.\nWe \ufb01rst extend the dynamic model to allow for bias, and then derive the eigenvalues for both cases.\n4 Harmonic Analysis of H\u221e\n\napproaches the integral g(xi) = (cid:82)\n\ni xj(\u03c0 \u2212 arccos(xT\n\nSd K\u221e(xT\n\nn\n\nThese results in the previous section imply that we can determine how quickly a network can learn\nfunctions of varying frequency by \ufb01nding the eigenvalues of the eigenvectors that correspond to\nthese frequencies. In this section we address this problem both theoretically and experimentally2.\nInterestingly, as we establish in Theorem 2 below, the bias-free network de\ufb01ned in (1) is not universal\nas it cannot represent functions that contain odd frequencies greater than one. As a consequence the\nodd frequencies lie in the null space of the kernel K\u221e and cannot be learned \u2013 a signi\ufb01cant de\ufb01ciency\nin the model of [7, 2]. We have the following:\nTheorem 2. In the harmonic expansion of f (x) in (1), the coef\ufb01cients corresponding to odd frequen-\ncies k \u2265 3 are zero.\n\n2Code for experiments shown in this paper can be found at https://github.com/ykasten/Convergence-Rate-\n\nNN-Different-Frequencies.\n\n4\n\n\fFigure 3: Left: Fitting a bias-free two-layer network (with 2000 hidden units) to training data comprised of 51\npoints drawn from f (\u03b8) = cos(3\u03b8) (black dots). The orange, solid curve depicts the network output. Consistent\nwith Thm. 2, the network \ufb01ts the data points perfectly with just even frequencies, yielding poor interpolation\nbetween data points. The right panel shows in comparison \ufb01tting the network (solid line) to training data points\n(black dots) drawn from f (\u03b8) = cos(4\u03b8). Fit was achieved by \ufb01xing the \ufb01rst layer weights at their random\n(Gaussian) initialization and optimizing over the second layer weights.\n\nProof. We show this for d \u2265 2. The theorem also applies to the case that d = 1 with a similar proof.\nConsider the output of one unit, g(x) = \u03c3(wT x) and assume \ufb01rst that w = (0, ..., 0, 1)T . In this\ncase g(x) = max{xd+1, 0} and it is a linear combination of just the zonal harmonics. The zonal\nharmonic coef\ufb01cients of g(x) are given by\n\ngk = V ol(Sd\u22121)\n\nmax{t, 0}Pk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt,\n\n(7)\n\n(cid:90) 1\n\n\u22121\n\nwhere V ol(Sd\u22121) denotes the volume of the hypersphere Sd\u22121 and Pk,d(t) denotes the Gegenbauer\npolynomial, given by the formula:\n\n1\n(1 \u2212 t2)\n\u0393 is Euler\u2019s gamma function. Eq. (7) can be written as\n\n\u0393( d\n2 )\n\u0393(k + d\n2 )\n\nPk,d(t) =\n\n(\u22121)k\n2k\n\ndk\n\ndtk (1 \u2212 t2)k+ d\u22122\n\n2 .\n\nd\u22122\n\n2\n\n(cid:90) 1\n(cid:90) 1\n\n0\n\n1\n\n(8)\n\n(9)\n\n(10)\n\ngk = V ol(Sd\u22121)\n\ntPk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt.\n\nFor odd k, Pk,d(t) is antisymmetric. Therefore, for such k\n\ngk =\n\n1\n2\n\nV ol(Sd\u22121)\n\ntPk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt.\n\nThis is nothing but the (scaled) inner product of the \ufb01rst order harmonic t with a harmonic of degree\nk, and due to the orthogonality of the harmonic functions this integral vanishes for all odd values of\nk except k = 1. This result remains unchanged if we use a general weight vector for w, as it only\nrotates g(x), resulting in a phase shift of the \ufb01rst order harmonic. Finally, f is a linear combination\nof single unit functions, and consequently its harmonic coef\ufb01cients at odd frequencies k \u2265 3 are\nzero.\n\nIn Figure 3 we use a bias-free, two-layer network to \ufb01t data drawn from the function cos(3\u03b8).\nIndeed, as the network cannot represent odd frequencies k \u2265 3 it \ufb01ts the data points perfectly with\ncombinations of even frequencies, hence yielding poor generalization.\nThis can be overcome by extending the model to use homogeneous coordinates, which introduce bias.\n(xT , 1)T \u2208 Rd+2, and apply (1) to \u00afx. Clearly, since (cid:107)x(cid:107) = 1\nFor a point x \u2208 Sd we denote \u00afx = 1\u221a\nalso (cid:107)\u00afx(cid:107) = 1. We note that the proofs of [7, 2] directly apply when both the weights and the biases\nare initialized using a normal distribution with the same variance. It is also straightforward to modify\nthese theorems to account for bias initialized at zero, as is common in many practical applications.\nWe assume bias is initialized at 0, and construct the corresponding \u00afH\u221e matrix. This matrix takes the\nform\n\n2\n\n\u00afH\u221e\nij = Ew\u223cN (0,\u03ba2I)\n\n\u00afHij =\n\ni xj + 1)(\u03c0 \u2212 arccos(xT\n\n(xT\n\ni xj)).\n\n(11)\n\n1\n4\u03c0\n\n5\n\n--/20/2-3-1.501.53--/20/2-3-1.501.53\fFigure 4: The six leading eigenvectors and three least signi\ufb01cant eigenvectors of the bias-free H\u221e in descending\norder of eigenvalues. Note that the least signi\ufb01cant eigenvectors resemble low odd frequencies.\n\n\u00b7\u00b7\u00b7\n\nFigure 5: The nine leading eigenvectors (k = 0, ..., 4) of \u00afH\u221e in descending order of eigenvalues. Note that\nnow the leading eigenvectors include both the low even and odd frequencies.\n\nFinally note that the bias adjusted kernel \u00afK\u221e(xT\ni xj), de\ufb01ned as in (11), also forms a convolution on\nthe original (non-homogeneous) points. Therefore, since we assume that in Sd the data is distributed\nuniformly, the eigenfunctions of \u00afK\u221e are also the spherical harmonics.\nWe next analyze the eigenfunctions and eigenvalues of K\u221e and \u00afK\u221e. We \ufb01rst consider data distributed\nuniformly over the circle S1 and subsequently discuss data in arbitrary dimension.\n\n4.1 Eigenvalues in S1\n(cid:82) \u03c0\nSince both K\u221e and \u00afK\u221e form convolution kernels on the circle, their eigenfunctions include the\nFourier series. For the bias-free kernel, K\u221e, the eigenvalues for frequencies k \u2265 0 are derived using\n\u2212\u03c0 K\u221e(\u03b8) cos(k\u03b8)d\u03b8 where z0 = 2\u03c0 and zk = \u03c0 for k > 0. (Note that since K\u221e is an\nk = 1\na1\nzk\neven function its integral with sin(\u03b8) vanishes.) This yields\n\na1\nk =\n\n1\n\u03c02\n1\n4\n2(k2+1)\n\u03c02(k2\u22121)2\n0\n\nk = 0\nk = 1\nk \u2265 2 even\nk \u2265 2 odd\n\n(12)\n\nH\u221e is a discrete matrix that represents convolution with K\u221e. It is circulant symmetric (when\nconstructed with points sampled with uniform spacing) and its eigenvectors are real. Each frequency\nexcept the DC is represented by two eigenvectors, one for sin(k\u03b8) and the other cos(k\u03b8).\n(12) allows us to make two predictions. First, the eigenvalues for the even frequencies k shrink at\nthe asymptotic rate of 1/k2. This suggests, as we show below, that high frequency components are\nquadratically slower to learn than low frequency components. Secondly, the eigenvalues for the odd\nfrequencies (for k \u2265 3) vanish. A network without bias cannot learn or even represent these odd\nfrequencies. Du et al.\u2019s convergence results critically depend on the fact that for a \ufb01nite discretization\nH\u221e is positive de\ufb01nite. In fact, H\u221e does contain eigenvectors with small eigenvalues that match the\nodd frequencies on the training data, as shown in Figure 4, which shows the numerically computed\neigenvectors of H\u221e. The leading eigenvectors include k = 1 followed by the low even frequencies,\nwhereas the eigenvectors with smallest eigenvalues include the low odd frequencies. However, a\nbias-free network can only represent those functions as a combination of even frequencies. These\nmatch the odd frequencies on the training data, but have wild behavior off the training data (see\nFig. 3). In fact, our experiments show that a network cannot even learn to \ufb01t the training data when\nlabeled with odd frequency functions with k \u2265 3.\nWith bias, the kernel \u00afK\u221e passes all frequencies, and the odd frequencies no longer belong to its null\nspace. The Fourier coef\ufb01cients for this kernel are\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nc1\nk =\n\n1\n\n1\n\n8\n\n2\u03c02 + 1\n\u03c02 + 1\n8\nk2+1\n\u03c02(k2\u22121)2\n\n1\n\n\u03c02k2\n\nk = 0\nk = 1\nk \u2265 2 even\nk \u2265 2 odd\n\n(13)\n\nFigure 5 shows that with bias, the highest eigenvectors include even and odd frequencies.\nThm. 4.1 in [2] tells us how fast a network learning each Fourier component should converge, as\na function of the eigenvalues computed in (13). Let yi be an eigenvector of \u00afH\u221e with eigenvalue\n\n6\n\n--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05--/20/2-0.0500.05\fFigure 6: Convergence times as a function of frequency. Left: S1 no bias (m = 4000, n = 1001, \u03ba = 1,\n\u03b7 = 0.01; training odd frequencies was stopped after 1800 iterations had no signi\ufb01cant effect on error). Left-\ncenter: S1 with bias (m = 4000, n = 1001, \u03ba = 2.5, \u03b7 = 0.01). Right-center: deep net (5 hidden layers with\nbias, m = 256, n = 1001, \u03b7 = 0.05, weight initialized as in [15], bias - uniform). Right: deep residual network\n(10 hidden layers with same parameters except \u03b7 = 0.01). The data lies on a 1D circle embedded in R30 at a\nrandom rotation. We estimate the growth in these graphs, from left, as O(k2.15), O(k1.93), O(k1.94), O(k2.11).\nTheoretical predictions (in orange) were scaled by a multiplicative constant to \ufb01t the measurements. This constant\nre\ufb02ects the length of each gradient step (e.g., due to the learning rate and size of training set). Convergence is\ndeclared when a 5% \ufb01tting error is obtained.\n\n\u03b7\u00af\u03bbi\n\n\u2212 log(\u00af\u03b4+\u0001)\n\n\u00af\u03bbi and denote by ti the number of iterations needed to achieve an accuracy \u00af\u03b4. Then, according\nto (5), (1 \u2212 \u03b7\u00af\u03bbi)ti < \u00af\u03b4 + \u0001. Noting that since \u03b7 is small, log(1 \u2212 \u03b7\u00af\u03bbi) \u2248 \u2212\u03b7\u00af\u03bbi, we obtain that\n. Combined with (13) we get that asymptotically in k the convergence time should\nti >\ngrow quadratically for all frequencies.\nWe perform experiments to compare theoretical predictions to empirical behavior. We generate\nuniformly distributed, normalized training data, and assign labels from a single harmonic function.\nWe then train a neural network until the error is reduced to 5% of its original value, and count the\nnumber of epochs needed. For odd frequencies and bias-free 2-layer networks we halt training\nwhen the network fails to signi\ufb01cantly reduce the error in a large number of epochs. We run\nexperiments with shallow networks and with deep fully connected networks and deep networks\nwith skip connections. We primarily use an L2 loss, but in supplementary material we show results\nwith a cross-entropy loss. Quadratic behavior is observed in all these cases, see Figure 6. The\nactual convergence times may vary with the details of the architecture and initialization. For very\nlow frequencies the run time is affected more strongly by the initialization, yielding slightly slower\nconvergence times than predicted.\nThm. 5.1 in [2] further allows us to bound the generalization error incurred when learning band\nk=0 \u03b1ke2\u03c0ikx. According to this theorem, and noting that the\neigenvalues of ( \u00afH\u221e)\u22121 \u2248 \u03c0k2, with suf\ufb01ciently many iterations the population loss LD computed\nover the entire data distribution is bounded by\n\nlimited functions. Suppose y =(cid:80)\u00afk\n(cid:114)\n\n(cid:115)\n\n2\u03c0(cid:80)\u00afk\n\nkk2\n\nk=1 \u03b12\nn\n\n.\n\n(14)\n\nLD (cid:47)\n\n2y( \u00afH\u221e)\u22121y\n\n\u2248\n\nn\n\nAs expected, the lower the frequency is, the lower the generalization bound is. For a pure sine wave\nthe bound increases linearly with frequency k.\n\n4.2 Eigenvalues in Sd, d \u2265 2\nTo analyze the eigenvectors of H\u221e when the input is higher dimensional, we must make use of\ngeneralizations of the Fourier basis and convolution to functions on a high dimensional hypersphere.\nSpherical harmonics provide an appropriate generalization of the Fourier basis (see [11] as a reference\nfor the following discussion). As with the Fourier basis, we can express functions on the hypersphere\nas linear combinations of spherical harmonics. Since the kernel is rotationally symmetric, and\ntherefore a function of one variable, it can be written as a linear combination of the zonal harmonics.\nFor every frequency, there is a single zonal harmonic which is also a function of one variable. The\nzonal harmonic is given by the Gegenbauer polynomial, Pk,d where k denotes the frequency, and d\ndenotes the dimension of the hypersphere.\n\n7\n\n01020300.0k0.5k1.0k1.5kIterationsTheoretical fitQuadratic fitMeasurements0102030 0k 2k 4k 6kIterationsTheoretical fitQuadratic fitMeasurements01020300k20k40kIterationsQuadratic FitMeasurements01020300k20k40k60kIterationsQuadratic FitMeasurements\fWe have already de\ufb01ned convolution in (6) in a way that is general for convolution on the hyper-\nsphere. The Funk-Hecke theorem provides a generalization of the convolution theorem for spherical\n(cid:82) 1\nharmonics, allowing us to perform a frequency analysis of the convolution kernel. It states:\nTheorem 3. (Funk-Hecke) Given any measurable function K on [\u22121, 1], such that the integral:\n\u22121 (cid:107)K(t)(cid:107)(1 \u2212 t2)\n\n2 dt < \u221e, for every spherical harmonic H(\u03c3) of frequency k, we have:\nd\u22122\n\nK(\u03c3 \u00b7 \u03be)H(\u03be)d\u03be =\n\nVol(Sd\u22121)\n\nK(t)Pk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt\n\nH(\u03c3).\n\n(cid:90)\n\nSd\n\n(cid:18)\n\n(cid:19)\n\nHere Vol(Sd\u22121) denotes the volume of Sd\u22121 and Pk,d(t) denotes the Gegenbauer polynomial de\ufb01ned\nin (8). This tells us that the spherical harmonics are the eigenfunctions of convolution. The eigen-\nvalues can be found by taking an inner product between K and the zonal harmonic of frequency k.\nConsequently, we see that for uniformly distributed input, in the limit for n \u2192 \u221e, the eigenvectors\nof H\u221e are the spherical harmonics in Sd.\nSimilar to the case of S1, in the bias free case the odd harmonics with k \u2265 3 lie in the null space of\nK\u221e. This is proved in the following theorem.\nTheorem 4. The eigenvalues of convolution with K\u221e vanish when they correspond to odd harmonics\nwith k \u2265 3.\n\nProof. Consider the vector function z(w, x) = I(wT x > 0)x and note that K\u221e(xi, xj) =\nSd zT (w, xi)z(w, xj)dw. Let y(x) be an odd order harmonic of frequency k > 1. The appli-\n\n(cid:82)\ncation of z to y takes the form(cid:90)\n\nSd\n\nz(w, x)y(x)dx =\n\nI(wT x > 0)g(x)dx,\n\n(15)\n\nwhere g(x) = y(x)x. g(x) is a (d + 1)-vector whose lth coordinate is gl(x) = xly(x). We \ufb01rst note\nthat gl(x) has no DC component. This is because gl is the product of two harmonics, the scaled \ufb01rst\norder harmonic, xl, and the odd harmonic y(x) (with k > 1), so their inner product vanishes.\nNext we will show that the kernel I(wT x > 0) annihilates the even harmonics, for k > 1. Note that\nthe odd/even harmonics can be written as a sum of monomials of odd/even degrees. Since g is the\nsum of even harmonics (the product of xl and an odd harmonic) this will imply that (15) vanishes.\nUsing the Funk-Hecke theorem, the even coef\ufb01cients of the kernel (with k > 1) are\n\nk = V ol(Sd\u22121)\nrd\n\n= V ol(Sd\u22121)\n\nI(t > 0)Pk,d(t)(1 \u2212 t2)\n\nPk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt =\n\nd\u22122\n2 dt\nV ol(Sd\u22121)\n\n(cid:90) 1\n\n2\n\n\u22121\n\n(16)\n\nPk,d(t)(1 \u2212 t2)\n\nd\u22122\n2 dt = 0.\n\nWhen we align the kernel with the zonal harmonic, wT x = t, justifying the second equality. The\nthird equality is due to the symmetry of the even harmonics, and the last equality is because the\nharmonics of k > 0 are zero mean.\n\nNext we compute the eigenvalues of both K\u221e and \u00afK\u221e (for simplicity we show only the case of even\nd, see supplementary material for the calculations). We \ufb01nd for networks without bias:\n\n(cid:90) 1\n(cid:90) 1\n\n\u22121\n\n0\n\n(cid:90) 1\n\n\u22121\n\n(cid:90)\n\nSd\n\n8\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nC1(d, 0)\n\n(cid:1)\n(cid:0) d\nC1(d, 1)(cid:80)d\nC1(d, k)(cid:80)k+ d\u22122\n\nd2d+1\n\nd\n2\n\n1\n\nq=(cid:100) k\n\n0\n\nad\nk =\n\nwith\n\nq=1 C2(q, d, 1)\n\n1\n\n2(2q+1)\n\n2\n\n2 (cid:101) C2(q, d, k)\n\n1\n\n2(2q\u2212k+2)\n\n(cid:16)\n\n1 \u2212\n\n1\n\n22q\u2212k+2\n\nC1(d, k) =\n\n(\u22121)k\n2k\n\n2\n\n\u03c0 d\n( d\n2 )\n\n1\n\n\u0393(k + d\n2 )\n\n,\n\nC2(q, d, k) = (\u22121)q\n\n(17)\n\n2q\u2212k+2\n\n(cid:1)(cid:17)\n(cid:0)2q\u2212k+2\n(cid:18)k + d\u22122\n\n2\n\n2\n\nk = 0\n\nk = 1\nk \u2265 2 even\nk \u2265 2 odd,\n\n(cid:19) (2q)!\n\nq\n\n(2q \u2212 k)!\n\n.\n\n\fFigure 7: Convergence times as a function of frequency for data in S2. Left: no bias (m = 16000, n = 1001,\n\u03ba = 1, and \u03b7 = 0.01; training odd frequencies was stopped after 40K iterations with no signi\ufb01cant reduction\nof error). Left-center: with bias (same parameters). Right-center: deep residual network (10 hidden layers\nwith m = 256, n = 5000, \u03b7 = 0.001, weight initialization as in [15], bias - uniform). The data lies on a\n2D sphere embedded in R30 at a random rotation. Growth estimates from left, O(k2.74), O(k2.87), O(k3.13).\nRight: Convergence exponent as a function of dimension. g(d) = limk\u2192\u221e \u2212 log cd\nlog k estimated by calculating the\ncoef\ufb01cients up to k = 1000, indicating that the coef\ufb01cients decay roughly as 1/kd.\n\nk\n\nAdding bias to the network, the eigenvalues for \u00afK\u221e are:\n\n(cid:1) + 2d\u22121\n\nd(d\u22121\n)\n\nd\n2\n\n\u2212 1\n\n2\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:32)\n\n1\n\nd\n2\n\nd2d+1\n\n1\n2 C1(d, 0)\n\n(cid:0) d\n2 C1(d, 1)(cid:80)k+ d\u22122\n2 C1(d, k)(cid:80)k+ d\u22122\n2 C1(d, k)(cid:80)k+ d\u22122\n\nq=(cid:100) k\n\nq=(cid:100) k\n\nq=(cid:100) k\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n2 (cid:101) C2(q, d, 1)\n2 (cid:101) C2(q, d, k)\n2 (cid:101) C2(q, d, k)\n\ncd\nk =\n\n2\n\n2\nq\n\n(cid:16)\n\nq=0 (\u22121)q(cid:0) d\u22122\n(cid:80) d\u22122\n(cid:16)\n\n1 \u2212 1\n\n2(2q\u2212k+2)\n1 \u2212\n\n22q\n\n4q\n\n1\n\n1\n\n1\n\n\u22121\n\n1\n\n2(2q+1) + 1\n\n2(2q\u2212k+1) +\n\n2(2q\u2212k+1)\n\n22q\u2212k+1\n\n(cid:16)\n(cid:16)\n(cid:16)\n\n2q+1\n\n(cid:33)\n(cid:1) 1\n(cid:1)(cid:17)(cid:17)\n(cid:0)2q\n(cid:16)\n(cid:0)2q\u2212k+1\n\n1 \u2212\n\n2q\u2212k+1\n\nq\n\n2\n\n22q\u2212k+2\n\n1\n\n(cid:1)(cid:17)(cid:17)\n\n(cid:1)(cid:17)(cid:17)\n\n(cid:0)2q\u2212k+2\n\n2q\u2212k+2\n\n2\n\nk = 0\n\nk = 1\nk \u2265 2 even\nk \u2265 2 odd.\n(18)\n\nWe trained two layer networks with and without bias, as well as a deeper network, on data representing\npure spherical harmonics in S2. Convergence times are plotted in Figure 7. These times increase\nroughly as k3, matching our predictions in (17) and (18). We further estimated numerically the\nanticipated convergence times for data of higher dimension. As the \ufb01gure shows (right panel),\nconvergence times are expected to grow roughly as kd. We note that this is similar to the bound\nderived in [21] under quite different assumptions.\n\n5 Discussion\n\nWe have developed a quantitative understanding of the speed at which neural networks learn functions\nof different frequencies. This shows that they learn high frequency functions much more slowly than\nlow frequency functions. Our analysis addresses networks that are heavily overparameterized, but\nour experiments suggest that these results apply to real neural networks.\nThis analysis allows us to understand gradient descent as a frequency based regularization. Essentially,\nnetworks \ufb01rst \ufb01t low frequency components of a target function, then they \ufb01t high frequency\ncomponents. This suggests that early stopping regularizes by selecting smoother functions. It also\nsuggests that when a network can represent many functions that would \ufb01t the training data, gradient\ndescent causes the network to \ufb01t the smoothest function, as measured by the power spectrum of the\nfunction. In signal processing, it is commonly the case that the noise contains much larger high\nfrequency components than the signal. Hence smoothing reduces the noise while preserving most of\nthe signal. Gradient descent may perform a similar type of smoothing in neural networks.\nAcknowledgments. The authors thank Adam Klivans, Boaz Nadler, and Uri Shaham for helpful discussions.\nThis material is based upon work supported by the National Science Foundation under Grant No. DMS1439786\nwhile the authors were in residence at the Institute for Computational and Experimental Research in Mathematics\nin Providence, RI, during the Computer Vision program. This research is supported by the National Science\nFoundation under grant no. IIS-1526234.\n\n9\n\n0102030024Iterations104Theoretical fitCubic fitMeasurements01020300246Iterations104Theoretical fitCubic fitMeasurements01020300k50k100k150kIterationsCubic FitMeasurements25010035095.8\fReferences\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural\n\nnetworks. arXiv preprint arXiv:1810.12065, 2018.\n\n[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization\nand generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584,\n2019.\n\n[3] Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine\n\nLearning Research, 18:1\u201353, 2017.\n\n[4] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. arXiv preprint\n\narXiv:1905.12173, 2019.\n\n[5] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized\nnetworks that provably generalize on linearly separable data. In International Conference on Learning\nRepresentations, 2018.\n\n[6] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In\n\nAdvances in Neural Information Processing Systems, 2019.\n\n[7] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-\n\nparameterized neural networks. International Conference on Learning Representations (ICLR), 2019.\n\n[8] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on\n\nlearning theory, pages 907\u2013940, 2016.\n\n[9] Farzan Farnia, Jesse Zhang, and David Tse. A spectral approach to generalization and optimization in\n\nneural networks. 2018.\n\n[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1126\u20131135. JMLR. org, 2017.\n\n[11] Jean Gallier. Notes on spherical harmonics and linear representations of lie groups. preprint, 2009.\n\n[12] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural\n\nnetworks in high dimension. arXiv preprint arXiv:1904.12191, 2019.\n\n[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[14] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear\nconvolutional networks. In Advances in Neural Information Processing Systems, pages 9461\u20139471, 2018.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In International Conference on Computer Vision\n(ICCV), pages 1026\u20131034, 2015.\n\n[16] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv preprint\n\narXiv:1810.02032, 2018.\n\n[17] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint\n\narXiv:1803.07300, 2018.\n\n[18] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably\n\nrobust to label noise for overparameterized neural networks. arXiv preprint arXiv:1903.11680, 2019.\n\n[19] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent\n\non structured data. In Advances in Neural Information Processing Systems, pages 8157\u20138166, 2018.\n\n[20] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions\nof deep neural networks. In Advances in neural information processing systems, pages 2924\u20132932, 2014.\n\n[21] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Ben-\ngio, and Aaron Courville. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734,\n2018.\n\n[22] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Weight sharing is crucial to succesful\n\noptimization. arXiv preprint arXiv:1706.00687, 2017.\n\n10\n\n\f[23] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit\nbias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822\u20132878,\n2018.\n\n[24] Matus Telgarsky. bene\ufb01ts of depth in neural networks.\n\n1517\u20131539, 2016.\n\nIn Conference on Learning Theory, pages\n\n[25] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. In International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), Fort Lauderdale, Florida, pages 1216\u20131224,\n2017.\n\n[26] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier\n\nanalysis sheds light on deep neural networks. CoRR, abs/1901.06523, 2019.\n\n[27] Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. CoRR,\n\nabs/1808.04295, 2018.\n\n[28] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2668, "authors": [{"given_name": "Basri", "family_name": "Ronen", "institution": "Weizmann Inst."}, {"given_name": "David", "family_name": "Jacobs", "institution": "University of Maryland, USA"}, {"given_name": "Yoni", "family_name": "Kasten", "institution": "Weizmann Institute"}, {"given_name": "Shira", "family_name": "Kritchman", "institution": "Weizmann Institute"}]}