{"title": "Gradient Dynamics of Shallow Univariate ReLU Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8378, "page_last": 8387, "abstract": "We present a theoretical and empirical study of the gradient dynamics of overparameterized shallow ReLU networks with one-dimensional input, solving least-squares interpolation. We show that the gradient dynamics of such networks are determined by the gradient flow in a non-redundant parameterization of the network function. We examine the principal qualitative features of this gradient flow. In particular, we determine conditions for two learning regimes: \\emph{kernel} and \\emph{adaptive}, which depend both on the relative magnitude of initialization of weights in different layers and the asymptotic behavior of initialization coefficients in the limit of large network widths. We show that learning in the kernel regime yields smooth interpolants, minimizing curvature, and reduces to \\emph{cubic splines} for uniform initializations. Learning in the adaptive regime favors instead \\emph{linear splines}, where knots cluster adaptively at the sample points.", "full_text": "Gradient Dynamics of Shallow Univariate\n\nReLU Networks\n\nFrancis Williams\u2217\n\nMatthew Trager\u2217\n\nClaudio Silva\n\nDaniele Panozzo\n\nDenis Zorin\n\nJoan Bruna\n\nNew York University\n\nAbstract\n\nWe present a theoretical and empirical study of the gradient dynamics of overparam-\neterized shallow ReLU networks with one-dimensional input, solving least-squares\ninterpolation. We show that the gradient dynamics of such networks are determined\nby the gradient \ufb02ow in a non-redundant parameterization of the network function.\nWe examine the principal qualitative features of this gradient \ufb02ow. In particular, we\ndetermine conditions for two learning regimes: kernel and adaptive, which depend\nboth on the relative magnitude of initialization of weights in different layers and\nthe asymptotic behavior of initialization coef\ufb01cients in the limit of large network\nwidths. We show that learning in the kernel regime yields smooth interpolants,\nminimizing curvature, and reduces to cubic splines for uniform initializations.\nLearning in the adaptive regime favors instead linear splines, where knots cluster\nadaptively at the sample points.\n\nIntroduction\n\n1\nAn important open problem in the theoretical study of neural networks is to describe the dynamical\nbehavior of the parameters during training and, in particular, the in\ufb02uence of the dynamics on the\ngeneralization error. To make progress on these issues, a number of studies have focused on a\ntractable class of architectures, namely single hidden-layer neural networks. For a \ufb01xed number of\nneurons, negative results establish that, even with random initialization, gradient descent may be\ntrapped in arbitrarily bad local minima [27, 31], which motivates an asymptotic analysis that studies\nthe optimization and generalization properties of these models as the number of neurons m grows.\nRecently, several works [13, 2, 6, 12, 23] explained the success of gradient descent at optimizing\nthe loss in the over-parameterized regime, i.e., when the number of neurons in signi\ufb01cantly higher\nthan the number of training samples. In parallel, another line of work established global convergence\nof gradient descent using tools from optimal transport and mean-\ufb01eld theory [8, 26, 22, 29]. The\nessential difference between these two approaches was pointed out in [7], and is related to the use of a\ndifferent scaling parameter as the number of neurons tends to in\ufb01nity: in one case, the neural network\nbehaves asymptotically as a kernel machine [19], which in turn implies that as over-parameterization\nincreases, the parameters stay close to their initial value; in contrast, in the mean \ufb01eld setting,\nparameters asymptotitically evolve following a PDE based on a continuity equation.\nAlthough both scaling regimes explain the success of gradient descent optimization on over-\nparametrized networks, they paint a different picture when it comes to generalization. The generaliza-\ntion properties in the kernel regime borrow from the well established theory of kernel regression in\nReproducing Kernel Hilbert Spaces (RKHS), which has been applied to kernels arising from neural\nnetworks in [17, 14, 20, 24, 10], and provide a somehow underwhelming answer to the bene\ufb01t of neu-\nral networks compared to kernel methods. However, in practice, large neural networks do not exhibit\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe traits of kernel/lazy learnings, since \ufb01lter weights signi\ufb01cantly deviate from their initialization\ndespite the over-parameterization. Also, empirically, active learning provides better generalization\nthan kernel learning [7], although the theoretical reasons for this are still poorly understood.\nIn this paper, we consider a simpli\ufb01ed setting, and study wide, single-hidden layer ReLU networks\nde\ufb01ned with one-dimensional inputs. We show how the kernel and active dynamics de\ufb01ne fundamen-\ntally different function estimation models. For a \ufb01xed number of neurons, the network may follow\neither of these dynamics, depending on the initialization. Speci\ufb01cally, we show that kernel dynamics\ncorrespond to interpolation with cubic splines, whereas active dynamics yields adaptive linear splines,\nwhere neurons accumulate at the discontinuities and yield piecewise linear approximations.\nFurther related work. Our work lies at the intersection between two lines of research: the works,\ndescribed above, that study the optimization and generalization for shallow neural networks, and the\nworks that attempt to shed light on these properties on low-dimensional inputs. In the latter category,\nwe mention [4] for their study of the abilities of ReLU networks to approximate low-dimensional\nmanifolds, and [32] for their empirical study of 3D surface reconstruction using precisely the intrinsic\nbias of SGD in overparametrised ReLU networks. Another remarkable recent work is [11], where\nthe approximation power of deep ReLU networks is studied in the context of univariate functions.\nOur analysis in the active regime (Sec. 3.1) is closely related to [21], in which the authors establish\nconvergence of gradient descent to piece-wise linear functions under initializations suf\ufb01ciently close\nto zero. We provide an Eulerian perspective using Wasserstein gradient \ufb02ows that simpli\ufb01es the\nanalysis, and is consistent with their conclusions. The implicit bias of SGD dynamics appears in\nseveral works, such as [30, 15], and, closest to our setup, in [28], where the authors observe a link\nbetween gradient dynamics and linear splines. They do not however observe the connection with\ncubic splines, although they observe experimentally that the function returned by a network is often\nsmooth and not piecewise linear. Finally, we mention related work that studies the tesselation of\nReLU networks on the input space [16].\nMain contributions. The goal of this paper is to describe the qualitative behavior of the dynamics\nor 1D shallow ReLU networks. Our main contributions can be summarized as follows.\n\u2022 We investigate the gradient dynamics of shallow 1D ReLU networks using a \u201ccanonical\u201d parame-\nterization (Sec. 3.1). We show that the dynamics in this case are are completely determined by the\nevolution of the residuals. Furthermore, neurons will always accumulate at certain sample points\nwhere the residual is large and of opposite sign compared to neighboring samples. This means that\nthe dynamics in the reduced parameterization biases towards functions that are piecewise-linear.\n\u2022 We observe that the dynamics in full parameters are related to the dynamics in canonical parameters\nby a change of metric that depends only on the network at initialization. This change of metric is\nexpressible in terms of invariants \u03b4i associated with each neuron. When \u03b4i (cid:29) 0 the dynamics in full\nparameters (locally) agree with the dynamics in reduced parameters; when \u03b4i (cid:28) 0, the dynamics in\nfull parameters (locally) follow kernel dynamics, in which only the outer layer weights change.\n\u2022 We consider the idealized kernel dynamics in the limit of in\ufb01nite neurons, and we show that the\nRKHS norm of a function f corresponds to a weighted L2 norm of the second derivative f(cid:48)(cid:48), i.e., a\nform of linearized curvature. For appropriate initial distributions of neurons, the solution to kernel\nlearning is a smooth cubic spline (Theorem 5). This illustrates the qualitative difference between\nthe \u201creduced\u201d and \u201ckernel\u201d regimes, which depend on parameter lift at initialization. Arbitrary\ninitializations will locally interpolate between these two regimes.\n\u2022 We also discuss the effect of applying a scaling parameter \u03b1(m) the network function (where m is\nthe number of neurons), which becomes important as the number of neurons tends to in\ufb01nity. As\nargued in [7], when \u03b1(m) = o(m), the variation of each neuron will asymptotically go to zero (lazy\nregime), so our local analysis translates into a global one.\n\n2 Preliminaries\nWe consider the problem of training of a two-layer ReLU neural network with m scalar inputs and a\nsingle scalar output using the least-squares loss:\n\nci[aix \u2212 bi]+,\n\nz = (a \u2208 Rm, b \u2208 Rm, c \u2208 Rm).\n\n(1)\n\n2\n\nmin\n\nz\n\nL(z) =\n\n1\n2\n\n|fz(xj) \u2212 yj|2\n\nwhere fz(x) =\n\ns(cid:88)\n\nj=1\n\nm(cid:88)\n\n1\n\n\u03b1(m)\n\ni=1\n\n\fFigure 1: Left: A network function fz(x) interpolating input samples (blue x\u2019s). The knots of fz(x)\nas a piecewise linear function are plotted as green circles. Right: The canonical parameters of the\nnetwork visualized as in (6). Each particle represents a neuron and the color indicates the sign of \u0001i.\nThe samples xj correspond to lines uxj + v = 0. The colored regions which correspond to different\nactivation patterns of neurons on the training data.\nHere (xi, yi) \u2208 R2, i = 1, . . . , s is a given set of samples, z is a vector of parameters, and \u03b1(m) is a\nnormalization factor that will be important as we consider the limit m \u2192 \u221e. We are interested in\nthe minimization of (1) performed by gradient descent over the parameters z. This scheme may be\nanalyzed through its continuous-time counterpart, the gradient \ufb02ow\nz(cid:48)(t) = \u2212\u2207L(z(t)).\n\n(2)\nWhile (2) describes the dynamics of z(t) in parameter space, we are ultimately interested in the\n(cid:80)\ntrajectories of the function fz(t). Let F := {f : R \u2192 R} denote the space of square-integrable\nscalar functions, and let \u03d5 be the function-valued mapping \u03d5(z) := fz. Since L(z) = R(\u03d5(z)) with\nj\u2264s |f (xj) \u2212 yj|2, we have by the chain rule that the dynamics of g(t) := \u03d5(z(t)) =\nR(f ) = 1\n2\nfz(t) are described by\n\nz(0) = z0,\n\ng(0) = fz0,\n\ng(cid:48)(t) = \u2212\u2207\u03d5(z(t))(cid:62)\u2207\u03d5(z(t))\u2207R(g(t)) .\n\n(3)\nThe dynamics in function space are thus controlled by a time-varying tangent kernel Kt\n:=\n\u2207\u03d5(z(t))(cid:62)\u2207\u03d5(z(t)). It was shown in [19] that under certain assumptions the kernel Kt remains\nnearly constant throughout training.\nIt is immediate to see that the parameters z can be continuously rescaled without affecting the\nfunction fz, according to (ai, bi, ci) (cid:55)\u2192 (aiki, biki, ci/ki) with ki > 0. In order to eliminate this\nambiguity, we introduce the following canonical parameterization of the network\u2019s functional space:\n\nri(cid:104)\u02dcx, d(\u03b8i)(cid:105)+,\n\nw = (r \u2208 Rm, \u03b8 \u2208 [0, 2\u03c0)m),\n\n\u02dcx = (x, 1).\n\n(4)\n\nm(cid:88)\n\ni=1\n\n\u02dcfw(x) =\n\n1\nm\n\n(cid:18) m\n\n\u03b1(m)\n\n(cid:113)\n\n(cid:19)\ni , arctan(\u2212bi/ai)\n\nwhere d(\u03b8i) = (cos \u03b8i, sin \u03b8i) \u2208 S1. The natural mapping into canonical parameters is given by\n\n\u03c0(ai, bi, ci) =\n\nci\n\na2\ni + b2\n\n= (ri, \u03b8i).\n\n(5)\n\nThis mapping clearly satis\ufb01es \u02dcf\u03c0(z) = fz. We can also de\ufb01ne the loss with respect to this parameter-\nization as \u02dcL(w) = L(z) where w = \u03c0(z). We will compare the dynamics of L(z) with those of\n\u02dcL(w) to study the impact on training of different choices of parameterization and initialization, as\nwell as the asymptotic behavior of (3) as m increases.\nVisualizing a network function. We can visualize a network function fz(x) in two ways. First, we\ncan plot fz(x) as a scalar function (Figure 1, left). Note that fz(x) is a continuous piecewise linear\nfunctions in x whose knots are the points where the operand inside a ReLU activation changes sign,\nnamely ei = bi/ai, ai (cid:54)= 0, i = 1, . . . , m. Alternatively, we can visualize the canonical parameters\nw = \u03c0(z) in R2, by plotting a neuron (ri, \u03b8i) as a particle with coordinates\n\n(ui, vi) = (|ri| cos(\u03b8i),|ri| sin(\u03b8i)),\n\n(6)\nand coloring each particle according to \u0001i = sign(ri) (Figure 1, right). In this visualization, each\ntraining sample point xj can be represented as the line uxj + v = 0, which identi\ufb01es the half-space\nof neurons that are active at xj. The collection of such lines for all samples partitions the plane into\nactivation regions, where neurons have a \ufb01xed activation pattern on the training data.\n\n3\n\n(ei,f(ei))(xj,yj)xi>0<0\f3 Training Dynamics\nOur goal is to solve (1) using the gradient \ufb02ow (2) of the loss L(z)1. We begin in Section 3.1 by\ninvestigating the gradient dynamics in the canonical parameterization:\nw(cid:48)(t) = \u2212\u2207 \u02dcL(w(t)).\n\n(7)\nWhile the relationship between \ufb02ows (2) and (7) is nonlinear, we argue in Section 3.2 that these are\nrelated by a simple change of metric.\n\nw(0) = w0,\n\n3.1 Dynamics in the Canonical Parameters\nWe assume that the canonical parameters (ri, \u03b8i) are initialized i.i.d. from some base distribution\n\u00b5(r, \u03b8). The function \u02dcfw is well-de\ufb01ned pointwise as m \u2192 \u221e, by the law of large numbers.\nFollowing the mean-\ufb01eld formulation of single-hidden layer networks [22, 8, 26], we express the\nfunction as an expectation with respect to the probability measure over the cylinder D = R \u00d7 S1:\n\n(cid:90)\n\nD\n\n\u02dcfw(x) =\n\n\u03d5(w; x)\u00b5(m)(dw) ,\n\n(cid:80)m\n\nwhere \u03d5(w; x) := ri(cid:104)\u02dcx, d(\u03b8i)(cid:105)+ and \u00b5(m)(w) = 1\nmined by the m particles wi, i = 1 . . . m. The least squares loss in this case becomes\n\ni=1 \u03b4wi(w) is the empirical measure deter-\n\nm\n\n\u02dcL(w) =\n\n1\n2\n\n(cid:107) \u02dcfw \u2212 y(cid:107)2X\nm(cid:88)\n(cid:104)\u03d5wi, y(cid:105)X +\n\n= Cy \u2212 1\nm\n\ni=1\n\nm(cid:88)\n\ni,i(cid:48)=1\n\n(cid:104)\u03d5wi, \u03d5wi(cid:48)(cid:105)X ,\n\n1\n\n2m2\n\nwhere (cid:104)f, g(cid:105)X :=(cid:80)s\n\n(cid:90)\n\n(cid:90)(cid:90)\n\nj=1 f (xj)g(xj) is the empirical dot-product. This loss may be interpreted as\nthe Hamiltonian of a system of m-interacting particles, under external \ufb01eld F and interaction kernel\nK de\ufb01ned respectively by F (w) := (cid:104)\u03d5w, y(cid:105)X ,K(w, w(cid:48)) := (cid:104)\u03d5w, \u03d5w(cid:48)(cid:105)X . We may also express this\nHamiltonian in terms of the empirical measure, by abusing notation\n\n\u02dcL(\u00b5(m)) = Cy \u2212\n\nF (w)\u00b5(m)(dw) +\nA direct calculation shows that the gradient \u2207wi\n\nD\n\n1\n2\n\u02dcL(w) can be written as\n\nD2\n\nK(w, w(cid:48))\u00b5(m)(dw)\u00b5(m)(dw(cid:48)) .\n\nm\n2\n\n\u2207wi\n\nwhere V is the potential function V (w; \u00b5) := \u2212F (w) +(cid:82)\n\n\u02dcL(w) = \u2207wV (wi; \u00b5(m)) ,\n\nD K(w, w(cid:48))\u00b5(dw(cid:48)). The gradient \ufb02ow in\nthe space of parameters w can now be interpreted in Eulerian terms as a gradient \ufb02ow in the space\nof measures over D, by using the notion of Wasserstein gradient \ufb02ows [22, 8, 26]. Indeed, particles\nevolve in D by \u201cfeeling\u201d a velocity \ufb01eld \u2207V de\ufb01ned in D. This formalism allows us now to describe\nthe dynamics independently of the number of neurons m, by replacing the empirical measure \u00b5(m)\nby any generic probability measure \u00b5 in D. The evolution of a measure under a generic time-varying\nvector \ufb01eld is given by the so-called continuity equation:2\n\n\u2202t\u00b5t = div(\u2207V \u00b5t) .\n\n\u221a\n\n(8)\nThe global convergence of this PDE for interaction kernels arising from single-hidden layer neural\nnetworks has been established under mild assumptions in [22, 8, 25]. Although the conditions for\nglobal convergence hold in the mean \ufb01eld limit m \u2192 \u221e, a propagation-of-chaos argument from\nstatistical mechanics gives Central Limit Theorems for the behavior of \ufb01nite-particle systems as\n\ufb02uctuations of order 1/\nThe dynamics in D are thus described by the velocity \ufb01eld \u2207V (w; \u00b5t), which depends on the current\nstate of the system through the measure \u00b5t(w), describing the probability of encountering a particle\n1To be precise, we should replace the gradient \u2207L(z) with the Clarke subdifferential \u2202L(z) [9], since\nL(z) is only piecewise smooth. At generic smooth points z, the subdifferential coincides with the gradient\n\u2202L(z(t)) = {\u2207L(z)}.\n2Understood in the weak sense, i.e., \u2202t\nc (D) continuously differentiable and with compact support.\nC 1\n\n(cid:0)(cid:82)\nD \u03c6(w)\u00b5t(dw)(cid:1) = \u2212(cid:82) (cid:104)\u2207\u03c6(w),\u2207V (w; \u00b5t)(cid:105)\u00b5t(dw), \u2200\u03c6 \u2208\n\nm around the mean-\ufb01eld solution; see [26, 25] for further details.\n\n4\n\n\fat position w at time t. We emphasize that equation (8) is valid for any measure, including the\nempirical measure \u00b5(m), and is therefore an exact model for the dynamics in both the \ufb01nite-particle\nand in\ufb01nite-particle regime. Let us now describe its speci\ufb01c form in the case of the empirical loss\ngiven above.\nAssume without loss of generality that the data points xj \u2208 R, j \u2264 s satisfy xj \u2264 xj(cid:48) whenever\nj < j(cid:48). Denote\n\nCs+j := {j(cid:48); j(cid:48) > j}, for j = 1 . . . s \u2212 1 .\n\nCj := {j(cid:48); j(cid:48) \u2264 j} for j = 1 . . . s,\nj = arctan(xj) \u00b1 \u03c0/2 partition the circle S1 into 2s \u2212 1\nWe observe that for each j, two angles \u03b1\u00b1\nregions Ak (visualized as the colored regions in Figure 1), which are in one-to-one correspondence\nwith the sets Ck, in the sense that \u03b8 \u2208 Ak if and only if {j;(cid:104)\u02dcxj, d(\u03b8)(cid:105) \u2265 0} = Ck . We also denote by\nBj the half-circle where (cid:104)\u02dcxj, \u03b8(cid:105) \u2265 0. Let t(\u03b8) be the tangent vector of S1 at \u03b8 (so t(\u03b8) = d(\u03b8)\u22a5) and\nw = (r, \u03b8), where we suppose \u03b8 \u2208 Ak. A straightforward calculation (see Appendix B) shows that\nthe angular and radial components of \u2207V (w; \u00b5t) are given by\n\n,\n\nj\u2208Ck\n\nR\u00d7Bj\n\n\u2207rV (w; \u00b5t) =\n\n\u2207\u03b8V (w; \u00b5t) = r\n\n\u03c1j(t)\u02dcxj, t(\u03b8)\n\n\u03c1j(t)\u02dcxj, d(\u03b8)\n\nwhere \u03c1j(t) =(cid:82)\n\n(9)\nr(cid:104)\u02dcxj, \u03b8(cid:105)\u00b5t(dr, d\u03b8) \u2212 yj is the residual at point xj at time t. These expressions\nshow that the dynamics are entirely controlled by the s-dimensional vector of residuals \u03c1(t) =\n(\u03c11(t), . . . \u03c1s(t)), and that the velocity \ufb01eld is piece-wise linear on each cylindrical region R \u00d7 Ak\n(e.g. Figure 9 in Appendix D). Under the assumptions that ensure global convergence of (8), we\nhave limt\u2192\u221e \u02dcL(\u00b5t) = 0, and therefore (cid:107)\u03c1(t)(cid:107) \u2192 0. The oscillations of \u03c1(t) as it converges to zero\ndetermine the relative orientation of the \ufb02ow within each region. The exact dynamics for the vector\nof residuals are given by the following proposition, proved in Appendix B:\nProposition 1. For each j,\n\nj\u2208Ck\n\n,\n\n(cid:42)(cid:88)\n\n(cid:42)(cid:88)\n\n(cid:43)\n\n(cid:43)\n\nwhere \u03a3k(t) =(cid:82)\n\nR\u00d7Ak\n\n\u02d9\u03c1j(t) = \u2212\u02dcx(cid:62)\n\n(cid:0)r2t(\u03b8) t(\u03b8)(cid:62) + d(\u03b8) d(\u03b8)(cid:62)(cid:1) \u00b5t(dr, d\u03b8) tracks the covariance of the measure\n\nk;Ak\u2282Bj\n\n\u03c1j(cid:48)(t)\u02dcxj(cid:48)\n\n\u03a3k(t)\n\nj(cid:48)\u2208Ck\n\n(10)\n\nj\n\nalong each cylindrical region.\nEquation (10) de\ufb01nes a system of ODEs for the residuals \u03c1(t), but its coef\ufb01cients are time-varying,\nand behave roughly as quadratic terms in \u03c1(t) (since they are second-order moments of the measure\nwhereas the residuals are \ufb01rst-order moments). It may be possible to obtain asymptotic control of the\noscillations \u03c1(t) by applying Duhamel\u2019s principle, but this is left for future work.\nNow let w = (r, \u03b8) with \u03b8 at a boundary of two regions Ak, Ak+1. The velocity \ufb01eld is modi\ufb01ed at\nthe transition by\n\n(cid:88)\n\n\uf8eb\uf8ed(cid:88)\n\n\uf8f6\uf8f8 ,\n\nk\u22121(cid:88)\n\nwhere j\u2217 is such that (cid:104)\u02dcxj\u2217, d(\u03b8)(cid:105) = 0, since \u03b8 is at the boundary of Ak. It follows that the only\ndiscontinuity is in the angular direction, of magnitude |r\u03c1j\u2217(t)|(cid:107)\u02dcxj\u2217(cid:107). In particular, an interesting\nphenomenon arises when the angular components of \u2207V (w)|Ak and \u2207V (w)|Ak+1 have opposite\nsigns, corresponding to an \u201cattractor\u201d or \u201crepulsor\u201d that attracts/repels particles along the direction\ngiven by \u02dcxj\u2217 (See Figure 9 in Appendix D). Writing sk =\n, we deduce\nfrom (9) that this occurs when |sk| < |\u03c1j\u2217(t)|(cid:107)\u02dcxj\u2217(cid:107) and sign(sk) (cid:54)= sign(\u03c1j\u2217(t)). We expand this\ncondition in the following Lemma.\nLemma 2. A data point xk is an attractor/repulsor if and only if\n\n(cid:68)(cid:80)\n\n\u03c1j(t)\u02dcxj, t(\u03b8)\n\n(cid:69)\n\nj\u2208Ck\n\n\u03c1i\u03c1k(cid:104)\u02dcxi, \u02dcxk(cid:105) > \u2212\u03c12\n\nk(cid:107)\u02dcxk(cid:107)2, or\n\n\u03c1i\u03c1k(cid:104)\u02dcxi, \u02dcxk(cid:105) > \u2212\u03c12\n\nk(cid:107)\u02dcxk(cid:107)2.\n\ni=1\n\ni=k+1\n\nIn words, mass will concentrate towards input points where the residual is currently large and of\nopposite sign from a weighted average of neighboring residuals. This is in stark contrast with the\nkernel dynamics (Section 3.3), where there is no adaptation to the input data points. We point out that\nthis qualitative behavior has been established in [21] under appropriate initial conditions, suf\ufb01ciently\nclose to zero, in line with our mean-\ufb01eld analysis. We also refer to Section B.2 of the Appendix,\nwhere we describe the adaptive regime when the objective is augmented with TV regularization.\n\n5\n\ns(cid:88)\n\n\u2207V (w)|Ak\u2212\u2207V (w)|Ak+1= \u03c1j\u2217(t)\n\n(cid:18) r(cid:104)\u02dcxj\u2217, t(\u03b8)(cid:105)\n\n(cid:104)\u02dcxj\u2217, \u03b8(cid:105)\n\n(cid:19)\n\n,\n\n\f\u03b4 = \u2212100\n\n\u03b4 = 0\n\n\u03b4 = 100\n\nFigure 2: The value of \u03b4 interpolates between different kinds of local trajectories of neurons. The\nplots are in the coordinate frame (\u2207 \u02dcL,\u2207 \u02dcL\u22a5). Left: the neurons move radially towards and away\nfrom the origin. Middle: the trajectories containing both radial and tangential components. Right: the\ntrajectories are parallel to the gradient \u2207 \u02dcL.\n\n3.2 Dynamics in the Full Parameters\nThe dynamics of gradient \ufb02ow (2) are different from the dynamics of the gradient \ufb02ow (7). For\nthe gradient \ufb02ow in canonical parameters we have observed adaptive learning behavior under the\nassumption of iid distribution of parameter initialization. The full set of parameters z = (a, b, c),\nmay exhibit both kernel and adaptive behavior depending on the initialization. To characterize this\nbehavior we rely on the following lemma.\nLemma 3. If z(t) = (a(t), b(t), c(t)) is a solution of the gradient \ufb02ow (2), then the quantities\n\n\u03b4 = (ci(t)2 \u2212 ai(t)2 \u2212 bi(t)2)m\n\ni=1\n\n(11)\n\nremain constant for all t. In particular, given a reduced neuron (ri, \u03b8i), we can uniquely recover the\noriginal neuron parameters (ai, bi, ci) from \u03b4i computed from the initialization.\nLemma 3 allows us to analyze how canonical parameters evolve under full gradient \ufb02ow in (a, b, c).\nOverall, the behavior is qualitatively the same, except it is in addition dependent on the relative scale\nof redundant parameters.\nProposition 4. Let z(t) be a solution of the gradient \ufb02ow (2) of L(z), and let \u03b4 = (\u03b4i) \u2208 Rm be\nthe vector of invariants (11), which depend only on the initialization z(0). If w(t) = (r(t), \u03b8(t)) is\ncurve of canonical parameters corresponding to z(t), then we have that\n\n\u02d9wi(t) = Pi \u00b7 \u2207wi\n\n\u02dcL(w),\n\ni = 1, . . . , m,\n\nwhere\n\nPi =\n\n(cid:34) m2\n\n\u03b1(m)2 (a2\n\ni + c2\ni )\n\ni + b2\n0\n\n0\n1\na2\ni +b2\ni\n\n(12)\n\n(13)\n\n(cid:35)\n\n.\n\n(cid:21)\n\n(cid:35)\n\n(cid:20)\u2207ri\n\n(cid:34) m2\n\n(cid:20)dri\n\n(cid:21)\n\nd\u03c4i\n\nWith respect to rescaled differentials d\u03c4 = rd\u03b8, corresponding to representing the \ufb02ow locally in a\nCartesian system aligned with the radial direction (pointing away from z = 0) and its perpendicular,\nthe \ufb02ow can be written as\n\n\u02dcL(w)dt\n\u02dcL(w)dt\n\n,\n\n\u03b1(m)2 (a2\n\ni + c2\ni )\n\ni + b2\n0\n\n0\nc2\ni\ni (cid:28) a2\n\n.\n\n=\n\ni = 1, . . . , m,\n\n\u2207\u03c4i\ni for all i (i.e., \u03b4i (cid:28) 0), then radial motion\nFrom these equations, one can see that if c2\nwill dominate. In other words, initializing the \ufb01rst layer with signi\ufb01cantly larger values than the\ni (cid:29) a2\nsecond leads to kernel learning. On the other hand, if c2\ni , then a solution of the gradient\n\ufb02ow (2) will follow the same trajectory as for the reduced gradient \ufb02ow (7). Also, if \u03b1(m) = o(m),\nthe radial component will dominate as m increases. Figure 2 shows the trajectories corresponding to\ndifferent values of \u03b4i for each neuron, with \u03b1(m) = m. The extreme cases of \u03b4 = \u2212\u221e and \u03b4 = +\u221e\ncorrespond to the \u201ckernel\u201d and \u201cadaptive\u201d regimes, respectively. Note that as \u03b4 increases, the neurons\ncluster at sample points, as explained in our analysis in Section 3.1, and in accordance to [21].\n\ni + b2\n\ni + b2\n\n(14)\n\n6\n\n1.000.750.500.250.000.250.500.751.001.000.750.500.250.000.250.500.751.00LL1.000.750.500.250.000.250.500.751.001.000.750.500.250.000.250.500.751.00LL1.000.750.500.250.000.250.500.751.001.000.750.500.250.000.250.500.751.00LL\f3.3 Kernel Dynamics\nWe now consider the dynamics in the special case where \u03b4 (cid:28) 0, and we consider m \u2192 \u221e. To obtain\nthe kernel regime in this case, it is suf\ufb01cient to consider a normalization \u03b1(m) = o(m). In particular,\nwhen \u03b1(m) = 1, as shown in the previous section, the parameters a and b remain mostly \ufb01xed and\nthe parameters c change throughout training, corresponding to the so-called random-features (RF)\nkernel of Rahimi and Recht [24].\nIn the limit case where a and b are completely \ufb01xed to their initial values, if we choose c close to\nthe zero vector, then the least squares problem (1) solved using gradient \ufb02ow, is equivalent to the\nminimal-norm constraint problem solution:\nminimize (cid:107)c(cid:107)2\nsubject to fz(xi) = yi,\n\n(15)\nGiven an initial distribution \u00b50 over the domain Da \u00d7 Db of parameters a and b, the random-feature\n(RF) kernel associated with (15) is given by\n\ni = 1, . . . , s.\n\nKRF(x, x(cid:48)) =\n\n[xa \u2212 b]+ \u00b7 [x(cid:48)a \u2212 b]+\u00b50(da, db) .\n\nDa\u00d7Db\n\nrem \u02dcfz(x) = (cid:80)s\nbounded for each a. De\ufb01ne \u03bd(u) =(cid:82) |a|\u00b5a(ab)dq(a). Then the solution (15) solves\n\nThe solution of (15) can now be written in terms of this RF kernel using the representer theo-\nj=1 \u03b1jKRF(xj, x), where \u03b1 is a vector of minimal RKHS norm that ful\ufb01lls the\ninterpolation constraints. Under appropriate assumptions, the solution to (15) is a cubic spline.\n:= E(a,b)\u223c\u00b50 (a2 + b2) < \u221e. Let\nTheorem 5. Assume the measure \u00b50 has \ufb01nite second moment \u03c32\n\u00b50\n\u00b50(a, b) = q(a)\u00b5a(b) be the decomposition in terms of marginal and conditional, and assume \u00b5a is\n\n(16)\n\n(cid:90)\n\n(cid:90)\n\n(cid:107)f(cid:107)2\n\n|f(cid:48)(cid:48)(u)|2\n\u03bd(u)\n\ndu\n\n\u2126\n\nmin\n\nf\n\ns.t.\n\nRF :=\n\nf (xi) = yi , i = 1 . . . s ,\n\n(17)\nwhere \u2126 := supp(\u03bd). Moreover, if \u00b50 is such that \u00b50(a, b) = q(a)1(b \u2208 Ia), where Ia \u2282 R is an\narbitrary interval, then (15) will be a cubic spline.\nNotice that the assumptions on \u00b50 to obtain an exact cubic spline kernel impose that if A, B is a\nrandom vector distributed according to \u00b50, then B|A is uniform over an arbitrary interval IA that\ncan depend upon A. The proof illustrates that one may generalize the interval IA by any countable\nunion of intervals. In particular, independent uniform initialization yields cubic splines, but radial\ndistributions, such as A, B being jointly Gaussian, do not (see Section A.3 in the Appendix). We\nremark that machine learning packages such as PyTorch use a uniform distribution for linear layer\nparameter initialization by default. We verify that indeed, solutions to (1) converge to cubic splines\nas m grows in Figure 3. We also point out that in Kernel Learning, early termination of gradient \ufb02ow\nacts as a regularizer favoring smooth, non-interpolatory solutions (see [19]).\nThe analysis and comparison of these kernels has recently been addressed in [5, 14] in the general\nhigh-dimensional setting, by describing its spectral decay in terms of spherical harmonics. Our results\ncomplement them in the particular one-dimensional setting thanks to the explicit functional form\nof the resulting RKHS norms. Additionally, Savarese et al. [28] study the functional form of the\n\nminimization in the variation norm, leading to a penality of the form(cid:82) |f(cid:48)(cid:48)(u)|du. We have instead\n\n\u221a\n\nL2 norms (RKHS) in the kernel regime. The L2 norms do not provide any adaptivity as opposed to\nthe L1 norm [3]. An interesting question is to precisely describe the transition between these two\nregimes as a function of the initialization.\nNumerical Experiments. For our numerical experiments, we use gradient descent with the pa-\nrameterization (1) and \u03b1(m) =\nm, appropriately scaling the weights a, b, c to achieve different\ndynamical behaviors. We also refer to Section D in the Appendix for additional experiments.\nCubic Splines. We show in Figure 3 that when \u2212\u03b4 (cid:28) r2 (i.e., in the kernel regime), and as the\nnumber of neurons grows, the network function fz converges to a cubic spline. For this experiment,\nwe used 10 points sampled from a square wave, and trained only the parameters c (i.e., \u03b4i = \u221e).\nNetwork Dynamics as a Function of \u03b4. We show in Figure 4 that as we vary \u03b4, the network function\ngoes from being smooth and non-adaptive in the kernel regime (\u03b4 = \u2212\u221e, i.e.training only the\nparameter c) to very adaptive (\u03b4 = \u221e, i.e.training only the parameters a, b). Note that as \u03b4 increases,\nclusters of knots emerge at the sample positions (collinear points in the uv diagrams).\n\n7\n\n\fm = 102\n\nm = 103\n\nm = 104\n\nFigure 3: A cubic spline with vanishing second derivative at its endpoints (blue line) is approximated\nby a neural network (\u03b4 = \u2212100) while varying the number m of neurons.\n\n\u03b4 = \u2212\u221e\n\n\u03b4 = \u22121\n\n\u03b4 = 0\n\n\u03b4 = 1\n\n\u03b4 = \u221e\n\nFigure 4: Comparison of \ufb01tting the network function to a sinusoid as \u03b4 varies (10000 epochs).\n\n4 Concluding Remarks\nWe have studied the implicit bias of gradient descent in the approximation of univariate functions with\nsingle-hidden layer ReLU networks. Despite being an extremely simpli\ufb01ed learning setup, it provides\na clear illustration that such implicit bias can be drastically different depending on how the neural\narchitecture is parameterized, normalized, or even initialized. Building on recent theoretical work\nthat studies neural networks in the overparameterized regime, we show how the model can behave\neither as a \u2018classic\u2019 cubic spline interpolation kernel, or as an adaptive interpolation method, where\nneurons concentrate on sample points where the approximation most needs them. Moreover, in the\none-dimensional case, we complement existing works [29] to reveal a transition between these two\nextreme training regimes, which roughly correspond to W 1,2 and W 2,2 Sobolev spaces respectively.\nAlthough in our univariate setup there is no clear advantage of one functional space over the other,\nour full description of the dynamics may prove useful in the high-dimensional regime, where the\ncurse of dimensionality affects Hilbert spaces de\ufb01ned by kernels [3]. We believe that the analysis\nof the PDE resulting from the mean-\ufb01eld regime (where adaptation occurs) in the low-dimensional\nsetting will be useful to embark in the analysis of the high-dimensional counterpart. We note however\nthat naively extending our analysis to high-dimensions would result in an exponential increase in the\nnumber of regions that de\ufb01ne our piecewise linear \ufb02ow, thus we anticipate that new tools might be\nneeded. Moreover, the interpretation of ReLU features in terms of Green\u2019s functions (as \ufb01rst pointed\nout in [29]) does not directly carry over to higher dimensions. Lastly, another important limitation of\nthe mean-\ufb01eld analysis is that it cannot be easily adapted to deep neural network architectures, since\nneurons are no longer exchangeable as in the many-particle system described above.\nAcknowledgements: This work was partially supported by the Alfred P. Sloan Foundation, NSF\nRI-1816753, NSF CAREER CIF 1845360, Samsung Electronics, the NSF CAREER award 1652515,\nthe NSF grant IIS-1320635, the NSF grant DMS-1436591, the NSF grant DMS-1835712, the SNSF\ngrant P2TIP2_175859, the Moore-Sloan Data Science Environment, the DARPA D3M program,\nNVIDIA, Labex DigiCosme, DOA W911NF-17-1-0438, a gift from Adobe Research, and a gift from\nnTopology. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views of DARPA.\n\n8\n\nfz(x)(xj,yj)Cubic Splinefz(x)(xj,yj)Cubic Splinefz(x)(xj,yj)Cubic Splinefz(ei)yjfz(ei)yjfz(ei)yjfz(ei)yjfz(ei)yjxi>0<0xi>0<0xi>0<0xi>0<0xi>0<0\fReferences\n[1] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathemati-\n\ncal society, 68(3):337\u2013404, 1950.\n\n[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis\nof optimization and generalization for overparameterized two-layer neural networks. arXiv\npreprint arXiv:1901.08584, 2019.\n\n[3] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal\n\nof Machine Learning Research, 18(1):629\u2013681, 2017.\n\n[4] Ronen Basri and David Jacobs. Ef\ufb01cient representation of low-dimensional manifolds using\n\ndeep networks. arXiv preprint arXiv:1602.04723, 2016.\n\n[5] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. arXiv preprint\n\narXiv:1905.12173, 2019.\n\n[6] Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.\n\n[7] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable program-\n\nming. arXiv preprint arXiv:1812.07956, 2018.\n\n[8] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-\nparameterized models using optimal transport. In Advances in neural information processing\nsystems, pages 3036\u20133046, 2018.\n\n[9] Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathe-\n\nmatical Society, 205:247\u2013262, 1975.\n\n[10] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\n[11] I Daubechies, R DeVore, S Foucart, B Hanin, and G Petrova. Nonlinear approximation and\n\n(deep) relu networks. arXiv preprint arXiv:1905.02199, 2019.\n\n[12] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[13] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[14] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized\n\ntwo-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.\n\n[15] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias\n\nin terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.\n\n[16] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. arXiv preprint\n\narXiv:1901.09021, 2019.\n\n[17] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-\n\ndimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.\n\n[18] Thomas Hotz and Fabian JE Telschow. Representation by integrating reproducing kernels.\n\narXiv preprint arXiv:1202.4443, 2012.\n\n[19] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\n[20] Chao Ma, Lei Wu, et al. A comparative analysis of the optimization and generalization property\nof two-layer neural network and random feature models under gradient descent dynamics. arXiv\npreprint arXiv:1904.04326, 2019.\n\n9\n\n\f[21] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network\n\nfeatures. arXiv preprint arXiv:1803.08367, 2018.\n\n[22] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013\nE7671, August 2018.\n\n[23] Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient\n\ndescent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.\n\n[24] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[25] Grant Rotskoff, Samy Jelassi, Joan Bruna, and Eric Vanden-Eijnden. Global convergence of\n\nneuron birth-death dynamics. arXiv preprint arXiv:1902.01843, 2019.\n\n[26] Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems:\nAsymptotic convexity of the loss landscape and universal scaling of the approximation error.\narXiv preprint arXiv:1805.00915, 2018.\n\n[27] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural\n\nnetworks. arXiv preprint arXiv:1712.08968, 2017.\n\n[28] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do in\ufb01nite width bounded\n\nnorm networks look in function space? arXiv preprint arXiv:1902.05040, 2019.\n\n[29] Justin Sirignano and Konstantinos Spiliopoulos. Mean \ufb01eld analysis of neural networks: A\n\ncentral limit theorem. arXiv preprint arXiv:1808.09372, 2018.\n\n[30] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The\nimplicit bias of gradient descent on separable data. The Journal of Machine Learning Research,\n19(1):2822\u20132878, 2018.\n\n[31] Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in two-layer neural network\n\noptimization landscapes. arXiv preprint arXiv:1802.06384, 2018.\n\n[32] Francis Williams, Teseo Schneider, Claudio Silva, Denis Zorin, Joan Bruna, and Daniele\nPanozzo. Deep geometric prior for surface reconstruction. arXiv preprint arXiv:1811.10943,\n2018.\n\n10\n\n\f", "award": [], "sourceid": 4545, "authors": [{"given_name": "Francis", "family_name": "Williams", "institution": "New York University"}, {"given_name": "Matthew", "family_name": "Trager", "institution": "NYU"}, {"given_name": "Daniele", "family_name": "Panozzo", "institution": "NYU"}, {"given_name": "Claudio", "family_name": "Silva", "institution": "New York University"}, {"given_name": "Denis", "family_name": "Zorin", "institution": "New York University"}, {"given_name": "Joan", "family_name": "Bruna", "institution": "NYU"}]}