{"title": "Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7146, "page_last": 7155, "abstract": "The performance of neural networks on high-dimensional data\n  distributions suggests that it may be possible to parameterize a\n  representation of a given high-dimensional function with\n  controllably small errors, potentially outperforming standard\n  interpolation methods.  We demonstrate, both theoretically and\n  numerically, that this is indeed the case.  We map the parameters of\n  a neural network to a system of particles relaxing with an\n  interaction potential determined by the loss function.  We show that\n  in the limit that the number of parameters $n$ is large, the\n  landscape of the mean-squared error becomes convex and the\n  representation error in the function scales as $O(n^{-1})$.\n  In this limit, we prove a dynamical variant of the universal\n  approximation theorem showing that the optimal\n  representation can be attained by stochastic gradient\n  descent, the algorithm ubiquitously used for parameter optimization\n  in machine learning.  In the asymptotic regime, we study the\n  fluctuations around the optimal representation and show that they\n  arise at a scale $O(n^{-1})$.  These fluctuations in the landscape\n  identify the natural scale for the noise in stochastic gradient\n  descent.  Our results apply to both single and multi-layer neural\n  networks, as well as standard kernel methods like radial basis\n  functions.", "full_text": "Parameters as interacting particles: long time\n\nconvergence and asymptotic error scaling of neural\n\nnetworks\n\nCourant Institute of Mathematical Sciences\n\nCourant Institute of Mathematical Sciences\n\nEric Vanden-Eijnden\n\nNew York University\neve2@cims.nyu.edu\n\nGrant M. Rotskoff\n\nNew York University\n\nrotskoff@cims.nyu.edu\n\nAbstract\n\nThe performance of neural networks on high-dimensional data distributions sug-\ngests that it may be possible to parameterize a representation of a given high-\ndimensional function with controllably small errors, potentially outperforming\nstandard interpolation methods. We demonstrate, both theoretically and numer-\nically, that this is indeed the case. We map the parameters of a neural network\nto a system of particles relaxing with an interaction potential determined by the\nloss function. We show that in the limit that the number of parameters n is large,\nthe landscape of the mean-squared error becomes convex and the representation\nerror in the function scales as O(n\u22121). In this limit, we prove a dynamical variant\nof the universal approximation theorem showing that the optimal representation\ncan be attained by stochastic gradient descent, the algorithm ubiquitously used\nfor parameter optimization in machine learning. In the asymptotic regime, we\nstudy the \ufb02uctuations around the optimal representation and show that they arise\nat a scale O(n\u22121). These \ufb02uctuations in the landscape identify the natural scale\nfor the noise in stochastic gradient descent. Our results apply to both single and\nmulti-layer neural networks, as well as standard kernel methods like radial basis\nfunctions.\n\n1\n\nIntroduction\n\nThe methods and models of machine learning are rapidly becoming de facto tools for the analysis and\ninterpretation of large data sets. The ability to synthesize and simplify high-dimensional data raises\nthe possibility that neural networks may also \ufb01nd applications as ef\ufb01cient representations of known\nhigh-dimensional functions. In fact, these techniques have already been explored in the context of\nfree energy calculations [1], partial differential equations [2, 3], and force\ufb01eld parameterization [4].\nYet determining the optimal set of parameters or \u201ctraining\u201d a given neural network remains one of\nthe central challenges in applications due to the slow dynamics of training [5] and the complexity\nof the objective function [6, 7]. Parameter optimization in machine learning typically relies on the\nstochastic gradient descent algorithm (SGD), which makes an empirical estimate of the gradient of\nthe objective function over a small number of sample points [5]. SGD has been analyzed in some\ncases\u2014for example, when the problem is known to be convex, as in the over-parameterized limit or\nother idealized settings [8, 9, 10, 11], there are rigorous guarantees of convergence and estimates of\nconvergence rates [12].\nWhile \ufb01nding the best set of parameters is computationally challenging, we have strong theoretical\nguarantees that neural networks can represent a large class of functions. The universal approximation\ntheorems [13, 14, 15] ensure the existence of a (possibly large) set of parameters that bring a neural\nnetwork arbitrarily close to a given function over a compact domain. A similar statement has been\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fproved for radial basis functions [16]. However, the proofs of the universal approximation theorems\ndo not ensure that any particular optimization technique can locate the ideal set of parameters.\nParameters as particles\u2014In order to study the properties of stochastic gradient descent for neural\nnetwork optimization, we recast the standard training procedure in terms of a system of interacting\nparticles [17]. In doing so, we give an exact rewriting of stochastic gradient descent as a stochastic\ndifferential equation with multiplicative noise, which has been studied previously [18, 19]. We\ninterpret the limiting behavior of the parameter optimization via a nonlinear Liouville equation for the\ntime evolution of a parameter distribution [20]. This framework provides analytical tools to determine\na Law of Large Numbers for the convergence of the optimization and to derive scaling results for the\nerror term as time and the number of parameters grow large. A similar perspective has been adopted\nconcurrently by Mei et al. [21], Chizat and Bach [22], and Sirignano and Spiliopoulos [23], which\nstudy the \u201cmean \ufb01eld limit\u201d, similar to our Law of Large Numbers, but not asymptotic \ufb02uctuations or\nerror scaling.\nConvergence and asymptotic dynamics of stochastic gradient descent\u2014We demonstrate that the\noptimization problem becomes convex in the limit n \u2192 \u221e and we show that both gradient descent\nand SGD convergence to the global minimum [24, 21]. This argument shows that the universal\napproximation theorem can be obtained as the limit of a stochastic gradient based optimization\nprocedure under an appropriate choice of hyper-parameters. In the scaling limit, our analysis gives\nbounds on the error of a representation and characterizes the asymptotic \ufb02uctuations in that error.\nConvergence to the optimum to \ufb01rst order occurs rapidly, i.e. on O(1) timescales. Diminishing the\nerror at next order requires quenching the noise in the dynamics on O(log n) time scales.\nImplications of noise in descent dynamics\u2014Our results give an explicit theoretical explanation for\nthe observation that additional noise in can lead to better generalization for neural networks [25, 26];\nlocal minima of depth O(n\u22121) are washed out by the noise of SGD.\nNumerical experiments\u2014We verify the scaling predicted by our asymptotic arguments for single\nlayer neural networks. Because it is impossible to determine the exact interaction potential in general,\nwe carry out numerical experiments using stochastic gradient descent for ReLU neural networks. We\nuse the p-spin energy function [27, 28] as the target function due to its complexity as the dimension\ngrows large.\nKey assumptions\u2014In order to derive the stochastic partial differential equation for SGD, we effec-\ntively assume the large data limit. Because we are focusing on function approximation we can always\ngenerate new training data by sampling random points in the domain of the function and evaluating\nthe target function at those points. The partial differential equation for gradient descent represents\nthe evolution of the parameters on the true loss landscape, i.e., the large data limit. In this limit, the\ndynamics is similar to online algorithms for stochastic gradient descent [5].\n\n2 Parameters as particles\nGiven a function f : \u2126 \u2192 R de\ufb01ned on a compact set \u2126 \u2282 Rd, consider its approximation by\n\n(1)\nwhere n \u2208 N, \u03d5 : \u2126 \u00d7 D \u2192 R is some kernel and (ci, yi) \u2208 R \u00d7 D with D \u2282 RN . The ci and yi\nare parameters to be learned for i = 1, . . . , n. We place the following assumption on the kernel: for\nany test function h,\n\nci\u03d5(x, yi)\n\nfn(x) =\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:90)\n\n\u2200y \u2208 D :\n\n\u2126\n\nh(x)\u03d5(x, y)d\u00b5(x) = 0 \u21d2 h(x) = 0,\n\n\u2200x \u2208 \u2126,\n\n(2)\n\nwhere \u00b5 is some positive measure on \u2126 (for example the Lebesgue measure, d\u00b5(x) = dx). This\ncondition is satis\ufb01ed for nonlinearities typically encountered in machine learning; a neural network\nwith any number of layers using a positive nonlinear activation function (e.g., ReLU, sigmoid) will\nclearly satisfy this property if the linear coef\ufb01cients are non-zero. The property above is similar to the\ndiscriminatory kernel condition in Cybenko [13]. Our results apply to radial basis functions, single\nlayer neural networks, and multilayer neural networks in which the \ufb01nal layer is scaled with n. In\nparticular, the statements we make require a \u201cwide\u201d \ufb01nal layer but are still applicable to networks\nwith multiple layers.\n\n2\n\n\fBy \u201ctraining\u201d the representation, we mean that we seek to optimize the parameters so as to minimize\nthe mean-squared error loss function,\n\n(cid:90)\n\u2126 |f (x) \u2212 fn(x)|2 d\u00b5(x).\n\n(cid:96)(f, fn) = 1\n2\n\n(3)\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\n1\n2n\n\nIn this case we have chosen to employ the mean-squared error and we can view (cid:96)(f, fn) as an \u201cenergy\u201d\nfunction for the parameters {(ci, yi)}n\n\ni=1,\n\nE(c1, y1, . . . , cn, yn) := n ((cid:96)(f, fn) \u2212 Cf ) =\n\n(cid:82)\n\u2126 |f (x)|2 d\u00b5(x) is a constant unaffected by the optimization and we have de\ufb01ned\n(5)\n\n\u03d5(x, y)\u03d5(x, z)d\u00b5(x).\n\nf (x)\u03d5(x, y)d\u00b5(x)\n\ncicjK(yi, yj)\n\nciF (yi) +\n\nK(y, z) =\n\nF (y) =\n\n(cid:90)\n\n(cid:90)\n\ni,j=1\n\n(4)\n\nwhere Cf = 1\n2\n\n\u2126\n\n\u2126\n\nDirectly optimizing the coef\ufb01cients to minimize the loss function (cid:96) is challenging in general because\nwe do not have any guarantee of convexity. However, these dif\ufb01culties can be conceptually alleviated\nby instead writing the objective function in terms of a weighted distribution\n\nGn : D \u2192 R, Gn(y) =\n\n1\nn\n\nci\u03b4(y \u2212 yi)\n\n(6)\n\nwhich converges weakly to some G(y) as n \u2192 \u221e, a fact which we describe in detail below. Convo-\nlution with this weighted distribution provides a convenient expression for the function representation\n\nfn(x) =\n\nci\u03d5(x, y)\u03b4(y \u2212 yi)dy \u2261 \u03d5 (cid:63) Gn.\n\n(7)\n\nInterestingly, in the limit that n \u2192 \u221e the objective function for the optimization becomes convex in\nterms of the signed distribution,\n\n(cid:90)\n\u2126 |f (x) \u2212 (\u03d5 (cid:63) G)(x)|2 d\u00b5(x).\n\n(cid:96)(f, \u03d5 (cid:63) G) = 1\n2\n\n(8)\n\nmeaning that a unique minimum value of the loss function can be attained for a not necessarily unique\nminimizer G\u2217 for which (cid:96)(f, \u03d5 (cid:63) G\u2217) = 0. This observation formalizes the statements made by\nBengio et al. in Ref. [24]. While the objective function is convex, it is by no means trivial to optimize\nthe weighted distribution. Writing the loss function in this language gives us a perspective that can\nbe exploited to derive the scaling of the error in arbitrary neural networks trained with stochastic\ngradient descent.\n\n3 Gradient descent\n\nWe \ufb01rst discuss the case of gradient descent for which we provide derivations of a law of large\nnumbers (LLN) and central limit theorem (CLT) for the optimization dynamics. These statements\nreveal the scaling in the representation error and the analysis has synergies which are useful in\nderiving LLN and CLT for stochastic gradient descent. Detailed arguments for the propositions stated\nhere are provided in the supplementary material.\nThe gradient descent dynamics is given by coupled ordinary differential equations for the weight and\n\nn(cid:88)\n\ni=1\n\n(cid:90)\n\n1\nn\n\nD\n\nn(cid:88)\n\ni=1\n\nthe parameters of the kernel,\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\nn\n\n\u02d9Y i = Ci\u2207F (Yi) \u2212\nn(cid:88)\n\n\u02d9Ci = F (Yi) \u2212\n\n1\nn\n\nn(cid:88)\n\nj=1\n\nCjK(Yi, Yj)\n\nj=1\n\n3\n\nCiCj\u2207K(Y i, Yj),\n\n(9)\n\n\fwith initial conditions sampled independently from a probability distribution \u03c1in(y, c) with full\nsupport in the domain D \u00d7 R. We analyze the evolution of the parameters by studying the \u201cparticle\u201d\ndistribution\n(10)\n\nn(cid:88)\n\n\u03c1n(t, y, c) =\n\n\u03b4(c \u2212 Ci(t))\u03b4(y \u2212 Y i(t))\n\n1\nn\n\ni=1\n\nthe \ufb01rst moment of which is the weighted distribution (6),\n\nGn(t, y) =\n\nc\u03c1n(t, y, c)dc =\n\nCi(t)\u03b4(y \u2212 Y i(t)).\n\n(cid:82) \u03d5(x, y)Gn(t, y)dy. Taking the limit n \u2192 \u221e, we see that the zeroth order term of the distri-\n\nWe can express the function representation in terms of the distribution as fn(t, x) =\n\nbution has smooth initial data \u03c10(0) = \u03c1in by the Law of Large Numbers. In Sec S1.1 we derive a\nnonlinear partial differential equation satis\ufb01ed by \u03c10, essentially by applying the chain rule:\n\ni=1\n\n(cid:90)\n\nn(cid:88)\n\n1\nn\n\nwhere\n\n\u2202t\u03c10 = \u2207 \u00b7 (c\u2207U ([\u03c10], y)\u03c10) + \u2202c (U ([\u03c10], y)\u03c10) ,\n\n(cid:90)\n\nU ([\u03c1], y) = \u2212F (y) +\n\nD\u00d7R\n\nc(cid:48)K(y, y(cid:48))\u03c1(y(cid:48), c(cid:48))dy(cid:48)dc(cid:48)\n\nThe PDE (12) is gradient descent in Wasserstein metric on a convex energy functional of the density\n(cf. Sec. S1.2.1); we refer to this type of equation as a nonlinear Liouville equation.\n\n3.1 Law of large numbers\n\nThe limiting equation (12) is a well-posed and deterministic nonlinear partial integro-differential\nequation. We can express it in terms of the target function f (x) by denoting\n\n(cid:90)\n\nD\u00d7R\n\n(cid:90)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nf0(t, x) =\n\nc\u03d5(x, y)\u03c10(t, y, c)dydc\n\nand we see that\n\nwhere the symmetric kernel function M is given by\n\n\u2202tf0(t, x) = \u2212\n\n\u2126\n\nM ([\u03c10(t)], x, x(cid:48)) (f0(t, x) \u2212 f (x)) d\u00b5(x(cid:48))\n\n(cid:0)c2\u2207y\u03d5(x, y) \u00b7 \u2207y\u03d5(x(cid:48), y) + \u03d5(x, y)\u03d5(x(cid:48), y)(cid:1) \u03c1(y, c) dydc.\n\n(cid:90)\n\nD\u00d7R\n\nM ([\u03c1], x, x(cid:48)) =\n\nThis kernel is positive de\ufb01nite and symmetric implying that the only stable \ufb01xed point is f0 = f if\n\u03c10(t = 0) = \u03c1in > 0, as discussed in Sec S1.2. Fixed points of the gradient \ufb02ow that are not energy\nminimizers exist, but they are not dynamically accessible from the initial density that we use (cf. [22]\nand Sec S1.2).\n\nProposition 3.1 (LLN for gradient descent) Let fn(t) = fn(t, x) = (cid:80)n\n\ni=1 Ci(t)\u03d5(x, Yi(t))\n(9) for the initial condition where each pair\n\nwhere {Yi(t), Ci(t)}n\n(Y i(0), Ci(0)) is sampled independently from \u03c1in > 0. Then\n\ni=1 are the solution of\n\nfn(t) = f0(t)\n\nlim\nn\u2192\u221e\n\nPin-almost surely\n\nwhere f0(t) solves (15) and satis\ufb01es\n\nf0(t) = f\n\na.e. in \u2126.\n\nlim\nt\u2192\u221e\n\n(17)\n\n(18)\n\nIn addition, the limits in n and t commute, i.e. we also have limn\u2192\u221e limt\u2192\u221e fn(t) = f.\nA detailed derivation of the LLN for gradient descent can be found in Sec. S1.2. The LLN should\nbe understood as a guarantee that gradient descent reaches the optimal representation for initial\nconditions sampled iid from a smooth distribution with full support on D \u00d7 R.\n\n4\n\n\f3.2 Central Limit Theorem and asymptotic \ufb02uctuations and error\n\nTo study the \ufb02uctuations around the optimal representation we look at the discrepancy between\nfn(t, x) and f0(t, x). These \ufb02uctuations are on the scale O(n\u22121/2) initially and diminish as the\noptimization progresses to reach scale O(n\u22121) or below, as summarized in the next two propositions.\nProposition 3.2 (CLT for GD) Let fn(t) be as in Proposition 3.1. Then for any t < \u221e as n \u2192 \u221e,\nwe have\n(19)\n\nin distribution\n\nn\u22121/2 (fn(t) \u2212 f0(t)) = f1/2(t)\n\nwhere f0(t) solves (15) and f1/2(t) is a Gaussian process with mean zero and some given covariance\nthat satis\ufb01es f1/2(t) \u2192 0 almost surely as t \u2192 \u221e.\nThis result is derived in Sec. S2, where the covariance of f1/2(t) is also given (S46). Since f1/2(t)\nconverges to zero as t \u2192 \u221e, it is useful to quantify the scale at which the \ufb02uctuations settle on long\ntime scales:\n\nlim\nn\u2192\u221e\n\nProposition 3.3 (Asymptotic error for GD) Under the same conditions as those in Proposition 3.2,\non any sequence an > 0 such that an/ log n \u2192 \u221e as n \u2192 \u221e, we have\n\nlim\nn\u2192\u221e\n\nn\u2212\u03be (fn(an) \u2212 f ) = 0\n\nalmost surely for any \u03be < 1\n\n(20)\n\nThis proposition characterizes the asymptotic error of the neural network, showing that it goes as\nfn = f + Cn\u22121 for some constant C \u2265 0. This scaling is more favorable than might be expected\nfrom the initial condition because the order of the error \u201cheals\u201d from 1/2 to 1 in the long time limit.\nThat is, the error from the initial, non-optimal parameter selection decays during the optimization\ndynamics, becoming much more favorable at late times.\n\n4 Stochastic gradient descent\n\nWe cannot typically evaluate the integrals required to compute F (y) and K(y, y(cid:48)). Instead, at each\ni=1 which we refer to\ntime step we estimate these functions using a small set of sample points {xi}P\nas a batch of size P . Consequently, we introduce noise by sampling random data to make imperfect\nestimates of the gradient of the objective function. To estimate the gradient of the loss we use an\nunbiased estimator which is simply the sample mean over a collection or \u201cbatch\u201d of P points\n\nEP (z) =\n\nn\n2P\n\n|fn(xi, z) \u2212 f (xi)|2\n\n(21)\n\nP(cid:88)\n\ni=1\n\n(cid:90)\n\nwhere, for simplicity, we write the parameters as a single vector z = (c1, y1, . . . cn, yn) \u2208 (D \u00d7\nR)n. Note that we have scaled the loss function by n so that \u2207EP is O(1) because our function\nrepresentation is scaled by n\u22121. The evolution equation of the corresponding dynamical variable\nZ(t) is\n\nZ(t + \u2206t) = Z(t) \u2212 \u2206t\u2207EP (Z(t)).\n\n(22)\n\nThe dynamics can be analyzed as a stochastic differential equation with a multiplicative noise term\narising from the approximate evaluation of the gradient of the loss function. To derive this dynamical\nequation, we \ufb01rst need the covariance which we can write explicitly:\n\nn2\n\n\u2126\n\n(fn \u2212 f )2 \u2207fn \u2297 \u2207fnd\u00b5 \u2212 n2\u2207(cid:96)(f, fn) \u2297 \u2207(cid:96)(f, fn) \u2261\n\n1\nP\n\nR(z).\n\n(23)\n\nwhere fn = fn(x, z) and f = f (x). The discretized dynamics (22) is statistically equivalent to the\nstochastic differential equation\n\n(24)\nwhere E(z) is the energy (4) based on the exact loss, \u03b8 = \u2206t/P , and the quadratic variation of the\nnoise is (cid:104)dB(t, z), dB(t, z)(cid:105) = R(z)dt. The SDE (24) is not Langevin dynamics in the classical\nsense because the noise has spatiotemporal correlations. In our case, because new data is sampled at\n\ndZ = \u2212\u2207zE(Z)dt + \u221a\u03b8dB(t, Z)\n\n5\n\n\fevery time step, there are no temporal correlations, which are a consequence of revisiting samples in\na training set. Written in terms of F and K, the parameters satisfy a collection of coupled SDEs that\nwe can use to study the evolution of \u03c1n,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\ndY i = Ci(t)\u2207F (Yi(t))\u2206t \u2212\n\nCi(t)Cj(t)\u2207K(Y i(t), Yj(t))\u2206t + dBi,\n\n(25)\n\ndCi = F (Yi(t))\u2206t \u2212\n\n1\nn\n\nCj(t)K(Y i(t), Y j(t))\u2206t + dB(cid:48)i\n\nn(cid:88)\n\nj=1\n\n1\nn\n\nn(cid:88)\n\nj=1\n\nwhere \u2206t > 0 is the time step. The time evolution of the parameter distribution can be derived by\nusing the It\u00f4 formula, which in turn gives rise to a stochastic partial differential equation for the\ntime-evolution of \u03c1n(t, c, y). This SPDE is\n\n\u2202t\u03c1n = \u2207 \u00b7 (c\u2207U ([\u03c1n], y)\u03c1n) + \u2202c (U ([\u03c1n], y)\u03c1n)\n,\n\n+ \u03b8D[\u03c1n, y, y] + \u221a\u03b8 (\u03b7(t, y) + \u03b7(t, c))\n\n(26)\n\nwhere D is a diffusive term given explicitly in Sec. S4.1 and which we do not reproduce here because\nit does not contribute in the subsequent scaling. This equation can be viewed as an extension of\nDean\u2019s equation [20] to a setting with multiplicative noise. The noise terms \u03b7 and \u03b7 (de\ufb01ned in Eq.\nS69) have a quadratic variation that diminishes as fn becomes close to f.\n\n4.1 Law of large numbers\n\nAt \ufb01rst, it may appear that we could choose an arbitrary expansion in powers of n\u2212\u03b1 for some\n\u03b1 > 0. However, as explained in Sec. S5, the expansion of \u03c1n\u03c1(cid:48)n contains terms of order n\u22121, which\nconstrains the choice of \u03b1. To perform an expansion, we take \u03b8 \u221d n\u22122\u03b1 so that, in the limit n \u2192 \u221e,\n\u03c10 satis\ufb01es the same deterministic equation as in the case of gradient descent. This means that an\nanalogous statement to Proposition 3.1 holds:\n\nProposition 4.1 (LLN for SGD) Let fn(t) = fn(t, x) = (cid:80)n\n\n{Yi(t), Ci(t)}n\neach pair (Y i(0), Ci(0)) is sampled independently from \u03c1in > 0. Then\n\ni=1 Ci(t)\u03d5(x, Yi(t)) with\ni=1 solution to (24) with \u03b8 = an\u22122\u03b1, a > 0 \u03b1 \u2208 (0, 1] and initial condition where\n\nalmost surely, where f0(t) solves (15). Furthermore,\n\nfn(t) = f0(t)\n\nlim\nn\u2192\u221e\n\nf0(t) = f\n\na.e. in \u2126.\n\nlim\nt\u2192\u221e\n\n(27)\n\n(28)\n\nIn addition the limits commute, i.e. limn\u2192\u221e limt\u2192\u221e fn(t) = f.\nThe Law of Large Numbers implies the universal approximation theorem, but notable additional\ninformation has emerged from our analysis. First, we emphasize that here we have obtained the\nrepresentation as the limit of a stochastic gradient descent optimization procedure. Secondly, the\nPDE describing the time evolution of f0 is independent of n, meaning the rate of convergence in time\nof fn does not depend on the number of parameters to leading order.\n\n4.2 Asymptotic \ufb02uctuations and error\n\nA remarkable feature of stochastic gradient descent is that the scale of \ufb02uctuations is controlled by\nthe accuracy of the representation. Roughly, the closer fn is to f, the smaller the discrepancy in their\ngradients meaning that the variance of the noise term is also small. We make use of this property to\nassess the asymptotic error for stochastic gradient descent:\n\nProposition 4.2 (Asymptotic error for SGD) Let fn(t) = fn(t, x) be as in Proposition 4.1. Then\nfor any an > 0 such that an/ log n \u2192 \u221e as n \u2192 \u221e, we have\n\nlim\nn\u2192\u221e\n\nn\u03b1 (fn(an) \u2212 f ) = 0\n\nalmost surely.\n\n(29)\n\n6\n\n\fThe discrepancy converges to zero almost surely with respect to the initial data as well as the statistics\nof the noise terms in (24). In terms of the loss function, we have\n\n(cid:96)(f, fn(an)) = 1\nso that the following proposition holds:\n\n2(cid:107)f \u2212 f0(an)(cid:107)2 \u2212 n\u2212\u03b1 (cid:104)f \u2212 f0(an), f\u03b1(an)(cid:105) + 1\n\n2 n\u22122\u03b1(cid:107)f\u03b1(an)(cid:107)2 + o(n\u2212\u03b1) (30)\n\nProposition 4.3 Under the same conditions as those in Proposition 4.2, the loss function satis\ufb01es\n(31)\n\nn\u03b1(cid:96)(f, fn(an)) = 0\n\nalmost surely.\n\nlim\nn\u2192\u221e\n\nThis means that the error at order n\u22121 can be quenched by increasing the batch size or decreasing the\ntime step as a function of the optimization time, e.g., setting \u03b1 = 1 by taking a batch of size n2.\n\n5 Numerical experiments\n\nTo test our results, we will use a function known for its complex features in high-dimensions:\n\nthe spherical 3-spin model, which is a map from the d \u2212 1 sphere of radius \u221ad to the reals f :\nSd\u22121(\u221ad) \u2192 R, given by\n\nf (x) =\n\nap,q,rxpxqxr,\n\nx \u2208 Sd\u22121(\u221ad) \u2282 Rd\n\n(32)\n\nd(cid:88)\n\n1\nd\n\np,q,r=1\np,q,r=1 are independent Gaussian random variables with mean zero and\nwhere the coef\ufb01cients {ap,q,r}d\nvariance one. The function (32) is known to have a number of critical points that grows exponentially\nwith the dimensionality d [27, 6, 28]. We note that previous works have sought to draw a parallel\nbetween the glassy 3-spin function and generic loss functions [7], but we are not exploring such an\nanalogy here. Rather, we simply use the function (32) as a dif\ufb01cult target for approximation by neural\nnetworks. That is, throughout this section, we train networks to learn f with a particular realization of\nap,q,r and study the accuracy of that representation as a function of the number of particles n. In Fig. 1\nwe show the representation error by computing the loss as well as the discrepancy between the target\nfunction and the neural network representation averaged over points at which the function is positive\n\n(or negative), i.e., 1/P(cid:80)P\n\ni=1 (fn(xi) \u2212 f (xi)) \u0398(f (xi)) where \u0398 is the Heaviside function.\n\nSingle layer sigmoid / ReLU neural network We consider the case that the nonlinear function h(x)\nis max(0, x), the restricted linear unit or ReLU activation function frequently used in large scale\napplications of machine learning. In these experiments, we test the scaling in d = 50, prohibitively\nhigh dimensional for any grid based method. We trained the networks with batch size P = 50 using\nstochastic gradient descent with n = i \u00d7 104 for i = 1, . . . , 6. For the two smallest networks, we\nran for 2 \u00d7 106 time steps with \u2206t = 10\u22123 and then quenched with P = 2500 for 2 \u00d7 105 steps.\nFor the largest networks, we used \u2206t = 5 \u00d7 10\u22124 to ensure stability and therefore doubled the\nnumber of steps so that the total training time remained \ufb01xed. Scaling data for the loss and the signed\ndiscrepancy are shown in Fig. 1. We also looked at sigmoid nonlinearities in d = 10, 25. These\nnetworks were trained as above but with P = (cid:98)n/5(cid:99) with a quench of P 2.\n6 Conclusions and outlook\n\nWe have introduced a perspective based on particle distribution functions that enables asymptotic\nanalysis of the optimization dynamics of neural networks. We have focused on the limit where the\nnumber of parameters n \u2192 \u221e, in which the objective function becomes convex and a stochastic\npartial differential equation describes the time evolution of the parameters. Our results emphasize that\nthe optimal parameters in this limit are accessible via stochastic gradient descent (Proposition 4.1) and\nthat \ufb02uctuations around the optimum can be controlled by modulating the batch size (Proposition 4.2).\nSurprisingly, the dynamical evolution does not depend on n, suggesting that the rate of convergence\nshould be asymptotically independent of the number of parameters.\nOur results do not address many features of neural network parameterization that merit further study\nexploiting the mathematical tools that have been developed for particle systems. In particular, the\nstatements we have derived are insensitive to the details of network architecture, which is among the\n\n7\n\n\fFigure 1: Large ReLU networks in high dimension (d = 50), and sigmoid neural networks in\nintermediate dimensions (bottom two rows). In all cases, we see linear scaling of the empirical loss\naveraged with P = 106. For the sigmoid neural networks, we also plot a measure of the discrepancy\nbetween the functions, which also scales as O(n\u22121). In each plot, the error scaling as a function of\nthe width of the network is plotted for 10 distinct random realizations of the function de\ufb01ned in (32)\nwith different colored stars for each realization.\n\nmost important considerations when designing or using a neural network. It would also be bene\ufb01cial\nto explore the ways in which regularizing processes, drop-out, for example, affect the convergence of\nthe PDE. Developing a rigorous understanding of which kernels and which architectures are optimal\nfor different types of target functions remains a compelling goal that appears within reach using the\ntools outlined here.\n\n8\n\n10210310\u2212210\u22121\u2018P(f,fn)(d=10)10210310\u2212410\u2212310\u22122h(f\u2212fn)\u0398(f)i10210310\u2212310\u22122h(fn\u2212f)\u0398(\u2212f)i102103100101\u2018P(f,fn)(d=25)10210310\u2212310\u2212210\u22121100h(f\u2212fn)\u0398(f)i10210310\u2212210\u22121100h(fn\u2212f)\u0398(\u2212f)i1042\u00d71043\u00d71044\u00d7104n3\u00d710\u221214\u00d710\u221216\u00d710\u22121\u2018P(f,fn)(d=50)ReLU\fAcknowledgments\n\nWe would like to thank Andrea Montanari and Matthieu Wyart for useful discussions regarding\nthe \ufb01xed points of gradient \ufb02ows in the Wasserstein metric. GMR was supported by the James\nS. McDonnell Foundation. EVE was supported by National Science Foundation (NSF) Materials\nResearch Science and Engineering Center Program Award DMR-1420073; and by NSF Award\nDMS-1522767.\n\nReferences\n[1] Elia Schneider, Luke Dai, Robert Q Topper, Christof Drechsel-Grau, and Mark E Tuckerman.\nStochastic Neural Network Approach for Learning High-Dimensional Free Energy Surfaces.\nPhysical Review Letters, 119(15):150601, October 2017.\n\n[2] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high dimensional committor functions\n\nusing arti\ufb01cial neural networks. arXiv:1802.10275, February 2018.\n\n[3] Jens Berg and Kaj Nystr\u00f6m. A uni\ufb01ed deep arti\ufb01cial neural network approach to partial\n\ndifferential equations in complex geometries. arXiv:1711.06464, November 2017.\n\n[4] J\u00f6rg Behler and Michele Parrinello. Generalized Neural-Network Representation of High-\n\nDimensional Potential-Energy Surfaces. Physical Review Letters, 98(14):583, April 2007.\n\n[5] L\u00e9on Bottou and Yann L. Cun. Large Scale Online Learning. In S. Thrun, L. K. Saul, and\nB. Sch\u00f6lkopf, editors, Advances in Neural Information Processing Systems 16, pages 217\u2013224.\nMIT Press, 2004.\n\n[6] Levent Sagun, V Ugur Guney, G\u00e9rard Ben Arous, and Yann LeCun. Explorations on high\n\ndimensional landscapes. arXiv:1412.6615, December 2014.\n\n[7] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\n\nThe Loss Surfaces of Multilayer Networks. arXiv:1412.0233, November 2014.\n\n[8] C Daniel Freeman and Joan Bruna. Topology and Geometry of Half-Recti\ufb01ed Network Opti-\n\nmization. arXiv:1611.01540, November 2016.\n\n[9] Luca Venturi, Afonso S Bandeira, and Joan Bruna. Neural Networks with Finite Intrinsic\n\nDimension have no Spurious Valleys. arXiv:1802.06384, February 2018.\n\n[10] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error\n\nguarantees for multilayer neural networks. arXiv:1605.08361, May 2016.\n\n[11] K Fukumizu and S Amari. Local minima and plateaus in hierarchical structures of multilayer\n\nperceptrons. Neural Networks, 13(3):317\u2013327, 2000.\n\n[12] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale\n\nMachine Learning. arXiv:1606.04838, June 2016.\n\n[13] G Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,\n\nSignals and Systems, 2(4):303\u2013314, December 1989.\n\n[14] A R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE\n\nTransactions on Information Theory, 39(3):930\u2013945, May 1993.\n\n[15] Francis Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. Journal of\n\nMachine Learning Research, 18(19):1\u201353, 2017.\n\n[16] J Park and I W Sandberg. Universal Approximation Using Radial-Basis-Function Networks.\n\nNeural Computation, 3(2):246\u2013257, June 1991.\n\n[17] Sylvia Serfaty. Systems of Points with Coulomb Interactions. arXiv:1712.04095, December\n\n2017.\n\n9\n\n\f[18] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of\n\nnonconvex stochastic gradient descent. arXiv:1705.07562, May 2017.\n\n[19] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modi\ufb01ed equations and adaptive stochastic\ngradient algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 2101\u20132110, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n[20] David S Dean. Langevin equation for the density of a system of interacting Langevin processes.\n\nJournal of Physics A: Mathematical and Theoretical, 29(24):L613\u2013L617, January 1999.\n\n[21] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013\nE7671, August 2018.\n\n[22] L\u00e9na\u00efc Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over-\n\nparameterized Models using Optimal Transport. arXiv:1805.09545, May 2018.\n\n[23] Justin Sirignano and Konstantinos Spiliopoulos. Mean Field Analysis of Neural Networks.\n\narXiv:1805.01053, May 2018.\n\n[24] Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte.\nConvex neural networks. In Y. Weiss, B. Sch\u00f6lkopf, and J. C. Platt, editors, Advances in Neural\nInformation Processing Systems 18, pages 123\u2013130. MIT Press, 2006.\n\n[25] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp\nMinima. arXiv:1609.04836, September 2016.\n\n[26] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\n\ngeneralization gap in large batch training of neural networks. arXiv:1705.08741, May 2017.\n\n[27] Antonio Auf\ufb01nger and G\u00e9rard Ben Arous. Complexity of random smooth functions on the\n\nhigh-dimensional sphere. The Annals of Probability, 41(6):4214\u20134247, November 2013.\n\n[28] Antonio Auf\ufb01nger, G\u00e9rard Ben Arous, and Ji\u02c7r\u00ed \u02c7Cern\u00fd. Random Matrices and Complexity of\n\nSpin Glasses. Communications on Pure and Applied Mathematics, 66(2):165\u2013201, 2012.\n\n10\n\n\f", "award": [], "sourceid": 3543, "authors": [{"given_name": "Grant", "family_name": "Rotskoff", "institution": "New York University"}, {"given_name": "Eric", "family_name": "Vanden-Eijnden", "institution": "New York University"}]}