{"title": "On Lazy Training in Differentiable Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 2937, "page_last": 2947, "abstract": "In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this ``lazy training'' phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that ``lazy training'' is behind the many successes of neural networks in difficult high dimensional tasks.", "full_text": "On Lazy Training in Differentiable Programming\n\nL\u00e9na\u00efc Chizat\n\nCNRS, Universit\u00e9 Paris-Sud\n\nOrsay, France\n\nlenaic.chizat@u-psud.fr\n\nEdouard Oyallon\n\nCentraleSupelec, INRIA\nGif-sur-Yvette, France\n\nedouard.oyallon@centralesupelec.fr\n\nFrancis Bach\n\nINRIA, ENS, PSL Research University\n\nParis, France\n\nfrancis.bach@inria.fr\n\nAbstract\n\ntheoretical works,\n\nit was shown that strongly over-\nIn a series of recent\nparameterized neural networks trained with gradient-based methods could converge\nexponentially fast to zero training loss, with their parameters hardly varying. In\nthis work, we show that this \u201clazy training\u201d phenomenon is not speci\ufb01c to over-\nparameterized neural networks, and is due to a choice of scaling, often implicit,\nthat makes the model behave as its linearization around the initialization, thus\nyielding a model equivalent to learning with positive-de\ufb01nite kernels. Through a\ntheoretical analysis, we exhibit various situations where this phenomenon arises\nin non-convex optimization and we provide bounds on the distance between the\nlazy and linearized optimization paths. Our numerical experiments bring a critical\nnote, as we observe that the performance of commonly used non-linear deep con-\nvolutional neural networks in computer vision degrades when trained in the lazy\nregime. This makes it unlikely that \u201clazy training\u201d is behind the many successes of\nneural networks in dif\ufb01cult high dimensional tasks.\n\n1\n\nIntroduction\n\nDifferentiable programming is becoming an important paradigm in signal processing and machine\nlearning that consists in building parameterized models, sometimes with a complex architecture and\na large number of parameters, and adjusting these parameters in order to minimize a loss function\nusing gradient-based optimization methods. The resulting problem is in general highly non-convex.\nIt has been observed empirically that, for \ufb01xed loss and model class, changes in the parameterization,\noptimization procedure, or initialization could lead to a selection of models with very different\nproperties [36]. This paper is about one such implicit bias phenomenon, that we call lazy training,\nwhich corresponds to the model behaving like its linearization around the initialization.\nThis work is motivated by a series of recent articles [11, 22, 10, 2, 37] where it is shown that over-\nparameterized neural networks could converge linearly to zero training loss with their parameters\nhardly varying. With a slightly different approach, it was shown in [17] that in\ufb01nitely wide neural\nnetworks behave like the linearization of the neural network around its initialization. In the present\nwork, we argue that this behavior is not speci\ufb01c to neural networks, and is not so much due to\nover-parameterization than to an implicit choice of scaling. By introducing an explicit scale factor,\nwe show that essentially any parametric model can be trained in this lazy regime if its output is\nclose to zero at initialization. This shows that guaranteed fast training is indeed often possible, but\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Non-lazy training (\u03c4 = 0.1)\n\n(b) Lazy training (\u03c4 = 2)\n\n(c) Generalization properties\n\nFigure 1: Training a two-layer ReLU neural network initialized with normal random weights of\nvariance \u03c4 2: lazy training occurs when \u03c4 is large. (a)-(b) Trajectory of weights during gradient\ndescent in 2-D (color shows sign of output layer). (c) Generalization in 100-D: it worsens as \u03c4\nincreases. The ground truth is generated with 3 neurons (arrows in (a)-(b)). Details in Section 3.\n\nat the cost of recovering a linear method1. Our experiments on two-layer neural networks and deep\nconvolutional neural networks (CNNs) suggest that this behavior is undesirable in practice.\n\n1.1 Presentation of lazy training\nWe consider a parameter space2 Rp, a Hilbert space F, a smooth model h : Rp \u2192 F (such as a neural\nnetwork) and a smooth loss R : F \u2192 R+. We aim to minimize, with gradient-based methods, the\nobjective function F : Rp \u2192 R+ de\ufb01ned as\n\nF (w) := R(h(w)).\n\nWith an initialization w0 \u2208 Rp, we de\ufb01ne the linearized model \u00afh(w) = h(w0) + Dh(w0)(w \u2212 w0)\naround w0, and the corresponding objective \u00afF : Rp \u2192 R+ as\n\u00afF (w) := R(\u00afh(w)).\n\nIt is a general fact that the optimization path of F and \u00afF starting from w0 are close at the beginning\nof training. We call lazy training the less expected situation where these two paths remain close until\nthe algorithm is stopped.\nShowing that a certain non-convex optimization is in the lazy regime opens the way for surprisingly\nprecise results, because linear models are rather well understood. For instance, when R is strongly\nconvex, gradient descent on \u00afF with an appropriate step-size converges linearly to a global mini-\nmizer [4]. For two-layer neural networks, we show in Appendix A.2 that the linearized model is a\nrandom feature model [26] which lends itself nicely to statistical analysis [6]. Yet, while advantageous\nfrom a theoretical perspective, it is not clear a priori whether this lazy regime is desirable in practice.\nThis phenomenon is illustrated in Figure 1 where lazy training for a two-layer neural network with\nrecti\ufb01ed linear units (ReLU) is achieved by increasing the variance \u03c4 2 at initialization (see next\nsection). While in panel (a) the ground truth features are identi\ufb01ed, this is not the case for lazy\ntraining on panel (b) that manages to interpolate the observations with just a small displacement\nin parameter space (in both cases, near zero training loss was achieved). As seen on panel (c), this\nbehavior hinders good generalization in the teacher-student setting [30]. The plateau reached for\nlarge \u03c4 corresponds exactly to the performance of the linearized model, see Section 3.1 for details.\n\n1.2 When does lazy training occur?\n\nA general criterion. Let us start with a formal computation. We assume that w0 is not a minimizer\nso that F (w0) > 0, and not a critical point so that \u2207F (w0) (cid:54)= 0. Consider a gradient descent step\nw1 := w0 \u2212 \u03b7\u2207F (w0), with a small stepsize \u03b7 > 0. On the one hand, the relative change of the\n. On the other hand, the relative change of the\nobjective is \u2206(F ) :=\n\n|F (w1)\u2212F (w0)|\n\n\u2248 \u03b7\n\nF (w0)\n\n(cid:107)\u2207F (w0)(cid:107)2\n\nF (w0)\n\n1Here we mean a prediction function linearly parameterized by a potentially in\ufb01nite-dimensional vector.\n2Our arguments could be generalized to the case where the parameter space is a Riemannian manifold.\n\n2\n\ncircle of radius 1gradient flow (+)gradient flow (-)1021011001010.00.51.01.52.02.53.03.5Test lossend of trainingbest throughout training\f(cid:107)\u2207F (w0)(cid:107)\u00b7(cid:107)D2h(w0)(cid:107)\n.\ndifferential of h measured in operator norm is \u2206(Dh) :=\nLazy training refers to the case where the differential of h does not sensibly change while the loss\nenjoys a signi\ufb01cant decrease, i.e., \u2206(F ) (cid:29) \u2206(Dh). Using the above estimates, this is guaranteed\nwhen\n\n(cid:107)Dh(w1)\u2212Dh(w0)(cid:107)\n\n(cid:107)Dh(w0)(cid:107)\n\n(cid:107)Dh(w0)(cid:107)\n\n\u2264 \u03b7\n\n(cid:107)\u2207F (w0)(cid:107)\n\nF (w0)\n\n(cid:29) (cid:107)D2h(w0)(cid:107)\n(cid:107)Dh(w0)(cid:107) .\n\nFor the square loss R(y) = 1\n\n2(cid:107)y \u2212 y(cid:63)(cid:107)2 for some y(cid:63) \u2208 F, this leads to the simpler criterion\n\u03bah(w0) := (cid:107)h(w0) \u2212 y(cid:63)(cid:107)(cid:107)D2h(w0)(cid:107)\n\n(cid:107)Dh(w0)(cid:107)2 (cid:28) 1,\n(1)\nusing the approximation (cid:107)\u2207F (w0)(cid:107) = (cid:107)Dh(w0)\n(h(w0) \u2212 y(cid:63))(cid:107) \u2248 (cid:107)Dh(w0)(cid:107) \u00b7 (cid:107)h(w0) \u2212 y(cid:63)(cid:107).\n(cid:124)\nThis quantity \u03bah(w0) could be called the inverse relative scale of the model h at w0. We prove\nin Theorem 2.3 that it indeed controls how much the training dynamics differs from the linearized\ntraining dynamics when R is the square loss3. For now, let us explore situations in which lazy training\ncan be shown to occur, by investigating the behavior of \u03bah(w0).\n\nRescaled models. Considering a scaling factor \u03b1 > 0, it holds\n\n\u03ba\u03b1h(w0) =\n\n(cid:107)\u03b1h(w0) \u2212 y(cid:63)(cid:107)(cid:107)D2h(w0)(cid:107)\n(cid:107)Dh(w0)(cid:107)2 .\n\n1\n\u03b1\n\nThus, \u03ba\u03b1h(w0) simply decreases as \u03b1\u22121 when \u03b1 grows and (cid:107)\u03b1h(w0) \u2212 y(cid:63)(cid:107) is bounded, leading\nto lazy training for large \u03b1. Training dynamics for such rescaled models are studied in depth in\nSection 2. For neural networks, there are various ways to ensure h(w0) = 0, see Section 3.\n\nHomogeneous models.\nis equivalent to multiplying the scale factor \u03b1 by \u03bbq. In equation,\n\nIf h is q-positively homogeneous4 then multiplying the initialization by \u03bb\n\n\u03bah(\u03bbw0) =\n\n1\n\n\u03bbq (cid:107)\u03bbqh(w0) \u2212 y(cid:63)(cid:107)(cid:107)D2h(w0)(cid:107)\n(cid:107)Dh(w0)(cid:107)2 .\n\nThis formula applies for instance to q-layer neural networks consisting of a cascade of homogenous\nnon-linearities and linear, but not af\ufb01ne, operators. Such networks thus enter the lazy regime as the\nvariance of initialization increases, if one makes sure that the initial output has bounded norm (see\nFigures 1 and 2(b) for 2-homogeneous examples).\nTwo-layer neural networks. For m, d \u2208 N, consider functions hm : (Rd)m \u2192 F of the form\n\nm(cid:88)\n\ni=1\n\nhm(w) = \u03b1(m)\n\n\u03c6(\u03b8i),\n\nwhere \u03b1(m) > 0 is a normalization, w = (\u03b81, . . . , \u03b8m) and \u03c6 : Rd \u2192 F is a smooth function.\nThis setting covers the case of two-layer neural networks (see Appendix A.2). When initializing\ni=1 satisfying E\u03c6(\u03b8i) = 0, and under the\nwith independent and identically distributed variables (\u03b8i)m\nassumption that D\u03c6 is not identically 0 on the support of the initialization, we prove in Appendix A.2\nthat for large m it holds\n\nE[\u03bahm(w0)] (cid:46) m\u2212 1\n\n2 + (m\u03b1(m))\u22121.\n\nAs a consequence, as long as m\u03b1(m) \u2192 \u221e when m \u2192 \u221e, such models are bound to reach the\nlazy regime. In this case, the norm of the initial output becomes negligible in front of the scale as m\ngrows due to the statistical cancellations that follow from the assumption E\u03c6(\u03b8i) = 0. In contrast, the\ncritical scaling \u03b1(m) = 1/m, allows to converge as m \u2192 \u221e to a non degenerate dynamic described\nby a partial differential equation and referred to as the mean-\ufb01eld limit [24, 7, 29, 33].\n\n3Note that lazy training could occur even when \u03bah(w0) is large, i.e. Eq. (1) only gives a suf\ufb01cient condition.\n4That is, for q \u2265 1, it holds h(\u03bbw) = \u03bbqh(w) for all \u03bb > 0 and w \u2208 Rp.\n\n3\n\n\f1.3 Content and contributions\n\nThe goal of this paper is twofold: (i) understanding in a general optimization setting when lazy\ntraining occurs, and (ii) investigating the practical usefulness of models in the lazy regime. It is\norganized as follows:\n\n\u2022 in Section 2, we study the gradient \ufb02ows for rescaled models \u03b1h and prove in various\nsituations that for large \u03b1, they are close to gradient \ufb02ows of the linearized model. When\nthe loss is strongly convex, we also prove that lazy gradient \ufb02ows converge linearly, either\nto a global minimizer for over-parameterized models, or to a local minimizer for under-\nparameterized models.\n\n\u2022 in Section 3, we use numerical experiments on synthetic cases to illustrate how lazy training\ndiffers from other regimes of training (see also Figure 1). Most importantly, we show\nempirically that CNNs used in practice could be far from the lazy regime, with their\nperformance not exceeding that of some classical linear methods as they become lazy.\n\nOur focus is on general principles and qualitative description.\n\nUpdates of the paper. This article is an expanded version of \u201cA Note on Lazy Training in Su-\npervised Differential Programming\u201d that appeared online in December 2018. Compared to the\n\ufb01rst version, it has been complemented with \ufb01nite horizon bounds in Section 2.2 and numerical\nexperiments on CNNs in Section 3.2 while the rest of the material has been slightly reorganized.\n\n2 Analysis of Lazy Training Dynamics\n\n2.1 Theoretical setting\n\nOur goal in this section is to show that lazy training dynamics for the scaled objective\n\nare close, when the scaling factor \u03b1 is large, to those of the scaled objective for the linearized model\n\nF\u03b1(w) :=\n\n1\n\u03b12 R(\u03b1h(w))\n\n(2)\n\n1\n\u03b12 R(\u03b1\u00afh(w)),\n\n\u00afF\u03b1(w) :=\n\n(3)\nwhere \u00afh(w) := h(w0) + Dh(w0)(w \u2212 w0) and w0 \u2208 Rp is a \ufb01xed initialization. Multiplying the ob-\njective by 1/\u03b12 does not change the minimizers, and corresponds to the proper time parameterization\nof the dynamics for large \u03b1. Our basic assumptions are the following:\nAssumption 2.1. The parametric model h : Rp \u2192 F is differentiable with a locally Lipschitz\ndifferential5 Dh. Moreover, R is differentiable with a Lipschitz gradient.\n\nThis setting is mostly motivated by supervised learning problems, where one considers a probability\ndistribution \u03c1 \u2208 P(Rd \u00d7 Rk) and de\ufb01nes F as the space L2(\u03c1x; Rk) of square-integrable functions\nwith respect to \u03c1x, the marginal of \u03c1 on Rd. The risk R is then built from a smooth loss function\n(cid:96) : (Rk)2 \u2192 R+ as R(g) = E(X,Y )\u223c\u03c1(cid:96)(g(X), Y ). This corresponds to empirical risk minimization\nwhen \u03c1 is a \ufb01nite discrete measure, and to population risk minimization otherwise (in which case\nonly stochastic gradients are available to algorithms). Finally, one de\ufb01nes h(w) = f (w,\u00b7) where\nf : Rp \u00d7 Rd \u2192 Rk is a parametric model, such as a neural network, which outputs in Rk depend on\nparameters in Rp and input data in Rd.\n\nGradient \ufb02ows.\nIn the rest of this section, we study the gradient \ufb02ow of the objective function\nF\u03b1 which is an approximation of (accelerated) gradient descent [12, 31] and stochastic gradient\ndescent [19, Thm. 2.1] with small enough step sizes. With an initialization w0 \u2208 Rp, the gradient\n5Dh(w) is a continuous linear map from Rp to F. The Lipschitz constant of Dh : w (cid:55)\u2192 Dh(w) is de\ufb01ned\nwith respect to the operator norm. When F has a \ufb01nite dimension, Dh(w) can be identi\ufb01ed with the Jacobian\nmatrix of h at w.\n\n4\n\n\f\ufb02ow of F\u03b1 is the path (w\u03b1(t))t\u22650 in the space of parameters Rp that satis\ufb01es w\u03b1(0) = w0 and solves\nthe ordinary differential equation\n\n\u03b1(t) = \u2212\u2207F\u03b1(w\u03b1(t)) = \u2212 1\nw(cid:48)\n\u03b1\n\n(4)\n(cid:124) denotes the adjoint of the differential Dh. We will study this dynamic for itself, and will\n\nwhere Dh\nalso compare it to the gradient \ufb02ow ( \u00afw\u03b1(t))t\u22650 of \u00afF\u03b1 that satis\ufb01es \u00afw\u03b1(0) = w0 and solves\n\n(cid:124)\u2207R(\u03b1h(w\u03b1(t))),\n\nDh(w\u03b1(t))\n\n\u03b1(t) = \u2212\u2207 \u00afF\u03b1( \u00afw\u03b1(t)) = \u2212 1\n\u00afw(cid:48)\n\u03b1\n\n(5)\nNote that when h(w0) = 0, the renormalized dynamic w0 + \u03b1( \u00afw\u03b1(t) \u2212 w0) does not depend on \u03b1,\nas it simply follows the gradient \ufb02ow of w (cid:55)\u2192 R(Dh(w0)(w \u2212 w0)) starting from w0.\n\n(cid:124)\u2207R(\u03b1\u00afh( \u00afw\u03b1(t))).\n\nDh(w0)\n\n2.2 Bounds with a \ufb01nite time horizon\n\nWe start with a general result that con\ufb01rms that when h(w0) = 0, taking large \u03b1 leads to lazy training.\nWe do not assume convexity of R.\nTheorem 2.2 (General lazy training). Assume that h(w0) = 0. Given a \ufb01xed time horizon T > 0, it\nholds supt\u2208[0,T ] (cid:107)w\u03b1(t) \u2212 w0(cid:107) = O(1/\u03b1),\n\n(cid:107)w\u03b1(t) \u2212 \u00afw\u03b1(t)(cid:107) = O(1/\u03b12)\n\nand\n\nsup\nt\u2208[0,T ]\n\nsup\nt\u2208[0,T ]\n\n(cid:107)\u03b1h(w\u03b1(t)) \u2212 \u03b1\u00afh( \u00afw\u03b1(t))(cid:107) = O(1/\u03b1).\n\nFor supervised machine learning problems, the bound on (cid:107)w\u03b1(t) \u2212 \u00afw\u03b1(t)(cid:107) implies that \u03b1h(w\u03b1(T ))\nalso generalizes like \u03b1\u00afh( \u00afw\u03b1(T )) outside of the training set for large \u03b1, see Appendix A.3. Note\nthat the generalization behavior of linear models has been widely studied, and is particularly well\nunderstood for random feature models [26], which are recovered when linearizing two layer neural\nnetworks, see Appendix A.2. It is possible to track the constants in Theorem 2.2 but they would\ndepend exponentially on the time horizon T . This exponential dependence can however be discarded\nfor the speci\ufb01c case of the square loss, where we recover the scale criterion informally derived in\nSection 1.2.\n2(cid:107)y \u2212 y(cid:63)(cid:107)2 for some\nTheorem 2.3 (Square loss, quantitative). Consider the square loss R(y) = 1\ny(cid:63) \u2208 F and assume that for some (potentially small) r > 0, h is Lip(h)-Lipschitz and Dh is\nLip(Dh)-Lipschitz on the ball of radius r around w0. Then for an iteration number K > 0 and\ncorresponding time T := K/Lip(h)2, it holds\n(cid:107)\u03b1h(w\u03b1(T )) \u2212 \u03b1\u00afh( \u00afw\u03b1(T ))(cid:107)\n\nLip(Dh)\n\n\u2264 K 2\n\u03b1\n\nLip(h)2 (cid:107)\u03b1h(w0) \u2212 y(cid:63)(cid:107)\n\n(cid:107)\u03b1h(w0) \u2212 y(cid:63)(cid:107)\n\nas long as \u03b1 \u2265 K(cid:107)\u03b1h(w0) \u2212 y(cid:63)(cid:107)/(rLip(h)).\nWe can make the following observations:\n\n\u2022 For the sake of interpretability, we have introduced a quantity K, analogous to an iteration\nnumber, that accounts for the fact that the gradient \ufb02ow needs to be integrated with a step-\nsize of order 1/Lip(\u2207F\u03b1) = 1/Lip(h)2. For instance, with this step-size, gradient descent\nat iteration K approximates the gradient \ufb02ow at time T = K/Lip(h)2, see, e.g., [12, 31].\n\u2022 Laziness only depends on the local properties of h around w0. These properties may vary a\nlot over the parameter space, as is the case for homogeneous functions seen in Section 1.2.\nFor completeness, similar bounds on (cid:107)w\u03b1(T ) \u2212 w0(cid:107) and (cid:107)w\u03b1(T ) \u2212 \u00afw\u03b1(T )(cid:107) are also provided in\nAppendix B.2. The drawback of the bounds in this section is the increasing dependency in time, which\nis removed in the next section. Yet, the relevance of Theorem 2.2 remains because it does not depend\non the conditioning of the problem. Although the bound grows as K 2, it gives an informative estimate\nfor large or ill-conditioned problems, where training is typically stopped much before convergence.\n\n2.3 Uniform bounds and convergence in the lazy regime\n\nThis section is devoted to uniform bounds in time and convergence results under the assumption\nthat R is strongly convex. In this setting, the function \u00afF\u03b1 is strictly convex on the af\ufb01ne hyperspace\n\n5\n\n\fw0 + ker Dh(w0)\u22a5 which contains the linearized gradient \ufb02ow ( \u00afw\u03b1(t))t\u22650, so the latter converges\nlinearly to the unique global minimizer of \u00afF\u03b1. In particular, if h(w0) = 0 then this global minimizer\ndoes not depend on \u03b1 and supt\u22650 (cid:107) \u00afw\u03b1(t) \u2212 w0(cid:107) = O(1/\u03b1). We will see in this part how these\nproperties re\ufb02ect on the lazy gradient \ufb02ow w\u03b1(t).\n\nOver-parameterized case. The following proposition shows global convergence of lazy training\nunder the condition that Dh(w0) is surjective. As rank Dh(w0) gives the number of effective\nparameters or degrees of freedom of the model around w0, this over-parameterization assumption\nguarantees that any model around h(w0) can be \ufb01tted. Of course, this can only happen if F is\n\ufb01nite-dimensional.\nTheorem 2.4 (Over-parameterized lazy training). Consider a M-smooth and m-strongly convex\nloss R with minimizer y(cid:63) and condition number \u03ba := M/m. Assume that \u03c3min, the smallest\n(cid:124) is positive and that the initialization satis\ufb01es (cid:107)h(w0)(cid:107) \u2264 C0 :=\nsingular value of Dh(w0)\nmin/(32\u03ba3/2(cid:107)Dh(w0)(cid:107)Lip(Dh)) where Lip(Dh) is the Lipschitz constant of Dh. If \u03b1 > (cid:107)y\u2217(cid:107)/C0,\n\u03c33\nthen for t \u2265 0, it holds\n\nmint/4).\nIf moreover h(w0) = 0, it holds as \u03b1 \u2192 \u221e, supt\u22650 (cid:107)w\u03b1(t) \u2212 w0(cid:107) = O(1/\u03b1),\n\n\u03ba(cid:107)\u03b1h(w0) \u2212 y\u2217(cid:107) exp(\u2212m\u03c32\n\n(cid:107)\u03b1h(w\u03b1(t)) \u2212 y\u2217(cid:107) \u2264 \u221a\n\n(cid:107)\u03b1h(w\u03b1(t)) \u2212 \u03b1\u00afh( \u00afw\u03b1(t))(cid:107) = O(1/\u03b1)\n\n(cid:107)w\u03b1(t) \u2212 \u00afw\u03b1(t)(cid:107) = O(log \u03b1/\u03b12).\n\nand\n\nsup\nt\u22650\n\nsup\nt\u22650\n\nThe proof of this result relies on the fact that \u03b1h(w\u03b1(t)) follows the gradient \ufb02ow of R in a time-\ndependent and non degenerate metric: the pushforward metric [21] induced by h on F. For the \ufb01rst\npart, we do not claim improvements over [11, 22, 10, 2, 37], where a lot of effort is also put in dealing\nwith the non-smoothness of h, which we do not study here. As for the uniform in time comparison\nwith the tangent gradient \ufb02ow, it is new and follows mostly from Lemma B.2 in Appendix B where\nthe constants are given and depend polynomially on the characteristics of the problem.\n\nUnder-parameterized case. We now remove the over-parameterization assumption and show\nagain linear convergence for large values of \u03b1. This covers in particular the case of population loss\nminimization, where F is in\ufb01nite-dimensional. For this setting, we limit ourselves to a qualitative\nstatement6.\nTheorem 2.5 (Under-parameterized lazy training). Assume that F is separable, R is strongly convex,\nh(w0) = 0 and rank Dh(w) is constant on a neighborhood of w0. Then there exists \u03b10 > 0 such\nthat for all \u03b1 \u2265 \u03b10 the gradient \ufb02ow (4) converges at a geometric rate (asymptotically independent\nof \u03b1) to a local minimum of F\u03b1.\n\nThanks to lower-semicontinuity of the rank function, the assumption that the rank is locally constant\nholds generically, in the sense that it is satis\ufb01ed on an open dense subset of Rp. In this under-\nparameterized case, the limit limt\u2192\u221e w\u03b1(t) is for \u03b1 large enough a strict local minimizer, but in\ngeneral not a global minimizer of F\u03b1 because the image of Dh(w0) does not a priori contain the\nglobal minimizer of R. Thus it cannot be excluded that there exists parameters w farther from w0\nwith a smaller loss. This fact is clearly observed experimentally in Section 3, Figure 2-(b). Finally, a\ncomparison with the linearized gradient \ufb02ow as in Theorem 2.4 could be shown along the same lines,\nbut would be technically slightly more involved because differential geometry comes into play.\n\nRelationship to the global convergence result in [7]. A consequence of Theorem 2.5 is that in\nthe lazy regime, the gradient \ufb02ow of the population risk for a two-layer neural network might get\nstuck in a local minimum. In contrast, it is shown in [7] that such gradient \ufb02ows converge to global\noptimality in the in\ufb01nite over-parameterization limit p \u2192 \u221e if initialized with enough diversity in\nthe weights. This is not a contradiction since Theorem 2.5 assumes a \ufb01nite number p of parameters.\nIn the lazy regime, the population loss might also converge to its minimum when p increases: this is\n(cid:124) [17] converges (after normalization) to a universal\nguaranteed if the tangent kernel Dh(w0)Dh(w0)\nkernel as p \u2192 \u221e. However, this convergence might be unreasonably slow in high-dimension, as\nFigure 1-(c) suggests. As a side note, we stress that the global convergence result in [7] is not limited\nto lazy dynamics but also covers non-linear dynamics, such as seen on Figure 1 where neurons move.\n\n6In contrast to the \ufb01nite horizon bound of Theorem 2.3, quantitative statements would here involve the\n\nsmallest positive singular value of Dh(w0), which is anyways hard to control.\n\n6\n\n\f(a)\n\n(b)\n\n\u221a\nFigure 2: (a) Test loss at convergence for gradient descent, when \u03b1 depends on m as \u03b1 = 1/m or\nm, the latter leading to lazy training for large m (not symmetrized). (b) Population loss at\n\u03b1 = 1/\nconvergence versus \u03c4 for SGD with a random N(0, \u03c4 2) initialization (symmetrized). In the hatched\narea the loss was still slowly decreasing.\n\n3 Numerical Experiments\n\nWe realized two sets of experiments, the \ufb01rst with two-layer neural networks conducted on syn-\nthetic data and the second with convolutional neural networks (CNNs) conducted on the CIFAR-10\ndataset [18]. The code to reproduce these experiments is available online7.\n\n3.1 Two-layer neural networks in the teacher-student setting\n(cid:80)m\nWe consider the following two-layer student neural network hm(w) = fm(w,\u00b7) with fm(w, x) =\nj=1 aj max(bj \u00b7 x, 0) where aj \u2208 R and bj \u2208 Rd for j = 1, . . . , m. It is trained to minimize the\nsquare loss with respect to the output of a two-layer teacher neural network with same architecture\nand m0 = 3 hidden neurons, with random weights normalized so that (cid:107)ajbj(cid:107) = 1 for j \u2208 {1, 2, 3}.\nFor the student network, we use random Gaussian weights, except when symmetrized initialization\nis mentioned, in which case we use random Gaussian weights for j \u2264 m/2 and set for j > m/2,\nbj = bj\u2212m/2 and aj = \u2212aj\u2212m/2. This amounts to training a model of the form h(wa, wb) =\nhm/2(wa)\u2212 hm/2(wb) with wa(0) = wb(0) and guaranties zero output at initialization. The training\ndata are n input points uniformly sampled on the unit sphere in Rd and we minimize the empirical\nrisk, except for Figure 2b(b) where we directly minimize the population risk with Stochastic Gradient\nDescent (SGD).\n\nCover illustration. Let us detail the setting of Figure 1 in Section 1. Panels (a)-(b) show gradient\ndescent dynamics with n = 15, m = 20 with symmetrized initialization (illustrations with more\nneurons can be found in Appendix C). To obtain a 2-D representation, we plot |aj(t)|bj(t) throughout\ntraining (lines) and at convergence (dots) for j \u2208 {1, . . . , m}. The blue or red colors stand for the\nsigns of aj(t) and the unit circle is displayed to help visualizing the change of scale. On panel (c), we\nset n = 1000, m = 50 with symmetrized initialization and report the average and standard deviation\nof the test loss over 10 experiments. To ensure that the bad performances corresponding to large \u03c4 are\nnot due to a lack of regularization, we display also the best test error throughout training (for kernel\nmethods, early stopping is a form of regularization [34]).\n\n\u221a\n\nIncreasing number of parameters. Figure 2-(a) shows the evolution of the test error when in-\ncreasing m as discussed in Section 1.2, without symmetrized initialization. We report the results\nfor two choices of scaling functions \u03b1(m), averaged over 5 experiments with d = 100. The scaling\nm leads to lazy training, with a poor generalization as m increases, in contrast to the scaling\n1/\n1/m for which the test error remains relatively close to 0 for large m (more experiments with this\nscaling can be found in [7, 29, 24]).\n\n7https://github.com/edouardoyallon/lazy-training-CNN\n\n7\n\n100101102103m01Test lossscaling 1/mscaling 1/m1021011001010123Population loss at convergencenot yet converged\fUnder-parameterized case. Finally, Figure 2-(b) illustrates the under-parameterized case, with\nd = 100, m = 50 with symmetrized initialization. We used SGD with batch-size 200 to minimize\nthe population square loss, and displayed average and standard deviation of the \ufb01nal population loss\n(estimated with 2000 samples) over 5 experiments. As shown in Theorem 2.5, SGD converges to a a\npriori local minimum in the lazy regime (i.e., here for large \u03c4). In contrast, it behaves well when \u03c4 is\nsmall, as in Figure 1. There is also an intermediate regime (hatched area) where convergence is very\nslow and the loss was still decreasing when the algorithm was stopped.\n\n3.2 Deep CNNs experiments\n\nWe now study whether lazy training is relevant to understand the good performances of convolutional\nneural networks (CNNs).\n\nInterpolating from standard to lazy training. We \ufb01rst study the effect of increasing the scale\nfactor \u03b1 on a standard pipeline for image classi\ufb01cation on the CIFAR10 dataset. We consider the\nVGG-11 model [32], which is a widely used model on CIFAR10. We trained it via mini-batch SGD\nwith a momentum parameter of 0.9. For the sake of interpretability, no extra regularization (e.g.,\nBatchNorm) is incorporated, since a simple framework that outperforms linear methods baselines\nwith some margin is suf\ufb01cient to our purpose (see Figure 3(b)). An initial learning rate \u03b70 is linearly\ndecayed at each epoch, following \u03b7t = \u03b70\n1+\u03b2t. The biases are initialized with 0 and all other weights\nare initialized with normal Xavier initialization [13]. In order to set the initial output to 0 we use the\ncentered model h, which consists in replacing the VGG model \u02dch by h(w) := \u02dch(w) \u2212 \u02dch(w0). Notice\nthat this does not modify the differential at initialization.\nThe model h is trained for the square loss multiplied by 1/\u03b12 (as in Section 2), with standard\ndata-augmentation, batch-size of 128 [35] and \u03b70 = 1 which gives the best test accuracies over the\ngrid 10k, k \u2208 {\u22123, 3}, for all \u03b1. The total number of epochs is 70, adjusted so that the performance\nreaches a plateau for \u03b1 = 1. Figure 3(a) reports the accuracy after training \u03b1h for increasing values\nof \u03b1 \u2208 10k for k = {0, 1, 2, 3, 4, 5, 6, 7} (\u03b1 = 1 being the standard setting). For \u03b1 < 1, the training\nloss diverges with \u03b70 = 1. We also report the stability of activations, which is the share of neurons\nover ReLU layers that, after training, are activated for the same inputs than at initialization, see\nAppendix C. Values close to 100% are strong indicators of an effective linearization.\nWe observe a signi\ufb01cant drop in performance as \u03b1 grows, and then the accuracy reaches a plateau,\nsuggesting that the CNN progressively reaches the lazy regime. This demonstrates that the linearized\nmodel (large \u03b1) is not suf\ufb01cient to explain the good performance of the model for \u03b1 = 1. For large \u03b1,\nwe obtain a low limit training accuracy and do not observe over\ufb01tting, a surprising fact since this\namounts to solving an over-parameterized linear system. This behavior is due to a poorly conditioned\nlinearized model, see Appendix C.\n\nPerformance of linearized CNNs.\nIn this second set of experiments, we investigate whether varia-\ntions of the models trained above in a lazy regime could increase the performance and, in particular,\ncould outperform other linear methods which also do not involve learning a representation [26, 25].\nTo this end, we train widened CNNs in the lazy regime, as widening is a well-known strategy to boost\nperformances of a given architecture [35]. We multiply the number of channels of each layer by 8 for\nthe VGG model and 7 for the ResNet model [16] (these values are limited by hardware constraints).\nWe choose \u03b1 = 107 to train the linearized models, a batch-size of 8 and, after cross-validation,\n\u03b70 = 0.01, 1.0 for respectively the standard and linearized model. We also multiply the initial\nweights by respectively 1.2 and 1.3 for the ResNet-18 and VGG-11, as we found that it slightly\nboosts the training accuracies. Each model is trained with the cross-entropy loss divided by \u03b12 until\nthe test accuracy stabilizes or increases, and we check that the average stability of activations (see\nAppendix C) was 100%.\nAs seen on Figure 3(b), widening the VGG model slightly improves the performances of the linearized\nmodel compared to the previous experiment but there is still a substantial gap of performances from\nother non-learned representations [28, 25] methods, not to mention the even wider gap with their\nnon-lazy counterparts. This behavior is also observed on the state-of-the-art ResNet architecture.\nNote that [3] reports a test accuracy of 77.4% without data augmentation for a linearized CNN with\na specially designed architecture which in particular solves the issue of ill-conditioning. Whether\n\n8\n\n\fModel\nResNet wide, linearized\nVGG-11 wide, linearized\nPrior features [25]\nRandom features [28]\nVGG-11 wide, standard\nResNet wide, standard\n\nTrain acc. Test acc.\n56.7\n61.7\n82.3\n84.2\n89.7\n91.0\n\n55.0\n61.0\n-\n-\n99.9\n99.4\n\n(b)\n\n(a)\n\nFigure 3: (a) Accuracies on CIFAR10 as a function of the scaling \u03b1. The stability of activations\nsuggest a linearized regime when high. (b) Accuracies on CIFAR10 obtained for \u03b1 = 1 (standard,\nnon-linear) and \u03b1 = 107 (linearized) compared to those reported for some linear methods without\ndata augmentation: random features and prior features based on the scattering transform.\n\nvariations of standard architectures and pipelines can lead to competitive performances with linearized\nCNNs, remains an open question.\n\nRemark on wide NNs.\nIt was proved [17] that neural networks with standard initialization (random\nindependent weights with zero mean and variance O(1/n(cid:96)) at layer (cid:96), where n(cid:96) is the size of the\nprevious layer), are bound to reach the lazy regime as the sizes of all layers grow unbounded.\nMoreover, for very large neural networks of more than 2 layers, this choice of initialization is\nessentially mandatory to avoid exploding or vanishing initial gradients [15, 14] if the weights are\nindependent with zero mean. Thus we stress that we do not claim that wide neural networks do\nnot show a lazy behavior, but rather that those which exhibit good performances are far from this\nasymptotic behavior.\n\n4 Discussion\n\nLazy training is an implicit bias phenomenon, that refers to the situation when a non-linear parametric\nmodel behaves like a linear one. This arises when the scale of the model becomes large, which\nhappens implicitly under some choices of hyper-parameters. While the lazy training regime provides\nsome of the \ufb01rst optimization-related theoretical insights for deeper models [10, 2, 37, 17], we\nbelieve it does not explain yet the many successes of neural networks that have been observed in\nvarious challenging, high-dimensional tasks in machine learning. This is corroborated by numerical\nexperiments where it is seen that the performance of networks trained in the lazy regime degrades\nand in particular does not exceed that of some classical linear methods. Instead, the intriguing\nphenomenon that still de\ufb01es theoretical understanding is the one displayed on Figure 1(c) for small \u03c4\nand on Figure 3(a) for \u03b1 = 1: neural networks trained with gradient-based methods (and neurons\nthat move) have the ability to perform high-dimensional feature selection through highly non-linear\ndynamics.\n\nAcknowledgments\n\nWe acknowledge supports from grants from R\u00e9gion Ile-de-France and the European Research Council\n(grant SEQUOIA 724063). Edouard Oyallon was supported by a GPU donation from NVIDIA. We\nthank Alberto Bietti for interesting discussions and Brett Bernstein for noticing an error in a previous\nversion of this paper.\n\nReferences\n[1] Ralph Abraham, Jerrold E. Marsden, and Tudor Ratiu. Manifolds, Tensor Analysis, and\n\nApplications, volume 75. Springer Science & Business Media, 2012.\n\n9\n\n101103105107 (scale of the model)60708090100%train accuracytest accuracystability of activations\f[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning\nvia over-parameterization. In Proceedings of the 36th International Conference on Machine\nLearning, volume 97, pages 242\u2013252, 2019.\n\n[3] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[4] L\u00e9on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. SIAM Review, 60(2):223\u2013311, 2018.\n\n[5] Youness Boutaib. On Lipschitz maps and their \ufb02ows. arXiv preprint arXiv:1510.07614, 2015.\n\n[6] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with SGD and random\nfeatures. In Advances in Neural Information Processing Systems, pages 10192\u201310203, 2018.\n\n[7] L\u00e9na\u00efc Chizat and Francis Bach. On the global convergence of gradient descent for over-\nparameterized models using optimal transport. In Advances in neural information processing\nsystems, 2018.\n\n[8] Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in\n\nNeural Information Processing Systems, pages 342\u2013350, 2009.\n\n[9] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\n[10] Simon S. Du, Lee Jason D., Li Haochuan, Wang Liwei, and Zhai Xiyu. Gradient descent \ufb01nds\nglobal minima of deep neural networks. In International Conference on Machine Learning\n(ICML), 2019.\n\n[11] Simon S. Du, Xiyu Zhai, Barnab\u00e1s P\u00f3czos, and Aarti Singh. Gradient descent provably\nIn International Conference on Learning\n\noptimizes over-parameterized neural networks.\nRepresentations, 2019.\n\n[12] Walter Gautschi. Numerical analysis. Springer Science & Business Media, 1997.\n\n[13] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedfor-\nward neural networks. In Proceedings of the thirteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 249\u2013256, 2010.\n\n[14] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architec-\n\nture. In Advances in Neural Information Processing Systems, pages 571\u2013581, 2018.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[17] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, 2018.\n\n[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[19] Harold Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and\n\nApplications, volume 35. Springer Science & Business Media, 2003.\n\n[20] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington,\nand Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International\nConference on Learning Representations, 2018.\n\n10\n\n\f[21] John M. Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1\u201329. Springer,\n\n2003.\n\n[22] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems,\npages 8167\u20138176, 2018.\n\n[23] Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin\nGhahramani. Gaussian process behaviour in wide deep neural networks. In International\nConference on Learning Representations, 2018.\n\n[24] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013\nE7671, 2018.\n\n[25] Edouard Oyallon and St\u00e9phane Mallat. Deep roto-translation scattering for object classi\ufb01cation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n2865\u20132873, 2015.\n\n[26] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[27] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimiza-\ntion with randomization in learning. In Advances in neural information processing systems,\npages 1313\u20131320, 2009.\n\n[28] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet\nclassi\ufb01ers generalize to ImageNet? In Proceedings of the 36th International Conference on\nMachine Learning, pages 5389\u20135400, 2019.\n\n[29] Grant M. Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems:\nAsymptotic convexity of the loss landscape and universal scaling of the approximation error. In\nAdvances in neural information processing systems, 2018.\n\n[30] David Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E,\n\n52(4):4225, 1995.\n\n[31] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d\u2019Aspremont. Integration methods\nand optimization algorithms. In Advances in Neural Information Processing Systems, pages\n1109\u20131118, 2017.\n\n[32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[33] Justin Sirignano and Konstantinos Spiliopoulos. Mean \ufb01eld analysis of neural networks: A\n\ncentral limit theorem. Stochastic Processes and their Applications, 2019.\n\n[34] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[35] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British\n\nMachine Vision Conference (BMVC), pages 87.1\u201387.12, 2016.\n\n[36] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In International Conference on Learning\nRepresentations, 2017.\n\n[37] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep ReLU networks. Machine Learning Journal, 2019.\n\n11\n\n\f", "award": [], "sourceid": 1688, "authors": [{"given_name": "L\u00e9na\u00efc", "family_name": "Chizat", "institution": "CNRS"}, {"given_name": "Edouard", "family_name": "Oyallon", "institution": "CNRS/LIP6"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}