{"title": "Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3202, "page_last": 3211, "abstract": "When optimizing over-parameterized models, such as deep neural networks, a large set of parameters can achieve zero training error. In such cases, the choice of the optimization algorithm and its respective hyper-parameters introduces biases that will lead to convergence to specific minimizers of the objective. Consequently, this choice can be considered as an implicit regularization for the training of over-parametrized models. In this work, we push this idea further by studying the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss. Using a time rescaling, we show that, with a vanishing initialization and a small enough step size, this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank.", "full_text": "Implicit Regularization of Discrete Gradient\n\nDynamics in Linear Neural Networks\n\nGauthier Gidel\nMila & DIRO\n\nUniversit\u00b4e de Montr\u00b4eal\n\nFrancis Bach\n\nINRIA & \u00b4Ecole Normale Sup\u00b4erieure\n\nPSL Research University, Paris\n\nSimon Lacoste-Julien\u2217\n\nMila & DIRO\n\nUniversit\u00b4e de Montr\u00b4eal\n\nAbstract\n\nWhen optimizing over-parameterized models, such as deep neural networks, a\nlarge set of parameters can achieve zero training error. In such cases, the choice of\nthe optimization algorithm and its respective hyper-parameters introduces biases\nthat will lead to convergence to speci\ufb01c minimizers of the objective. Consequently,\nthis choice can be considered as an implicit regularization for the training of\nover-parametrized models. In this work, we push this idea further by studying\nthe discrete gradient dynamics of the training of a two-layer linear network with\nthe least-squares loss. Using a time rescaling, we show that, with a vanishing\ninitialization and a small enough step size, this dynamics sequentially learns the\nsolutions of a reduced-rank regression with a gradually increasing rank.\n\n1\n\nIntroduction\n\nWhen optimizing over-parameterized models, such as deep neural networks, a large set of parameters\nleads to a zero training error. However they lead to different values for the test error and thus have\ndistinct generalization properties. More speci\ufb01cally, Neyshabur [2017, Part II] argues that the choice\nof the optimization algorithm (and its respective hyperparameters) provides an implicit regularization\nwith respect to its geometry: it biases the training, \ufb01nding a particular minimizer of the objective.\nIn this work, we use the same setting as Saxe et al. [2018]: a regression problem with least-squares\nloss on a multi-dimensional output. Our prediction is made either by a linear model or by a two-layer\nlinear neural network [Saxe et al., 2018]. We extend their work which covered the continuous gradient\ndynamics, to weaker assumptions as well as analyze the behavior of the discrete gradient updates\nWe show that with a vanishing initialization and a small enough step-size, the gradient dynamics of a\ntwo-layer linear neural network sequentially learns components that can be ranked according to a\nhierarchical structure whereas the gradient dynamics induced by the same regression problem but\nwith a linear prediction model instead learns these components simultaneously, missing this notion of\nhierarchy between components. The path followed by the two-layer formulation actually corresponds\nto successively solving the initial regression problem with a growing low rank constraint which is\nalso know as reduced-rank regression [Izenman, 1975]. Note that this notion of path followed by\nthe dynamics of a whole network is different from the notion of path introduced by Neyshabur et al.\n[2015a] which corresponds to a path followed inside a \ufb01xed network, i.e., one corresponds to training\ndynamics whereas the other corresponds to the propagation of information inside a network.\nTo sum-up, in our framework, the path followed by the gradient dynamics of a two-layer linear network\nprovides an implicit regularization that may lead to potentially better generalization properties. Our\ncontributions are the following:\n\n\u2217CIFAR fellow, Canada CIFAR AI chair\n\nCorrespondance to the \ufb01rst author: .@umontreal.ca\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2022 Under some assumptions (see Assumption 1), we prove that both the discrete and continuous\ngradient dynamics sequentially learn the solutions of a gradually less regularized version of\nreduced-rank regression (Corollary 2 and 3). Among the close related work, such result on\nimplicit regularization regarding discrete dynamics is novel. For the continuous case, we\nweaken the standard commutativity assumption using perturbation analysis.\n\n\u2022 We experimentally verify the reasonableness of our assumption and observe improvements in\nterms of generalization (matrix reconstruction in our case) using the gradient dynamics of the\ntwo-layer linear network when compared against the linear model.\n\n1.1 Related Work\n\nThe implicit regularization provided by the choice of the optimization algorithm has recently become\nan active area of research in machine learning, putting lot of interest on the behavior of gradient\ndescent on deep over-parametrized models [Neyshabur et al., 2015b, 2017, Zhang et al., 2017].\nSeveral works show that gradient descent on unregularized problems actually \ufb01nds a minimum norm\nsolution with respect to a particular norm that drastically depends on the problem of interest. Soudry\net al. [2018] look at a logistic regression problem and show that the predictor does converge to the max-\nmargin solution. A similar idea has been developed in the context of matrix factorization [Gunasekar\net al., 2017]. Under the assumption that the observation matrices commute, they prove that gradient\ndescent on this non-convex problem \ufb01nds the minimum nuclear norm solution of the reconstruction\nproblem, they also conjecture that this result would still hold without the commutativity assumption.\nThis conjecture has been later partially solved by Li et al. [2018] under mild assumptions (namely\nthe restricted isometry property). This work has some similarities with ours, since both focus on a\nleast-squares regression problem over matrices with a form of matrix factorization that induces a\nnon convex landscape. Their problem is more general than ours (see Uschmajew and Vandereycken\n[2018] for an even more general setting) but they are showing a result of a different kind from ours:\nthey focus on the properties of the limit solution the continuous dynamics whereas we show some\nproperties on the whole dynamics (continuous and discrete), proving that it actually visits points\nduring the optimization that may provide good generalization. Interestingly, both results actually\nshare common assumptions such as a commutativity assumption (which is less restrictive in our case\nsince it is always true in some realistic settings such as linear autoencoders), vanishing initialization\nand a small enough step size.\nNar and Sastry [2018] also analyzed the gradient descent algorithm on a least-squares linear network\nmodel as a discrete time dynamical system, and derived certain necessary (but not suf\ufb01cient) properties\nof the local optima that the algorithm can converge to with a non-vanishing step size. In this work,\ninstead of looking at the properties of the limit solutions, we focus on the path followed by the\ngradient dynamics and precisely caracterize the weights learned along this path.\nCombes et al. [2018] studied the continuous dynamics of some non-linear networks under relatively\nstrong assumptions such as the linear separability of the data. Conversely, in this work, we do not\nmake such separability assumption on the data but focus on linear networks.\nFinally, Gunasekar et al. [2018] compared the implicit regularization provided by gradient descent\nin deep linear convolutional and fully connected networks. They show that the solution found by\ngradient descent is the minimum norm for both networks but according to a different norm. In this\nwork, the fact that gradient descent \ufb01nds the minimum norm solution is almost straightforward using\nstandard results on least-squares. But the path followed by the gradient dynamics reveals interesting\nproperties for generalization. As developed earlier, instead of focusing on the properties of the\nsolution found by gradient descent, our goal is to study the path followed by the discrete gradient\ndynamics in the case of a two-layer linear network.\nPrior work [Saxe et al., 2013, 2014, Advani and Saxe, 2017, Saxe et al., 2018, Lampinen and Ganguli,\n2019] studied the gradient dynamics of two-layer linear networks and proved a result similar to our\nThm. 2. We consider Saxe et al. [2018] as the closest related work, we re-use their notion of simple\ndeep linear neural network, that we call two-layer neural networks, and use some elements of their\nproofs to extend their results. However, note that their work comes from a different perspective:\nthrough a mathematical analysis of a simple non-linear dynamics, they intend to highlight continuous\ndynamics of learning where one observes the sequential emergence of hierarchically structured\nnotions to explain the regularities in representation of human semantic knowledge. In this work, we\n\n2\n\n\fare also considering a two-layer neural network but with an optimization perspective. We are able to\nextend Saxe et al. [2018, Eq. 6 and 7] weakening the commutativity assumption considered in Saxe\net al. [2018] using perturbation analysis. In \u00a74.1, we test to what extent our weaker assumption holds.\nOur main contribution is to show a similar result on the discrete gradient dynamics, that is important\nin our perspective since we aim to study the dynamics of gradient descent. This result cannot be\ntrivially extended from the result on the continuous dynamics. We provide details on the dif\ufb01culties\nof the proof in \u00a73.2.\n\n2 A Simple Deep Linear Model\n\nIn this work, we are interested in analyzing a least-squares model with multi-dimensional outputs.\nGiven a \ufb01nite number n of inputs xi \u2208 Rd , 1 \u2264 i \u2264 n we want to predict a multi-dimensional\noutputs yi \u2208 Rp , 1 \u2264 i \u2264 n with a deep linear network [Saxe et al., 2018, Gunasekar et al., 2018],\n(1)\n\n\u02c6yd(x) := W (cid:62)\n\nDeep linear model:\n\nL \u00b7\u00b7\u00b7 W (cid:62)\n\n1 x ,\n\nwhere W1, . . . , WL are learned through a MSE formulation with the least-squares loss f,\n\n(W \u2217\n\n1 , . . . , W \u2217\n\nL) \u2208 arg min\nWl\u2208Rrl\u22121\u00d7rl\n1\u2264l\u2264L\n\n(cid:107)Y \u2212 XW1 \u00b7\u00b7\u00b7 WL(cid:107)2\n\n2 =: f (W1, . . . , WL) ,\n\n(2)\n\n1\n2n\n\nwhere r0 = d, rl \u2208 N , 1 \u2264 l \u2264 L \u2212 1 and rL = p, X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7p are such that,\n\nX(cid:62) := (x1 \u00b7\u00b7\u00b7 xn) and Y (cid:62) := (y1 \u00b7\u00b7\u00b7 yn) ,\n\n(3)\nare the design matrices of (xi)1\u2264i\u2264n and (yi)1\u2264i\u2264n. The deep linear model (1) is a L-layer deep\nlinear neural network where we see hl := Wl \u00b7\u00b7\u00b7 W1x for 1 \u2264 l \u2264 L \u2212 1 as the lth hidden layer.\nAt \ufb01rst, since this deep linear network cannot represent more than a linear transformation, we could\nthink that there is no reason to use a deeper representation L = 1. However, in terms of learning \ufb02ow,\nwe will see in \u00a73 that for L = 2 this model has a completely different dynamics from L = 1.\nIncreasing L may induce a low rank constraint when r := min{rl : 1 \u2264 l \u2264 L \u2212 1} < min(d, p).\nIn that case, (2) is equivalent to a reduced-rank regression,\n\nW k,\u2217 \u2208 arg min\nW \u2208Rp\u00d7d\nrank(W )\u2264r\n\n1\n2n\n\nn(cid:88)\n\ni=1\n\n(cid:107)Y \u2212 XW(cid:107)2\n2 .\n\n(4)\n\nThese problems have explicit solutions depending on X and Y [Reinsel and Velu, 1998, Thm. 2.2].\nNote that, in this work we are interested in the implicit regularization provided in the context of\nover-parametrized models, i.e., when r > min(p, d). In that case,\n\n{W1 \u00b7\u00b7\u00b7 WL : Wl \u2208 Rr\u00d7l\u22121,rl , 1 \u2264 l \u2264 L} = Rp\u00d7d .\n\n3 Gradient Dynamics as a Regularizer\n\nIn this section we would like to study the discrete dynamics of the gradient \ufb02ow of (2), i.e.,\n\nl \u2212 \u03b7\u2207Wl f(cid:0)W (t)\n\n[L]\n\n(cid:1) W (0)\n\nW (t+1)\n\nl\n\n= W (t)\n\nl \u2208 Rrl\u22121\u00d7rl , 1 \u2264 l \u2264 L ,\n\n(5)\n\nwhere we use the notation W (t)\nL ). The quantity \u03b7 is usually called the step-size.\nIn order to get intuitions on the discrete dynamics we also consider its respective continuous version,\n\n\u02d9Wl(t) = \u2212\u2207Wl f(cid:0)W[L](t)(cid:1) Wl(0) \u2208 Rrl\u22121\u00d7rl , 1 \u2264 l \u2264 L ,\n\n[L] := (W (t)\n\n1 , . . . , W (t)\n\n(6)\n\nwhere for 1 \u2264 l \u2264 L,\n\u02d9Wl(t) is the temporal derivative of Wl(t). Note that there is no step-size in\nthe continuous time dynamics since it actually corresponds to the limit of (5) when \u03b7 \u2192 0. The\ncontinuous dynamics may be more convenient to study because such differential equations may have\nclosed form solutions. In \u00a73.1, we will see that under reasonable assumptions it is the case for (6).\n\n3\n\n\f3.1 Continuous dynamics\n\nLinear model: L = 1. We start with the study of the continuous linear model, its gradient is,\n\nwhere \u03a3xy := 1\n\nn X(cid:62)Y and \u03a3x := 1\n\n\u2207f (W ) = \u03a3xW \u2212 \u03a3xy,\n(7)\nn X(cid:62)X. Thus, W (t) is the solution of the differential equation,\n(8)\n\n\u02d9W (t) = \u03a3xy \u2212 \u03a3xW (t) , W (0) = W0 .\n\nProposition 1. For any W0 \u2208 Rd\u00d7p , the solution to the linear differential equation (8) is\n\nW (t) = e\u2212t\u03a3x (W0 \u2212 \u03a3\u2020\n\nx\u03a3xy) + \u03a3\u2020\n\nx\u03a3xy ,\n\n(9)\n\nx is the pseudoinverse of \u03a3x.\n\nwhere \u03a3\u2020\nThis standard result on ODE is provided in \u00a7B.1. Note that when W0 \u2192 0 we have\n\nW (t) \u2192\nW0\u21920\n\n(Id \u2212 e\u2212t\u03a3x )\u03a3\u2020\n\n(10)\nDeep linear network: L \u2265 2. The study of the deep linear model is more challenging since for\nL \u2265 2, the landscape of the objective function f is non-convex. The gradient \ufb02ow of (2) is\n\u2207fWl (W[L]) = W (cid:62)\nwhere we used the convention that W1,0 = Id and WL+1,L = Ip. Thus (6) becomes\n\nl+1:L where Wi:j := Wi \u00b7\u00b7\u00b7 Wj , 1 \u2264 l \u2264 L , (11)\n\n1:l\u22121(\u03a3xW \u2212\u03a3xy)W (cid:62)\n\nx\u03a3xy .\n\n1 \u2264 l \u2264 L .\n\n\u02d9Wl(t) = W1:l\u22121(t)(cid:62)(\u03a3xy \u2212 \u03a3xW (t))Wl+1:L(t)(cid:62) , Wl(0) \u2208 Rd\u00d7p ,\n\n(12)\nWe obtain a coupled differential equation (12) that is harder to solve than the previous linear\ndifferential equation (8) due, at the same time, to its non-linear components and to the coupling\nbetween Wl , 1 \u2264 l \u2264 L. However, in the case L = 2, Saxe et al. [2018] managed to \ufb01nd an explicit\nsolution to this coupled differential equation under the assumption that \u201cperceptual correlation is\nminimal\u201d (\u03a3x = Id).2 In this work we extend Saxe et al. [2018, Eq. 7] (for L = 2) under weaker\nassumptions. More precisely, we do not require the covariance matrix \u03a3x to be the identity matrix.\nLet (U , V , D) be the SVD of \u03a3xy, our assumption is the following:\nAssumption 1. There exist two orthogonal matrices U, V such that we have the joint decomposition,\n(13)\nwhere B is such that (cid:107)B(cid:107)2 \u2264 \u0001 and Dx, Dxy are matrices only with diagonal coef\ufb01cients. We note\n\u03c31 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3rxy > 0 the singular values of \u03a3xy and \u03bb1, . . . , \u03bbrx the diagonal entries of Dx.\nSince two matrices commute if and only if they are co-diagonalizable [Horn et al., 1985, Thm. 1.3.21],\nthe quantity \u0001 represent to what extend \u03a3x and \u03a3xy\u03a3(cid:62)\nxy do not commute. Before solving (12) under\nAssump. 1, we describe some motivating examples where the quantity \u0001 is small or zero:\n\n\u03a3x = U (Dx + B)U(cid:62)\n\n\u03a3xy = U DxyV (cid:62) ,\n\nand\n\n\u2022 Linear autoencoder: If Y is set to X and L = 2, we recover a linear autoencoder:\n\n\u02c6x(x) = W (cid:62)\n\n2 W (cid:62)\n\n1 x, where h := W (cid:62)\n\n1 x is the encoded representation of x,\n\nxy =(cid:0) 1\n\nn X(cid:62)X(cid:1)2\n\n\u03a3xy\u03a3(cid:62)\n\nThus, B = 0 .\n\n(14)\n= \u03a32\nx .\nNote that this linear autoencoder can also be interpreted as a form of principal component\nanalysis. Actually, if we initialize with W1 = W (cid:62)\n2 , the gradient dynamics exactly recovers\nthe PCA of X, which is closely related to the matrix factorization problem of Gunasekar et al.\n[2017]. See \u00a7A where this derivation is detailed.\n\u2022 Deep linear multi-class prediction: In that case, p is the number of classes and yi is a\none-hot encoding of the class with, in practice, p (cid:28) d. The intuition on why we may expect\n(cid:107)B(cid:107)2 to be small is because rank(Y ) (cid:28) rank(X) and thus the matrices of interest only\nhave to almost commute on a small space in comparison to the whole space, thus B would be\nclose to 0. We verify this intuition by computing (cid:107)B(cid:107)2 for several classi\ufb01cation datasets in\nTable 1.\n\u2022 Minimal in\ufb02uence of perceptual correlation: \u03a3x \u2248 Id. It is the setting discussed by Saxe\net al. [2018]. We compare this assumption for some classi\ufb01cation datasets with our Assump. 1\nin \u00a74.1.\n\n2By a rescaling of the data, their proof is valid for any matrix \u03a3x proportional to the identity matrix.\n\n4\n\n\fAn explicit solution for L = 2. Under Assump. 1 and specifying the initialization, one can solve\nthe matrix differential equation for \u0001 = 0 and then use perturbation analysis to assess how close the\nsolution of (8) is to the closed form solution derived for \u0001 = 0. This result is summarized in the\nfollowing theorem proved in \u00a7B.2.\nTheorem 1. When L = 2, under Assump. 1, if we initialize with W1(0) = U diag(e\u2212\u03b41, . . . , e\u2212\u03b4p )Q\nand W2(0) = Q\u22121 diag(e\u2212\u03b41 , . . . , e\u2212\u03b4d )V (cid:62) where Q is an arbitrary invertible matrix, then the\nsolution of (12) can be decomposed as the sum of the solution for \u0001 = 0 and a perturbation term,\n\n(cid:113)\n1 (t) := U diag(cid:0)(cid:112)w1(t), . . . ,\nwp(t)(cid:1)Q\n2 (t) := Q\u22121 diag(cid:0)(cid:112)w1(t), . . . ,(cid:112)wd(t)(cid:1)V (cid:62)\n\n(15)\n\n\uf8f1\uf8f2\uf8f3W1(t) = W 0\n\nW2(t) = W 0\n\n1 (t) + W \u0001\n1 (t) + W \u0001\n\n1 (t) where W 0\n2 (t) where W 0\n\nwhere we have c > 0 such that (cid:107)W \u0001\n\ni (t)(cid:107) \u2264 \u0001 \u00b7 ect2 and,\n\ne\u22122\u03b4i\n\n\u03c3ie2\u03c3it\u22122\u03b4i\n\n, rxy < i \u2264 rx (16)\n\n, 1 \u2264 i \u2264 rxy , wi(t) =\n\n1 + 2e\u2212\u03b4i\u03bbit\n\n\u03bbi(e2\u03c3it\u22122\u03b4i \u2212 e\u22122\u03b4i) + \u03c3i\n\nwi(t) =\nwhere (\u03c3i) and (\u03bbi) are de\ufb01ned is Assump. 1. Note that rank(\u03a3xy) := rxy \u2264 rank(\u03a3x) := rx.\nThe main dif\ufb01culty in this result is the perturbation analysis for which we use a consequence of\nGr\u00a8onwall\u2019s inequality [Gronwall, 1919] (Lemma 4). The proof can be sketched in three parts: \ufb01rst\nshowing the result for \u0001 = 0, then showing that in the case \u0001 > 0, the matrices W1(t)/t and W2(t)/t\nare bounded and \ufb01nally use Lemma 4 to get the perturbation bound.\nThis result is more general than the one provided by Saxe et al. [2018] because it requires a weaker\nassumption than \u03a3x = Id and \u0001 = 0. In doing so, we obtain a result that takes into account the\nin\ufb02uence of correlations of the input samples. Note that Thm. 1 is only valid if the initialization\nW1(0)W2(0) has the same singular vectors as \u03a3xy. However, making such assumptions on the ini-\ntialization is standard in the literature and, in practice, we can set the initialization of the optimization\nalgorithm in order to also ensure that property. For instance, in the case of the linear autoencoder,\none can set W1(0) = W2(0) = e\u2212\u03b4Id.\nIn the following subsection we will use Thm. 1 to show that the components [U ]i , 1 \u2264 i \u2264 rxy in\nthe order de\ufb01ned by the decreasing singular values of \u03a3xy are learned sequentially by the gradient\ndynamics.\n\nrxy(cid:88)\n\n\u03c3iuiv(cid:62)\n\ni\n\nSequential learning of components. The sequential learning of the left singular vectors of \u03a3xy\n(sorted by the magnitude of its singular values) by the continuous gradient dynamics of deep linear\nnetworks has been highlighted by Saxe et al. [2018]. They note in their Eq. (10) that the ith phase\ntransition happens approximately after a time Ti de\ufb01ned as (using our notation),\n\n\u03b4i\n\u03c3i\n\n.\n\ni=1\n\nTi :=\n\nln(\u03c3i) where \u03a3xy =\n\n(17)\nThey argue that as \u03b4i \u2192 \u221e, the time Ti is roughly O(1/\u03c3i). The intuition is that a vanishing\ninitialization increases the gap between the phase transition times Ti and thus tends to separate the\nlearning of each components. However, a vanishing initialization just formally leads to Ti \u2192 \u221e.\nIn this work, we introduce a notion of time rescaling in order to formalize this notion of phase\ntransition and we show that, after this time rescaling, the point visited between two phase transitions\nis the solution of a low rank regularized version (4) of the initial problem (2) with the low rank\nconstraint that loosens sequentially.\nThe intuition behind time rescaling is that it counterbalances the vanishing initialization in (17): Since\nTi grows as fast as \u03b4i we need to multiply the time by \u03b4i, in order to grow at the same pace as Ti.\nUsing this rescaling we can present our theorem, proved in \u00a7B.3, which says that a vanishing\ninitialization tends to force the sequential learning of the component of X associated with the largest\nsingular value of \u03a3xy. Note that we need to rescale the time uniformly for each component. That is\nwhy in the following we set \u03b4i = \u03b4 , 1 \u2264 i \u2264 max(p, d).\nTheorem 2. Let us denote wi(t), the values de\ufb01ned in (16). If wi(0) = e\u2212\u03b4 , 1 \u2264 i \u2264 r, and\n\u0001 = e\u2212\u03b42 ln(\u03b4) then we have that wi(\u03b4t) converge to a step function as \u03b4 \u2192 \u221e:\n\nwi(\u03b4t) \u2192\n\u03b4\u2192\u221e\n\n\u03c3i\n\n\u03bbi+\u03c3i\n\n1{t = Ti} + \u03c3i\n\n1{t > Ti} .\n\n\u03bbi\n\n(18)\n\n5\n\n\fwhere Ti := 1/\u03c3i, 1{t \u2208 A} = 1 if t \u2208 A and 0 otherwise.\n\nNotice how the ith components of W1 and W2 are inactive, i.e., wi(t) is zero, for small t and\nis suddenly learned when t reaches the phase transition time Ti := 1/\u03c3i. As shown in Prop. 1\nand illustrated in Fig. 1, this sequential learning behavior does not occur for the non-factorized\nformulation. Gunasekar et al. [2017] observed similar differences between their factorized and not\nfactorized formulations of matrix regression. Note that, the time rescaling we introduced is t \u2192 \u03b4t,\nin order to compensate the vanishing initialization, rescaling the time and taking the limit this way\nfor (8) would lead to a constant function.\nGunasekar et al. [2017] also had to consider a vanishing initialization in order to show that on a\nsimple matrix factorization problem the continuous dynamics of gradient descent does converge to\nthe minimum nuclear norm solution. This assumption is necessary in such proofs in order to avoid\nto initialize with wrong components. However one cannot consider an initialization with the null\nmatrix since it is a stationary point of the dynamics, that is why this notion of double limit (vanishing\ninitialization and t \u2192 \u221e) is used.\nFrom Thm. 2, two corollaries follow directly. The \ufb01rst one regards the nuclear norm of the product\nW1(\u03b4t)W2(\u03b4t). This corollary says that (cid:107)W1(\u03b4t)W2(\u03b4t)(cid:107)\u2217 is a step function and that each incre-\nment of this integer value corresponds to the learning of a new component of X. These components\nare leaned by order of relevance, i.e., by order of magnitude of their respective eigenvalues and the\nlearning of a new component can be easily noticed by an incremental gap in the nuclear norm of the\nmatrix product W1(\u03b4t)W2(\u03b4t),\nCorollary 1. Let W1(t) and W2(t) be the matrices solution of (12) de\ufb01ned in (15). The limit of the\nsquared euclidean norm of W1(t)W2(t) when \u03b4 \u2192 \u221e is a step function de\ufb01ned as,\n\nrxy(cid:88)\n\ni=1\n\n\u03c32\ni\n\u03bb2\ni\n\n(19)\n\n2 \u2192\n\u03b4\u2192\u221e\n\n1{Ti < t} + \u03c32\n\ni\n\n(cid:107)W1(\u03b4t)W2(\u03b4t)(cid:107)2\n\n(\u03bbi+\u03c3i)2 1{Ti = t}\nwhere Ti := 1/\u03c3i and \u03c31 > \u00b7\u00b7\u00b7 > \u03c3rxy > 0 are the positive singular values of \u03a3xy.\nIt is natural to look at the norm of the prod-\nuct W1(\u03b4t)W2(\u03b4t) since in Thm. 2, (wi(t))\nare its singular values. However, since the\nrank of W1(\u03b4t)W2(\u03b4t) is discontinuously\nincreasing after each phase transition, any\nnorm would jump with respect to the rank\nincrements. We illustrate these jumps in\nFig. 1 where we plot\nthe closed form of\n(cid:55)\u2192 W (\u03b4t) and\nthe squared (cid:96)2 norms of t\nt (cid:55)\u2192 W1(\u03b4t)W2(\u03b4t) for vanishing initializa-\ntions with \u03a3yx = diag(10\u22121, 10\u22122, 10\u22123) and\n\u03a3x = Id.\nFrom Thm. 2, we can notice that, between\ntime Tk and Tk+1, the rank of the limit ma-\ntrix W1W2 is actually equal to k, meaning that\nat each phase transition, the rank of W1W2 is\nincreased by 1. Moreover, this matrix product\ncontains the k components of X corresponding\nto the k largest singular values of \u03a3xy. Thus,\nwe can show that this matrix product is the so-\nlution of the k-low rank constrained version (4)\nof the initial problem (2),\nCorollary 2. Let W1(t) and W2(t) be the matrices solution of (12) de\ufb01ned in (15). We have that,\n\nFigure 1: Closed form solution of squared (cid:96)2 norm\nof W (\u03b4t) and W1(\u03b4t)W2(\u03b4t) respectively for a linear\nmodel and a two-layer linear autoencoder, depending\non W (0) = W1(0)W2(0) = e\u2212\u03b4Id. Note that for an\nautoencoder \u03bbi = \u03c3i and thus the trace norm has integer\nvalues. According to Thm. 2, the integer trace norm\nincrement represents the learning of a new component.\n\n1\n\u03c3k\n\n< t < 1\n\n\u03c3k+1\n\n\u21d2 W1(\u03b4t)W2(\u03b4t) \u2192\n\n\u03b4\u2192\u221e W k,\u2217 ,\n\n1 \u2264 k \u2264 rxy .\n\n(20)\n\nwhere W k,\u2217 is the minimum (cid:96)2 norm solution of the reduced-rank-k regression problem (4) .\n\n6\n\n100101102103Rescaledtime0.00.51.01.52.02.53.0kWk22kW1W2k22,\u03b4=12.0kW1W2k22,\u03b4=7.5kW1W2k22,\u03b4=3.0kWk22,\u03b4=3.0kWk22,\u03b4=7.5kWk22,\u03b4=12.0\f3.2 Discrete dynamics\n\nWe are interested in the behavior of optimization methods. Thus, the gradient dynamics of interest\nis the discrete one (5). A major contribution of our work is thus contained in this section. The\ncontinuous case studied in \u00a7 3.1 provided good intuitions and insights on the behavior of the potential\ndiscrete dynamics that we can use for our analysis.\n\nWhy the discrete analysis is challenging. Previous related work [Gunasekar et al., 2017, Saxe\net al., 2018] only provide results on the continuous dynamics. Their proofs use the fact that their\nrespective continuous dynamics of interest have a closed form solution (e.g., Thm.1). To our\nknowledge, no closed form solution is known for the discrete dynamics (5). Thus its analysis requires\na new proof technique. Moreover, using Euler\u2019s integration methods, one can show that both dynamics\nare close but only for a vanishing step size depending on a \ufb01nite horizon. Such dependence on the\nhorizon is problematic since the time rescaling used in Thm. 2 would make any \ufb01nite horizon go to\nin\ufb01nity. In this section, we consider realistic conditions on the step-size (namely, it has to be smaller\nthan the Lipschitz constant and some notion of eigen-gap) without any dependence on the horizon.\nSuch assumption is relevant since we want to study the dynamics of practical optimization algorithms\n(i.e., with a step size as large as possible and without knowing in advance the horizon).\n\nSingle layer linear model.\nIn this paragraph, we consider the discrete update for the linear model.\nSince L = 1, for notational compactness, we call Wt the matrix updated according to (5). Using the\ngradient derivation (7), the discrete update scheme for the linear model is,\n\nWt+1 = Wt \u2212 \u03b7(\u03a3xWt \u2212 \u03a3xy) = (Id \u2212 \u03b7\u03a3x)Wt + \u03b7\u03a3xy .\n\nNoticing that for 1/\u03bbmax(\u03a3x) > \u03b7 > 0 , Id \u2212 \u03b7\u03a3x is invertible, this recursion (see \u00a7B.4) leads to,\n(21)\n\nx\u03a3xy)(Id \u2212 \u03b7\u03a3x)t + \u03a3\u2020\n\nWt = (W0 \u2212 \u03a3\u2020\n\nx\u03a3xy .\n\nWe obtain a similar result as the solution of the differential equation given in Prop. 1. With a vanishing\ninitialization we reach a function that does not sequentially learn some components.\n\nTwo-layer linear model. The discrete update scheme for the two-layer linear network (2) is,\n\n1\n\nW (t+1)\n\n= W (t)\n\n1 \u2212 \u03b7(\u03a3xW (t)\u2212 \u03a3xy)(W (t)\n\n1 )(cid:62)(\u03a3xW (t)\u2212 \u03a3xy) .\nWhen \u0001 = 0, by a change of basis and a proper initialization, we can reduce the study of this matrix\nequation to r independant dynamics (see \u00a7B.5 for more details), for 1 \u2264 i \u2264 r,\n\n2 )(cid:62) , W (t+1)\n\n2 \u2212 \u03b7(W (t)\n\n= W (t)\n\n2\n\nw(t+1)\n\ni\n\n= w(t)\n\ni + \u03b7w(t)\n\ni (\u03c3i \u2212 \u03bbiw(t)\nleading to the following theorem,\n\ni w(t)\n\ni ) .\n\nThus we can derive a bound on the iterate w(t)\nTheorem 3. Under the same assumptions as Thm. 1 and \u0001 = 0, we have\n\ni\n\n(22)\n\n(cid:16)(cid:113)\n\n(cid:113)\n\n1 = U diag\n\nW (t)\nMoreover, for any 1 \u2264 i \u2264 rxy, if 1 > w(0)\n\n1 , . . . ,\n\nw(t)\np\n\nw(t)\n\n(cid:17)\nV (cid:62) .\nQ and W (t)\ni > 0 and 2\u03b7\u03c3i < 1, then \u2200t \u2265 0 , 1 \u2264 i \u2264 rx we have,\n\n2 = Q\u22121 diag\n\n1 , . . . ,\n\nw(t)\nd\n\n(cid:17)\n\nw(t)\n\n(cid:16)(cid:113)\n\n(cid:113)\n\nw(0)\n\ni\n\n(\u03c3i \u2212 \u03bbiw(0)\n\ni\n\n)e(\u22122\u03b7\u03c3i+4\u03b72\u03c32\n\ni )t + w(0)\n\ni \u03bbi\n\n\u2264 w(t)\n\ni \u2264\n\n(\u03c3i \u2212 \u03bbiw(0)\n\ni\n\nw(0)\n\ni\n\n)e(\u22122\u03b7\u03c3i\u2212\u03b72\u03c32\n\ni )t + w(0)\n\ni \u03bbi\n\n, (23)\n\nand w(t)\n\ni \u2264\n\nw(0)\ni\n1+w(0)\ni \u03bbi\u03b7t\n\nfor rxy \u2264 i \u2264 rx. The differences with the continuous case (16) are in red.\n\nProof sketch. The solution of the continuous dynamics lets us think directly studying the sequence\ni might be quite challenging since the solution of the continuous dynamics wi(t)\u22121 has a non-\nw(t)\nlinear and non-convex behavior.\nThe main insight from this proof is that one can treat the discrete case using the right transformation,\nto show that some sequence doee converge linearly.\n\n7\n\n\fDataset\nMNIST\nCIFAR-10\nImageNet\n\n\u2206xy\n\n2.8 \u00d7 10\u22122\n3.0 \u00d7 10\u22122\n1.7 \u00d7 10\u22121\n\n\u2206x\n\n.70\n.68\n.70\n\nFigure 2: Trace norm and reconstruction errors of W (t) for L = 1 and\n2 as a function of t.\n\nTable 1: Value of the quantities\n\u2206xy and \u2206x de\ufb01ned in (27).\n\nThm. 2 indicates the quantity wi(t)\u22121 \u2212 \u03bbi\nwi(t)\u22121 \u2212 \u03c3i\n\nis the good candidate to show linear convergence to 0,\n= (wi(0)\u22121 \u2212 \u03c3i\nWhat we can expect is thus to show that the sequence (w(t)\nstep of the proof is to show that (w(t)\nthen to use (22) to get,\n\nhas similar properties. The \ufb01rst\ni ) is an increasing sequence smaller than one. The second step is\n\ni )\u22121 \u2212 \u03c3i\n\n)e\u22122\u03b7\u03c3it .\n\n(24)\n\n\u03bbi\n\n\u03bbi\n\n\u03bbi\n\n\u03c3i\n\n1\n\nw(t+1)\n\ni\n\n\u03c3i\n\n= 1\nw(t)\n\n\u2212 \u03bbi\n1+x \u2264 1 \u2212 x + x2 for any 1 \u2264 x \u2264 0 we can derive the upper and lower\n\n1\n)+\u03b72(\u03c3i\u2212\u03bbiw(t)\n\n1+2(\u03c3i\u2212\u03bbiw(t)\n\n\u2212 \u03bbi\n\n\u03c3i\n\n)2\n\ni\n\ni\n\ni\n\nThen using that 1 \u2212 x \u2264 1\nbounds on the linear convergence rate. See \u00a7B.5 for full proof.\n\n(cid:19)\n\nIn order to get a similar interpretation of Thm. 3 in terms of implicit regularization, we use the\nintuitions from Thm. 2. The analogy between continuous and discrete time is that the discrete time\ndynamics is doing t time-steps of size \u03b7, meaning that we have W (\u03b7t) \u2248 Wt, the time rescaling in\ncontinuous time consists in multiplying the time by \u03b4 thus we get the analog phase transition time,\n(25)\ni = e\u2212\u03b4. Thus, we show that the ith component is learned\n\nRecall that we assumed that m(0)\naround time Ti, and consequently that the components are learned sequentially,\nCorollary 3. If \u03b7 < 1\n2\u03c31\n\n, for 1 \u2264 i \u2264 rxy \u2212 1, then for 1 \u2264 i < rx,\n\nand \u03b7 < \u03c3i\u2212\u03c3i+1\n\n, \u03b7 < 2 \u03c3i\u2212\u03c3i+1\n\n\u21d2 Ti := 1\n\ni = n(0)\n\n\u03b7Ti := 1\n\u03c3i\n\n2\u03c32\n\n\u03b7\u03c3i\n\n.\n\n\u03c32\ni\n\ni+1\n\nw(\u03b4Tj )\n\ni\n\n\u2192\n\u03b4\u2192\u221e\n\nif\n\nif\n\nor\n\ni > rxy\ni \u2264 rxy and j > i .\n\nj < i\n\n(26)\n\n\u03c3j \u03b7 , 1 \u2264 j \u2264 rxy and Tj := +\u221e if j > rxy.\n\nwhere T0 := 0, Tj := 1\nThis result is proved in \u00a7B.5. The quantities \u03c3i\u2212\u03c3i+1\ncan be interpreted as relative eigen-\ngaps. Note that they are well de\ufb01ned since we assumed that the eigenspaces were unidimensional.\nThe intuition behind this condition is that the step-size cannot be larger than the eigen-gaps because\notherwise the discrete optimization algorithm would not be able to distinguish some components.\n\nand \u03c3i\u2212\u03c3i+1\n\n\u03c32\ni\n\n\u03c32\n\ni+1\n\n(cid:18)\n\n\uf8f1\uf8f2\uf8f30\n\n\u03c3i\n\u03bbi\n\n4 Experiments\n\n4.1 Assump. 1 for Classi\ufb01cation Datasets\n\nIn this section we verify to what extent Assump. 1 is true on standard classi\ufb01cation datasets. For\nthis, we compute the normalized quantities \u2206xy and \u2206x representing how much Assump. 1 and the\nassumption that \u03a3x \u2248 Id are respectively broken. We compute B by computing U, the left singular\nvector of \u03a3xy and looking at the non-diagonal coef\ufb01cients of U(cid:62)\u03a3xU,\n\n(27)\n\n(cid:13)(cid:13) \u02c6\u03a3x \u2212 \u02c6Id\n\n(cid:13)(cid:13)2 ,\n\n\u2206xy :=\n\n(cid:107)B(cid:107)2\n(cid:107)\u03a3x(cid:107)2\n\n, \u2206x := 1\n2\n\n8\n\n100101102103104105Numberofiterations012345TraceNormW1W2W100101102103104105Numberofiterations10\u2212410\u22122100ReconstructionerrorW1W2W\fwhere (cid:107)\u00b7(cid:107) is the Frobenius norm, the \u02c6\u03a3 expressions correspond to \u02c6X := X/(cid:107)X(cid:107) and \u02c6Id := Id/(cid:107)Id(cid:107).\nThese normalized quantities are between 0 and 1. The closer to 1, the less the assumption hold and\nconversely, the closer to 0, the more the assumption approximately holds. We present the results\non three standard classi\ufb01cation datasets, MNIST [LeCun et al., 2010], CIFAR10 [Krizhevsky et al.,\n2014] and ImageNet [Deng et al., 2009], a down-sampled version of ImageNet with images of size\n64 \u00d7 64. In Table 1, we can see that the quantities \u2206x and \u2206xy do not vary much among the datasets\nand that the \u2206 associated with our our Assump. 1 is two orders of magnitude smaller than the \u2206\nassociated with Saxe et al. [2018]\u2019s assumption indicating the relevance of our assumption.\n\n4.2 Linear Autoencoder\n\nFor an auto-encoder, we have, Y = X. We want to compare the reconstruction properties of W (t)\ncomputed though (21) and of the matrix product W (t)\nare computed\nthough (22). In this experiment, we have p = d = 20, n = 1000, r = 5 and we generated synthetic\ndata. First we generate a \ufb01xed matrix B \u2208 Rd\u00d7r such that, Bkl \u223c U([0, 1]), 1 \u2264 k, l \u2264 n. Then,\nfor 1 \u2264 i \u2264 n, we sample xi \u223c Bzi + \u0001i where zi \u223c N (0, D := diag(4, 2, 1, 1/2, 1/4)) and\n\u0001i \u223c 10\u22123N (0, Id). In Fig. 2, we plot the trace norm of W (t) and W (t)\nas well as their\nrespective reconstruction errors as a function of t the number of iterations,\n\n2 where W (t)\n\nand W (t)\n\n1 W (t)\n\n1\n\n2\n\n1 W (t)\n\n2\n\n(cid:107)W (t) \u2212 BDB(cid:62)(cid:107)2 .\n\n(28)\n\nWe can see that the experimental results are very close to the theoretical behavior predicted with the\ncontinuous dynamics in Figure 1. Contrary to the dynamics induced by the linear model formulation\n(L = 1), the dynamics induced by the two-layer linear network (L = 2) is very close to a step\nfunction: each step corresponds to the learning to a new component: They are learned sequentially.\n\n5 Discussion\n\nThere is a growing body of empirical and theoretical evidence that the implicit regularization induced\nby gradient methods is key in the training of deep neural networks. Yet, as noted by Zhang et al.\n[2017], even for linear models, our understanding of the origin of generalization is limited. In this\nwork, we focus on a simple non-convex objective that is parametrized by a two-layer linear network.\nIn the case of linear regression we show that the discrete gradient dynamics also visits points that are\nimplicitly regularized solutions of the initial optimization problem. In that sense, in the context of\nmachine learning, applying gradient descent on the overparametrized model of interest, provides a\nform of implicit regularization: it sequentially learns the hierarchical components of our problem\nwhich could help for generalization. Our setting does not pretend to solve generalization in deep\nneural networks; many majors components of the standard neural network training are omitted\nsuch as the non-linearities, large values of L and the stochasticity in the learning procedure (SGD).\nNevertheless, it provides useful insights about the source of generalization in deep learning.\n\nAcknowledgments.\n\nThis research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence\nResearch Chair in \u201cData Science for Realtime Decision-making\u201d, by the NSERC Discovery Grant\nRGPIN-2017-06936, by a graduate Borealis AI fellowship and by a Google Focused Research award.\n\nReferences\nM. S. Advani and A. M. Saxe. High-dimensional dynamics of generalization error in neural networks.\n\narXiv preprint arXiv:1710.03667, 2017.\n\nN. Berglund. Perturbation theory of dynamical systems. arXiv preprint math/0111178, 2001.\n\nE. A. Coddington and N. Levinson. Theory of Ordinary Differential Equations. Tata McGraw-Hill\n\nEducation, 1955.\n\nR. T. d. Combes, M. Pezeshki, S. Shabanian, A. Courville, and Y. Bengio. On the learning dynamics\n\nof deep neural networks. arXiv preprint arXiv:1809.06848, 2018.\n\n9\n\n\fJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009.\n\nT. H. Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of\n\ndifferential equations. Annals of Mathematics, 1919.\n\nS. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization\n\nin matrix factorization. In NIPS, 2017.\n\nS. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolu-\n\ntional networks. arXiv preprint arXiv:1806.00468, 2018.\n\nR. A. Horn, R. A. Horn, and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n\nA. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate\n\nAnalysis, 1975.\n\nA. Krizhevsky, V. Nair, and G. Hinton. The CIFAR-10 dataset. online: http://www. cs. toronto.\n\nedu/kriz/cifar. html, 2014.\n\nA. K. Lampinen and S. Ganguli. An analytic theory of generalization dynamics and transfer learning\n\nin deep linear networks. In ICLR, 2019.\n\nY. LeCun, C. Cortes, and C. Burges. MNIST handwritten digit database. AT&T Labs [Online].\n\nAvailable: http://yann. lecun. com/exdb/mnist, 2010.\n\nY. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix sensing and\nneural networks with quadratic activations. In Conference On Learning Theory, pages 2\u201347, 2018.\n\nK. Nar and S. Sastry. Step size matters in deep learning. In NeurIPS, 2018.\n\nB. Neyshabur. Implicit Regularization in Deep Learning. PhD thesis, TTIC, 2017.\n\nB. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep\n\nneural networks. In NIPS, 2015a.\n\nB. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit\n\nregularization in deep learning. In ICLR, 2015b.\n\nB. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit\n\nregularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.\n\nG. C. Reinsel and R. Velu. Multivariate Reduced-Rank Regression: Theory and Applications. Springer\n\nScience & Business Media, 1998.\n\nA. M. Saxe, J. L. McClellans, and S. Ganguli. Learning hierarchical categories in deep neural\n\nnetworks. In Proceedings of the Annual Meeting of the Cognitive Science Society, 2013.\n\nA. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning\n\nin deep linear neural networks. In ICLR, 2014.\n\nA. M. Saxe, J. L. McClelland, and S. Ganguli. A mathematical theory of semantic development in\n\ndeep neural networks. arXiv preprint arXiv:1810.10531, 2018.\n\nD. Soudry, E. Hoffer, and N. Srebro. The implicit bias of gradient descent on separable data. In ICLR,\n\n2018.\n\nA. Uschmajew and B. Vandereycken. On critical points of quadratic low-rank matrix optimization\n\nproblems. Tech. report (submitted), July 2018.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. 2017.\n\n10\n\n\f", "award": [], "sourceid": 1800, "authors": [{"given_name": "Gauthier", "family_name": "Gidel", "institution": "Mila"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}]}