{"title": "Implicit Regularization in Deep Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 7413, "page_last": 7424, "abstract": "Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low \"complexity.\" We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.", "full_text": "Implicit Regularization in Deep Matrix Factorization\n\nSanjeev Arora\n\nPrinceton University and Institute for Advanced Study\n\narora@cs.princeton.edu\n\nWei Hu\n\nPrinceton University\n\nhuwei@cs.princeton.edu\n\nNadav Cohen\n\nTel Aviv University\n\ncohennadav@cs.tau.ac.il\n\nYuping Luo\n\nPrinceton University\n\nyupingl@cs.princeton.edu\n\nAbstract\n\nEfforts to understand the generalization mystery in deep learning have led to the\nbelief that gradient-based optimization induces a form of implicit regularization, a\nbias towards models of low \u201ccomplexity.\u201d We study the implicit regularization of\ngradient descent over deep linear neural networks for matrix completion and sens-\ning, a model referred to as deep matrix factorization. Our \ufb01rst \ufb01nding, supported by\ntheory and experiments, is that adding depth to a matrix factorization enhances an\nimplicit tendency towards low-rank solutions, oftentimes leading to more accurate\nrecovery. Secondly, we present theoretical and empirical arguments questioning\na nascent view by which implicit regularization in matrix factorization can be\ncaptured using simple mathematical norms. Our results point to the possibility that\nthe language of standard regularizers may not be rich enough to fully encompass\nthe implicit regularization brought forth by gradient-based optimization.\n\nIntroduction\n\n1\nIt is a mystery how deep neural networks generalize despite having far more learnable parameters than\ntraining examples. Explicit regularization techniques alone cannot account for this generalization,\nas they do not prevent the networks from being able to \ufb01t random data (see [52]). A view by which\ngradient-based optimization induces an implicit regularization has thus arisen. Of course, this view\nwould be uninsightful if \u201cimplicit regularization\u201d were treated as synonymous with \u201cpromoting gen-\neralization\u201d \u2014 the question is whether we can characterize the implicit regularization independently\nof any validation data. Notably, the mere use of the term \u201cregularization\u201d already predisposes us\ntowards characterizations based on known explicit regularizers (e.g. a constraint on some norm of the\nparameters), but one must also be open to the possibility that something else is afoot.\nAn old argument (cf. [25, 29]) traces implicit regularization in deep learning to bene\ufb01cial effects of\nnoise introduced by small-batch stochastic optimization. The feeling is that solutions that do not\ngeneralize correspond to \u201csharp minima,\u201d and added noise prevents convergence to such solutions.\nHowever, recent evidence (e.g. [26, 51]) suggests that deterministic (or near-deterministic) gradient-\nbased algorithms can also generalize, and thus a different explanation is in order.\nA major hurdle in this study is that implicit regularization in deep learning seems to kick in only\nwith certain types of data (not with random data for example), and we lack mathematical tools for\nreasoning about real-life data. Thus one needs a simple test-bed for the investigation, where data\nadmits a crisp mathematical formulation. Following earlier works, we focus on the problem of\nmatrix completion: given a randomly chosen subset of entries from an unknown matrix W \u2217, the\ntask is to recover the unseen entries. To cast this as a prediction problem, we may view each entry\nin W \u2217 as a data point: observed entries constitute the training set, and the average reconstruction\nerror over the unobserved entries is the test error, quantifying generalization. Fitting the observed\nentries is obviously an underdetermined problem with multiple solutions. However, an extensive\nbody of work (see [11] for a survey) has shown that if W \u2217 is low-rank, certain technical assumptions\n(e.g. \u201cincoherence\u201d) are satis\ufb01ed and suf\ufb01ciently many entries are observed, then various algorithms\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Matrix completion via gradient descent over deep matrix factorizations. Left (respectively, right) plot\nshows reconstruction errors for matrix factorizations of depths 2, 3 and 4, when applied to the completion of a\nrandom rank-5 (respectively, rank-10) matrix with size 100 \u00d7 100. x-axis stands for the number of observed\nentries (randomly chosen), y-axis represents reconstruction error, and error bars (indiscernible) mark standard\ndeviations of the results over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden\ndimensions 100. Both learning rate and standard deviation of (random, zero-centered) initialization for gradient\ndescent were set to the small value 10\u22123. Notice, with few observed entries factorizations of depths 3 and 4\nsigni\ufb01cantly outperform that of depth 2, whereas with more entries all factorizations perform well. For further\ndetails, and a similar experiment on matrix sensing tasks, see Appendix E.\n\ncan achieve approximate or even exact recovery. Of these, a well-known method based upon convex\noptimization \ufb01nds the minimal nuclear norm matrix among those \ufb01tting all observed entries (see [9]).1\nOne may try to solve matrix completion using shallow neural networks. A natural approach, matrix\nfactorization, boils down to parameterizing the solution as a product of two matrices \u2014 W =\nW2W1 \u2014 and optimizing the resulting (non-convex) objective for \ufb01tting observed entries. Formally,\nthis can be viewed as training a depth-2 linear neural network. It is possible to explicitly constrain the\nrank of the produced solution by limiting the shared dimension of W1 and W2. However, in practice,\neven when the rank is unconstrained, running gradient descent with small learning rate (step size)\nand initialization close to zero tends to produce low-rank solutions, and thus allows accurate recovery\nif W \u2217 is low-rank. This empirical observation led Gunasekar et al. to conjecture in [20] that gradient\ndescent over a matrix factorization induces an implicit regularization minimizing nuclear norm:\nConjecture 1 (from [20], informally stated). With small enough learning rate and initialization close\nenough to the origin, gradient descent on a full-dimensional matrix factorization converges to the\nminimum nuclear norm solution.\n\nW = WN WN\u22121 \u00b7\u00b7\u00b7 W1 ,\n\nDeep matrix factorization Since standard matrix factorization can be viewed as a two-layer neural\nnetwork (with linear activations), a natural extension is to consider deeper models. A deep matrix\nfactorization2 of W \u2208 Rd,d(cid:48)\n, with hidden dimensions d1, . . . , dN\u22121 \u2208 N, is the parameterization:\n(1)\nwhere Wj \u2208 Rdj ,dj\u22121, j = 1, . . . , N, with dN := d, d0 := d(cid:48). N is referred to as the depth of the\nfactorization, the matrices W1, . . . , WN as its factors, and the resulting W as the product matrix.\nCould the implicit regularization of deep matrix factorizations be stronger than that of their shallow\ncounterpart (which Conjecture 1 equates with nuclear norm minimization)? Experiments reported\nin Figure 1 suggest that this is indeed the case \u2014 depth leads to more accurate completion of a\nlow-rank matrix when the number of observed entries is small. Our purpose in the current paper is to\nmathematically analyze this stronger form of implicit regularization. Can it be described by a matrix\nnorm (or quasi-norm) continuing the line of Conjecture 1, or is a paradigm shift required?\n1.1 Paper overview\nReview of related work is given in Appendix A (deferred to supplementary material per lack of space).\nIn Section 2 we investigate the potential of norms for capturing the implicit regularization in deep\nmatrix factorization. Surprisingly, we \ufb01nd that the main theoretical evidence connecting nuclear norm\nand shallow (depth-2) matrix factorization \u2014 proof given in [20] for Conjecture 1 in a particular\nrestricted setting \u2014 extends to arbitrarily deep factorizations as well. This result disquali\ufb01es the\n\n1Recall that the nuclear norm (also known as trace norm) of a matrix is the sum of its singular values,\n\nregarded as a convex relaxation of rank.\n\nwhile others less so (e.g. [50, 16, 49]).\n\n2Note that the literature includes various usages of this term \u2014 some in line with ours (e.g. [47, 53, 33]),\n\n2\n\n\fnatural hypothesis by which Schatten quasi-norms replace nuclear norm as the implicit regularization\nwhen one adds depth to a shallow matrix factorization. Instead, when interpreted through the lens\nof [20], it brings forth a conjecture by which the implicit regularization is captured by nuclear norm\nfor any depth. Since our experiments (Figure 1) show that depth changes (enhances) the implicit\nregularization, we are led to question the theoretical direction proposed in [20], and accordingly\nconduct additional experiments to evaluate the validity of Conjecture 1.\nTypically, when the number of observed entries is suf\ufb01ciently large with respect to the rank of the\nmatrix to recover, nuclear norm minimization yields exact recovery, and thus it is impossible to\ndistinguish between that and a different implicit regularization which also perfectly recovers. The\nregime most interesting to evaluate is therefore that in which the number of observed entries is too\nsmall for exact recovery by nuclear norm minimization \u2014 here there is room for different implicit\nregularizations to manifest themselves by providing higher quality solutions. Our empirical results\nshow that in this regime, matrix factorizations consistently outperform nuclear norm minimization,\nsuggesting that their implicit regularization admits stronger bias towards low-rank, in contrast to\nConjecture 1. Together, our theory and experiments lead us to suspect that the implicit regularization\nin matrix factorization (shallow or deep) may not be amenable to description by a simple mathematical\nnorm, and a detailed analysis of the dynamics in optimization may be necessary.\nSection 3 carries out such an analysis, characterizing how the singular value decomposition of the\nlearned solution evolves during gradient descent. Evolution rates of singular values turn out to be\nproportional to their size exponentiated by 2 \u2212 2/N, where N is the depth of the factorization. This\nestablishes a tendency towards low rank solutions, which intensi\ufb01es with depth. Experiments validate\nthe \ufb01ndings, demonstrating the dynamic nature of implicit regularization in deep matrix factorization.\nWe believe the trajectories traversed in optimization may be key to understanding generalization in\ndeep learning, and hope that our work will inspire further progress along this line.\n\n2 Can the implicit regularization be captured by norms?\nIn this section we investigate the possibility of extending Conjecture 1 for explaining implicit\nregularization in deep matrix factorization. Given the experimental evidence in Figure 1, one may\nhypothesize that gradient descent on a depth-N matrix factorization implicitly minimizes some\nnorm (or quasi-norm) that approximates rank, with the approximation being more accurate the\nlarger N is. For example, a natural candidate would be Schatten-p quasi-norm to the power of p\n(0 < p \u2264 1), which for a matrix W \u2208 Rd,d(cid:48)\nr (W ), where\n\u03c3p\n\u03c31(W ), . . . , \u03c3min{d,d(cid:48)}(W ) are the singular values of W . For p = 1 this reduces to nuclear norm,\nwhich by Conjecture 1 corresponds to a depth-2 factorization. As p approaches zero we obtain a\ncloser approximation of rank(W ), which could be suitable for factorizations of higher depths.\nWe will focus in this section on matrix sensing \u2014 a more general problem than matrix completion.\nHere, we are given m measurement matrices A1, . . . , Am, with corresponding labels y1, . . . , ym\ngenerated by yi = (cid:104)Ai, W \u2217(cid:105), and our goal is to reconstruct the unknown matrix W \u2217. As in the\ncase of matrix completion, well-known methods, and in particular nuclear norm minimization, can\nrecover W \u2217 if it is low-rank, certain technical conditions are met, and suf\ufb01ciently many observa-\ntions are given (see [42]).\n\n:=(cid:80)min{d,d(cid:48)}\n\nis de\ufb01ned as: (cid:107)W(cid:107)p\n\nr=1\n\nSp\n\n2.1 Current theory does not distinguish depth-N from depth-2\nOur \ufb01rst result is that the theory developed by [20] to support Conjecture 1 can be generalized to\nsuggest that nuclear norm captures the implicit regularization in matrix factorization not just for\ndepth 2, but for arbitrary depth. This is of course inconsistent with the experimental \ufb01ndings reported\nin Figure 1. We will \ufb01rst recall the existing theory, and then show how to extend it.\n[20] studied implicit regularization in shallow (depth-2) matrix factorization by considering recovery\nof a positive semide\ufb01nite matrix from sensing via symmetric measurements, namely:\n\nminW\u2208S d\n\n(yi \u2212 (cid:104)Ai, W(cid:105))2 ,\n(2)\nwhere A1, . . . , Am \u2208 Rd,d are symmetric and linearly independent, and S d\n+ stands for the set of (sym-\nmetric and) positive semide\ufb01nite matrices in Rd,d. Focusing on the underdetermined regime m (cid:28) d2,\nthey investigated the implicit bias brought forth by running gradient \ufb02ow (gradient descent with\nin\ufb01nitesimally small learning rate) on a symmetric full-rank matrix factorization, i.e. on the objective:\n\n(cid:96)(W ) := 1\n2\n\n+\n\n(cid:88)m\n\ni=1\n\n3\n\n\f(cid:88)m\n\n(yi \u2212(cid:10)Ai, ZZ(cid:62)(cid:11))2 .\n\n\u03c8 : Rd,d \u2192 R\u22650\n\n, \u03c8(Z) := (cid:96)(ZZ(cid:62)) = 1\n\n2\n\ni=1\n\nFor \u03b1 > 0, denote by Wsha,\u221e(\u03b1) (sha here stands for \u201cshallow\u201d) the \ufb01nal solution ZZ(cid:62) obtained\nfrom running gradient \ufb02ow on \u03c8(\u00b7) with initialization \u03b1I (\u03b1 times identity). Formally, Wsha,\u221e(\u03b1) :=\ndZ (Z(t)) for t \u2208 R\u22650 (t here is a continuous\nlimt\u2192\u221e Z(t)Z(t)(cid:62) where Z(0) = \u03b1I and \u02d9Z(t) = \u2212 d\u03c8\ntime index, and \u02d9Z(t) stands for the derivative of Z(t) with respect to time). Letting (cid:107)\u00b7(cid:107)\u2217 represent\nmatrix nuclear norm, the following result was proven by [20]:\nTheorem 1 (adaptation of Theorem 1 in [20]). Assume the measurement matrices A1, . . . , Am\ncommute. Then, if \u00afWsha := lim\u03b1\u21920 Wsha,\u221e(\u03b1) exists and is a global optimum for Equation (2)\n+, (cid:96)(W )=0 (cid:107)W(cid:107)\u2217, i.e. \u00afWsha is a global optimum\nwith (cid:96)( \u00afWsha) = 0, it holds that \u00afWsha \u2208 argminW\u2208S d\nwith minimal nuclear norm.3\n\nMotivated by Theorem 1 and empirical evidence they provided, [20] raised Conjecture 1, which, for-\nmally stated, hypothesizes that the condition in Theorem 1 of {Ai}m\ni=1 commuting is unnecessary, and\nan identical statement holds for arbitrary (symmetric linearly independent) measurement matrices.4\nWhile the analysis of [20] covers only symmetric matrix factorizations of the form ZZ(cid:62), they noted\nthat it can be extended to also account for asymmetric factorizations of the type considered in the\ncurrent paper. Speci\ufb01cally, running gradient \ufb02ow on the objective:\n\n\u03c6(W1, W2) := (cid:96)(W2W1) = 1\n2\n\n(yi \u2212 (cid:104)Ai, W2W1(cid:105))2 ,\n\nwith W1, W2 \u2208 Rd,d initialized to \u03b1I, \u03b1 > 0, and denoting by Wsha,\u221e(\u03b1) the product matrix\nobtained at the end of optimization (i.e. Wsha,\u221e(\u03b1) := limt\u2192\u221e W2(t)W1(t) where Wj(0) = \u03b1I and\n(W1(t), W2(t)) for t \u2208 R\u22650), Theorem 1 holds exactly as stated. For completeness,\n\u02d9Wj(t) = \u2212 \u2202\u03c6\nwe provide a proof of this fact in Appendix D.\n\n\u2202Wj\n\ni=1\n\n(cid:88)m\n\n(cid:88)m\n\n2\n\ni=1\n\nNext, we show that Theorem 1 \u2014 the main theoretical justi\ufb01cation for the connection between nuclear\nnorm and shallow matrix factorization \u2014 extends to arbitrarily deep factorizations as well. Consider\ngradient \ufb02ow over the objective:\n\n\u03c6(W1, . . . , WN ) := (cid:96)(WN WN\u22121 \u00b7\u00b7\u00b7 W1) = 1\n\n(yi \u2212 (cid:104)Ai, WN WN\u22121 \u00b7\u00b7\u00b7 W1(cid:105))2 ,\n\n\u2202Wj\n\nwith W1, . . . , WN \u2208 Rd,d initialized to \u03b1I, \u03b1 > 0. Using Wdeep,\u221e(\u03b1) to denote the product\nmatrix obtained at the end of optimization (i.e. Wdeep,\u221e(\u03b1) := limt\u2192\u221e WN (t)WN\u22121(t)\u00b7\u00b7\u00b7 W1(t)\nwhere Wj(0) = \u03b1I and \u02d9Wj(t) = \u2212 \u2202\u03c6\n(W1(t), . . . , WN (t)) for t \u2208 R\u22650), a result analogous to\nTheorem 1 holds:\nTheorem 2. Suppose N \u2265 3, and that the matrices A1, . . . , Am commute. Then, if \u00afWdeep :=\nlim\u03b1\u21920 Wdeep,\u221e(\u03b1) exists and is a global optimum for Equation (2) with (cid:96)( \u00afWdeep) = 0, it holds that\n\u00afWdeep \u2208 argminW\u2208S d\n+, (cid:96)(W )=0 (cid:107)W(cid:107)\u2217, i.e. \u00afWdeep is a global optimum with minimal nuclear norm.5\nProof sketch (for complete proof see Appendix C.1). Our proof is inspired by that of Theorem 1\n\u02d9W (t) derived in [3] (Lemma 3 in Appendix B), it can\ngiven in [20]. Using the expression for\nbe shown that W (t) commutes with {Ai}m\ni=1, and takes on a particular form. Taking limits t \u2192 \u221e\nand \u03b1 \u2192 0, optimality (minimality) of nuclear norm is then established using a duality argument.\nTheorem 2 provides a particular setting where the implicit regularization in deep matrix factorizations\nboils down to nuclear norm minimization. By Proposition 1 below, there exist instances of this setting\nfor which the minimization of nuclear norm contradicts minimization (even locally) of Schatten-p\n\n3The result of [20] is slightly more general \u2014 it allows gradient \ufb02ow to be initialized by \u03b1O, where O is an\narbitrary orthogonal matrix, and it is shown that this leads to the exact same Wsha,\u221e(\u03b1) as one would obtain\nfrom initializing at \u03b1I. For simplicity, we limit our discussion to the latter initialization.\n\n4Their conjecture also relaxes the requirement from the initialization of gradient \ufb02ow \u2014 an initial value\nof \u03b1Z0 is believed to suf\ufb01ce, where Z0 is an arbitrary full-rank matrix (that does not depend on \u03b1).\n5By Appendix C.1: WN (t)WN\u22121(t)\u00b7\u00b7\u00b7 W1(t)(cid:23) 0 \u2200t. Therefore, even though the theorem treats opti-\nmization over S d\n+ using an unconstrained asymmetric factorization, gradient \ufb02ow implicitly constrains the search\nto S d\n+, so the assumption of \u00afWdeep being a global optimum for Equation (2) with (cid:96)( \u00afWdeep) = 0 is no stronger\nthan the analogous assumption in Theorem 1 from [20]. The implicit constraining to S d\n+ also holds when N = 2\n(see Appendix D), so the asymmetric extension of Theorem 1 does not involve strengthening assumptions either.\n\n4\n\n\fFigure 2: Evaluation of nuclear norm as the implicit regularization in deep matrix factorization. Each plot\ncompares gradient descent over matrix factorizations of depths 2 and 3 (results for depth 4 were indistinguishable\nfrom those of depth 3; we omit them to reduce clutter) against minimum nuclear norm solution and ground\ntruth in matrix completion tasks. Top (respectively, bottom) row corresponds to completion of a random\nrank-5 (respectively, rank-10) matrix with size 100 \u00d7 100. Left, middle and right columns display (in y-axis)\nreconstruction error, nuclear norm and effective rank (cf. [43]) respectively. In each plot, x-axis stands for the\nnumber of observed entries (randomly chosen), and error bars (indiscernible) mark standard deviations of the\nresults over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden dimensions 100.\nBoth learning rate and standard deviation of (random, zero-centered) initialization for gradient descent were\ninitially set to 10\u22123. Running with smaller learning rate did not yield a noticeable change in terms of \ufb01nal\nresults. Initializing with smaller standard deviation had no observable effect on results of depth 3 (and 4), but did\nimpact those of depth 2 \u2014 the outcomes of dividing standard deviation by 2 and by 4 are included in the plots.7\nNotice, with many observed entries minimum nuclear norm solution coincides with ground truth (minimum rank\nsolution), and matrix factorizations of all depths converge to these. On the other hand, when there are fewer\nobserved entries minimum nuclear norm solution does not coincide with ground truth, and matrix factorizations\nprefer to lower the effective rank at the expense of higher nuclear norm, in a manner that is more potent for\ndeeper factorizations. For further details, and a similar experiment on matrix sensing tasks, see Appendix E.\n\nquasi-norm for any 0 < p < 1. Therefore, one cannot hope to capture the implicit regularization\nin deep matrix factorizations through Schatten quasi-norms. Instead, if we interpret Theorem 2\nthrough the lens of [20], we arrive at a conjecture by which the implicit regularization is captured\nby nuclear norm for any depth.\nProposition 1. For any dimension d \u2265 3, there exist linearly independent symmetric and commutable\nmeasurement matrices A1, . . . , Am \u2208 Rd,d, and corresponding labels y1, . . . , ym \u2208 R, such that\nthe limit solution de\ufb01ned in Theorem 2 \u2014 \u00afWdeep \u2014 which has been shown to satisfy \u00afWdeep \u2208\n+, (cid:96)(W )=0 (cid:107)W(cid:107)\u2217, is not a local minimum of the following program for any 0 < p < 1:6\nargminW\u2208S d\n\nminW\u2208S d\n\n+, (cid:96)(W )=0 (cid:107)W(cid:107)Sp\n\n.\n\nProof sketch (for complete proof see Appendix C.2). We choose A1, . . . , Am and y1, . . . , ym such\nthat: (i) \u00afWdeep = diag(1, 1, 0, . . . , 0); and (ii) adding \u0001 \u2208 (0, 1) to entries (1, 2) and (2, 1) of \u00afWdeep\nmaintains optimality. The result then follows from the fact that the addition of \u0001 decreases Schatten-p\nquasi-norm for any 0 < p < 1.\n\n2.2 Experiments challenging Conjecture 1\nSubsection 2.1 suggests that from the perspective of current theory, it is natural to apply Conjecture 1\nto matrix factorizations of arbitrary depth. On the other hand, the experiment reported in Figure 1\nimplies that depth changes (enhances) the implicit regularization. To resolve this tension we conduct\na more re\ufb01ned experiment, which ultimately puts in question the validity of Conjecture 1.\n\n6Following [20], we take for granted existence of \u00afWdeep and it being a global optimum for Equation (2)\nwith (cid:96)( \u00afWdeep) = 0. If this is not the case then Theorem 2 does not apply, and hence it obviously does not\ndisqualify minimization of Schatten quasi-norms as the implicit regularization.\n\n7As can be seen, using smaller initialization enhanced the implicit tendency of depth-2 matrix factorization\ntowards low rank. It is possible that this tendency can eventually match that of depth-3 (and -4), but only if\ninitialization size goes far below what is customary in deep learning.\n\n5\n\n\fOur experimental protocol is as follows. For different matrix completion tasks with varying number\nof observed entries, we compare minimum nuclear norm solution to those brought forth by running\ngradient descent on matrix factorizations of different depths. For each depth, we apply gradient\ndescent with different choices of learning rate and standard deviation for (random, zero-centered)\ninitialization, observing the trends as these become smaller. The outcome of the experiment is\npresented in Figure 2. As can be seen, when the number of observed entries is suf\ufb01ciently large\nwith respect to the rank of the matrix to recover, factorizations of all depths indeed admit solutions\nthat tend to minimum nuclear norm. However, when there are less entries observed \u2014 precisely the\ndata-poor setting where implicit regularization matters most \u2014 neither shallow (depth-2) nor deep\n(depth-N with N \u2265 3) factorizations minimize nuclear norm. Instead, they put more emphasis on\nlowering the effective rank (cf. [43]), in a manner which is stronger for deeper factorizations.\nA close look at the experiments of [20] reveals that there too, in situations where the number of\nobserved entries (or sensing measurements) was small (less than required for reliable recovery), a\ndiscernible gap appeared between the minimal nuclear norm and that returned by (gradient descent on)\na matrix factorization. In light of Figure 2, we believe that if [20] had included in its plots an accurate\nsurrogate for rank (e.g. effective rank or Schatten-p quasi-norm with small p), scenarios where matrix\nfactorization produced sub-optimal (higher than minimum) nuclear norm would have manifested\nsuperior (lower) rank. More broadly, our experiments suggest that the implicit regularization in\n(shallow or deep) matrix factorization is somehow geared towards low rank, and just so happens to\nminimize nuclear norm in cases with suf\ufb01ciently many observations, where minimum nuclear norm\nand minimum rank are known to coincide (cf. [9, 42]). We note that the theoretical analysis of [32]\nsupporting Conjecture 1 is limited to such cases, and thus cannot truly distinguish between nuclear\nnorm minimization and some other form of implicit regularization favoring low rank.\nGiven that Conjecture 1 seems to hold in some settings (Theorems 1 and 2; [32]) but not in other\n(Figure 2), we hypothesize that capturing implicit regularization in (shallow or deep) matrix factoriza-\ntion through a single mathematical norm (or quasi-norm) may not be possible, and a detailed account\nfor the optimization process might be necessary. This is carried out in Section 3.\n\n3 Dynamical analysis\nThis section characterizes trajectories of gradient \ufb02ow (gradient descent with in\ufb01nitesimally small\nlearning rate) on deep matrix factorizations. The characterization signi\ufb01cantly extends past analyses\nfor linear neural networks (e.g. [44, 3]) \u2014 we derive differential equations governing the dynamics\nof singular values and singular vectors for the product matrix W (Equation (1)). Evolution rates of\nsingular values turn out to be proportional to their size exponentiated by 2 \u2212 2/N, where N is the\ndepth of the factorization. For singular vectors, we show that lack of movement implies a particular\nform of alignment with the gradient, and by this strengthen past results which have only established\nthe converse. Via theoretical and empirical demonstrations, we explain how our \ufb01ndings imply a\ntendency towards low-rank solutions, which intensi\ufb01es with depth.\nOur derivation treats a setting which includes matrix completion and sensing as special cases. We\nassume minimization of a general analytic loss (cid:96)(\u00b7),8 overparameterized by a deep matrix factorization:\n(3)\n\n\u03c6(W1, . . . , WN ) := (cid:96)(WN WN\u22121 \u00b7\u00b7\u00b7 W1) .\n\nWe study gradient \ufb02ow over the factorization:\n\n, t \u2265 0 , j = 1, . . . , N ,\nand in accordance with past work, assume that factors are balanced at initialization, i.e.:\n\ndt Wj(t) = \u2212 \u2202\n\n\u03c6(W1(t), . . . , WN (t))\n\n\u02d9Wj(t) := d\n\n\u2202Wj\n\n(4)\n\nj (0)\n\nW (cid:62)\n\n, j = 1, . . . , N \u2212 1 .\n\nj+1(0)Wj+1(0) = Wj(0)W (cid:62)\n\n(5)\nEquation (5) is satis\ufb01ed approximately in the common setting of near-zero initialization (it holds\nexactly in the \u201cresidual\u201d setting of identity initialization \u2014 cf. [23, 5]). The condition played an\nimportant role in the analysis of [3], facilitating derivation of a differential equation governing\nthe product matrix of a linear neural network (see Lemma 3 in Appendix B). It was shown in [3]\nempirically that there is an excellent match between the theoretical predictions of gradient \ufb02ow with\nbalanced initialization, and its practical realization via gradient descent with small learning rate and\nnear-zero initialization. Other works (e.g. [4, 28]) later supported this match theoretically.\n\n8A function f (\u00b7) is analytic on a domain D if at every x \u2208 D: it is in\ufb01nitely differentiable; and its Taylor\n\nseries converges to it on some neighborhood of x (see [30] for further details).\n\n6\n\n\fWe note that by Section 6 in [3], for depth N \u2265 3, the dynamics of the product matrix W (Equa-\ntion (1)) cannot be exactly equivalent to gradient descent on the loss (cid:96)(\u00b7) regularized by a penalty\nterm. This preliminary observation already hints to the possibility that the effect of depth is different\nfrom those of standard regularization techniques.\nEmploying results of [3], we will characterize the evolution of singular values and singular vec-\ntors for W . As a \ufb01rst step, we show that W admits an analytic singular value decomposition ([7, 12]):\nLemma 1. The product matrix W (t) can be expressed as:\n\nW (t) = U (t)S(t)V (cid:62)(t) ,\n\n(6)\nwhere: U (t) \u2208 Rd,min{d,d(cid:48)}, S(t) \u2208 Rmin{d,d(cid:48)},min{d,d(cid:48)} and V (t) \u2208 Rd(cid:48),min{d,d(cid:48)} are analytic\nfunctions of t; and for every t, the matrices U (t) and V (t) have orthonormal columns, while S(t) is\ndiagonal (elements on its diagonal may be negative and may appear in any order).\nProof sketch (for complete proof see Appendix C.3). We show that W (t) is an analytic function of t\nand then invoke Theorem 1 in [7].\nThe diagonal elements of S(t), which we denote by \u03c31(t), . . . , \u03c3min{d,d(cid:48)}(t), are signed sin-\ngular values of W (t);\nthe columns of U (t) and V (t), denoted u1(t), . . . , umin{d,d(cid:48)}(t) and\nv1(t), . . . , vmin{d,d(cid:48)}(t), are the corresponding left and right singular vectors (respectively).\nWith Lemma 1 in place, we are ready to characterize the evolution of singular values:\nTheorem 3. The signed singular values of the product matrix W (t) evolve by:\n\n\u02d9\u03c3r(t) = \u2212N \u00b7(cid:0)\u03c32\n\nr (t)(cid:1)1\u22121/N \u00b7(cid:10)\u2207(cid:96)(W (t)), ur(t)v(cid:62)\nr (t)(cid:11)\n\n, r = 1, . . . , min{d, d(cid:48)} .\n\n(7)\nIf the matrix factorization is non-degenerate, i.e. has depth N \u2265 2, the singular values need not be\nsigned (we may assume \u03c3r(t) \u2265 0 for all t).\nProof sketch (for complete proof see Appendix C.4). Differentiating the analytic singular value de-\ncomposition (Equation (6)) with respect to time, multiplying from the left by U(cid:62)(t) and from\nthe right by V (t), and using the fact that U (t) and V (t) have orthonormal columns, we obtain\n\u02d9\u03c3r(t) = u(cid:62)\n\u02d9W (t)\ndeveloped by [3] (Lemma 3 in Appendix B).\nStrikingly, given a value for W (t), the evolution of singular values depends on N \u2014 depth of the\nmatrix factorization \u2014 only through the multiplicative factors N \u00b7(\u03c32\nr (t))1\u22121/N (see Equation (7)). In\nthe degenerate case N = 1, i.e. when the product matrix W (t) is simply driven by gradient \ufb02ow over\nthe loss (cid:96)(\u00b7) (no matrix factorization), the multiplicative factors reduce to 1, and the singular values\n\nr (t) \u02d9W (t)vr(t). Equation (7) then follows from plugging in the expression for\n\nr (t)(cid:11). With N \u2265 2, i.e. when depth is added to the factor-\n\nevolve by: \u02d9\u03c3r(t) = \u2212(cid:10)\u2207(cid:96)(W (t)), ur(t)v(cid:62)\n\nization, the multiplicative factors become non-trivial, and while the constant N does not differentiate\nr (t))1\u22121/N do \u2014 they enhance movement of large singular\nbetween singular values, the terms (\u03c32\nvalues, and on the other hand attenuate that of small ones. Moreover, the enhancement/attenuation\nbecomes more signi\ufb01cant as N (depth of the factorization) grows.\nNext, we turn to the evolution of singular vectors:\nLemma 2. Assume that at initialization, the singular values of the product matrix W (t) are distinct\nand different from zero.9 Then, its singular vectors evolve by:\n\n\u02d9U (t) = \u2212U (t)(cid:0)F (t) (cid:12)(cid:2)U(cid:62)(t)\u2207(cid:96)(W (t))V (t)S(t) + S(t)V (cid:62)(t)\u2207(cid:96)(cid:62)(W (t))U (t)(cid:3)(cid:1)\n\u02d9V (t) = \u2212V (t)(cid:0)F (t) (cid:12)(cid:2)S(t)U(cid:62)(t)\u2207(cid:96)(W (t))V (t) + V (cid:62)(t)\u2207(cid:96)(cid:62)(W (t))U (t)S(t)(cid:3)(cid:1)\n\n\u2212(cid:0)Id \u2212 U (t)U(cid:62)(t)(cid:1)\u2207(cid:96)(W (t))V (t)(S2(t))\n\u2212(cid:0)Id(cid:48) \u2212 V (t)V (cid:62)(t)(cid:1)\u2207(cid:96)(cid:62)(W (t))U(cid:62)(t)(S2(t))\n\n(9)\nwhere Id and Id(cid:48) are the identity matrices of sizes d \u00d7 d and d(cid:48) \u00d7 d(cid:48) respectively, (cid:12) stands for the\nHadamard (element-wise) product, and the matrix F (t) \u2208 Rmin{d,d(cid:48)},min{d,d(cid:48)} is skew-symmetric\nwith ((\u03c32\n\nr (t))1/N )\u22121 in its (r, r(cid:48))\u2019th entry, r (cid:54)= r(cid:48).10\n\nr(cid:48)(t))1/N \u2212 (\u03c32\n\n2\u2212 1\n\n(8)\n\n1\n\n2\u2212 1\n\nN ,\n\n1\n\nN\n\n9This assumption can be relaxed signi\ufb01cantly \u2014 all that is needed is that no singular value be identically zero\n(\u2200r \u2203t s.t. \u03c3r(t) (cid:54)= 0), and no pair of singular values be identical through time (\u2200r, r(cid:48) \u2203t s.t. \u03c3r(t) (cid:54)= \u03c3r(cid:48) (t)).\n10Equations (8) and (9) are well-de\ufb01ned when t is such that \u03c31(t), . . . , \u03c3min{d,d(cid:48)}(t) are distinct and different\nfrom zero. By analyticity, this is either the case for every t besides a set of isolated points, or it is not the case for\nany t. Our assumption on initialization disquali\ufb01es the latter option, so any t for which Equations (8) or (9) are\nill-de\ufb01ned is isolated. The derivatives of U and V for such t may thus be inferred by continuity.\n\n7\n\n\fFigure 3: Dynamics of gradient descent over deep matrix factorizations \u2014 speci\ufb01cally, evolution of singular\nvalues and singular vectors of the product matrix during training for matrix completion. Top row corresponds to\nthe task of completing a random rank-5 matrix with size 100 \u00d7 100 based on 2000 randomly chosen observed\nentries; bottom row corresponds to training on 10000 entries chosen randomly from the MovieLens 100K\ndataset (completion of a 943 \u00d7 1682 matrix, cf. [24]).11 First (left) three columns show top singular values for,\nrespectively, depths 1 (no matrix factorization), 2 (shallow matrix factorization) and 3 (deep matrix factorization).\nLast (right) column shows singular vectors for a depth-2 factorization, by comparing on- vs. off-diagonal entries\nin the matrix U(cid:62)(t)\u2207(cid:96)(W (t))V (t) (see Corollary 1) \u2014 for each group of entries, mean of absolute values is\nplotted, along with shaded area marking the standard deviation. All matrix factorizations are full-dimensional\n(hidden dimensions 100 in top row plots, 943 in bottom row plots). Notice, increasing depth makes singular\nvalues move slower when small and faster when large (in accordance with Theorem 3), which results in solutions\nwith effectively lower rank. Notice also that U(cid:62)(t)\u2207(cid:96)(W (t))V (t) is diagonally dominant so long as there is\nmovement, showing that singular vectors of the product matrix align with those of the gradient (in accordance\nwith Corollary 1). For further details, and a similar experiment on matrix sensing, see Appendix E.\n\nProof sketch (for complete proof see Appendix C.5). We follow a series of steps adopted from [46]\nto obtain expressions for \u02d9U (t) and \u02d9V (t) in terms of U (t), V (t), S(t) and \u02d9W (t). Plugging in the\nexpression for \u02d9W (t) developed by [3] (Lemma 3 in Appendix B) then yields Equations (8), (9).\n\nCorollary 1. Assume the conditions of Lemma 2, and that the matrix factorization is non-degenerate,\ni.e. has depth N \u2265 2. Then, for any time t such that the singular vectors of the product matrix W (t)\nare stationary, i.e. \u02d9U (t) = 0 and \u02d9V (t) = 0, it holds that U(cid:62)(t)\u2207(cid:96)(W (t))V (t) is diagonal, meaning\nthey align with the singular vectors of \u2207(cid:96)(W (t)).\nProof sketch (for complete proof see Appendix C.6). By Equations (8) and (9), U(cid:62)(t) \u02d9U (t)S(t) \u2212\nS(t)V (cid:62)(t) \u02d9V (t) is equal to the Hadamard product between U(cid:62)(t)\u2207(cid:96)(W (t))V (t) and a (time-\ndependent) square matrix with zeros on its diagonal and non-zeros elsewhere. When \u02d9U (t) = 0\nand \u02d9V (t) = 0 obviously U(cid:62)(t) \u02d9U (t)S(t) \u2212 S(t)V (cid:62)(t) \u02d9V (t) = 0, and so the Hadamard product is\nzero. This implies that U(cid:62)(t)\u2207(cid:96)(W (t))V (t) is diagonal.\nEarlier papers studying gradient \ufb02ow for linear neural networks (e.g. [44, 1, 31]) could show that\nsingular vectors are stationary if they align with the singular vectors of the gradient. Corollary 1\nis signi\ufb01cantly stronger and implies a converse \u2014 if singular vectors are stationary, they must be\naligned with the gradient. Qualitatively, this suggests that a \u201cgoal\u201d of gradient \ufb02ow on a deep matrix\nfactorization is to align singular vectors of the product matrix with those of the gradient.\n\nImplicit regularization towards low rank\n\n3.1\nFigure 3 presents empirical demonstrations of our conclusions from Theorem 3 and Corollary 1. It\nshows that for a non-degenerate deep matrix factorization, i.e. one with depth N \u2265 2, under gradient\ndescent with small learning rate and near-zero initialization, singular values of the product matrix\nare subject to an enhancement/attenuation effect as described above: they progress very slowly after\ninitialization, when close to zero; then, upon reaching a certain threshold, the movement of a singular\n\n11Observations of MovieLens 100K were subsampled solely for reducing run-time.\n\n8\n\n\fvalue becomes rapid, with the transition from slow to rapid movement being sharper with a deeper\nfactorization (larger N). In terms of singular vectors, the \ufb01gure shows that those of the product matrix\nindeed align with those of the gradient. Overall, the dynamics promote solutions that have a few\nlarge singular values and many small ones, with a gap that is more extreme the deeper the matrix\nfactorization is. This is an implicit regularization towards low rank, which intensi\ufb01es with depth.\n\ndiagonalize the gradient, meaning(cid:8)|u(cid:62)\n\nTheoretical illustration Consider the simple case of square matrix sensing with a single mea-\n2 ((cid:104)A, W(cid:105) \u2212 y)2, where A \u2208 Rd,d is the measurement matrix,\nsurement \ufb01t via (cid:96)2 loss: (cid:96)(W ) = 1\nand y \u2208 R the corresponding label. Suppose we learn by running gradient \ufb02ow over a depth-N\nmatrix factorization, i.e. over the objective \u03c6(\u00b7) de\ufb01ned in Equation (3). Corollary 1 states that the\nsingular vectors of the product matrix \u2014 {ur(t)}r and {vr(t)}r \u2014 are stationary only when they\nsingular values in \u2207(cid:96)(W (t)). In our case \u2207(cid:96)(W ) = ((cid:104)A, W(cid:105) \u2212 y)A, so stationarity of singular\nvectors implies |u(cid:62)\nr (t)\u2207(cid:96)(W (t))vr| = |\u03b4(t)| \u00b7 \u03c1r, where \u03b4(t) := (cid:104)A, W (t)(cid:105) \u2212 y and \u03c11, . . . , \u03c1d\nare the singular values of A (in no particular order). We will assume that starting from some\nr (t)\u2207(cid:96)(W (t))vr(t) = \u03b4(t) \u00b7 er \u00b7 \u03c1r for\ntime t0 singular vectors are stationary, and accordingly u(cid:62)\nr = 1, . . . , d, where e1, . . . , ed \u2208 {\u22121, 1}. Theorem 3 then implies that (signed) singular values\nof the product matrix evolve by:\n\nr (t)\u2207(cid:96)(W (t))vr| : r = 1, . . . , d(cid:9) coincides with the set of\n\nLet r1, r2 \u2208 {1, . . . , d}. By Equation (10):\n\n(cid:90) t\n\nt(cid:48)=t0\n\n(cid:0)\u03c32\n\nr1\n\nr (t)(cid:1)1\u22121/N \u00b7 \u03b4(t) \u00b7 er \u00b7 \u03c1r\n(cid:0)\u03c32\n\n\u02d9\u03c3r(t) = \u2212N \u00b7(cid:0)\u03c32\n(t(cid:48))(cid:1)\u22121+1/N\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\u03b1r1,r2 \u00b7 \u03c3r2(t) + const\n(cid:0)\u03c3r2(t)(cid:1)\u03b1r1,r2 \u00b7 const\n(cid:0)\u03b1r1,r2 \u00b7 (\u03c3r2 (t))\u2212 N\u22122\n\nN + const(cid:1)\u2212 N\n\n\u02d9\u03c3r1 (t(cid:48))dt(cid:48) =\n\n(cid:90) t\n\ner1\u03c1r1\ner2\u03c1r2\n\n\u00b7\n\nt(cid:48)=t0\n\nr2\n\nN\u22122\n\n\u03c3r1(t) =\n\nComputing the integrals, we may express \u03c3r1(t) as a function of \u03c3r2(t):12\n\n, \u2200t \u2265 t0 .\n\n(t(cid:48))(cid:1)\u22121+1/N\n\n\u02d9\u03c3r2 (t(cid:48))dt(cid:48) .\n\n, N = 1\n\n, N = 2\n, N \u2265 3\n\n,\n\n(10)\n\n(11)\n\nwhere \u03b1r1,r2 := er1\u03c1r1 (er2\u03c1r2 )\u22121, and const stands for a value that does not depend on t. Equa-\ntion 11 reveals a gap between \u03c3r1(t) and \u03c3r2(t) that enhances with depth. For example, consider\nthe case where 0 < \u03b1r1,r2 < 1. If the depth N is one, i.e. the matrix factorization is degenerate,\n\u03c3r1(t) will grow linearly with \u03c3r2 (t). If N = 2 \u2014 shallow matrix factorization \u2014 \u03c3r1(t) will grow\npolynomially more slowly than \u03c3r2(t) (const here is positive). Increasing depth further will lead\n\u03c3r1(t) to asymptote when \u03c3r2(t) grows, at a value which can be shown to be lower the larger N is.\nOverall, adding depth to the matrix factorization leads to more signi\ufb01cant gaps between singular\nvalues of the product matrix, i.e. to a stronger implicit bias towards low rank.\n\n4 Conclusion\n\nThe implicit regularization of gradient-based optimization is key to generalization in deep learning.\nAs a stepping stone towards understanding this phenomenon, we studied deep linear neural networks\nfor matrix completion and sensing, a model referred to as deep matrix factorization. Through theory\nand experiments, we questioned prevalent norm-based explanations for implicit regularization in\nmatrix factorization (cf. [20]), and offered an alternative, dynamical approach. Our characterization\nof the dynamics induced by gradient \ufb02ow on the singular value decomposition of the learned matrix\nsigni\ufb01cantly extends prior work on linear neural networks. It reveals an implicit tendency towards low\nrank which intensi\ufb01es with depth, supporting the empirical superiority of deeper matrix factorizations.\nAn emerging view is that understanding optimization in deep learning necessitates a detailed account\nfor the trajectories traversed in training (cf. [4]). Our work adds another dimension to the potential\nimportance of trajectories \u2014 we believe they are necessary for understanding generalization as well,\nand in particular, may be key to analyzing implicit regularization for non-linear neural networks.\n\n12In accordance with Theorem 3, if N \u2265 2, we assume without loss of generality that \u03c3r1 (t), \u03c3r2 (t) \u2265 0,\n\nwhile disregarding the trivial case of equality to zero.\n\n9\n\n\fAcknowledgments\n\nThis work was supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research,\nAmazon Research, DARPA and SRC. Nadav Cohen was a member at the Institute for Advanced\nStudy, and was additionally supported by the Zuckerman Israeli Postdoctoral Scholars Program. The\nauthors thank Nathan Srebro for illuminating discussions which helped improve the paper.\n\nReferences\n[1] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural\n\nnetworks. arXiv preprint arXiv:1710.03667, 2017.\n\n[2] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex\n\noptimization problems. Journal of Control and Decision, 5(1):42\u201360, 2018.\n\n[3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration\n\nby overparameterization. In International Conference on Machine Learning, pages 244\u2013253, 2018.\n\n[4] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent\n\nfor deep linear neural networks. International Conference on Learning Representations, 2019.\n\n[5] Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization ef\ufb01ciently\nlearns positive de\ufb01nite linear transformations. In International Conference on Machine Learning, pages\n520\u2013529, 2018.\n\n[6] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank\n\nmatrix recovery. In Advances in Neural Information Processing Systems, pages 3873\u20133881, 2016.\n\n[7] Angelika Bunse-Gerstner, Ralph Byers, Volker Mehrmann, and Nancy K Nichols. Numerical computation\nof an analytic singular value decomposition of a matrix valued function. Numerische Mathematik, 60(1):\n1\u201339, 1991.\n\n[8] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite\n\nprograms via low-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[9] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717, 2009.\n\n[10] Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An\n\noverview. arXiv preprint arXiv:1809.09573, 2018.\n\n[11] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete\n\nobservations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608\u2013622, 2016.\n\n[12] B De Moor and S Boyd. Analytic properties of singular values and vectors. Katholic Univ. Leuven, Belgium\n\nTech. Rep, 28:1989, 1989.\n\n[13] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex\n\noptimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n[14] Simon S Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. arXiv\n\npreprint arXiv:1901.08572, 2019.\n\n[15] Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models:\n\nLayers are automatically balanced. arXiv preprint arXiv:1806.00900, 2018.\n\n[16] Jicong Fan and Jieyu Cheng. Matrix completion by deep matrix factorization. Neural Networks, 98:34\u201341,\n\n2018.\n\n[17] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances\n\nin Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[18] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uni\ufb01ed\ngeometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 1233\u20131242. JMLR. org, 2017.\n\n[19] Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient\n\ndynamics in deep linear neural networks. arXiv preprint arXiv:1904.13262, 2019.\n\n10\n\n\f[20] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.\nImplicit regularization in matrix factorization. In Advances in Neural Information Processing Systems,\npages 6151\u20136159, 2017.\n\n[21] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms\nof optimization geometry. In Proceedings of the 35th International Conference on Machine Learning,\nvolume 80, pages 1832\u20131841, 2018.\n\n[22] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear\nconvolutional networks. In Advances in Neural Information Processing Systems, pages 9461\u20139471, 2018.\n\n[23] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conference on Learning\n\nRepresentations, 2016.\n\n[24] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions\n\non interactive intelligent systems (tiis), 5(4):19, 2016.\n\n[25] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[26] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization\ngap in large batch training of neural networks. In Advances in Neural Information Processing Systems,\npages 1731\u20131741, 2017.\n\n[27] Yulij Ilyashenko and Sergei Yakovenko. Lectures on analytic differential equations, volume 86. American\n\nMathematical Soc., 2008.\n\n[28] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. International\n\nConference on Learning Representations, 2019.\n\n[29] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.\nOn large-batch training for deep learning: Generalization gap and sharp minima. International Conference\non Learning Representations, 2017.\n\n[30] Steven G Krantz and Harold R Parks. A primer of real analytic functions. Springer Science & Business\n\nMedia, 2002.\n\n[31] Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer\n\nlearning in deep linear networks. International Conference on Learning Representations, 2019.\n\n[32] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix\nsensing and neural networks with quadratic activations. In Proceedings of the 31st Conference On Learning\nTheory, pages 2\u201347, 2018.\n\n[33] Zechao Li and Jinhui Tang. Deep matrix factorization for social image tag re\ufb01nement and assignment.\nIn 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP), pages 1\u20136. IEEE,\n2015.\n\n[34] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regu-\nlarization for multiple passes sgm. In International Conference on Machine Learning, pages 2340\u20132348,\n2016.\n\n[35] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical\nestimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International\nConference on Machine Learning, pages 3351\u20133360, 2018.\n\n[36] Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro,\nand Daniel Soudry. Convergence of gradient descent on separable data. In Proceedings of Machine\nLearning Research, volume 89, pages 3420\u20133428, 2019.\n\n[37] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role\n\nof implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.\n\n[38] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in\n\ndeep learning. In Advances in Neural Information Processing Systems, pages 5947\u20135956, 2017.\n\n[39] Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square matrix\nsensing without spurious local minima via the burer-monteiro approach. In Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 65\u201374, 2017.\n\n11\n\n\f[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W,\n2017.\n\n[41] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Ben-\ngio, and Aaron Courville. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734,\n2018.\n\n[42] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[43] Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th\n\nEuropean Signal Processing Conference, pages 606\u2013610. IEEE, 2007.\n\n[44] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. International Conference on Learning Representations, 2014.\n\n[45] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit\nbias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822\u20132878,\n2018.\n\n[46] James Townsend. Differentiating the singular value decomposition. Technical report, 2016.\n\n[47] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Bj\u00f6rn W Schuller. A deep matrix\nfactorization method for learning attribute representations. IEEE transactions on pattern analysis and\nmachine intelligence, 39(3):417\u2013429, 2017.\n\n[48] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions\nof linear matrix equations via procrustes \ufb02ow. In International Conference on Machine Learning, pages\n964\u2013973, 2016.\n\n[49] Qi Wang, Mengying Sun, Liang Zhan, Paul Thompson, Shuiwang Ji, and Jiayu Zhou. Multi-modality\ndisease modeling via collective deep matrix factorization. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1155\u20131164. ACM, 2017.\n\n[50] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Deep matrix factorization\n\nmodels for recommender systems. In IJCAI, pages 3203\u20133209, 2017.\n\n[51] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv\n\npreprint arXiv:1708.03888, 2017.\n\n[52] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\nlearning requires rethinking generalization. International Conference on Learning Representations, 2017.\n\n[53] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In\n\nThirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4033, "authors": [{"given_name": "Sanjeev", "family_name": "Arora", "institution": "Princeton University"}, {"given_name": "Nadav", "family_name": "Cohen", "institution": "Tel Aviv University"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Yuping", "family_name": "Luo", "institution": "Princeton University"}]}