Paper ID: | 1800 |
---|---|

Title: | Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks |

The authors theoretically analyze the dynamics of continuous and discrete optimization for one and two layer linear neural networks under square loss. For the continuous case, they improve the results of Saxe et al. by proving the implicit regularization under more reasonable assumptions. They also analyze the behavior of discrete dynamics for the same networks. To the best of my knowledge, analysis of discrete dynamics is novel. In theorem 1, I understand that the specific initialization is required in your proof for the results to hold. I wonder to what extent this initialization is necessary? If one initializes the matrices to a matrix with vanishing nuclear norm but different singular vectors, I guess we still should observe the same phenomenon of sequential learning. Can the authors clarify to what extent this initialization form is necessary? The fact that the authors were able to relax the assumption that the two matrices should commute is very interesting. A question similar to the previous one regarding this assumption: is this assumption that the two matrices should approximately commute just needed for the proof or do the authors think it is necessary?

The paper studies the gradient dynamics of the SGD learning algorithm in the context of the two layer linear network. The authors consider two scenarios: continuous and discrete gradient dynamics. The former has been studied in Saxe et al 2018 but the authors manage to loosen the assumption and provide a more general result. In particular, under the assumption that both the covariance matrix X' X and correlation matrix X' Y share the join decomposition, SGD updates can be derived to have a close form. Additionally, the paper shows an interesting asymptotic phenomenon: as the initialization of weights tends to go zero, the weight product at any SGD iterate converges to the solution of the reduced rank regression problem. A more novel result is the formulation of the discrete gradient dynamics where the authors show that under a bit more stringent assumption, the SGD updates can have a close form. Although the paper shows interesting and novel theoretical results, the main concern is that the assumption 1 is too strong. It is not clear that this assumption is met in practice. Furthermore, the theoretical analysis might not be able to generalize to network of more than 3 layers. For this, the paper is more suitable to be considered as a low rank regression problem, rather than a deep linear neural network.

Summary: This paper studies the implicit regularization of discrete gradient descent algorithm for over-parameterized two-layer neural networks. By making mild assumptions on the data, this paper finds out that the discrete gradient dynamics learn the reduced-rank regression solutions sequentially. The experimental results support the theoretical results in this paper. Pros: - The theoretical results, i.e., gradient dynamics sequentially learn the solutions, on the reduced-rank regression problem are novel and interesting. The results can shed lights on representation learning of deep neural networks. - The experiments on both synthetic and real datasets are well-designed, which support the theoretical results as well as validate the assumptions. Limitation & Questions: - The analysis seems to be specific to two-layer linear neural networks. Could this be possibly extended to deep neural networks? Typo: - L459-L460, \bm{W}_{1}(t)^{0}. Other Comments: The code 'M_0 = 10**-3 * 1/np.sqrt(d) * np.random.rand(r,d)' in [18], 'r' is not defined. After replacing r with a number, the code can reproduce the results in Figure 2. There is an omission in the related work on implicit regularization: https://arxiv.org/abs/1712.09203

The paper addresses an important topic (implicit regularization in deep learning), is well-written, and although I did not verify proofs in the appendixes, I believe it is mathematically solid. Nonetheless, I have several concerns with regards to originality, significance and clarity (see below), which ultimately lead me to vote against its acceptance. *Originality:* The setup and result proven here are very similar to those of previous works (in particular [2], which was not cited!). This is true for the proof techniques as well. The only part I view as potentially novel is the perturbation analysis, but that is unfortunately not discussed at all in the body of the paper. I recommend to the authors to put much more focus on this aspect, as without it the paper is merely a straightforward extension of prior work. *Significance:* In my opinion, the significance of contribution (A) --- relaxation of assumption in (i) --- is in the perturbation analysis, i.e. the fact that only approximate alignment is required. Contribution (B) --- transition from continuous (gradient flow) to discrete (gradient descent) optimization --- is also relatively simple without the perturbation analysis. Since that is fully deferred to the appendixes, I think the paper body itself falls short of the significance threshold required for publication at NeurIPS. *Clarity:* The paper is in my opinion clear and well-written. However, I would like to point out that its presentation may potentially mislead readers into overestimating the significance and implications of the results proven. Specifically, the paper repeatedly uses the term "deep" (including in its title), where in fact its analyses are all limited to shallow (two or one layer) models. Also, it refers to "implicit regularization" (including in the title), and talks about gradient descent over different architectures biasing towards certain solutions, but the setting it treats is one in which the final solution is known, and the only implication of the architecture is on the path taken to get there. I am not claiming that the latter question is uninteresting, but it should be contrasted against the kind of implicit regularization we typically encounter in deep learning (early stopping is not needed there). I am aware of the fact that there are prior works that make the same inaccuracies, but in my opinion that does not justify doing so. *Quality:* I did not go over proofs in the appendixes, but as far as I could tell the paper is technically solid. *Additional Comments:* * In Figure 2 (experiment), I believe there should be a discussion on the one layer model overfitting after some point and the two layer model not doing so. * It is not clear to me if the discrete result allows epsilon > 0 (i.e. approximate alignment). The text suggests that it does but the statement doesn't appear to deliver. If there is no accounting for approximation in the discrete case then my concern about significance is even more potent. -------- Update after reading other reviews and author response: In high level, my critique on this paper is that it is very similar to prior work (in particular Lampinen and Ganguli 2019, which was not cited) in its formulation and assumptions (and also in many of its proof techniques), but nonetheless does not clearly articulate its distinction. I believe an uninformed reader can easily misinterpret the text into associating with this work contributions that have already been made. A more transparent presentation in my view would be to start off with a detailed technical recap of the existing framework on which this paper relies, and then move on to discuss the incremental improvements this paper introduces. Since the improvements are ultimately incremental, I think their technical aspects (e.g. proof ideas) must be included in the body of the paper, not just their motivation. For example, it is important to clarify that the discrete result does not include perturbations, thus the assumptions ensure that the multidimensional problem decomposes into scalar problems, immensely simplifying the analysis. The author response did not really address my concerns. For example, while I recommend that technical details behind perturbation and discrete analyses be added to paper body, the response refers to text discussing motivation and challenges. The comment I gave on misuse of the term "implicit regularization" was also not recognized --- in practical deep leaning the term refers to the final solution being regularized, not just the optimization path (again, I am not saying paths are not interesting, just that the text should be transparent). After reading the author response, I reviewed the proofs in the appendix. I believe they do carry some novelty, and if added to paper body, would bring forth a contribution worthy of acceptance. I encourage the authors to rewrite the paper in a way that is more transparent on one hand, and on the other does more justice to their technical novelty. With regards to the current form of the paper, I think it is marginally below acceptance threshold (updating my score accordingly).