Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
I think this paper is a very good read. It is both pedagogical and brings interesting food for thought on a very active topic. English usage is good and references are adequate. Although it may be interesting to hint at how much the ideas could or could not be conveyed to more general nonlinear settings, the methodology is interesting and I particularly liked the core section 3 about dynamical analysis of the model output along iterations. The theoretical findings are supported by experiments.
This paper studies the implicit regularization of gradient descent over deep neural networks for deep matrix factorization models. The paper begins with a review of prior work regarding how running gradient descent on a shallow matrix factorization model, with small learning rate and initialization close to zero, tends to converge to solutions that minimize the nuclear norm  (Conjecture 1). This discussion is then extended to deep matrix factorization, where predictive performance improves with depth when the number of observed entries is small. Experimental results (Figure 2) which challenge Conjecture 1 are then presented, which indicate that implicit regularization in both shallow and deep matrix factorization converges to low-rank solutions, rather than minimizing nuclear norm, when few entries are observed. Finally, a theoretical and experimental analysis of the dynamics of gradient flow for deep matrix factorization is presented, which shows how singular values and singular vectors of the product matrix evolve during training, and how this leads to implicit regularization that induces low-rank solutions. The organization of the paper presents a narrative that evolves nicely as the paper progresses, which makes the paper relatively easy to follow. Most of the theoretical and experimental results are fairly convincing. However, some of the plots are a bit difficult to read, particularly in Figures 1 and 3, since some of the plots are missing axis labels. Also, it’s not clear how the plots in Fig. 3 can be interpreted to show that the singular values move slower when small and faster when large as depth is increased, as indicated by the authors. In Sec. 2.2, it is noted that when sufficiently many entries are observed, minimum nuclear norm and minimum rank are known to coincide. It would be helpful to provide further discussion or analysis regarding this point, perhaps by showing a plot of how the reconstruction error for the depth 2, 3, and 4 solutions compare to the minimum nuclear norm solution as a function of the number of observed entries.
Overall, I found this an extremely interesting paper. The paper is well written (though most proofs were in the Supplementary section), provides a good background (although this is also relegated to the Supplementary section), and makes an interesting contribution to our understanding of deep learning. I appreciated the approach as well, and at least to me, this was novel as well. The paper is original, technically sound (as far as I checked---but I cannot vouch for all the proofs), well written, and significant.