Paper ID: | 7344 |
---|---|

Title: | Global Convergence of Gradient Descent for Deep Linear Residual Networks |

Response to authors' feedback: I thank the authors for the rebuttal. My score remains the same. ----------------------------------- The work proposes an initialization for deep linear networks in which the last layer is initialized to 0, and the rest to the identity Id. With this initialization, the networks are shown to converge linearly to zero loss, under conditions (for discrete-time GD) that are different from and perhaps conceptually simpler than previous works. For instance, compared to reference [2] (Arora et al “A convergence analysis of gradient descent for deep linear neural networks”, ICLR 2019), this work removes completely the delta-balanced condition in [2] by showing that this condition actually holds, for most layers, on the GD trajectory (Lemma 4.2 and Eq. (4.6)). While certain elements have already been seen in previous works (e.g. the property in Lemma 4.2 is similar to the delta-balanced condition in [2], or the requirement of zero initialization for the last layer’s weight has been seen in “fixup initialization” of reference [21] in the context of residual networks), I think the proposed initialization as well as the convergence analysis here deserve credits for novelty. In particular, I appreciate the insight that at the region where symmetry breaks, the velocity is bounded away from zero (line 146 and Eq (4.4)), and hence one ought to initialize accordingly. Interestingly, from my understanding, some further extensions of this analysis might say more about the landscape on which GD travels. For example: - Firstly, by symmetry, one can actually choose any one weight matrix to be the zero-initialized one, with the rest initialized to Id. - Secondly, one can modify the ZAS and initialize the last layer as non-zero appropriately (say, 0.1*Id), such that the two identities after line 145 hold. These points further attest some degree of distinction from “fixup initialization” of [21] for this work. - Thirdly, I think instead of initializing to Id, one can initialize to alpha*Id, for alpha>1, and get an improvement on the convergence rate. In particular — if my mental calculation is correct — that should yield, for continuous-time GD: R(t) \leq \exp(-2\alpha^{2(L-1)}t)R(0). Note that this alpha*Id initialization doesn’t change R(0) as compared to Id initialization, since one of the matrices is 0 at initialization. This improvement is significant for larger L. Now we can see that the convergence rate can be made arbitrarily large! The above three points, if true, should show that there might be hidden insights from the analysis of this paper.

I've read the author feedback and appreciate the added experiment, which shows the benefit of the proposed ZAS initialization in training very deep ResNets compared with standard init schemes (the authors tried the Xavier init.) For the moment I am raising my score to 6. ------ This paper proves a simple theoretical result for optimizing a deep linear network: when all layers are d-by-d, if we initialize the top layer as 0 and all but the top layer by the identity matrix, then the gradient flow (as well as gradient descent with a polynomially small stepsize) enjoys linear convergence to zero risk on the matrix regression problem. This paper is well presented and the convergence result is particularly neat. However the result is a rather direct consequence of a known alignment property on deep linear networks, and thus might be incremental in its contribution and novelty. Indeed, the proof follows straightforwardly from the known fact (as the authors noted) that the matrices W_{l+1}^T * W_{l+1} - W_l * W_l^T remains constant on the gradient flow path. This paper comes up with a specific design which utilizes the above fact and implies that W_{(L-1):1} stays well-conditioned, and hence a gradient domination condition. Compared with existing alignment-type results on deep linear nets (e.g. Ji and Telgarsky 2019), the present result is a bit lacking in interesting messages for real problems, e.g. when nonlinearity is present.

# Positive aspects * The toy model of Figure 1 is insightful and effective at providing an intuition behind the potential advantages of the method. # Post-rebuttal Authors have responded to some of my criticisms, increasing my score to 6. * I checked the proofs in Page 5 and they seem correct to me. # Main criticism The main limitation I see in this paper is that experiments are exclusively done on linear networks. While linear networks might provide a useful tool for the theoretical analysis of initialization and to derive convergence rates, they are not useful machine learning models. In other words, I do not think the development of a new initialization method is useful unless its shown to be competitive on real problems (i.e., beyond deep linear networks). Furthermore, I have concerns with the experiments the authors show even in this unrealistic scenario. The authors only show _one_ run of their experiments where ZAS outperforms the near-identity initialization. It is hence unclear whether this is a reproducible effect or an exception one that only arises in this very specific example that the authors hand-picked. It would be much more convincing in any case to show these curves for multiple runs of the same dataset and not just one hand-picked. # Other comments The paper is rather poorly written. Take for example the description of related work in L17-23. The authors write "the random orthogonal initialization proposed in analyzing the gradient descent ..." are the authors citing a paper named "analyzing the gradient descent ..." ? are they describing the paper?, or in the next line "how the knowledge transfer across tasks in transfer learning" [what is knowledge? transfer-> transfers? furthermore, how is this related to the current paper? Almost every paragraph has some example of poor writing like this, making the paper hard to read. Poor writing is not limited to prose. The mathematical notation is equally sloppy. For example, R is defined in Eq. (2.3) as a function of W_1, ..., W_L. It then appears right below in (2.4), where its instead a function of the iterations t. I guess that R(t) is overloaded to mean R(W_1, ..., W_L), where the W's are taken at the t-th iterate, but I shouldn't have to guess. # Suggestions It seems to be that A : B = tr(A B^T), why not use this more standard notation?