NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5096
Title:Lookahead Optimizer: k steps forward, 1 step back

Reviewer 1

UPDATE: After reading the rebuttal and discussing with the other reviewers, I'd like to increase my score to 5. Particularly I think that the theoretical part is weak, and it needs significant improvement. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This paper presents a Lookahead algorithm for optimizing deep neural nets. The algorithm consists of an inner loop, which updates fast weights using conventional optimizers like SGD, and an outer loop, which updates slow weights by linear interpolation. Also some theoretical analysis on optimal learning rate and convergence is provided as well. The paper is written clearly, easy to follow. My main concerns about the paper are as follows: 1. Novelty: The proposed idea has been explored in Zhang et al. 2018 ( where they proposed an algorithm called Time-Delay momentum, equivalent to fast/slow weights, though I think there may be some errors in their theoretical analysis. Similar performance such as training and testing behavior on image classification was reported as well. 2. I do not see any usage about Sec. 2.1, as stated in L91 that in practice the authors are still using fixed learning rate. So what is the benefit of providing such a result? 3. In Yanet al., Unified analysis of stochastic momentum methods for deep learning, IJCAI 2018, the convergence of momentum in expectation was proved for (convex) deterministic loss functions. Sec. 3.2 verifies such a result. 4. The empirical results are weak, and I do not see the benefit of using the proposed method over traditional ones like SGD and Adam.

Reviewer 2

Update: I have read the author's response and have kept my score. Please note that in DeVries and Taylor'17, 'ResNet-18' is not truly the ResNet-18 model (it consists of 4 stages and has more than an order of magnitude more parameters than the original ResNet-18 due to wider channels). This should be made clear in the paper in order not to cause more confusion in the community. Originality: Medium/High The proposed algorithm is considerably different than recently proposed methods for deep learning, which gravitate towards adaptive gradient methods. It has some similarities to variance reduction algorithms with inner and outer loops, however Lookahead has a very simple outer loop structure and and is easy to implement. I consider it a significantly original work. Quality: Medium The experimental evaluation is very extensive, including many challenging tasks, and improvement on language modelling is significant. The analysis, although helps motivate the proposed algorithm, is restricted to a very simplistic setting -- any analysis for the smooth non-convex case would be very helpful since the method is aimed at training neural networks, even if it provided no advantage over SGD. The CIFAR experiments, however, look a bit off. To achieve 4.7-4.8% error on CIFAR-10 a very deep ResNet is typically requried (a ResNet-1001 achieves similar performance, as reported in [1]), and I find it unlikely that the authors achieved such performance with a ResNet-18. I would guess that a Wide ResNet was used instead of a standard ResNet model, with a considerably widening factor, but I expect this to be addressed in the rebuttal. Clarity: High The paper is clear and well written. Significance: Medium/High The paper manages to successfully convince that variance reduction-type methods with simple inner/outer loops can provide significant advantages to deep learning applications, and might increase interest in such kind of research. The idea is refreshing in face of how much focus is aimed on adaptive gradient methods and Adam-type variants currently, and the paper shows through extensive experiments that Lookahead can provide performance boosts when compared to other methods -- especially in tasks where adaptive methods outperform SGD (in the paper, language modelling and machine translation).

Reviewer 3

*****************After Author Response*************** Dear authors, thank you very much for your response. After much consideration, I find that the response of the authors does address my initial concerns. Though after much deliberation, I believe the numerics are convincing and the contribution is clear. Thus I have increased the grade to 6. I kindly request that the authors make a clear comparison with SWA, both on the high level and experimentally. In particular, I was not satisfied with the authors response because: 1) (major) I pointed out in my review that LA and SWA are similar augmentations by writing SWA in the same format as LA. Though I agree they are not the same, they are at least competing methods. The authors responded stating that SWA and LA "serve different purposes and are complementary". The justification for this is that LA uses and exponentially moving average and SWA was is typically initiated towards the end. This for me is a justification of why they are not the same method (which I already agreed), but it does clarify why they are not directly competing augmentations (note that SWA could be applied throughout). Though I am pleased the authors offered more experiments applying LA in conjunction with SWA, I think if makes more sense to directly compare the performance of SGD + LA and SGD + SWA to best verify what is the contribution of each method. 2) (minor) In my review I pointed out that the theory in Section 3.1 is very restrictive since optimizing a diagonal quadratic is equivalent to minimizing several decoupled 1-dimensional second order polynomials. The authors responded stating that it "is equivalent to general stochastic convex quadratics" pointing to Sutskever et al. 2013, Proposition 6.1 where the authors show that the analysis of *full* gradient descent with/without momentum depends only on the eigenvalues of the quadratic. But this is not true for stochastic gradient methods. Because for quadratic SGD can be cast as a method that samples rows or columns of the quadratic matrix. But what is a row or column depends on the coordinate basis, and as a result, the analysis of SGD does depend on eigenvalues and eigenvectors. Thus the authors response is in accurate. Though I do concede that, it sheds light on the convergence of full batch gradient methods combined with the lookahead scheme. ********************************************************** *Issues* Similarity to Stochastic weight averaging: The proposed lookahead algorithm is similar to the Stochastic Weight Averaging (SWA) [1], so much so that SWA could be considered as a variant of Lookahead, and vice-versa. To see this, I have re-written the SWA Algorithm in the same notation used in this paper: SWA Algorithm: 1. phi_{0} = O_{1,0} = initial parameter 2. for t = 1,2, … do 3. for t = 1,2, …, k do 4. O_{t,i} = O_{t,i-1} + A(L, O_{t,i-1}, d) 5. end for 6. alpha_t = 1/(t+1) 7. phi_t = phi_{t-1} + alpha_t (O_{t,k} - phi_{t-1}) 8. end for Note that I re-arranged the notation by using two loops (as done here) as opposed to using the mod function (as used in [1]). The difference between the SWA algorithm and the Lookahead is that the inner iterations are not reset using the outer iterations (there is no O_{t,0} = phi_{t-1} step as there is in Lookahead) and the parameter alpha depends on the outer iteration counter (see line 6 of SWA Algorithm). Excluding these two difference, the methods are essentially the same. Furthermore, both rely on the same intuition and both paper make use of the same projected level set plots (Figure 1 in this paper and Figure 1 in the SWA paper [1]). Finally both have the same computation overhead, are easy to implement, and are directly competing augmentations (it doesn’t make sense to apply both independently). This is why I feel the Lookahead procedure should have been compared to SWA and these similarities drawn out. Though I agree the other augmentations mentioned in Section 4 are not directly competing since they can be applied in conjunction with Lookahead (variance reduction, acceleration, Anderson averaging ...etc ). Indexing mistake eq (1) and (2): There is a small indexing mistake here, instead of Equation (1) according to Algorithm 1 we should have \phi_{t+1} = \phi_t + \alpha (\theta_{t+1,k} - \phi_t) Thus unrolling this recurrence gives \phi_{t+1} = (1-\alpha)^{t+1} \phi_0 + \alpha\sum_{i=0}^{t+1}(1-\alpha)^i\theta_{t+1-i,k} Line 72: … operations that amortized … Proposition 1: This proposition has a few issues. The formula for \hat{\alpha}^* has not been explained anywhere (nor in the appendix). In particular, is \nabla L(O_{t,k}) the full gradient or a stochastic gradient (there is no mini-batch notation). What is the motivation behind using O_{t,k} - \hat{A}^{-1} \nabla L(O_{t,k}) in place of O^*? Also you should explicitly define that \hat{A} is this diagonal of an approximate empirical Fisher matrix mentioned on line 84. Finally, it seems this entire Section 2.1 could be removed since on lines 91 – 94 you explain that this approximation does not offer any significant gain to using a fixed alpha. Theory in Section 3.1. The model problem is very restrictive. Indeed, having a diagonal quadratic is equivalent to minimizing several decoupled 1-dimensional second order polynomials. Thus Proposition 2 essentially shows how the lookahead affects the asymptotic fixed point of the variance on 1-dimensional 2nd order polynomials. It is often misleading to draw conclusions of how a method behaves by analyzing it in 1-dimension (e.g. efficient methods for solving 1-dimensional optimization problems are very different as compared to the efficient methods for solving n-dimensional problems). If instead the authors had analyzed a convex quadratic (not necessarily diagonal), this would have been much more informative. Line 129: Why measure the rate of contraction r:= 1.0 - ||O_t ||/ ||O_{t-1} || in the plots? If r_t → 0 this does not show that the methods converges. Why not report the decrease in the expected quadratic loss function instead? Line 153: The description of the SWA “… (SWA), which averages the weights of different neural network obtained during training” seems to be incorrect. Indeed, as I detailed before, SWA is very much the same type of augmentation as Lookahead. There is only one neural network. Line 177: … their method requires the order of …. Section B.1: The notation for the matrix A is overloaded, making it harder to follow this section. Note that A is at the same time the matrix in the quadratic x^T A x and it is the matrix that represents the lookahead interpolation. Please introduce different symbols for these two matrices. Line 439: “we set … learning rate of {0.1, 0.2}”. This seems to be a mistake. What was the value chosen for the learning rate? [1] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry P. Vetrov, Andrew Gordon Wilson: Averaging Weights Leads to Wider Optima and Better Generalization. UAI 2018: 876-885