NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5096
Title:Lookahead Optimizer: k steps forward, 1 step back

The theoretical analysis received much criticism during the discussions. Parts of the discussion have been updated in the reviews. Overall, the theoretical contributions are weak and distracting from the main paper. We suggest either strengthening the section or moving it to a less prominent position. In particular, we have ignored the theoretical contributions in the evaluation. A comment from the senior area chair on the theoretical analysis is quoted below. "The "analysis" is very disappointing, limited and I don't think particularly insightful. Its only for restricted types of quadratic objectives, which is very limited. And for stochastic objectives the claim is that for a fixed step size the radius of convergence is smaller (recall that with a fixed step size SGD will not converge to the optimum, but only to a radius around the optimum). But this is not really what we care about---we care about how quickly it gets there. You can easily change this radius by changing the step size." While the majority of the reviewers have found the experimental contributions convincing, there are some lingering concerns. Major concerns: 1. Tuning step size for baselines: The authors provide the tuning parameters used in the experiments in the appendix, but do not state what the final value that was chosen in the results reported in Section 5 are. In particular, the extensiveness of grid-search for SGD step size is crucial here. Please include experiments addressing the following: (a) Grid search should include values that are "signficanlty larger" as well as "significantly smaller" than the optimal parameter chosen for final numbers. (b) At least for smaller datasets, a finer grid search should be used. 2. Comparison to SWA: Even though SWA in original paper is proposed as fine tuning method (on last few epochs), Reviewer #5 has clearly laid out (in the initial review) the similarities to the current method and how it can be used throughout training (rather than just fine tuning on last few epochs). The authors seem to have missed this point in the response and continue to use SWA as a fine-tuning tool. We recommend revisiting the algorithm and adding empirical comparison to SGD+SWA where SWA is used *throughout* training - this should look like skipped-polyak-averaging where every 'k'th iterates are averaged. See reviewer #5's updated comment. 2. Test accuracies across epochs: Reviewer #3 rightly points out the Fig 5-7 show faster convergence only in training loss. Although Tables 2-3 show improvements in test scores, it would be quite useful to plot test accuracies across epochs (analogues of Fig 5-7 with test accuracies) Minor concerns: 1. Figure 3: although this figure is not the main contribution, it is not parse-able. Please provide appropriate details on what is "range of values of alpha" and how they correspond in the plot? what is dark blue vs light blue lines in the plot? 2. Figure 2: What precisely is plotted on y-axis? If it is average training cross-entropy loss, it is surprising that the loss is so low at initialization (or even at 1 epoch)? Minor: Fig 5 title is wrong. Please fix other typos in the paper carefully.