NeurIPS 2020

Parabolic Approximation Line Search for DNNs


Meta Review

There was a ton of discussion about this paper between reviewers and area chairs, multiple reviewers improved their view of the paper based on the author response, and I read through the paper in detail myself. I was still conflicted after reading it, but I am leaning towards recommending acceptance. However, I *implore* the authors to carefully consider the issues brought up by R2 and R3 as well as the issues that I bring up below. I believe that every single one of the issues brought up can be fixed, and will be extremely-disappointed if these issues are not addressed in the final version (it would make me regret recommending acceptance, and probably be more-harsh on empirical papers in the future, especially by the same authors). Some specific comments from my read-through of the paper: - I agree with the authors that optimization is a mix of theory and empirical work, and it is completely ok for works to be purely empirical if the experiments are done well. I appreciate that the reviewers were flexible on this issue. - This step-size is well-known in optimization. For example, it's used to propose trial step-sizes as part of the famous line-search of More and Thunte (which I believe is used as a line-search in the standard line-search L-BFGS implementations, and is explicitly discussed in the book of Nocedal and Wright in the line-search chapter. Those works don't suggest to blindly accept the step-size and are using it in the context of a deterministic line-search, but when the method is being discussed there should be prior related work cited. - In my opinion the theoretical results are uninteresting/irrelevant, and including them in the main paper arguably borders on being misleading: -- Under the assumptions of Lemma 1, the algorithm is equivalent to gradient descent with exact line-search on a quadratic function (or CG if you use the standard non-linear CG variants). You could just state this relationship and then cite a source giving convergence in this well-known setting (in fact, we also know convergence *rates* in this setting). -- After Lemma 1 it says that has "no convergence guarantee" with noise. This is mis-leading: if you add noise to Lemma 1 (even with bounded variance) then the given algorithm *does not converge*. Please fix this so we don't need to correct people over the next 'x' number of years like we now correct people regarding convergence of Adam. -- Similarly, even for noise-free objectives I suspect that this method does not converge in general. You really need to emphasize that Lemma 1 is quite a special case because of this equivalence. -- The assumptions in Proposition 2 are possibly the strongest that I have ever seen regarding a finite-sum problem (I would argue that they are even stronger than the "interpolation" assumptions in recent related work), and as pointed by one of the reviewers they make no sense in a machine learning context. I still think it can be worthwhile to include this result, but please add a paragraph saying that these assumptions are unrealistic and would not hold in machine learning applications. You should consider moving this result to the appendix as suggested by one of the reviewers, as naive readers might be mis-led. - I'm ok with including the word "conjugate" in Section 4.3. However, please take the reviewers suggestion that it should be described as conjugate-like. (It would only be conjugate if you had a quadratic and no noise, or you did an exact line-search with no noise.) You may also want consider discussing the Polyak-Ribiere-Plus method and other approaches to guarantee that the non-linear CG methods actually generate a descent direction (I assume you needed to do something like this in the experiments, which is why beta=0 ends up showing up). - I tend to disagree with the reviewer who brings up the issue that the function may not be parabolic when the momentum direction is included. It would be good to add empirical evidence of this, but I see no reason to believe that it wouldn't be locally parabolic in such directions. - As mentioned by one of the reviewers, the paper/experiments needs to tease apart whether the gain is from the step-size or the use the non-linear CG selection of beta. I suspect that both of these help, and that they might work better together, but an ablation study where you test each of these on their own should be included (slightly annoying because you need to choose the step-size to test the non-linear CG method on its own, but I'm sure you can come up with something reasonable). - I was surprised that when SLS was discussed that "over-parameterization" (or "interpolation" or whatever you want to call it) was not discussed at all. SLS was *only* shown to work in this over-parameterized setting, and indeed it *does not* work in under-parameterized settings. This is significant in several ways: -- If SLS is viewed as the main thing to compare to, then it should be stated *whether each model is over-parmaeterized for each dataset*. If there are cases where this is not true, then SLS doesn't seem to be the right method to compare to. -- It might be the case that the *reason that the proposed method works* is also related to over-parameterization. For example, if we added noise to Lemma 1 but assumed over-parameterization then I expect that you could prove that the proposed method does converge (by basically using the arguments in the SLS paper). I think you should either (a) try to prove this result or (b) do experiments on a very-under-parameterized problem (like in the SLS paper) to test this hypothesis regarding why it works at all. -- I would like to see comparisons against other ways to set the step-size that do not rely on the over-parameterization assumption. I am giving the authors the benefit of the doubt here as the empirical work does seem to be carefully done, but in the camera-ready version I expect to see *at least 3* alternate ways to set the step-size among the experiments (some suggestions include coin betting, "no more pesky learning rates", and maybe one of those "do gradient descent on the learning rate" papers like Baydin et al. which often come out but don't work too well when the experiments aren't being run by the author of the paper).