Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

UPDATE: I've read the other reviews and the rebuttal. I am keeping my score - this is a good paper. --------------------------------------------------- Originality. The study of Stochastic Gradient Descent in overparametrized setting is a popular and important trend in a recent development of huge-scale optimization for deep-learning. The authors propose a very basic and classical method, consisting from the well-known algorithmical blocks (SGD + Armijo-type line search) together with its first theoretical justification under "interpolation assumption". The proof of convergence (for example, Theorem 2) mainly consists from the standard arguments (which are used for the proof of the classical non-stochastic Gradient Method under Lipschitz-continuous gradients). Usually, if we try to repeat this analysis with SGD, the main struggle is how to take the expectation. Under the interpolation assumption, the authors are able to take this expectation easily and therefore propagate non-stochastic GD rates to the stochastic case. Despite the fact, that presented analysis is simple, up to my knowledge, these theoretical results and corresponding assumption are very novel. The main concern about this would be to make this trick more transparent, probably, to provide some insight about the proof in the main paper. It would make the presentation more attractive, if to declare additionally the limitations of this proof, with a discussion of possible relaxation of the assumption. From now it is also not very clear, how restrictive is it. In my opinion, more examples when interpolation assumption is satisfied, stated precisely and in the paper would help. Quality. The submission contains a number of proven theorems with results about convergence of the proposed method. The paper is self-contained. Numerical experiments are supported with the code. One more concern would be a lack of comparison of the results with that one, given in works [5] and [77] (which are only briefly mentioned in the paper). Clarity. The paper is written in a well-mannered way, and it is easy to follow. It informs the reader adequately about the field and gives a comprehensive reference. It would be interesting to see more examples and possible generalization of the presented approach. For clarity of the text, I would also prefer to have the assumption ("interpolation" and "SGC") stated in a kind of separated form, not between the main text. Significance. Provided fast convergence rates appears to be the first strong theoretical result about adaptive SGD (to the best of my knowledge), making the results of this work significant.

EDIT: The author feedback contained rigorous corrections to proofs that were incorrect, as well as experiments showing that their method does not lead to more function evaluations. These were exactly my two biggest issues, and so I have changed my score from a 4 to a 7. The previous score of 4 was not due to the significance, impact or novelty, but just because I highly value rigor. With the corrections, I believe this is a very solid paper. -------------------------- The paper's strongest contribution is its novelty. The authors do a good job of taing a classical method (the Armijo line search), and deriving a version that is not just applicable, but useful in modern day machine learning. Moreover, the authors do a good job of motivating the use of this method in contrast to popular methods with difficult hyperparameter tuning, such as Adam. The authors do a great job in citing relevant work, and telling a comprehensive story about the Armijo line search, up to their version today, as well as tying it in to modern research themes such as overparameterization and variance reduction. The quality of the paper unfortunately varies between different parts. The paper is generally well-written, well-motivated, and very clear in its presentation of its theoretical and experimental contributions. Unfortunately, some of the proofs are either wrong or are not rigorous. Theorem 1 contains multiple steps that are flawed. In particular, at one point the authors seem to say that the expected value of a minimum is equal to the minimum of expected values. Even simple settings of the \mu_i, L_i, and \eta_max can be used to disprove some of the equations in this proof where there are interactions between mins, maxes, and expected values. Even assuming these issues are fixed, the authors then seem to use (without proof) the notion that if f_1, ..., f_n are each \mu_i-strongly convex and L_i-smooth, and the average f of these functions is \mu-strongly convex and L-smooth, then the average of \mu_i/L_i is less than or equal to \mu/L. It is not clear to me that this is true, and even if it were, there would need to be a proof of this fact. As it stands, the proof of Theorem 1 is incorrect, and may even need the statement of the theorem to be reworked (to be more similar to Theorem 4) in order to hold. Issues with taking the expected value of a minimum also appear in E.2.2, though it seems the authors take some steps there to better ensure that the equations hold. However, this section could use more clarity on this, due to the similar issues above. Theorem 3 is more problematic in general. In the proof, the authors take 1st order Taylor expansions, and then simply remove the 2nd order error terms by stating that the step-size is sufficiently small. While they acknowledge that they do this, this does not change the fact that this still isn't a rigorous proof. The authors never characterize how small the step-size must be in order to discount such terms. Moreover, if the step-size is too small, then it will be impossible to derive fast convergence rates. Thus, as stated, Theorem 3 is simply not proven. While I believe it could be proven by using the intuitive idea of making the 2nd order error terms sufficiently small, the authors do not do this. They regularly state that a function equals its 1st order Taylor expansion, which just isn't true. Given that the authors list Theorem 3 as one of their main contributions, this greatly detracts from the quality of the paper. In general, the proofs in the appendix could also be better written. In addition to omitting certain details, the overall format of the proofs is a little inconsistent, and could use some more careful explanation in many parts (eg. using more equations, instead of simply saying things such as "subtracting the above terms" as in the proof of Theorem 3). For example, the writing and explanation in the proof of Theorem 7 is much better and clearer than the writing and explanation in the proofs of Theorems 1-3. The experiments are solid, and encompass a variety of different machine learning tasks, which is good. The authors also do a good job of stating how each method compared is initialized, as well as how they are evaluated. That being said, there are issues that could be improved upon. First, given the very noisy nature of the plots in Figure 3, it would be hugely beneficial to include error bar plots, especially as the authors plot the average of 5 independent runs. Second, the line search method requires more function evaluations, something that the authors discuss at length. However, this may increase the amount of compute time required per iteration. While the authors state that they verify that the number of function evaluations is at most twice the number of gradient evaluations, it would be good to see this represented in a plot. For example, one simple thing would be to plot the error of various training methods with respect to compute time. This would give the reader a clearer picture of the actual runtime considerations involved in the authors' method. The main body of the paper is very clear. The authors do a good job of organizing their work according to different theoretical and practical concerns, and do a good job of educating their readers throughout. Moreover, the authors deserve commendation for being clear about how the methods are analyzed in theory versus how they are implemented in practice. The experiments are all well-detailed and clear, as is the supplied code. The work has mixed significance. Based on the abstract alone, the work is very significant, both to theoreticians and practitioners. The work advances the idea of interpolation as a useful condition for various optimization problems, and devises and easily implementable, widely useful method for training machine learning models. However, the significance of the work's theory is marred by the incorrect and non-rigorous proofs. Moreover, the practical significance could be improved by further demonstrating that the line search methods do not increase the run time needed to reach a certain accuracy.

Paper ID:	2038
Title:	Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Reviewer 1

Reviewer 2

Reviewer 3