Paper ID: | 8546 |
---|---|

Title: | The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares |

Originality: I'm not an expert on this subfield, but as far as I know the work is original. Quality: I think this is a very nice paper. The results are interesting and clearly stated. And the work addresses an important question about the difference between theoretical learning rate schedules / averaging schemes and those used in practice. Also the work touches on the important difference between general function classes (e.g. convex) and specific interesting instances (e.g. least squares). I have a couple of major comments: 1) organisation of the paper. In my opinion the CIFAR-10 experiments should be brought forward in the paper to follow the introduction. This is because as it stands it looks like the experiments are being run to test the theory. But I believe everybody in the community already knew these experimental results and conceptually they actually motivated the development of the theory. It should be made clear in the discussion that these CIFAR-10 experiments are not an honest attempt to falsify the theory. 2) that brings me to the second point that I believe more effort could be made to experimentally test the theory. For example, I believe the authors should run the Polyak averaging scheme on CIFAR-10 with polynomially decaying step sizes. If the reason that step-decay works better than poly-decay for final iterate is related to the theoretical arguments in the paper, then we would also expect Polyak averaging to work well as it has a similar theoretical foundation to the authors' algorithm. If it doesn't then that raises more interesting questions that could be highlighted and left for future work. Clarity: the work is very clear. Significance: I believe the work to be significant. I am giving score of weak reject to give the authors an impetus to address my points, but I expect to upgrade the score provided they either argue their side or agree to change the paper. I have not checked the math but intend to do so before the next round of review. Minor comments: 1. Figure 2, the caption says the plot has a log scale, but it doesn't. 2. An inconsistent notation is used for dot product. E.g. equation 1, equation after line 147, equation 2, equation 3. 3. Please could the authors write the assumptions in latex assumption blocks so they are easily visible at a glance.

*************After author response************ Thank you for the answer. I'll keep my mark and vote for accepting this paper. **************************************************** Originality and significance: as far as I understand the paper, the tools used to analyze the last iterate of SGD are not new, yet the results on least square are novel and deserve some attention. Note that there are only a few comments on the difference of behavior of the last iterate estimator for general convex problems and least-square problems in the case of polynomially decreasing step sizes. Quality and clarity : the main quality of the paper is its clarity and its accuracy. All the insights claims are sufficiently argued and developed during the theorems and the purpose of the paper is clear.

The paper studies the SDG algorithm for least square regression problems. A new analysis with step decay schedule of the learning rate is presented which theoretically shows the near optimality of the algorithm. The paper also proves that the polynomial decay schemes learning rate is sub-optimal. Overall, the paper is well-written and theoretically sound. I have not checked the proof in detail but the analysis looks right to me. Base on the theoretical novelty of the paper, I would recommend for publication.