Paper ID: | 328 |
---|---|

Title: | Towards closing the gap between the theory and practice of SVRG |

============================================================== After authors' response: the correction in Eq.(5) makes sense to me. I'm not very familiar with the mini-batch analysis in SVRG. It seems to me after reading the response that although the analysis for "arbitrary sampling" may not be that innovative, the analysis of mini-batch is new and different from previous works. From this perspective, I do agree the paper has its unique contribution. ============================================================== The paper is written in a clean way, and easy to read. From these perspective, I enjoyed reading this paper a lot. My major concerns are as stated in the contribution section: 1. Results and technics are not super innovative, similar analysis for other sampling scheme or non-averaging inner loop are done in [1, 10] and many other papers. Although they are on different settings, but the idea and proof shares a lot similarity. In this sense, I do not view the theoretical contribution of this paper as significant. 2. As this paper is targeted toward closing the gap of theory and practice, the slight weakness of theoretical results would not be a problem if this paper indeed performs practical experiment and showing the advantages of the changes. However, it's only on UCI dataset, on strongly convex loss functions. Some observations in experiments are not explained clearly. For instance, why standard SVRG (blue curve) would get stuck in a bad position for such a long time? It seems to be it's only because a very simple reason the choice of m is too large, so that standard SVRG use a very bad estimate of gradient for a very long time. However, this should not be an issue in practice, if practitioner tune the size of inner loop size, instead of blindly follow the conservative suggestion of theory. Finally, I feel the suggestion on the optimal minibatch size b (equation 5) a bit confusing... In fact, when m=n, and n<< L/mu, choosing b=1 gives complexity n + L_max/mu while choosing b=n gives complexity nL/mu. I'm wondering why the optimal choice of b is n in this case, especially when L_max = L? Do authors implicitly consider the setting L_max >> L?

######################################################## After reading the author's response, I retract my concern about knowledge of the strong convexity parameter. Regarding Figure 3, I am inclined to say that the wall clock plot doesn't seem to add anything to the discussion. Because b=1 is so slow, you can't see anything that is happening with the other lines, and in some sense this seems like an argument for why we should just ignore your paper and use any minibatch size we want as long as it's larger than 1. I would suggest either 1) removing the b=1 line from the plot so we can see what is happening with everything else or 2) removing the plot entirely. ######################################################## This is a nicely written paper, which extends existing theory of SVRG to cover parameter choices that are commonly made in practice. The proposed algorithms are not a dramatic improvement over SVRG, and they do not claim to be. However, they are somewhat simpler aesthetically, they decouple some of the hyperparameter choices to be more flexible, and the theory applies to any parameter choices. I see this paper mostly as reassurance for people who use SVRG that the simplifying choices they make (e.g. choosing m = n, sampling without replacement, etc) do not significantly hurt the theory behind the algorithm. I am not convinced that the ability to optimize b and or m according to the theory would be terribly useful in practice, but it does give some general guidance to what you should do e.g. when Lmax >> L versus when Lmax \approx L.

### post-rebuttal ### The authors did not indicate whether they will allude to related work mentioned. The papers mentioned indeed are similar (granted, the paper deals with finite sums whereas the references consider the streaming variant of the question). As a final note, the second reference does consider the precise question of what is the best batch size on every instance of a streaming least squares problem for mini-batch SGD - something that is similar to issues considered in this paper in the context of SVRG, aside from considering parallel/distributed SGD methods. ################# I like this paper's motivation, message and result. Aside from the comments below, I request the authors to go over typos/bugs (if any). The paper misses out on some references that have considered similar issues: [1] The work of Streaming SVRG - Frostig et al. indicates that the inner loop of SVRG needs to be run for a condition number of steps - where the notion of condition number is similar to one described in this paper. [2] The work of Jain et al (2018) "parallelizing stochastic gradient for least squares regression: mini-batching, averaging and model misspecification", JMLR. This paper studies mini-batching in the case of least squares regression and SGD. They have similar notions of expected smoothness, the best batch size etc., specialized to the case of least squares, which I think have been discussed in a very thorough manner.