NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper studies how to test the stationarity of stochastic gradient with momentum using some advanced testing statistics that take the time correlations into accounts. Extensive experiments are run to demonstrate the advantage of the proposed method over existing approaches. Originality: The paper is based on extending a recent paper by Yaida. It does not seem that original to me but the authors do combine the condition by Yaida with some more advanced testing statistics in a new way. Overall I think the extension is quite natural, so the conceptual novelty is not that high. But to be fair, the paper still has presented some solid new results. Quality: I like the extensive experiments in this paper. It seems to me the quality of the experiments are reasonably good. On theoretical side, I like the intuitions in this paper: the time correlations have to be taken into accounts. However, it seems there is a gap between the assumptions in the theory and practice. The authors assume the gradient estimate is Markov in theory but in simulations they just run many different epochs. If I understand correctly, the data are reused when running different epochs. Then clearly there are time correlations between the reused data at different steps. It seems the Markov assumption is just too strong. If all the data are only used once and the gradient is sampled in an online manner, then the Markov assumption seems reasonable. But clearly the authors are not considering this case, since the online setting where the data is only used once just does not make sense for the overparameterized regions. Overall I don't think the proposed method in this paper has a solid theoretical foundation, although intuitively the Markov assumption is better than the IID assumption. The Markov assumption is not precise for the case where data is reused in multiple epochs. Another possibility is that fixing the data set and only considering the randomness in sampling. For this case, the Markov assumption seems fine but how to prove the time homogeneity here? Clarity: The paper is well written and quite clear. Significance: It is possible that the proposed testing statistics or some variants may be useful for practical problems although there are many other heuristics that can also be used to determine whether the SGD iterations have reached plateau or not. In addition, the authors' approach require setting another hyperparameter delta. This may introduce some extra difficulty in applying their approach for practical problems. Overall I think the overall significance is at the moderate level. =============================================== ============================================================ AFTER REBUTTAL: Thanks for the further explanations. The authors have addressed all concerns. I have increased my score.
Reviewer 2
The paper improves upon the main work from Yaida et. al by introducing a much more rigorous test for whether the chosen statistics indeed converge to zero. I think the authors introduction of the t-test and especially the improved estimators that take into account the temporal Markov Chain correlation is indeed great. The main text is very easy to follow and quite clear, with very clear idea and aim which the reader can follow. Several points of discussion: 1. I think a more important point though is that the authors should have plotted results with the learning rate adaptation of equation 9 for the comparison. From the variance tests and Fig.4 one can conclude that the newer test is better, but I think it is also important to measure to how much improvement in actual optimization that translates to? 2. There is an empirical analysis of the choice of the decaying constant (Fig. 3), however would be interesting to see similar plots for the confidence parameter (gamma in the paper) in the t-test interval construction. 3. On the comparison with ADAM it is claimed that it is "hand tuned". Given that SGM has decreasing learning rate schedule, there is no reason not to have run ADAM also with such cut-and-decay schedule and optimize it as well. 4. Any discussion on the higher order comparison rules from Yaida et. al would be beneficial. 5. It would be quite interesting to see if ADAM itself can benefit from SASA as well as how does the statistics of interest in the paper behave under ADAM in the first place. Figure 4 is really confusing as there is no legend or explanations in the figure title on what the different curves actually represent - I was expecting two lines for the two tests in eq.9 and 10 but there are clearly 5 (one figure has a single red curve)?
Reviewer 3
The paper proposes SASA, a method to drop the learning rate by a constant multiplicative factor when a certain criterion is met. The criterion is a statistical test, indicating stationarity of the Markov chain of the optimization parameters, which is performed once per epoch, and computed on easy to evaluate statistics of momentum SGD. The method is evaluated on three real world tasks and shows comparable performance to hand-tuned competitors. I. Originality Statistical tests for optimization are not new, in that regard the originality of the paper is low. However, I find the approach to test for stationary interesting and useful. II. Clarity The clarity of the paper is high. Indeed, while it is proposing a novel method, it contains parts similar to a technical report, which is refreshing to read. The authors also anticipate potential questions or concerns the reader might have and try to answer those in the paper already. Here are my concerns on clarity. i) The authors describe a “statistical test”, however, I am not sure if the exact test is defined somewhere in a concise way, e.g., what is the null-hypothesis? (Is it “no stationarity”?) ii) Furthermore, it is still not entirely clear to me what is at stake if the test triggers wrongly, or does not trigger (although it should). As I understand, the same type of test is performed at each epoch, which will make it somewhat likely that both those scenarios happen eventually. I could imagine for instance, the test not triggering, and the optimizer slowly diverging away from a stationary distribution again. On the other hand, non-stationarity seems to be the null-hypothesis, and not rejecting it should be the safe-place of the test. iii) In my opinion it would be beneficial to spend a bit more time on the interpretation of Yaida’s condition, as this is the condition of the proposed test. Looking closely, Yaida’s condition seems to say that the fraction of the squared norm of the search direction d_k, and the inner product of x_k and its gradient g_k is constant when SGM is in a stationary distribution. This, does not seem trivial at all to me, and I wonder if the authors have some intuition about why this relation holds (and what it means geometrically). iv) Additionally, from what I gather, the condition holds for “general functions F”; first, it is unclear to me what that means, and second, by assuming that there exists a stationary condition, one might implicitly also assume some smoothness on the function F that makes the stationarity possible. The condition, otherwise, seems very specific. v) On a practical note, it is unclear to me what the lowest possible number of samples N is that needs to be acquired to make the method work reliably. Especially for smaller datasets, this might be an issue, both because statistics lack after one epoch, and also because the Markov chain did not mix enough. vi) On another practical note, I am unsure, however, how applicable the proposed method is. For instance, I am unclear if the method is still applicable when the gradients g_k is a biased estimator of the gradient? Biased estimators seem to become popular lately, and the significance would increase or decrease depending on the applicability. [Post-Rebuttal: Thank you for your rebuttal. I increased my score, as I believe this is an interesting paper.]