NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:107
Title:First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Reviewer 1

Clarity: the paper is general clear, and concise regardless of the substantial technicalities (but see below). Paper organization is sensible in that too technical proofs are left to the appendix and the overarching proof strategy is put as a separate section. Originality: this line of research seems very recent, and this paper appears as a natural and timely continuation of [1], which empirically supports that gradient noise in SGD is heavy tailed. Quality and significance: I believe theoretical results by themselves are deep enough to judge this work as having high quality. This adds to the empirical validation, which are also insightful. I am only a bit concerned this work might be taken as an excuse for making theory, but losing connection with the broad machine learning community. [1]

Reviewer 2

********* edit of review report after discussion ************ I have gone through all the other review reports, the authors' feedback, and the meta-reviewer's comments, as well as the discussion up to now. For Reviewer #1's concern about making theory, I tend to be open-minded since I can not find solid evidence that the paper is making theory only. For Reviewer #4's comment about the over-claim of the result the paper proved, my take is follows. First, for many problems, the true local minima enjoys the flat basin. A famous example I have is the following paper: McGoff, Kevin A., et al. "The Local Edge Machine: inference of dynamic models of gene regulation." Genome biology 17.1 (2016): 214. where the fitting to the gene expression level dynamics is expected to be flat near the true value, because the natural evolution tends to select the dynamics robust to the changes of the environment or external stimulations. Second, the authors have explained the motivation of using the Levy process to model the noise. As long as the motivation is valid, it is hard to say that the jumps are unusual. Based on the above reflection, I tend to maintain my previous rating of this paper, and am happy to vote for accepting the paper for publication. *************************************************************** Originality: The task is new and the methods, I believe, should be standard in the analysis of stochastic differential equations. This work is a novel combination of the techniques in stochastic differential equations, and the deep learning methodologies. The difference between this work and the previous works in literature is clear: the introduction of the S-alpha-S noise. To the best of our knowledge, the related works are sufficiently cited. Quality: Due to time restriction we have not got enough time to go through the proof. However, we are sure that all the claims are well supported by the theoretical analysis if the analysis is correct. The work presented in this paper is complete. The authors are careful and honest about evaluating both the strengths and weakness of their work. Clarity: The submission is well organized and clearly written. Since this is a theoretical paper, the proof provided in the supplementary materials form sufficient resource for readers to "reproduce" the results. Significance: The results are important. Since the analysis covers an important case of error distribution of which the existence is verified by experiments, the work presented in this paper has a big potential to be influential. I believe that other researchers and practitioners are likely to use the results of this paper. This submission addresses a difficult task in a better way than previous work, and advances the state-of-the-art results in a demonstratable way. The work provides unique theoretical analysis.

Reviewer 3

***************After Author response************ (1)-(2)-(4)-(5)-(6) were only recommendations and will of precisions. (3) was a mistake I made inferring from Theorem 1 that all results were only 1-dimensional. But the authors pointed out my mistake. Sorry for this lack of attention. The main concern I have is about the conclusions drawn from the result they prove. They deduce flatness near local minima from the fact that the basin of attraction of the latter is wide (from Theorem 2). I may be wrong but for me there is no reason that this happen even with the Holder-Gradient condition (even in R as they tried to prove in the rebuttal). Moreover, the big novelty with this model is the fact that the dynamics is not continuous anymore and seem to jump without taking care of the heights of the barriers. Do we observe these unusual jumps in practice ? I am sorry to show my skepticism because the paper is well-referenced and nice in general, I have only doubts on the fact that it model well Neural Networks. For all these reasons, I decide to keep my 5, as it is said in the description : Marginally below the acceptance threshold. I tend to vote for rejecting this submission, but accepting it would not be that bad. And if other reviewers and yourself are convinced by the model, I would be ok with it. ******************************************************* Originality: The paper is quite original and the new assumption on the noise model of SGD leads to a new continuous time limit SDE for the dynamic of SGD under which the exit time properties are fairly different (polynomial time versus usual exponential time). I appreciate the will of finding new good model for SGD that explains the practice. Quality-Clarity: The paper reads very clearly and is, despite its novel ideas and technical background for the ML community, fairly understandable. Moreover the continuous-time limit is explained and the result for the exit time of the discretized counterpart is related to the continuous one -which is not always the case in this literature. However, I am not totally convinced by the conclusions of the paper when affirming that as the time to exit from an interval is polynomial with respect to its width and does not depend on the height of barriers, SGD has a tendency to stay in wide minima bassins. To explain this phenomenon I would have expected a result depending on the flatness around local minima whereas the result is about every interval of width a (whether this is a neighbourhood of a local minimum or not). I would be glad to have more precisions about this in the rebuttal. Significance: see Contributions section.