Review for NeurIPS paper: Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

NeurIPS 2020

Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

Review 1

Summary and Contributions: This paper analyzes stochastic gradient descent for smooth non-convex objectives with potentially heavy-tailed noise in the gradient oracle. A new method of quantifying tails, the “biased expectation” is introduced, that allows the others to generate high-probability bounds for quantities involving heavy tailed random variables. The analysis recovers standard results for subgaussian and possibly biased noise in the gradient, but also allows for proving convergence for certain classes of noise that have polynomially decaying tails.

Strengths: I liked the biased expectation trick, which seems like a very natural way to analyze the problem. The results provide concrete scalings for the learning rate parameter. Once the biased expectation is defined ,the results seem relatively straightforward applications of the definition. I do not think this is a disadvantage since easier proofs tend to be more generally useful in my opinion.

Weaknesses: One shortcoming I see is that the result holds only for some unknown iterate along the optimization path, rather than e.g. the last iterate. However, this is hard to show even for ordinary light-tailed noise distributions. I would also have liked to see more discussion of specific kinds of heavy tailed noise and their biased expectation values. ---- I have read the reviews and responses. I still think this is a good paper, although there are a few presentation issues. In line 413, I think the statement e^(sX)/E[e^(sX)] sums to one should be replaced with something like p(X,Y)e^(sX)/E[e^(sX)]. Also, please use a single counter for all theorems/lemmas/propositions. It makes it easier to binary search since any given page is likely to have at least one of these.

Correctness: I believe so, but I did not check all details in the appendix.

Clarity: yes

Relation to Prior Work: yes, to the best of my knowledge.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper presents a new framework (of 'biased expectations') to give an analysis of convergence of the plain SGD algorithm for non-convex but Lipschitz-gradient objective functions under various assumptions on the noise of the gradient estimator.

Strengths: The 'biased expectations' seems to be very closely related to log-moment generating functions as used extensively in applied probability. The authors provide claims of varying analysis under different noise assumptions.

Weaknesses: Post-feedback: Thanks for the detailed responses, they are satisfactory to me. --------- The analysis of convergence of SGD is a well-beaten track. To be able to get a new piece of work in this field, one needs to very clearly lay out the limitations of prior work that your work is able to remedy. Most of the SGD analysis concentrates on convergence-in-expectation, for which simple assumptions on bounds on bias in and variance of gradient estimation suffice to show convergence. For non-convex functions, convergence is in terms of either the average, weighed average or min of the norms of the gradients at each iterate. (The authors use the min of the norms.) However, I do not immediately get a sense on why this new framework and analysis is needed. Why should I care about convergence in (high) probability over in-expectation convergence? I understand that the characteristics of the bounds vary by the fatness of the tail of the gradient's noise because in-probability convergence now needs to worry about the rarer cases of slower convergence. But to get work on this accepted, there has to be a lot more effort, e.g. using simple concrete example objective functions and resultant iterations, or via extensive numerical experiments to show that in-probability convergence matters. There is a glimmer of this in Fig 3, but the message is muddied by the fact that both bounded but uncentered noise and heavy tailed distributions do worse. Clearly a bias-variance trade-off is happening here and so it is not just the tail of the noise that matters, but the theoretical results seem silent on this.

Correctness: I have not verified the proofs of the claims, but they seem to align correctly with what I would expect. One curious or worrisome factor seems to be their dependence on the optimality gap "\Delta" , which is my interpretation of for example the sentences in lines 184-185 after Thm 2. So, the rate of the convergence under your analysis of in-expectation convergence depends on how far the method is started, and can be a slow \sqrt{T} in some cases? That is new and an unusual observation.

Clarity: The english is clear, but I have a hard time reconciling this with existing literature given the lack of prior work discussion and comparison.

Relation to Prior Work: The single largest lack of this paper is in providing a concrete context w.r.t. to the vast literature at this point on analysis of SGD convergence. The authors must provide a comparison and contrast with existing literature to tell us why we should care about their new results.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper discussed the notion of biased expectation, and it's application in analysing the convergence rate of SGD. ====================== Have read the rebuttal. Raised to 7.

Strengths: This paper uses biased expectation, a parametrized statistics as a bridge, to provide a generic framework of analyzing the convergence rate of SGD under different noise, which include heavy tailed ones.

Weaknesses: I do not see major weaknesses of the paper.

Correctness: I think it's correct and the experiments are adequate.

Clarity: Yes. This paper does a very good job in the flow of formulating the theory and inducing the examples. At the beginning the notion of biased expectation is clearly defined, and the properties of it are listed clearly with explanations and compared with well known quantities, which enables reader to gain the sense of the quantity and to refer to the results easily when they read the later sections. The example section compares usual types of noise with figures, and proposes the universality of the formal proof strategy by discussing each type of noise.

Relation to Prior Work: This paper refers to previous SGD works, but they are less related to the analysis in terms of wide ranges of noise models. But it is possible that this area is unexploited.

Reproducibility: Yes

Additional Feedback: Overall the paper looks good, but I'd like to see more motivations on this topic, say, why is it important to study different noise, which type of noise do certain real world applications belong to, etc.

Review 4

Summary and Contributions: I have read the author rebuttal. As far as I can see there has been no further justification of the switch from constant stepsizes to time-varying stepsizes, which is still making the experiments hard to interpret vis-a-vis the theory. I would strongly recommend that the authors bridge this gap. The main focus of the paper is a new quantity, called the "biased expectation" which is a slight extension of the moment generating function. After showing several nice properties (e.g., decomposition for independent variables, conditional expectations, and a concentration inequality), the author switch to analysing stochastic gradient descent under various noise settings. The authors give a meta-theorem (Thm. 1) in terms of the biased expectation, and derive various bounds on the minimum squared norm of the gradient across T steps of gradient descent. The authors provide some experiments.

Strengths: Overall, I am intrigued by the simplicity and elegance of the paper and how it seems to generalise existing results. The main result, Thm. 1, in conjunction with Prop. 2 become a powerful tool with which we can get various bounds on the minimum norm gradients of stochastic gradient descent over T iterations. In particular, it allows bounds not only in expectation, but also in terms of absolute gradient magnitudes.

Weaknesses: While the "biased expectation" appears to be a powerful tool, the overall results are restricted to the gradients of the algorithm at _some_ time t in the last T iterates. While this is a common outcome of the standard analysis of SGD, it would be nice if (with some additional assumptions on f) the results could be transposed to f(x_t) or x_t within some basin of attraction. The special case of s = 0 needs much more detailed treatment. While the authors point out in the supplement that \phi is continuous at s = 0, much of the document switches between looking at s->0 or s = 0 without explanation. Assumption 1: I see that the authors need to contol ||X_t||^2 in Thm 1. (Eq. 20 in supplement) and they do with with a "variance" sigma^2. When the mean is not zero, there seems to be some overlap between this sigma^2 and the quantities being bounded in Eq. 7, 8. Is there no better way to "orthogonalise" these assumptions? The paper is a little bit disconnected from existing extensions of SGD. In particular, I would have loved to see how the conclusions relate to variance reduction methods for SGD, e.g., SVRG, etc. The experimental results are not very useful because the connections to the theory are at best tenuous. First of all, the plots show 90th percentile results, which would skew the results and may not present typical behaviour (and certainly not "expectations" as in Thm. 2). Secondly, the algorithm here uses eta_t instead of the fixed eta choices that were discussed before, so technically none of the previous theorems apply (granted, the authors claim the theory can be extended to handle decreasing eta_t, but the entire paper focusses on the fixed eta case). Finally, the plots for bounded uncentered noise seem to raise more questions than they answer and which (as far as I can see) are not answered by the current theory. Specifically, why does biased noise sometimes speed up convergence relative to unbiased noise? Does this have something to do with where the algorithm is initialised relative to the local optimum? The paper focusses on the non-convex case, but perhaps some of those features could be more clearly illustrated by focussing the simulations also on simpler convex settings (the reasoning being that all basins of attraction look convex locally).

Correctness: As far as I checked, the results look correct.

Clarity: In large part, yes, but it is still very dense. The proofs in the supplement actually aren't that long so I would encourage the authors to bring some more of that intuition into the main paper.

Relation to Prior Work: As far as I can see this is novel, but as mentioned above, it would be good to include more context on variance stabilisation methods.

Reproducibility: Yes

Additional Feedback: See weaknesses above