Review for NeurIPS paper: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

NeurIPS 2020

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Review 1

Summary and Contributions: This paper provides upper and lower bounds on the stability of (S)GD when the loss is nonsmooth and convex. The techniques are novel and they use their results to analyze a new Differentially private algorithm for stochastic optimization. They also provide generalization bounds for multi-pass nonsmooth SGD.

Strengths: Overall a great paper. Very well written, and well organized. Results are definitely novel, sound and I am sure the community will receive this work well.

Weaknesses: It would be nice if you discussed in the Broader Impact section how your results may be used by others. It would also be nice to see future research directions. It looks like you already solved the problem in hand (judging from the lower bounds), if I am mistaken could you comment more about it?

Correctness: looked at some of the proofs in detail and the seem to be correct.

Clarity: Yes, very well written

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I just have a few questions. Q: is it possible to prove a statement similar to Lemma 3.1 where instead of having x^1 = y^1 you have x^1 = y^1 + v where v is such that || v || <= some small number? Q: what could be improved if you were to assume that the loss functions are also strongly-convex? also, I don’t think the equality in 182 is correct (although it is an upper bound). UPDATE AFTER AUTHOR FEEDBACK: Thanks for the clarifications! I think this is good work so will leave my score unchanged.

Review 2

Summary and Contributions: The paper addresses the uniform stability of SGD on nonsmooth convex loss functions. To be specific, the author presents sharp bounds of L2 sensitivity (UAS) of SGD variants, therefore obtaining the optimal excess empirical risk. As a consequence, the paper provides the generalization bounds for multi-pass nonsmooth SGD and utility bounds for the differentially private variant.

Strengths: This is the first work to consider the stabillity of SGD for nonsmooth convex losses with well-established theories and specific consequences in applications.

Weaknesses: It seems that this work lacks the motivating examples in practice.The paper did not mention any specific example of nonsmooth convex losses. In machine learning, the nonsmoothness of losses often comes from regularization terms. In these cases, instead of finding subgradients, it is better to use proximal gradient methods that have faster convergence rates.

Correctness: All the claims and methods are clear and correct.

Clarity: Overall, it is a well written paper.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: 1. Section 1 and Section 2 are a liittle bit messy, and some sentences are redundant. It is better to reorganize these two sections in the next draft. 2. When writing subgradients, please use $\partial$ instead of $\nabla$ to avoid confusion.

Review 3

Summary and Contributions: The authors study the uniform stability of SGD with convex and non-smooth loss functions. Based on this, the authors develop both high-probability generalization bounds and generalization bounds in expectation for multi-pass SGD, which is dimension-free. The authors further apply this result to develop a differentially private stochastic optimization algorithm for non-smooth optimization problems. -------------- After Authors' response -------------- Thank you for your response. It is still not quite clear to me how the technique in [1] applies to the uniform sampling as considered in this paper. It would be helpful for the readers if the authors can provide details in the final version. I agreed that constants can be computed. It will be better to derive explicit privacy guarantees in the revised version. I would also like to see more discussions on the paper "Fine-grained analysis" since their work also establishes stability bounds for nonsmooth convex losses. Although they consider generally on-average stability bounds, as far as I can see their analysis for non-smooth loss applies to uniform stability. Overall, I like this paper. It has interesting results on differential privacy for learning with non-smooth loss functions. The lower bounds on uniform stability are also interesting.

Strengths: Previous stability analysis of SGD requires loss functions to be strongly smooth. An essential property in the smooth case is that the gradient update is non-expansive. This paper uses the monotonicity of the gradient to tackle the non-expansiveness and successfully imply interesting stability bounds. The authors further show that the derived stability bounds cannot be further improved by establishing matching lower bounds for some elegantly constructed loss functions. An application of this is to develop optimal generalization bounds $O(1/\sqrt{n})$ for multi-pass SGD by taking step size $\eta=(Tn)^{-1/2}$ and $T=n^2$. Based on the stability analysis, the authors further develop a computationally efficient and differentially private algorithm for non-smooth learning which enjoys the optimal generalization bounds. This algorithm is significantly more computationally efficient than [2], and improves the complexity of the algorithm in [16] by removing a logarithmic factor and is substantially simple.

Weaknesses: - Below eq (3), for the upper bound of $\delta_t$ the right-hand side should be $2\sum_s\eta_sa_s$ instead of $2\sum_s\eta_sa_s\delta_s$. - It is misleading to claim that it is the first work to address the stability of SGD for non-smooth convex loss functions as there are indeed existing work which already addressed stability of stochastic optimization with non-smooth loss. It would be interesting to add some discussions or comparison with these references mentioned below: 1. “Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent”. ICML 2020. In this paper, their work relaxes the smoothness to $\alpha$-Holder continuity of (sub)gradients, which include the non-smooth loss functions in this paper as $\alpha=0$. Their stability analysis also improves the optimal generalization bounds $O(1/\sqrt{n})$ for multi-pass SGD with $T=O(n^2)$. It seems to me that the main technical novelty appeared in the proof of Lemma 3 which studied \delta_t^2 (as opposed to the study of \delta_t in Hardt et al’s paper) using the approximate contraction for the gradient mapping for the non-smooth loss which has already explored in the above paper. Similar ideas have already explored in the above reference in a more general setting. 2. Private Stochastic Convex Optimization: Efficient Algorithms for Non-smooth Objectives, Arxiv preprint (2020). In this Arxiv preprint, the authors developed a different differentially private algorithm (Private FTRL) for non-smooth learning problems which can also achieve optimal generalization bounds. - The authors indicate that Theorem 5.1 on privacy guarantees follow from the same line of Theorem 2.1 in [3] but omit the proof. Furthermore, the authors mention that they replace the privacy analysis of the Gaussian mechanism with the tighter Moments Accountant method [1]. However, the analysis in [1] consider the Poisson sampling while Algorithm 2 considers uniform sampling with replacement. Furthermore, the moment bound in [1] is asymptotic. Therefore, it is not clear to me how to derive Theorem 5.1. I would recommend the authors to include the details for completeness as the differentially private SGD is an important application of the stability analysis for non-smooth loss functions.

Correctness: The claims and method are correct.

Clarity: The paper is very well written and is very clear to follow.

Relation to Prior Work: The difference of this work from previous contributions is clearly discussed.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper provides stability guarantees on SGD for solving nonsmooth Lipschitz convex functions. It allows the authors to further derive generalization error of multi-pass SGD and develop differentially private SGD.

Strengths: The paper is relavent to the Neurips community since it is both related to learning theory and optimization. The key contribution is a uniform stability upper bound on sgd for minimizing nonsmooth convex losses, which extends the previously known bound for smooth convex minimization problems, see e.g. Train faster, generalize better: Stability of stochastic gradient descent.

Weaknesses: Some minor comments: 1. The definition of $\varepsilon_{risk}$ is not consistent in the paper. See page 1 and page 4. 2. I'm not sure if I missed something. The paper adopts the gradient notation for subgradient. Since this paper primarily deals with nonsmooth functions, it might be better to use another notation. 3. In the remark after Theorem 4.1, the authors claim the optimality of multi-pass SGD for excess risk with properly chosen stepsize. I think this result might deserve a place in the main text. 4. Since the authors have obtained bound on sgd with fixed permutation, I'm curious to see what privacy guarantee one can say about sgd with fixed permutation. Even further, what can we say about SGD without replacement (i.e. sgd with different permutations in each epoch)?

Correctness: The claims (most importantly Lemma 3.1) seem correct. I checked part of the proof.

Clarity: The paper is well written and I enjoyed reading the paper.

Relation to Prior Work: Yes, the authors provide adequate account on related and prior work.

Reproducibility: Yes

Additional Feedback: