NeurIPS 2020

Understanding Gradient Clipping in Private SGD: A Geometric Perspective


Review 1

Summary and Contributions: In this paper authors analyze the convergence conditions for popular DP-SGD method by studying the geometric properties of bias introduced from gradient clipping and subsampling. Authors show that when the distribution of subsampling noise is symmetric, and the optimization problem satisfies certain smoothness conditions, clipped SGD will converge to correct solution. Next authors extend the analysis to DP-SGD and show similar convergence guarantees, even if the optimization problem is non-Lipschitz. Finally, authors illustrate that the gradient distribution for a certain deep learning problem is indeed nearly symmetric.

Strengths: The DP-SGD is widely applied method in differentially private machine learning, and studying the convergence of it is very important. This paper connects two fundamental parts of the DP-SGD, the subsampling induced noise and the clipping threshold, to guarantee the convergence, first in non-DP setting and later for full DP-SGD. The clipping induced bias is quantified in terms of a coupling between the true gradient distribution and a "close" symmetric distribution. This coupling gives a disparity measure between the two distributions. This gives an easy way of analyzing the convergence: when the true gradient distribution is close to symmetric, the DP-SGD behaves well. Besided the main analysis of the geometrics of DP-SGD, authors propose a method to mitigate the clipping induced bias for a case when the gradient distribution is not symmetric. They show by experiments that this method indeed works in three (toy) problems. However, authors should better connect this solution to DP-SGD, otherwise it feels bit unconnected to the rest of the nice work.

Weaknesses: This paper mainly deals with theoretical convergence guarantees of DP-SGD. However, the rate of convergence is not discussed in the paper. For DP-SGD, the convergence rate is very important, since more iterations will cause the privacy guarantees to detoriate. I think this angle should at least be mentioned somewhere in the discussion.

Correctness: I didn't go through all the proofs in supplementary, but the ones I read looked correct.

Clarity: The paper is well written. Typos: Line 125, Symetricity Line 132, sqaured Line 157, missing space after period ".Note"

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Equation 2, I think the \min is not supposed to be there. Or at least it should be on both sides. Equation 5, it's bit difficult to see how \xi affects the expression inside the expectation. Maybe you could remind that it is the noise that oracle "adds" to the gradient. Line 134, is there something missing from the sentence? It seems weird if <\nabla f(x_t), g_t> would be a distribution. Figure 5 has only subfigure captions. Experiments of Section 5, legend in Figure 5 (?) says DP-SGD, but is it truly DP version of just clipped gradient SGD? ############################### Post-rebuttal update. Thanks authors for addressing my concerns and the extra experiments on connecting DP-SGD and Sec 5.


Review 2

Summary and Contributions: This paper gives a thorough analysis on the gradient clipping in private SGD. More specifically, it demonstrates how gradient clipping can prevent SGD from converging to stationary point. Also, the results provide an explanation why private SGD with gradient clipping remains effective in practice despite its potential clipping bias. Finally, this paper develops a new perturbation-based technique that can provably correct the clipping bias even for instances with highly asymmetric gradient distributions.

Strengths: This paper addresses an important issue in private SGD: how to understand the gradient clipping. The results explain why the private SGD with gradient clipping remains effective in practice despite its potential clipping bias, from a geometric perspective. Also, clipping bias correction approach is proposed via a perturbation-based technique.

Weaknesses: Experimental results should be given to support the Theorems in Section 2.

Correctness: Yes

Clarity: Yes, very well-written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:


Review 3

Summary and Contributions: The paper analyzes the DPSGD algorithm and provides a variant with better provable guarantee.

Strengths: The paper theoretically analyzes the DPSGD algorithm and provides a stronger theoretical guarantee compared to existing results. The analysis depends on introducing an auxiliary distribution that is close to the existing gradient noise distribution. The authors also propose a variant of the DPSGD algorithm that has better provable guarantee.

Weaknesses: The results depend on how close the underlying noise distribution of the stochastic gradients is to a symmetric distribution, tilde{p}. Therefore most of the results have a dependence on tilde{p}, which is hard to interpret or verify in practice. The assumption that the noise distribution is close to a symmetric distribution may not hold for many neural networks with typically sparse gradient updates such as LSTM models. ---- Post rebuttal edit: Thanks for answering my questions. I have read the rebuttal and will keep my score. Minor comment:  Figure 1 doesn't seem to be referred anywhere in the text.

Correctness: Yes

Clarity: Yes. There are few typos: line 132: sqaured --> squared line 157: space is missing before the period.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:


Review 4

Summary and Contributions: This paper analyses gradient clipping and how it biases convergence to the optimum. It goes on to analyze differentially private gradient clipping. Bounds are given in terms of a bias term, which translates to a distance measure of the distribution of gradient noise to a symmetric distribution (or mixture). The bias term itself is not bounded, which makes the other bounds less meaningful. However the bias term is analyzed for a few classes of gradient distributions. Adding Gaussian noise before clipping does allow one to bound the bias (but now there are no differentially private guarantees).

Strengths: Significance - Understanding gradient clipping is an important problem, as gradient clipping is extensively used in practice - Bounding the bias in Theorem 6 and illustrating its performance for different values of k in Section 5 clearly demonstrates the benefit of the bounds and better understanding how clipping effects optimization Novelty - The use of symmetric noise distributions to give bounds for clipping is, to my knowledge, new. Relevance - The work should be relevant to both theoretical and empirical practitioners in the Neurips community.

Weaknesses: Significance - I don't find many of the bounds particularly useful, since the bias term itself is not bounded. In section 5 a bound is given for the bias, which is great, but then you have to add noise to the gradients, which is not particularly interesting as it applies neither to DP or non-DP clipping. Also, there is a complicated dependence on k in two terms. The work would be stronger if there were general bounds for the bias, or if the method in section 5 was shown to also be differentially private. Novelty - Clipping has been studied a fair amount in the literature. I don't know how different the conclusions of this paper are from previous work. Other comments - Line 106-107: The smoothness assumption is not valid for neural networks with relu activation functions. This limitis the applicability of the work. - The symmetry assumptions would probably not hold for neural networks which have exploding gradients, like RNNs. - In the Experiment section 4, instead of showing random projections, it would be more information to show projections that had the least symmetry. This could be done by sampling a large number of random directions and then only displaying a few with the least symmetry.

Correctness: I did not check carefully, but they seemed correct.

Clarity: In general the paper is well written. There are many minor errors though, that should be corrected: - Line 7-8: "to stationary point" -> "to a stationary point" - Equation (1): "max" -> "min" - Line 68-69: "inner product upper bounds a constant re-scaling". What inner product? - Line 132: "sqaured" -> "squared" - Line 144: "treshold" -> "threshold" - Line 157: Start "Note..." on a new line. Make sense to have a paragraph break here. - Line 253: "seem" -> "seen"

Relation to Prior Work: Some related work is mentioned, but this section could be improved. It is not clear how the conclusions of this paper differ from previous work (e.g. how do the biases calculated in this paper differ from biases in other papers and why is this formulation of the bias better?)

Reproducibility: Yes

Additional Feedback: