NeurIPS 2020

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Review 1

Summary and Contributions: This paper proposes an epoch-wise stochastic gradient descent ascent method ( Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-max problems. Moreover, it studies the convergence properties of the Epoch-GDA method, which achieve an optimal rate of $O(1/T)$ for the duality gap of general SCSC min-max problems without smoothness and funcation’s structure assumptions.

Strengths: This paper provides a sharp convergence analysis of the Epoch-GDA method for solving strongly convex strongly concave (SCSC) min-max problems without smoothness and funcation’s structure assumptions. At the same time, it also provides a sharp convergence analysis of the Epoch-GDA method for solving weakly-convex strongly-concave (WCSC) min-max problems.

Weaknesses: The paper should provide some experimental results to demonstrate the efficiency of the proposed algorithms.

Correctness: All methods in the paper are correct.

Clarity: This paper is well written.

Relation to Prior Work: The related work part of this paper clearly discuss differences between the proposed methods and the previous methods.

Reproducibility: Yes

Additional Feedback: Some comments are given as follows: C1: Epoch GDA seems very similar to Epoch GD in terms of algorithm presentation and analysis statement. Can the authors highlight the key difference between those two algorithms? C2: When comparing with [36] and [32], the authors claim deterministic updates are present in both algorithms. Why does this paper get around of the deterministic updates? C3: When stating the lower bound in Remark 2, O(1/T) should be \Omega(1/T). %%%%%%%%%%%%%%%%%%%%%% Thanks for your responses. I still maintain earlier positive review and recommend acceptance.

Review 2

Summary and Contributions: This paper extends the work of [13], who provide an algorithm (Epoch-GD) which returns a $O(1/T)$-approximate solution (referred to here as the objective gap) for minimization of (possibly non-smooth) strongly convex functions after $T$ stochastic gradient oracle calls. The main result in the current paper extends this result to a $O(1/T)$ convergence bound for the duality gap of a (possibly non-smooth) strongly convex-strongly concave min-max problem (instead of the objective gap), for a min-max version of this algorithm (Epoch-GDA). They also provide a related bound for finding approximate stationary points in weakly convex-strongly concave problems.

Strengths: The main strength of the work is that they extend the $O(1/T)$ bound which was previously obtained for the objective gap, to the duality gap for min-max problems. Moreover, the authors also provide a bound for convergence to an approximate stationary point for weakly convex-strongly concave problems.

Weaknesses: A weakness of the paper is that the key technical contribution (Lemma 1), which connects the duality gap to Euclidean distance, is fairly simple to prove, as a consequence of the strong convexity and strong concavity assumptions. For this reason, it is not clear why (in the strongly convex-strongly concave setting) extending the O(1/T) bound from the objective gap to the duality gap is an important technical contribution.

Correctness: Although I have not checked the proof itself, the overview of the proof provided by the authors seems to be correct.

Clarity: The paper seems to be clearly written.

Relation to Prior Work: The relation to prior work is clearly discussed.

Reproducibility: Yes

Additional Feedback: I am satisfied with the authors' explanations, and have adjusted my score accordingly.

Review 3

Summary and Contributions: %%%%%%%%%%%% Update after rebuttal %%%%%%%%%%%% I would like to thank the authors for their explanation and I will be glad to see the revisions promised by the authors appear in a new version of the paper. ---------------------------------------------------------------------------------------------------- In the paper, the authors propose the first algorithm that achieves the O(1/T) optimal rate for the duality gap (or the Nikaido-Isoda function) in stochastic strongly-convex-strongly-concave saddle-point problems without resorting to either smoothness or specific structure of the problem. The algorithm itself is a straightforward combination of Epoch-SGD for minimization and the well-known gradient-descent-ascent method for minimax problem. At the heart of the analysis is a new lemma that relates the duality gap to some distance measures between two points. Basing on these results the authors also derive new algorithms for stochastic non-smooth weakly-convex-strongly-concave saddle-point problems with improved complexity.

Strengths: Recently, there has been an abundance of emerging work that studies the performance of first-order methods (GDA/EG/OG) in saddle-point problems under different assumptions. I believe these fundamental problems are important for us to build a global picture of how various algorithms designed for saddle-point optimization behave in different settings. Importantly, this paper investigates the relatively less explored non-smooth setup and present simple algorithms with proper analysis, which is a nice contribution to the field.

Weaknesses: Please refer to the following parts for detailed comments.

Correctness: I did not check the proofs in detail, but the methodology look reasonable and several techniques that are commonly used for the analysis of stochastic saddle-point problems are also employed in the paper. I just want to mention that for Theorem 2, I fail to see why we always have dist(0,∂P(\hat{x}_𝜏^*)) <= γ\|\hat{x}_𝜏^*-x_0^𝜏\|. I fully understand that this is true when the minimizer of the regularized problem lies in interior of X. However, considering the case where x lies on the boundary, I suppose we may need to add the normal cone to the ∂P term to so that the statement handles this case as well?

Clarity: The paper is overall well-organized, but that there seems to be too much redundancy in the description of results. In particular, Remark 3 is basically repeating what is mentioned in the paragraphs before Theorem 2, and the authors could consider removing either the remark or the corresponding paragraph to save some space. I also spot several typos. For example, the is a confusion in the use of the words gradient and subgradient (I believe all of them should be subgradient). In lemma 1, x* and y* are defined in the statement but only used in the proof. In line 217, one should write \hat{f}_k instead of f_k, and in line 138 in the definition of the limiting subgradient one should put v ∈ R^d in the place of v_k ∈ R^d. Finally, the authors could also consider providing more intuition behind the definition of a nearly ε-stationary point (for example, by mentioning the Moreau envelope) or pointing the readers to a reference for this concept.

Relation to Prior Work: The paper does a great job in summarizing recent works working on similar topics. Nonetheless, I find some important references are missing. In particular, the convergence in O(1/k) in terms of mean squared distance for non-smooth strongly-monotone variational inequalities (a generalization of strongly-convex-strongly-concave saddle-point problem) was shown in [1]. While the algorithmic idea is very different and the actual contribution also differs (dual gap versus distance to solution as the convergence measure), I think given the similarity of the results, this paper should be mentioned by the authors. [1] Yousefian, F., Nedić, A., & Shanbhag, U. V. (2015). Self-tuned stochastic approximation schemes for non-Lipschitzian stochastic multi-user optimization and Nash games. IEEE Transactions on Automatic Control, 61(7), 1753-1766.

Reproducibility: Yes

Additional Feedback: In Theorem 1 of the paper, a high probability result is provided while in Theorem 2 a bound in expectation is proved. Could you comment on this inconsistency? I am curious to know if there is any fundamental reason behind it.