{"title": "On the convergence of single-call stochastic extra-gradient methods", "book": "Advances in Neural Information Processing Systems", "page_first": 6938, "page_last": 6948, "abstract": "Variational inequalities have recently attracted considerable interest in machine learning as a flexible paradigm for models that go beyond ordinary loss function minimization (such as generative adversarial networks and related deep learning systems). In this setting, the optimal O(1/t) convergence rate for solving smooth monotone variational inequalities is achieved by the Extra-Gradient (EG) algorithm and its variants. Aiming to alleviate the cost of an extra gradient step per iteration (which can become quite substantial in deep learning), several algorithms have been proposed as surrogates to Extra-Gradient with a single oracle call per iteration. In this paper, we develop a synthetic view of such algorithms, and we complement the existing literature by showing that they retain a $O(1/t)$ ergodic convergence rate in smooth, deterministic problems. Subsequently, beyond the monotone deterministic case, we also show that the last iterate of single-call, stochastic extra-gradient methods still enjoys a $O(1/t)$ local convergence rate to solutions of non-monotone variational inequalities that satisfy a second-order sufficient condition.", "full_text": "On the Convergence of Single-Call\nStochastic Extra-Gradient Methods\n\nYu-Guan Hsieh\n\nUniv. Grenoble Alpes, LJK and ENS Paris\n\n38000 Grenoble, France.\nyu-guan.hsieh@ens.fr\n\nFranck Iutzeler\n\nUniv. Grenoble Alpes, LJK\n38000 Grenoble, France.\n\nfranck.iutzeler@univ-grenoble-alpes.fr\n\nJ\u00e9r\u00f4me Malick\n\nCNRS, LJK\n\n38000 Grenoble, France.\n\njerome.malick@univ-grenoble-alpes.fr\n\nUniv. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG\n\nPanayotis Mertikopoulos\n\n38000 Grenoble, France.\n\npanayotis.mertikopoulos@imag.fr\n\nAbstract\n\nVariational inequalities have recently attracted considerable interest in machine\nlearning as a \ufb02exible paradigm for models that go beyond ordinary loss function\nminimization (such as generative adversarial networks and related deep learning\nsystems). In this setting, the optimal O(1/t) convergence rate for solving smooth\nmonotone variational inequalities is achieved by the Extra-Gradient (EG) algorithm\nand its variants. Aiming to alleviate the cost of an extra gradient step per iteration\n(which can become quite substantial in deep learning applications), several algo-\nrithms have been proposed as surrogates to Extra-Gradient with a single oracle\ncall per iteration. In this paper, we develop a synthetic view of such algorithms,\nand we complement the existing literature by showing that they retain a O(1/t)\nergodic convergence rate in smooth, deterministic problems. Subsequently, beyond\nthe monotone deterministic case, we also show that the last iterate of single-call,\nstochastic extra-gradient methods still enjoys a O(1/t) local convergence rate\nto solutions of non-monotone variational inequalities that satisfy a second-order\nsuf\ufb01cient condition.\n\n1\n\nIntroduction\n\nDeep learning is arguably the fastest-growing \ufb01eld in arti\ufb01cial intelligence: its applications range from\nimage recognition and natural language processing to medical anomaly detection, drug discovery, and\nmost \ufb01elds where computers are required to make sense of massive amounts of data. In turn, this has\nspearheaded a proli\ufb01c research thrust in optimization theory with the twofold aim of demystifying\nthe successes of deep learning models and of providing novel methods to overcome their failures.\nIntroduced by Goodfellow et al. [20], generative adversarial networks (GANs) have become the\nyoungest torchbearers of the deep learning revolution and have occupied the forefront of this drive\nin more ways than one. First, the adversarial training of deep neural nets has given rise to new\nchallenges regarding the ef\ufb01cient allocation of parallelizable resources, the compatibility of the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fLipschitz\n\nLipschitz + Strong\n\n\u0003\u0002\n\n\u0001\n\nErgodic\n\nDeterministic\nStochastic\n\n1/\n\n1/t\n\u221a\nt [13, 18]\n\nLast Iterate\n\nUnknown\nUnknown\n\n\u0003\u0002\n\n1/t\n\n1/t\n\n\u0001\n\nErgodic\n\nLast Iterate\n\ne\u2212\u03c1t [18, 25, 31]\n\n\u0003\u0002\n\n1/t\n\n\u0001\n\nTable 1: The best known global convergence rates for single-call extra-gradient methods in monotone VI\nproblems; logarithmic factors ignored throughout. A box indicates a contribution from this paper.\n\nchosen architectures, etc. Second, the loss landscape in GANs is no longer that of a minimization\nproblem but that of a zero-sum, min-max game \u2013 or, more generally, a variational inequality (VI).\nVariational inequalities are a \ufb02exible and widely studied framework in optimization which, among\nothers, incorporates minimization, saddle-point, Nash equilibrium, and \ufb01xed point problems. As such,\nthere is an extensive literature devoted to solving variational inequalities in different contexts; for an\nintroduction, see [4, 17] and references therein. In particular, in the setting of monotone variational\ninequalities with Lipschitz continuous operators, it is well known that the optimal rate of convergence\nis O(1/t), and that this rate is achieved by the Extra-Gradient (EG) algorithm of Korpelevich [23]\nand its Bregman variant, the Mirror-Prox (MP) algorithm of Nemirovski [32].1\nThese algorithms require two projections and two oracle calls per iteration, so they are more costly\nthan standard Forward-Backward / descent methods. As a result, there are two complementary\nstrands of literature aiming to reduce one (or both) of these cost multipliers \u2013 that is, the number of\nprojections and/or the number of oracle calls per iteration. The \ufb01rst class contains algorithms like\nthe Forward-Backward-Forward (FBF) method of Tseng [43], while the second focuses on gradient\nextrapolation mechanisms like Popov\u2019s modi\ufb01ed Arrow\u2013Hurwicz algorithm [37].\nIn deep learning, the latter direction has attracted considerably more interest than the former. The\nmain reason for this is that neural net training often does not involve constraints (and, when it does,\nthey are relatively cheap to handle). On the other hand, gradient calculations can become very costly,\nso a decrease in the number of oracle calls could offer signi\ufb01cant practical bene\ufb01ts. In view of this,\nour aim in this paper is (i) to develop a synthetic approach to methods that retain the anticipatory\nproperties of the Extra-Gradient algorithm while making a single oracle call per iteration; and (ii) to\nderive quantitative convergence results for such single-call extra-gradient (1-EG) algorithms.\n\nOur contributions. Our \ufb01rst contribution complements the existing literature (reviewed below and\nin Section 3) by showing that the class of 1-EG algorithms under study attains the optimal O(1/t)\nconvergence rate of the two-call method in deterministic variational inequalities with a monotone,\nLipschitz continuous operator. Subsequently, we show that this rate is also achieved in stochastic\nvariational inequalities with strongly monotone operators provided that the optimizer has access to an\noracle with bounded variance (but not necessarily bounded second moments).\nImportantly, this stochastic result concerns both the method\u2019s \u201cergodic average\u201d (a weighted average\nof the sequence of points generated by the algorithm) as well as its \u201clast iterate\u201d (the last generated\npoint). The reason for this dual focus is that averaging can be very useful in convex/monotone\nlandscapes, but it is not as bene\ufb01cial in non-monotone problems (where Jensen\u2019s inequality does\nnot apply). On that account, last-iterate convergence results comprise an essential stepping stone for\nventuring beyond monotone problems.\nArmed with these encouraging results, we then focus on non-monotone problems and show that, with\nhigh probability, the method\u2019s last iterate exhibits a O(1/t) local convergence rate to solutions of\nnon-monotone variational inequalities that satisfy a second-order suf\ufb01cient condition. To the best of\nour knowledge, this is the \ufb01rst convergence rate guarantee of this type for stochastic, non-monotone\nvariational inequalities.\n\nRelated work. The prominence of Extra-Gradient/Mirror-Prox methods in solving variational\ninequalities and saddle-point problems has given rise to a vast corpus of literature which we cannot\nhope to do justice here. Especially in the context of adversarial networks, there has been a \ufb02urry\n\n1Korpelevich [23] proved the method\u2019s asymptotic convergence for pseudomonotone variational inequalities.\n\nThe O(1/t) convergence rate was later established by Nemirovski [32] with ergodic averaging.\n\n2\n\n\fof recent activity relating variants of the Extra-Gradient algorithm to GAN training, see e.g., [9, 14,\n18, 19, 24, 28, 44] and references therein. For concreteness, we focus here on algorithms with a\nsingle-call structure and refer the reader to Sections 3\u20135 for additional details.\nThe \ufb01rst variant of Extra-Gradient with a single oracle call per iteration dates back to Popov [37].\nThis algorithm was subsequently studied by, among others, Chiang et al. [10], Rakhlin and Sridharan\n[38, 39] and Gidel et al. [18]; see also [13, 25] for a \u201cre\ufb02ected\u201d variant, [14, 30, 31, 36] for an\n\u201coptimistic\u201d one, and Section 3 for a discussion of the differences between these variants. In the context\nof deterministic, strongly monotone variational inequalities with Lipschitz continuous operators, the\nlast iterate of the method was shown to exhibit a geometric convergence rate [18, 25, 31, 42]; similar\ngeometric convergence results also extend to bilinear saddle-point problems [18, 36, 42], even though\nthe operator involved is not strongly monotone. In turn, this implies the convergence of the method\u2019s\nergodic average, but at a O(1/t) rate (because of the hysteresis of the average). In view of this,\nthe fact that 1-EG methods retain the optimal O(1/t) convergence rate in deterministic variational\ninequalities without strong monotonicity assumptions closes an important gap in the literature.2\nAt the local level, the geometric convergence results discussed above echo a surge of interest in local\nconvergence guarantees of optimization algorithms applied to games and saddle-point problems,\nsee e.g., [1, 3, 15, 24] and references therein. In more detail, Liang and Stokes [24] proved local\ngeometric convergence for several algorithms in possibly non-monotone saddle-point problems under\na local smoothness condition. In a similar vein, Daskalakis and Panageas [15] analyzed the limit\npoints of (optimistic) gradient descent, and showed that local saddle points are stable stationary\npoints; subsequently, Adolphs et al. [1] and Mazumdar et al. [27] proposed a class of algorithms that\neliminate stationary points which are not local Nash equilibria.\nGeometric convergence results of this type are inherently deterministic because they rely on an\nassociated resolvent operator being \ufb01rmly nonexpansive \u2013 or, equivalently, rely on the use of the\ncenter manifold theorem. In a stochastic setting, these techniques are no longer applicable because\nthe contraction property cannot be maintained in the presence of noise; in fact, unless the problem at\nhand is amenable to variance reduction \u2013 e.g., as in [6, 9, 21] \u2013 geometric convergence is not possible\n\u221a\nif the noise process is even weakly isotropic. Instead, for monotone problems, Cui and Shanbhag [13]\nand Gidel et al. [18] showed that the ergodic average of the method attains a O(1/\nt) convergence\nrate. Our global convergence results for stochastic variational inequalities improve this rate to O(1/t)\nin strongly monotone variational inequalities for both the method\u2019s ergodic average and its last iterate.\nIn the same light, our local O(1/t) convergence results for non-monotone variational inequalities\nprovide a key extension of local, deterministic convergence results to a fully stochastic setting, all the\nwhile retaining the fastest convergence rate for monotone variational inequalities.\nFor convenience, our contributions relative to the state of the art are summarized in Table 1.\n\n2 Problem setup and blanket assumptions\n\nVariational inequalities. We begin by presenting the basic variational inequality framework that\nwe will consider throughout the sequel. To that end, let X be a nonempty closed convex subset of Rd,\nand let V : Rd \u2192 Rd be a single-valued operator on Rd. In its most general form, the variational\ninequality (VI) problem associated to V and X can be stated as:\n\n(VI)\n\nFind x(cid:63) \u2208 X such that (cid:104)V (x(cid:63)), x \u2212 x(cid:63)(cid:105) \u2265 0 for all x \u2208 X .\nTo provide some intuition about (VI), we discuss two important examples below:\nExample 1 (Loss minimization). Suppose that V = \u2207f for some smooth loss function f on X = Rd.\nThen, x(cid:63) \u2208 X is a solution to (VI) if and only if \u2207f (x(cid:63)) = 0, i.e., if and only if x(cid:63) is a critical point\nof f. Of course, if f is convex, any such solution is a global minimizer.\nExample 2 (Min-max optimization). Suppose that X decomposes as X = \u0398 \u00d7 \u03a6 with \u0398 = Rd1,\n\u03a6 = Rd2, and assume V = (\u2207\u03b8L,\u2212\u2207\u03c6L) for some smooth function L(\u03b8, \u03c6), \u03b8 \u2208 \u0398, \u03c6 \u2208 \u03a6. As in\n2A few weeks after the submission of our paper, we were made aware of a very recent preprint by Mokhtari\net al. [30] which also establishes a O(1/t) convergence rate for the algorithm\u2019s \u201coptimistic\u201d variant in saddle-\npoint problems (in terms of the Nikaido\u2013Isoda gap function). To the best of our knowledge, this is the closest\nresult to our own in the literature.\n\n3\n\n\fExample 1 above, the solutions to (VI) correspond to the critical points of L; if, in addition, L is\nconvex-concave, any solution x(cid:63) = (\u03b8(cid:63), \u03c6(cid:63)) of (VI) is a global saddle-point, i.e.,\nfor all \u03b8 \u2208 \u0398 and all \u03c6 \u2208 \u03a6.\n\nL(\u03b8(cid:63), \u03c6) \u2264 L(\u03b8(cid:63), \u03c6(cid:63)) \u2264 L(\u03b8, \u03c6(cid:63))\n\nGiven the original formulation of GANs as (stochastic) saddle-point problems [20], this observation\nhas been at the core of a vigorous literature at the interface between optimization, game theory, and\ndeep learning, see e.g., [9, 14, 18, 24, 28, 36, 44] and references therein.\n\nThe operator analogue of convexity for a function is monotonicity, i.e.,\n\n(cid:104)V (x(cid:48)) \u2212 V (x), x(cid:48) \u2212 x(cid:105) \u2265 0\n\nfor all x, x(cid:48) \u2208 Rd.\n\nSpeci\ufb01cally, when V = \u2207f for some suf\ufb01ciently smooth function f, this condition is equivalent to f\nbeing convex [4]. In this case, following Nesterov [34, 35] and Juditsky et al. [22], the quality of a\ncandidate solution \u02c6x \u2208 X can be assessed via the so-called error (or merit) function\n\nand/or its restricted variant\n\nErr(\u02c6x) = sup\nx\u2208X\n\n(cid:104)V (x), \u02c6x \u2212 x(cid:105)\n\nErrR(\u02c6x) = max\nx\u2208XR\n\n(cid:104)V (x), \u02c6x \u2212 x(cid:105),\n\nwhere XR \u2261 X \u2229 BR(0) = {x \u2208 X : (cid:107)x(cid:107) \u2264 R} denotes the \u201crestricted domain\u201d of the problem.\nMore precisely, we have the following basic result.\nLemma 1 (Nesterov, 2007). Assume V is monotone. If x(cid:63) is a solution of (VI), we have Err(x(cid:63)) = 0\nand ErrR(x(cid:63)) = 0 for all suf\ufb01ciently large R. Conversely, if ErrR(\u02c6x) = 0 for large enough R > 0\nand some \u02c6x \u2208 XR, then \u02c6x is a solution of (VI).\nIn light of this result, Err and ErrR will be among our principal measures of convergence in the\nsequel.\n\nBlanket assumptions. With all this in hand, we present below the main assumptions that will\nunderlie the bulk of the analysis to follow.\nAssumption 1. The solution set X (cid:63) of (VI) is nonempty.\nAssumption 2. The operator V is \u03b2-Lipschitz continuous, i.e.,\n\n(cid:107)V (x(cid:48)) \u2212 V (x)(cid:107) \u2264 \u03b2(cid:107)x(cid:48) \u2212 x(cid:107)\n\nfor all x, x(cid:48) \u2208 Rd.\n\nAssumption 3. The operator V is monotone.\n\nIn some cases, we will also strengthen Assumption 3 to:\nAssumption 3(s). The operator V is \u03b1-strongly monotone, i.e.,\n\n(cid:104)V (x(cid:48)) \u2212 V (x), x(cid:48) \u2212 x(cid:105) \u2265 \u03b1(cid:107)x(cid:48) \u2212 x(cid:107)2\n\nfor some \u03b1 > 0 and all x, x(cid:48) \u2208 Rd.\n\nThroughout our paper, we will be interested in sequences of points Xt \u2208 X generated by algorithms\nthat can access the operator V via a stochastic oracle [33].3 Formally, this is a black-box mechanism\nwhich, when called at Xt \u2208 X , returns the estimate\n\nVt = V (Xt) + Zt,\n\n(1)\n\nwhere Zt \u2208 Rd is an additive noise variable satisfying the following hypotheses:\n\na) Zero-mean:\nb) Finite variance: E[(cid:107)Zt(cid:107)2 | Ft] \u2264 \u03c32.\n\nE[Zt | Ft] = 0.\n\nIn the above, Ft denotes the history (natural \ufb01ltration) of Xt, so Xt is adapted to Ft by de\ufb01nition; on\nthe other hand, since the t-th instance of Zt is generated randomly from Xt, Zt is not adapted to Ft.\nObviously, if \u03c32 = 0, we have the deterministic, perfect feedback case Vt = V (Xt).\n\n3Depending on the algorithm, the sequence index t may take positive integer or half-integer values (or both).\n\n4\n\n\f3 Algorithms\n\nThe Extra-Gradient algorithm.\nExtra-Gradient (EG) algorithm of Korpelevich [23] can be stated in recursive form as\n\nIn the general framework outlined in the previous section, the\n\nXt+1/2 = \u03a0X (Xt \u2212 \u03b3tVt)\nXt+1 = \u03a0X (Xt \u2212 \u03b3tVt+1/2)\n\n(EG)\nwhere \u03a0X (y) := arg minx\u2208X(cid:107)y \u2212 x(cid:107) denotes the Euclidean projection of y \u2208 Rd onto the closed\nconvex set X and \u03b3t > 0 is a variable step-size sequence. Using this formulation as a starting point,\nthe main idea behind the method can be described as follows: at each t = 1, 2, . . . , the oracle is\ncalled at the algorithm\u2019s current \u2013 or base \u2013 state Xt to generate an intermediate \u2013 or leading \u2013 state\nXt+1/2; subsequently, the base state Xt is updated to Xt+1 using gradient information from the\nleading state Xt+1/2, and the process repeats. Heuristically, the extra oracle call allows the algorithm\nto \u201canticipate\u201d the landscape of V and, in so doing, to achieve improved convergence results relative\nto standard projected gradient / forward-backward methods; for a detailed discussion, we refer the\nreader to [7, 17] and references therein.\n\nSingle-call variants of the Extra-Gradient algorithm. Given the signi\ufb01cant computational over-\nhead of gradient calculations, a key desideratum is to drop the second oracle call in (EG) while\nretaining the algorithm\u2019s \u201canticipatory\u201d properties. In light of this, we will focus on methods that\nperform a single oracle call at the leading state Xt+1/2, but replace the update rule for Xt+1/2 (and,\npossibly, Xt as well) with a proxy that compensates for the missing gradient. Concretely, we will\nexamine the following family of single-call extra-gradient (1-EG) algorithms:\n\n1. Past Extra-Gradient (PEG) [10, 18, 37]:\n\nXt+1/2 = \u03a0X (Xt \u2212 \u03b3tVt\u22121/2)\nXt+1 = \u03a0X (Xt \u2212 \u03b3tVt+1/2)\n\n[Proxy: use Vt\u22121/2 instead of Vt in the calculation of Xt+1/2]\n\n2. Re\ufb02ected Gradient (RG) [8, 13, 25]:\n\nXt+1/2 = Xt \u2212 (Xt\u22121 \u2212 Xt)\nXt+1 = \u03a0X (Xt \u2212 \u03b3tVt+1/2)\n\n[Proxy: use (Xt\u22121 \u2212 Xt)/\u03b3t instead of Vt in the calculation of Xt+1/2; no projection]\n\n3. Optimistic Gradient (OG) [14, 30, 31, 36]:\n\nXt+1/2 = \u03a0X (Xt \u2212 \u03b3tVt\u22121/2)\nXt+1 = Xt+1/2 + \u03b3tVt\u22121/2 \u2212 \u03b3tVt+1/2\n\n(PEG)\n\n(RG)\n\n(OG)\n\n[Proxy: use Vt\u22121/2 instead of Vt in the calculation of Xt+1/2; use Xt+1/2 + \u03b3tVt\u22121/2 instead\nof Xt in the calculation of Xt+1; no projection]\n\nThese are the main algorithmic schemes that we will consider, so a few remarks are in order. First,\ngiven the extensive literature on the subject, this list is not exhaustive; see e.g., [30, 31, 36] for a\ngeneralization of (OG), [26] for a variant that employs averaging to update the algorithm\u2019s base state\nXt, and [19] for a proxy de\ufb01ned via \u201cnegative momentum\u201d. Nevertheless, the algorithms presented\nabove appear to be the most widely used single-call variants of (EG), and they illustrate very clearly\nthe two principal mechanisms for approximating missing gradients: (i) using past gradients (as in the\nPEG and OG variants); and/or (ii) using a difference of successive states (as in the RG variant).\nWe also take this opportunity to provide some background and clear up some issues on terminology\nregarding the methods presented above. First, the idea of using past gradients dates back at least to\nPopov [37], who introduced (PEG) as a \u201cmodi\ufb01ed Arrow\u2013Hurwicz\u201d method a few years after the\noriginal paper of Korpelevich [23]; the same algorithm is called \u201cmeta\u201d in [10] and \u201cextrapolation\nfrom the past\u201d in [18] (but see also the note regarding optimism below). The terminology \u201cRe\ufb02ected\n\n5\n\n\fGradient\u201d and the precise formulation that we use here for (RG) is due to Malitsky [25]. The well-\nknown primal-dual algorithm of Chambolle and Pock [8] can be seen as a one-sided, alternating\nvariant of the method for saddle-point problems; see also [44] for a more recent take.\nFinally, the terminology \u201coptimistic\u201d is due to Rakhlin and Sridharan [38, 39], who provided a uni\ufb01ed\nview of (PEG) and (EG) based on the sequence of oracle vectors used to update the algorithm\u2019s\nleading state Xt+1/2.4 Because the framework of [38, 39] encompasses two different algorithms,\nthere is some danger of confusion regarding the use of the term \u201coptimism\u201d; in particular, both (EG)\nand (PEG) can be seen as instances of optimism. The speci\ufb01c formulation of (OG) that we present\nhere is the projected version of the algorithm considered by Daskalakis et al. [14];5 by contrast, the\n\u201coptimistic\u201d method of Mertikopoulos et al. [28] is equivalent to (EG) \u2013 not (PEG) or (OG).\nThe above shows that there can be a broad array of single-call extra-gradients methods depending on\nthe speci\ufb01c proxy used to estimate the missing gradient, whether it is applied to the algorithm\u2019s base\nor leading state, when (or where) a projection operator is applied, etc. The contact point of all these\nalgorithms is the unconstrained setting (X = Rd) where they are exactly equivalent:\nProposition 1. Suppose that the 1-EG methods presented above share the same initialization, X0 =\nX1 \u2208 X , V1/2 = 0, and are run with the same, constant step-size \u03b3t \u2261 \u03b3 for all t \u2265 1. If X = Rd,\nthe generated iterates Xt coincide for all t \u2265 1.\nThe proof of this proposition follows by a simple rearrangement of the update rules for (PEG), (RG)\nand (OG), so we omit it. In the projected case, the 1-EG updates presented above are no longer\nequivalent \u2013 though, of course, they remain closely related.\n\n4 Deterministic analysis\n\nWe begin with the deterministic analysis, i.e., when the optimizer receives oracle feedback of the\nform (1) with \u03c3 = 0. In terms of presentation, we keep the global and local cases separated and\nwe interleave our results for the generated sequence Xt and its ergodic average. To streamline our\npresentation, we defer the details of the proofs to the paper\u2019s supplement and only discuss here the\nmain ideas.\n\n4.1 Global convergence\nOur \ufb01rst result below shows that the algorithms under study achieve the optimal O(1/t) ergodic\nconvergence rate in monotone problems with Lipschitz continuous operators.\nTheorem 1. Suppose that V satis\ufb01es Assumptions 1\u20133. Assume further that a 1-EG algorithm is run\nwith perfect oracle feedback and a constant step-size \u03b3 < 1/(c\u03b2), where c = 1 +\n2 for the RG\nvariant and c = 2 for the PEG and OG variants. Then, for all R > 0, we have\n\n\u221a\n\n(cid:0) \u00afXt\n\n(cid:1) \u2264 R2 + (cid:107)X1 \u2212 X1/2(cid:107)2\n\n2\u03b3t\n\nwhere \u00afXt = t\u22121(cid:80)t\n\nErrR\n\ns=1 Xs+1/2 is the ergodic average of the algorithm\u2019s sequence of leading states.\nThis result shows that the EG and 1-EG algorithms share the same convergence rate guarantees, so\nwe can safely drop one gradient calculation per iteration in the monotone case. The proof of the\ntheorem is based on the following technical lemma which enables us to treat the different variants of\nthe 1-EG method in a uni\ufb01ed way.\nLemma 2. Assume that V satis\ufb01es Assumption 3 (monotonicity). Suppose further that the sequence\n(Xt)t\u2208N/2 of points in Rd satis\ufb01es the following \u201cquasi-descent\u201d inequality with \u00b5s, \u03bbs \u2265 0:\n\n(cid:107)Xs+1 \u2212 p(cid:107)2 \u2264 (cid:107)Xs \u2212 p(cid:107)2 \u2212 2\u03bbs(cid:104)V (Xs+1/2), Xs+1/2 \u2212 p(cid:105) + \u00b5s \u2212 \u00b5s+1\n\n(3)\n\n4More precisely, Rakhlin and Sridharan [38, 39] use the term Optimistic Mirror Descent (OMD) in reference\nto the Mirror-Prox method of Nemirovski [32], itself a variant of (EG) with projections de\ufb01ned by means of a\nBregman function; for a related treatment, see Nesterov [34] and Juditsky et al. [22].\n5To see this, note that the difference between two consecutive intermediate steps Xt\u22121/2 and Xt+1/2 can be\nwritten as Xt+1/2 = \u03a0X (Xt\u22121/2 \u2212 (\u03b3t\u22121 + \u03b3t)Vt\u22121/2 + \u03b3t\u22121Vt\u22123/2). Writing (OG) in the form presented\nabove shows that (OG) can also be viewed as a single-call variant of the FBF method of Tseng [43].\n\n6\n\n\ffor all p \u2208 XR and all s \u2208 {1, . . . , t}. Then,\n\n(cid:32)(cid:80)t\n\nErrR\n\n(cid:80)t\n\ns=1 \u03bbsXs+1/2\n\ns=1 \u03bbs\n\n(cid:33)\n\n2(cid:80)t\n\n\u2264 R2 + \u00b51\ns=1 \u03bbs\n\n.\n\nRemark 1. For Examples 1 and 2 it is possible to state both Theorem 1 and Lemma 2 with more\nadapted measures. We refer the readers to the supplement for more details.\nThe use of Lemma 2 is tailored to time-averaged sequences like \u00afXt, and relies on establishing a\nsuitable \u201cquasi-descent inequality\u201d of the form (3) for the iterates of 1-EG. Doing this requires in turn\na careful comparison of successive iterates of the algorithm via the Lipschitz continuity assumption\nfor V ; we defer the precise treatment of this argument to the paper\u2019s supplement.\nOn the other hand, because the role of averaging is essential in this argument, the convergence of\nthe algorithm\u2019s last iterate requires signi\ufb01cantly different techniques. To the best of our knowledge,\nthere are no comparable convergence rate guarantees for Xt under Assumptions 1\u20133; however, if\nAssumption 3 is strengthened to Assumption 3(s), the convergence of Xt to the (necessarily unique)\nsolution of (VI) occurs at a geometric rate. For completeness, we state here a consolidated version of\nthe geometric convergence results of Malitsky [25], Gidel et al. [18], and Mokhtari et al. [31].\nTheorem 2. Assume that V satis\ufb01es Assumptions 1, 2 and 3(s), and let x(cid:63) denote the (necessarily\nunique) solution of (VI). If a 1-EG algorithm is run with a suf\ufb01ciently small step-size \u03b3, the generated\nsequence Xt converges to x(cid:63) at a rate of (cid:107)Xt \u2212 x(cid:63)(cid:107) = O(exp(\u2212\u03c1 t)) for some \u03c1 > 0.\n\n4.2 Local convergence\n\nWe continue by presenting a local convergence result for deterministic, non-monotone problems. To\nstate it, we will employ the following notion of regularity in lieu of Assumptions 1\u20133 and 3(s).\nDe\ufb01nition 3. We say that x(cid:63) is a regular solution of (VI) if V is C 1-smooth in a neighborhood of x(cid:63)\nand the Jacobian JacV (x(cid:63)) is positive-de\ufb01nite along rays emanating from x(cid:63), i.e.,\n\nz(cid:62) JacV (x(cid:63))z \u2261 d(cid:88)\n\nzi\n\n\u2202Vi\n\u2202xj\n\n(x(cid:63))zj > 0\n\ni,j=1\n\nfor all z \u2208 Rd \\{0} that are tangent to X at x(cid:63).\nThis notion of regularity is an extension of similar conditions that have been employed in the local\nanalysis of loss minimization and saddle-point problems. More precisely, if V = \u2207f for some\nloss function f, this de\ufb01nition is equivalent to positive-de\ufb01niteness of the Hessian along quali\ufb01ed\nconstraints [5, Chap. 3.2]. As for saddle-point problems and smooth games, variants of this condition\ncan be found in several different sources, see e.g., [16, 24, 29, 40, 41] and references therein.\nUnder this condition, we obtain the following local geometric convergence result for 1-EG methods.\nTheorem 4. Let x(cid:63) be a regular solution of (VI). If a 1-EG method is run with perfect oracle\nfeedback and is initialized suf\ufb01ciently close to x(cid:63) with a suf\ufb01ciently small constant step-size,we have\n(cid:107)Xt \u2212 x(cid:63)(cid:107) = O(exp(\u2212\u03c1 t)) for some \u03c1 > 0.\nThe proof of this theorem relies on showing that (i) V essentially behaves like a smooth, strongly\nmonotone operator close to x(cid:63); and (ii) if the method is initialized in a small enough neighborhood\nof x(cid:63), it will remain in said neighborhood for all t. As a result, Theorem 4 essentially follows by\n\u201clocalizing\u201d Theorem 2 to this neighborhood.\nAs a preamble to our stochastic analysis in the next section, we should state here that, albeit\nstraightforward, the proof strategy outlined above breaks down if we have access to V only via a\nstochastic oracle. In this case, a single \u201cbad\u201d realization of the feedback noise Zt could drive the\nprocess away from the attraction region of any local solution of (VI). For this reason, the stochastic\nanalysis requires signi\ufb01cantly different tools and techniques and is considerably more intricate.\n\n5 Stochastic analysis\n\nWe now present our analysis for stochastic variational inequalities with oracle feedback of the form (1).\nFor concreteness, given that the PEG variant of the 1-EG method employs the most straightforward\n\n7\n\n\f(a) Strongly monotone [\u00011 = 1, \u00012 = 0],\n\ndeterministic, last iterate\n\n(b) Monotone [\u00011 = 0, \u00012 = 1],\ndeterministic, ergodic averaging\n\n(c) Non-monotone [\u00011 = 1, \u00012 = \u22121], iid\nZ \u223c N (0, .01), last iterate (b = 15)\n\nFigure 1: Illustration of the performance of EG and 1-EG in the (a priori non-monotone) saddle-point problem\n\nL(\u03b8, \u03c6) = 2\u00011\u03b8\n\n(cid:62)\nA1\u03b8 + \u00012\n\n(cid:0)\u03b8\n\nA2\u03b8(cid:1)2 \u2212 2\u00011\u03c6\n\n(cid:62)\n\nB1\u03c6 \u2212 \u00012\n(cid:62)\n\n(cid:0)\u03c6\n\nB2\u03c6(cid:1)2 + 4\u03b8\n\n(cid:62)\n\n(cid:62)\nC\u03c6\n\non the full unconstrained space X = Rd = Rd1\u00d7d2 with d1 = d2 = 1000 and A1, B1, A2, B2 (cid:31) 0. We choose\nthree situations representative of the settings considered in the paper: (a) linear convergence of the last iterate of\nthe deterministic methods in strongly monotone problems; (b) the O(1/t) convergence of the ergodic average\nin monotone, deterministic problems; and (c) the O(1/t) local convergence rate of the method\u2019s last iterate in\nstochastic, non-monotone problems. For (a) and (b), the origin is the unique solution of (VI), and for (c) it is a\nregular solution thereof. We observe that 1-EG consistently outperforms EG in terms of oracle calls for a \ufb01xed\nstep-size, and the observed rates are consistent with the rates reported in Table 1.\n\nproxy mechanism, we will focus on this variant throughout; for the other variants, the proofs and\ncorresponding explicit expressions follow from the same rationale (as in the case of Theorem 1).\n\n5.1 Global convergence\n\n\u221a\n\nAs we mentioned in the introduction, under Assumptions 1\u20133, Cui and Shanbhag [13] and Gidel\net al. [18] showed that 1-EG methods attain a O(1/\nt) ergodic convergence rate. By strengthening\nAssumption 3 to Assumption 3(s), we show that this result can be augmented in two synergistic ways:\nunder Assumptions 1, 2 and 3(s), both the last iterate and the ergodic average of 1-EG achieve a\nO(1/t) convergence rate.\nTheorem 5. Suppose that V satis\ufb01es Assumptions 1, 2 and 3(s), and assume that (PEG) is run\nwith stochastic oracle feedback of the form (1) and a step-size of the form \u03b3t = \u03b3/(t + b) for some\n\u03b3 > 1/\u03b1 and b \u2265 4\u03b2\u03b3. Then, the generated sequence of the algorithm\u2019s base states satis\ufb01es\n\nwhile its ergodic average \u00afXt = t\u22121(cid:80)t\n\nE[(cid:107)Xt \u2212 x(cid:63)(cid:107)2] \u2264 6\u03b32\u03c32\n\u03b1\u03b3 \u2212 1\n\n1\nt\n\n+ o\n\ns=1 Xs enjoys the bound\n\nE[(cid:107) \u00afXt \u2212 x(cid:63)(cid:107)2] \u2264 6\u03b32\u03c32\n\u03b1\u03b3 \u2212 1\n\nlog t\n\nt\n\n+ o\n\n,\n\n(cid:19)\n(cid:18) 1\n(cid:18) log t\n\nt\n\nt\n\n(cid:19)\n\n.\n\n\u221a\nRegarding our proof strategy for the last iterate of the process, we can no longer rely either on a\ncontraction argument or the averaging mechanism that yields the O(1/\nt) ergodic convergence rate.\nInstead, we show in the appendix that Xt is (stochastically) quasi-Fej\u00e9r in the sense of [11, 12]; then,\nleveraging the method\u2019s speci\ufb01c step-size, we employ successive numerical sequence estimates to\ncontrol the summability error and obtain the O(1/t) rate.\n\n5.2 Local convergence\n\nWe proceed to examine the convergence of the method in the stochastic, non-monotone case. Our\nmain result in this regard is the following.\nTheorem 6. Let x(cid:63) be a regular solution of (VI) and \ufb01x a tolerance level \u03b4 > 0. Suppose further\nthat (PEG) is run with stochastic oracle feedback of the form (1) and a variable step-size of the form\n\u03b3t = \u03b3/(t + b) for some \u03b3 > 1/\u03b1 and large enough b. Then:\n\n8\n\n020406080100# Oracle Calls10\u22121610\u22121310\u22121010\u2212710\u2212410\u22121\u2225x\u2212x*\u22252EG1-EG\u03b3=0.1\u03b3=0.2\u03b3=0.3\u03b3=0.4100101102103# Oracle Calls10\u2212310\u2212210\u22121100\u2225x\u2212x*\u22252EG1-EG\u03b3=0.2\u03b3=0.4\u03b3=0.6\u03b3=0.8100101102103104# Oracle Calls10\u2212310\u2212210\u22121\u2225x\u2212x*\u22252EG1-EG\u03b3=0.3\u03b3=0.6\u03b3=0.9\u03b3=1.2\f(a) There are neighborhoods U and U1 of x(cid:63) in X such that, if X1/2 \u2208 U, X1 \u2208 U1, the event\n\nE\u221e = {Xt+1/2 \u2208 U for all t = 1, 2, . . .}\n\noccurs with probability at least 1 \u2212 \u03b4.\n(b) Conditioning on the above, we have:\n\n(cid:18) 1\n\n(cid:19)\n\n,\n\nt\n\nE[(cid:107)Xt \u2212 x(cid:63)(cid:107)2 | E\u221e] \u2264 4\u03b32(M 2 + \u03c32)\n(\u03b1\u03b3 \u2212 1)(1 \u2212 \u03b4)\n\n1\nt\n\n+ o\n\nwhere M = supx\u2208U(cid:107)V (x)(cid:107) < \u221e and \u03b1 = inf x\u2208U(cid:104)V (x), x \u2212 x(cid:63)(cid:105)/(cid:107)x \u2212 x(cid:63)(cid:107)2 > 0.\n\nThe \ufb01niteness of M and the positivity of \u03b1 are both consequences of the regularity of x(cid:63) and their\nvalues only depend on the size of the neighborhood U. Taking a larger U would increase the\nalgorithm\u2019s certi\ufb01ed initialization basin but it would also negatively impact its convergence rate (since\nM would increase while \u03b1 would decrease). Likewise, the neighborhood U1 only depends on the\nsize of U and, as we explain in the appendix, it suf\ufb01ces to take U1 to be \u201cone fourth\u201d of U.\nFrom the above, it becomes clear that the situation is signi\ufb01cantly more involved than the correspond-\ning deterministic analysis. This is also re\ufb02ected in the proof of Theorem 6 which requires completely\nnew techniques, well beyond the straightforward localization scheme underlying Theorem 4. More\nprecisely, a key step in the proof (which we detail in the appendix) is to show that the iterates of the\nmethod remain close to x(cid:63) for all t with arbitrarily high probability. In turn, this requires showing\nthat the probability of getting a string of \u201cbad\u201d noise realizations of arbitrary length is controllably\nsmall. Even then however, the global analysis still cannot be localized because conditioning changes\nthe probability law under which the oracle noise is unbiased. Accounting for this conditional bias\nrequires a surprisingly delicate probabilistic argument which we also detail in the supplement.\n\n6 Concluding remarks\n\nOur aim in this paper was to provide a synthetic view of single-call surrogates to the Extra-Gradient\nalgorithm, and to establish optimal convergence rates in a range of different settings \u2013 deterministic,\nstochastic, and/or non-monotone. Several interesting avenues open up as a result, from extending the\ntheory to more general Bregman proximal settings, to developing an adaptive version as in the recent\nwork [2] for two-call methods. We defer these research directions to future work.\n\nAcknowledgments\n\nThis work bene\ufb01ted from \ufb01nancial support by MIAI Grenoble Alpes (Multidisciplinary Institute in\nArti\ufb01cial Intelligence). P. Mertikopoulos was partially supported by the French National Research\nAgency (ANR) grant ORACLESS (ANR\u201316\u2013CE33\u20130004\u201301) and the EU COST Action CA16228\n\u201cEuropean Network for Game Theory\u201d (GAMENET).\n\nReferences\n[1] Adolphs, Leonard, Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann. 2019. Local saddle point\noptimization: a curvature exploitation approach. AISTATS \u201919: Proceedings of the 22nd International\nConference on Arti\ufb01cial Intelligence and Statistics.\n\n[2] Bach, Francis, K\ufb01r Y. Levy. 2019. A universal algorithm for variational inequalities adaptive to smoothness\n\nand noise. COLT \u201919: Proceedings of the 32nd Annual Conference on Learning Theory.\n\n[3] Balduzzi, David, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel. 2018.\nICML \u201918: Proceedings of the 35th International\n\nThe mechanics of n-player differentiable games.\nConference on Machine Learning.\n\n[4] Bauschke, Heinz H., Patrick L. Combettes. 2017. Convex Analysis and Monotone Operator Theory in\n\nHilbert Spaces. 2nd ed. Springer, New York, NY, USA.\n\n[5] Bertsekas, Dimitri P. 1997. Nonlinear programming. Journal of the Operational Research Society 48(3)\n\n334\u2013334.\n\n[6] Bo\u00b8t, Radu Ioan, Panayotis Mertikopoulos, Mathias Staudigl, Phan Tu Vuong. 2019. Forward-backward-\nforward methods with variance reduction for stochastic variational inequalities. https://arxiv.org/\nabs/1902.03355.\n\n9\n\n\f[7] Bubeck, S\u00e9bastien. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends in\n\nMachine Learning 8(3-4) 231\u2013358.\n\n[8] Chambolle, Antonin, Thomas Pock. 2011. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Journal of Mathematical Imaging and Vision 40(1) 120\u2013145.\n\n[9] Chavdarova, Tatjana, Gauthier Gidel, Fran\u00e7ois Fleuret, Simon Lacoste-Julien. 2019. Reducing noise in\n\nGAN training with variance reduced extragradient. https://arxiv.org/abs/1904.08598.\n\n[10] Chiang, Chao-Kai, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, Shenghuo\nZhu. 2012. Online optimization with gradual variations. COLT \u201912: Proceedings of the 25th Annual\nConference on Learning Theory.\n\n[11] Combettes, Patrick L. 2001. Quasi-Fej\u00e9rian analysis of some optimization algorithms. Dan Butnariu, Yair\nCensor, Simeon Reich, eds., Inherently Parallel Algorithms in Feasibility and Optimization and Their\nApplications. Elsevier, New York, NY, USA, 115\u2013152.\n\n[12] Combettes, Patrick L., Jean-Christophe Pesquet. 2015. Stochastic quasi-Fej\u00e9r block-coordinate \ufb01xed point\n\niterations with random sweeping. SIAM Journal on Optimization 25(2) 1221\u20131248.\n\n[13] Cui, Shisheng, Uday V. Shanbhag. 2016. On the analysis of re\ufb02ected gradient and splitting methods for\nmonotone stochastic variational inequality problems. CDC \u201916: Proceedings of the 57th IEEE Annual\nConference on Decision and Control.\n\n[14] Daskalakis, Constantinos, Andrew Ilyas, Vasilis Syrgkanis, Haoyang Zeng. 2018. Training GANs with\n\noptimism. ICLR \u201918: Proceedings of the 2018 International Conference on Learning Representations.\n\n[15] Daskalakis, Constantinos, Ioannis Panageas. 2018. The limit points of (optimistic) gradient descent in\nmin-max optimization. NIPS\u201918: Proceedings of the 31st International Conference on Neural Information\nProcessing Systems.\n\n[16] Facchinei, Francisco, Christian Kanzow. 2007. Generalized Nash equilibrium problems. 4OR 5(3)\n\n173\u2013210.\n\n[17] Facchinei, Francisco, Jong-Shi Pang. 2003. Finite-Dimensional Variational Inequalities and Complemen-\n\ntarity Problems. Springer Series in Operations Research, Springer.\n\n[18] Gidel, Gauthier, Hugo Berard, Ga\u00ebtan Vignoud, Pascal Vincent, Simon Lacoste-Julien. 2019. A variational\ninequality perspective on generative adversarial networks. ICLR \u201919: Proceedings of the 2019 International\nConference on Learning Representations.\n\n[19] Gidel, Gauthier, Reyhane Askari Hemmat, Mohammad Pezehski, R\u00e9mi Le Priol, Gabriel Huang, Simon\nLacoste-Julien, Ioannis Mitliagkas. 2019. Negative momentum for improved game dynamics. AISTATS\n\u201919: Proceedings of the 22nd International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[20] Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, Yoshua Bengio. 2014. Generative adversarial nets. NIPS \u201914: Proceedings of the 27th\nInternational Conference on Neural Information Processing Systems.\n\n[21] Iusem, Alfredo N., Alejandro Jofr\u00e9, Roberto I. Oliveira, Philip Thompson. 2017. Extragradient method with\nvariance reduction for stochastic variational inequalities. SIAM Journal on Optimization 27(2) 686\u2013724.\n[22] Juditsky, Anatoli, Arkadi Semen Nemirovski, Claire Tauvel. 2011. Solving variational inequalities with\n\nstochastic mirror-prox algorithm. Stochastic Systems 1(1) 17\u201358.\n\n[23] Korpelevich, G. M. 1976. The extragradient method for \ufb01nding saddle points and other problems. \u00c8konom.\n\ni Mat. Metody 12 747\u2013756.\n\n[24] Liang, Tengyuan, James Stokes. 2019. Interaction matters: A note on non-asymptotic local convergence\nof generative adversarial networks. AISTATS \u201919: Proceedings of the 22nd International Conference on\nArti\ufb01cial Intelligence and Statistics.\n\n[25] Malitsky, Yura. 2015. Projected re\ufb02ected gradient methods for monotone variational inequalities. SIAM\n\nJournal on Optimization 25(1) 502\u2013520.\n\n[26] Malitsky, Yura. 2019. Golden ratio algorithms for variational inequalities. Mathematical Programming\n\n1\u201328.\n\n[27] Mazumdar, Eric V, Michael I Jordan, S Shankar Sastry. 2019. On \ufb01nding local nash equilibria (and only\n\nlocal nash equilibria) in zero-sum games. https://arxiv.org/abs/1901.00838.\n\n[28] Mertikopoulos, Panayotis, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar,\nGeorgios Piliouras. 2019. Optimistic mirror descent in saddle-point problems: Going the extra (gradient)\nmile. ICLR \u201919: Proceedings of the 2019 International Conference on Learning Representations.\n\nunknown payoff functions. Mathematical Programming 173(1-2) 465\u2013507.\n\n[29] Mertikopoulos, Panayotis, Zhengyuan Zhou. 2019. Learning in games with continuous action sets and\n[30] Mokhtari, Aryan, Asuman Ozdaglar, Sarath Pattathil. 2019. Convergence rate of O(1/k) for optimistic\ngradient and extra-gradient methods in smooth convex-concave saddle point problems. https://arxiv.\norg/pdf/1906.01115.pdf.\n\n10\n\n\f[31] Mokhtari, Aryan, Asuman Ozdaglar, Sarath Pattathil. 2019. A uni\ufb01ed analysis of extra-gradient and\noptimistic gradient methods for saddle point problems: proximal point approach. https://arxiv.org/\nabs/1901.08511v2.\n\n[32] Nemirovski, Arkadi Semen. 2004. Prox-method with rate of convergence O(1/t) for variational inequalities\nwith Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM\nJournal on Optimization 15(1) 229\u2013251.\n\n[33] Nesterov, Yurii. 2004. Introductory Lectures on Convex Optimization: A Basic Course. No. 87 in Applied\n\nOptimization, Kluwer Academic Publishers.\n\n[34] Nesterov, Yurii. 2007. Dual extrapolation and its applications to solving variational inequalities and related\n\nproblems. Mathematical Programming 109(2) 319\u2013344.\n\n[35] Nesterov, Yurii. 2009. Primal-dual subgradient methods for convex problems. Mathematical Programming\n\n120(1) 221\u2013259.\n\n[36] Peng, Wei, Yu-Hong Dai, Hui Zhang, Lizhi Cheng. 2019. Training GANs with centripetal acceleration.\n\nhttps://arxiv.org/abs/1902.08949.\n\n[37] Popov, Leonid Denisovich. 1980. A modi\ufb01cation of the Arrow\u2013Hurwicz method for search of saddle\n\npoints. Mathematical Notes of the Academy of Sciences of the USSR 28(5) 845\u2013848.\n\n[38] Rakhlin, Alexander, Karthik Sridharan. 2013. Online learning with predictable sequences. COLT \u201913:\n\nProceedings of the 26th Annual Conference on Learning Theory.\n\n[39] Rakhlin, Alexander, Karthik Sridharan. 2013. Optimization, learning, and games with predictable se-\nquences. NIPS \u201913: Proceedings of the 26th International Conference on Neural Information Processing\nSystems.\n\n[40] Ratliff, Lillian J, Samuel A Burden, S Shankar Sastry. 2013. Characterization and computation of local\nnash equilibria in continuous games. 2013 51st Annual Allerton Conference on Communication, Control,\nand Computing (Allerton). IEEE, 917\u2013924.\n\n[41] Rosen, J. B. 1965. Existence and uniqueness of equilibrium points for concave N-person games. Econo-\n\nmetrica 33(3) 520\u2013534.\n\n[42] Tseng, Paul. 1995. On linear convergence of iterative methods for the variational inequality problem.\n\nJournal of Computational and Applied Mathematics 60(1-2) 237\u2013252.\n\n[43] Tseng, Paul. 2000. A modi\ufb01ed forward-backward splitting method for maximal monotone mappings.\n\nSIAM Journal on Control and Optimization 38(2) 431\u2013446.\n\n[44] Yadav, Abhay, Sohil Shah, Zheng Xu, David Jacobs, Tom Goldstein. 2018. Stabilizing adversarial nets\nwith prediction methods. ICLR \u201918: Proceedings of the 2018 International Conference on Learning\nRepresentations.\n\n11\n\n\f", "award": [], "sourceid": 3769, "authors": [{"given_name": "Yu-Guan", "family_name": "Hsieh", "institution": "Universit\u00e9 Grenoble Alpes / \u00c9cole Normale Sup\u00e9rieure Paris"}, {"given_name": "Franck", "family_name": "Iutzeler", "institution": "Univ. Grenoble Alpes"}, {"given_name": "J\u00e9r\u00f4me", "family_name": "Malick", "institution": "CNRS and LJK"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}]}