{"title": "Reducing Noise in GAN Training with Variance Reduced Extragradient", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 403, "abstract": "We study the effect of the stochastic gradient noise on the training of generative adversarial networks (GANs) and show that it can prevent the convergence of standard game optimization methods, while the batch version converges. We address this issue with a novel stochastic variance-reduced extragradient (SVRE) optimization algorithm, which for a large class of games improves upon the previous convergence rates proposed in the literature. We observe empirically that SVRE performs similarly to a batch method on MNIST while being computationally cheaper, and that SVRE yields more stable GAN training on standard datasets.", "full_text": "Reducing Noise in GAN Training with Variance\n\nReduced Extragradient\n\nTatjana Chavdarova\u21e4\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nIdiap, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nGauthier Gidel\u21e4\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nElement AI\n\nFran\u00e7ois Fleuret\n\nIdiap, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nSimon Lacoste-Julien\u2020\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nAbstract\n\nWe study the effect of the stochastic gradient noise on the training of generative ad-\nversarial networks (GANs) and show that it can prevent the convergence of standard\ngame optimization methods, while the batch version converges. We address this\nissue with a novel stochastic variance-reduced extragradient (SVRE) optimization\nalgorithm, which for a large class of games improves upon the previous conver-\ngence rates proposed in the literature. We observe empirically that SVRE performs\nsimilarly to a batch method on MNIST while being computationally cheaper, and\nthat SVRE yields more stable GAN training on standard datasets.\n\n1\n\nIntroduction\n\nMany empirical risk minimization algorithms rely on gradient-based optimization methods. These\niterative methods handle large-scale training datasets by computing gradient estimates on a subset of\nit, a mini-batch, instead of using all the samples at each step, the full batch, resulting in a method\ncalled stochastic gradient descent (SGD, Robbins and Monro (1951); Bottou (2010)).\nSGD methods are known to ef\ufb01ciently minimize single objective loss functions, such as cross-entropy\nfor classi\ufb01cation or squared loss for regression. Some algorithms go beyond such training objective\nand de\ufb01ne multiple agents with different or competing objectives. The associated optimization\nparadigm requires a multi-objective joint minimization. An example of such a class of algorithms are\nthe generative adversarial networks (GANs, Goodfellow et al., 2014), which aim at \ufb01nding a Nash\nequilibrium of a two-player minimax game, where the players are deep neural networks (DNNs).\nAs of their success on supervised tasks, SGD based algorithms have been adopted for GAN training\nas well. Recently, Gidel et al. (2019a) proposed to use an optimization technique coming from the\nvariational inequality literature called extragradient (Korpelevich, 1976) with provable convergence\nguarantees to optimize games (see \u00a7 2). However, convergence failures, poor performance (sometimes\nreferred to as \u201cmode collapse\u201d), or hyperparameter susceptibility are more commonly reported\ncompared to classical supervised DNN optimization.\nWe question naive adoption of such methods for game optimization so as to address the reported\ntraining instabilities. We argue that as of the two player setting, noise impedes drastically more the\ntraining compared to single objective one. More precisely, we point out that the noise due to the\nstochasticity may break the convergence of the extragradient method, by considering a simplistic\nstochastic bilinear game for which it provably does not converge.\n\n\u21e4equal contribution\n\u2020Canada CIFAR AI Chair\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fln( 1\n\nno\nno\n\nMethod\nSVRG\nln( 1\nAcc. SVRG ln( 1\nSVRE \u00a73.2\n\nComplexity \u00b5-adaptivity\n\u270f )\u21e5(n + \u00afL2\n\u00b52 )\n\u270f )\u21e5(n + pn \u00afL\n\u00b5 )\n\u270f )\u21e5(n + \u00af`\n\u00b5 ) if \u00af` = O( \u00afL)\nTable 1: Comparison of variance reduced methods\nfor games for a \u00b5-strongly monotone operator with\nLi-Lipschitz stochastic operators. Our result makes\nthe assumption that the operators are `i-cocoercive.\ni /\u00b5], more details and a tighter\nNote that `i 2 [Li, L2\nrate are provided in \u00a73.2. The SVRG variants are pro-\nposed by Palaniappan and Bach (2016). \u00b5-adaptivity\nindicates if the hyper-parameters that guarantee con-\nvergence (step size & epoch length) depend on the\nstrong monotonicity parameter \u00b5: if not, the algo-\nrithm is adaptive to local strong monotonicity. Note\nthat in some cases the constant ` may depend on \u00b5\nbut SVRE is adaptive to strong convexity when \u00af`\nremains close to \u00afL (see for instance Proposition 2).\n\ntial weights \u27130, '0. t = 0\n\nnPn\nnPn\n\nAlgorithm 1 Pseudocode for SVRE.\n1: Input: Stopping time T , learning rates \u2318\u2713,\u2318 ', ini-\n2: while t \uf8ff T do\n'S = 't and \u00b5S' = 1\ni (\u2713S, 'S)\ni=1 r'LD\n3:\n\u2713S = \u2713t and \u00b5S\u2713 = 1\ni (\u2713S, 'S)\ni=1 r\u2713LG\n4:\n5: N \u21e0 Geom1/n\n(Sample epoch length)\nfor i = 0 to N1 do {Beginning of the epoch}\n6:\nSample i\u2713, i' \u21e0 \u21e1\u2713,\u21e1 ', do extrapolation:\n7:\n. (5)\n\u02dc't = 't \u2318'dD\n8:\n\u02dc\u2713t = \u2713t \u2318\u2713dG\n. (5)\n9:\nSample i\u2713, i' \u21e0 \u21e1\u2713,\u21e1 ' and do update:\n10:\n't+1 = 't \u2318'dD\n11:\n\u2713t+1 = \u2713t \u2318\u2713dG\n12:\n13:\nt t + 1\n14: Output: \u2713T , 'T\n\ni'( \u02dc\u2713t, \u02dc't, \u2713S, 'S) . (5)\ni\u2713 ( \u02dc\u2713t, \u02dc't, \u2713S, 'S)\n. (5)\n\ni'(\u2713t, 't, \u2713S, 'S)\ni\u2713 (\u2713t, 't, \u2713S, 'S)\n\nThe theoretical aspect we present in this paper is further supported empirically, since using larger\nmini-batch sizes for GAN training has been shown to considerably improve the quality of the samples\nproduced by the resulting generative model: Brock et al. (2019) report a relative improvement of\n46% of the Inception Score metric (see \u00a7 4) on ImageNet if the batch size is increased 8\u2013fold. This\nnotable improvement raises the question if noise reduction optimization methods can be extended to\ngame settings. In turn, this would allow for a principled training method with the practical bene\ufb01t of\nomitting to empirically establish this multiplicative factor for the batch size.\nIn this paper, we investigate the interplay between noise and multi-objective problems in the context\nof GAN training. Our contributions can be summarized as follows: (i) we show in a motivating\nexample how the noise can make stochastic extragradient fail (see \u00a7 2.2).\n(ii) we propose a new\nmethod \u201cstochastic variance reduced extragradient\u201d (SVRE) that combines variance reduction and\nextrapolation (see Alg. 1 and \u00a7 3.2) and show experimentally that it effectively reduces the noise.\n(iii) we prove the convergence of SVRE under local strong convexity assumptions, improving over\nthe known rates of competitive methods for a large class of games (see \u00a7 3.2 for our convergence\nresult and Table 1 for comparison with standard methods).\n(iv) we test SVRE empirically to train\nGANs on several standard datasets, and observe that it can improve SOTA deep models in the late\nstage of their optimization (see \u00a7 4).\n\n2 GANs as a Game and Noise in Games\n\n2.1 Game theory formulation of GANs\n\nThe models in a GAN are a generator G, that maps an embedding space to the signal space, and\nshould eventually map a \ufb01xed noise distribution to the training data distribution, and a discriminator\nD whose purpose is to allow the training of the generator by classifying genuine samples against\ngenerated ones. At each iteration of the algorithm, the discriminator D is updated to improve its \u201creal\nvs. generated\u201d classi\ufb01cation performance, and the generator G to degrade it.\nFrom a game theory point of view, GAN training is a differentiable two-player game where the\ngenerator G\u2713 and the discriminator D' aim at minimizing their own cost function LG and LD, resp.:\n(2P-G)\n\nand\n\n\u2713\u21e4 2 arg min\n\n\u27132\u21e5 LG(\u2713, '\u21e4)\n\n'\u21e4 2 arg min\n\n'2 LD(\u2713\u21e4, ') .\n\nWhen LD = LG =: L this game is called a zero-sum game and (2P-G) is a minimax problem:\n\nmax\n\n'2 L(\u2713, ')\n\nmin\n\u27132\u21e5\n\n2\n\n(SP)\n\n\f\u27132 + 2 ,\n\ngame: min\n\u27132R\n\nFigure 1: Illustration of the discrepancy between\ngames and minimization on simple examples:\nmin: min\n\u2713 \u00b7 .\n\u2713,2R\nLeft: Minimization. Up to a neighborhood,\nthe noisy gradient always points to a direction\nthat make the iterate closer to the minimum (?).\nRight: Game. The noisy gradient may point to\na direction (red arrow) that push the iterate away\nfrom the Nash Equilibrium (?).\n\nmax\n2R\n\nThe gradient method does not converge for some convex-concave examples (Mescheder et al., 2017;\nGidel et al., 2019a). To address this, Korpelevich (1976) proposed to use the extragradient method3\nwhich performs a lookahead step in order to get signal from an extrapolated point:\n\nExtrapolation:( \u02dc\u2713t = \u2713t \u2318r\u2713LG(\u2713t, 't)\n\u02dc't = 't \u2318r'LD(\u2713t, 't)\nNote how \u2713t and 't are updated with a gradient from a different point, the extrapolated one. In the\ncontext of a zero-sum game, for any convex-concave function L and any closed convex sets \u21e5 and ,\nthe extragradient method converges (Harker and Pang, 1990, Thm. 12.1.11).\n\nUpdate:( \u2713t+1 = \u2713t \u2318r\u2713LG( \u02dc\u2713t, \u02dc't)\n't+1 = 't \u2318r'LD( \u02dc\u2713t, \u02dc't)\n\n(EG)\n\n2.2 Stochasticity Breaks Extragradient\nAs the (EG) converges for some examples for which gradient methods do not, it is reasonable to\nexpect that so does its stochastic counterpart (at least to a neighborhood). However, the resulting noise\nin the gradient estimate may interact in a problematic way with the oscillations due to the adversarial\ncomponent of the game4. We depict this phenomenon in Fig. 1, where we show the direction of the\nnoisy gradient on single objective minimization example and contrast it with a multi-objective one.\nWe present a simplistic example where the extragradient method converges linearly (Gidel et al.,\n2019a, Corollary 1) using the full gradient but diverges geometrically when using stochastic estimates\nof it. Note that standard gradient methods, both batch and stochastic, diverge on this example.\nIn particular, we show that: (i) if we use standard stochastic estimates of the gradients of L with a sim-\nple \ufb01nite sum formulation, then the iterates !t := (\u2713t, 't) produced by the stochastic extragradient\nmethod (SEG) diverge geometrically, and on the other hand\n(ii) the full-batch extragradient method\ndoes converge to the Nash equilibrium !\u21e4 of this game (Harker and Pang, 1990, Thm. 12.1.11).\nTheorem 1 (Noise may induce divergence). For any \u270f 0 There exists a zero-sum \u270f\n2-strongly\n, the iterates (!t)\nmonotone stochastic game such that if !0 6= !\u21e4, then for any step-size \u2318>\u270f\ncomputed by the stochastic extragradient method diverge geometrically, i.e., there exists \u21e2> 0, such\nthat E[k!t !\u21e4k2] > k!0 !\u21e4k2(1 + \u21e2)t.\nProof sketch. All detailed proofs can be found in \u00a7 C of the appendix. We consider the following\nstochastic optimization problem (with d = n):\n\n1\nn\n\n\u270f\n2\n\nnXi=1\n\ni + \u2713>Ai' \n\u27132\n\n\u270f\n2\n\ni where\n'2\n\n[Ai]kl = 1 if k = l = i and 0 otherwise.\n\n(1)\n\nNote that this problem is a simple dot product between \u2713 and ' with an (\u270f/n)-`2 norm penalization,\nthus we can compute the batch gradient and notice that the Nash equilibrium of this problem is\n(\u2713\u21e4, '\u21e4) = (0, 0). However, as we shall see, this simple problem breaks with standard stochastic\noptimization methods.\n\n3For simplicity, we focus on unconstrained setting where \u21e5= Rd. For the constrained case, a Euclidean\n\nprojection on the constraints set should be added at every update of the method.\n\n4Gidel et al. (2019b) formalize the notion of \u201cadversarial component\u201d of a game, which yields a rotational\ndynamics in gradients methods (oscillations in parameters), as illustrated by the gradient \ufb01eld of Fig. 1 (right).\n\n3\n\n\fSampling a mini-batch without replacement I \u21e2{ 1, . . . , n}, we denote AI := Pi2I Ai. The\n\nextragradient update rule can be written as:\n\n\u21e2\u2713t+1 = (1 \u2318AI\u270f)\u2713t \u2318AI((1 \u2318AJ \u270f)'t + \u2318AJ \u2713t)\n\n't+1 = (1 \u2318AI\u270f)'t + \u2318AI((1 \u2318AJ \u270f)\u2713t \u2318AJ 't) ,\n\n(2)\n\nn2 (2\u23182 \u23184(1 + \u270f2))\u2318 E[Nt] .\n\nwhere I and J are the mini-batches sampled for the update and the extrapolation step, respectively.\nLet us write Nt := k\u2713tk2 + k'tk2. Noticing that [AI\u2713]i = [\u2713]i if i 2 I and 0 otherwise, we have,\nE[Nt+1] =\u21e31 |I|n (2\u2318\u270f \u23182(1 + \u270f2)) |I|2\n(3)\nConsequently, if the mini-batch size is smaller than half of the dataset size, i.e. 2|I|\uf8ff n, we have that\n9\u21e2> 0 , s.t. , E[Nt] > N0(1 + \u21e2)t. For the theorem statement, we set n = 2 and |I| = 1.\n8\u2318>\u270f,\nThis result may seem contradictory with the standard result on SEG (Juditsky et al., 2011) saying\nthat the average of the iterates computed by SEG does converge to the Nash equilibrium of the game.\nHowever, an important assumption made by Juditsky et al. is that the iterates are projected onto a\ncompact set and that estimator of the gradient has \ufb01nite variance. These assumptions break in this\nexample since the variance of the estimator is proportional to the norm of the (unbounded) parameters.\nNote that constraining the optimization problem (23) to bounded domains \u21e5 and , would make\nthe \ufb01nite variance assumption from Juditsky et al. (2011) holds. Consequently, the averaged iterate\ns=0 !s would converge to !\u21e4. In \u00a7 A.1, we explain why in a non-convex setting, the\n\u00af!t := 1\nconvergence of the last iterate is preferable.\n\ntPt1\n\n3 Reducing Noise in Games with Variance Reduced Extragradient\n\nOne way to reduce the noise in the estimation of the gradient is to use mini-batches of samples\ninstead of one sample. However, mini-batch stochastic extragradient fails to converge on (23) if the\nmini-batch size is smaller than half of the dataset size (see \u00a7 C.1). In order to get an estimator of\nthe gradient with a vanishing variance, the optimization literature proposed to take advantage of the\n\ufb01nite-sum formulation that often appears in machine learning (Schmidt et al., 2017, and references\ntherein).\n\n3.1 Variance Reduced Gradient Methods\nLet us assume that the objective in (2P-G) can be decomposed as a \ufb01nite sum such that5\n\nLG(!) =\n\n1\nn\n\nnXi=1\n\nLG\ni (!)\n\nand LD(!) =\n\n1\nn\n\nnXi=1\n\ni (!) where ! := (\u2713, ') .\nLD\n\n(4)\n\nJohnson and Zhang (2013) propose the \u201cstochastic variance reduced gradient\u201d (SVRG) as an unbiased\nestimator of the gradient with a smaller variance than the vanilla mini-batch estimate. The idea is to\noccasionally take a snapshot !S of the current model\u2019s parameters, and store the full batch gradient\n\u00b5S at this point. Computing the full batch gradient \u00b5S at !S is an expensive operation but not\nprohibitive if done infrequently (for instance once every dataset pass).\nAssuming that we have stored !S and \u00b5S := (\u00b5S\u2713 , \u00b5S'), the SVRG estimates of the gradients are:\n\ni (!) := rLG\ndG\n\ni (!)rLG\n\ni (!S)\n\nn\u21e1i\n\ni (!) := rLD\n\ni (!)rLD\n\ni (!S)\n\nn\u21e1i\n\n+ \u00b5S'.\n\ni (!)] = 1\n\n+ \u00b5S\u2713 , dD\n\ni=1 rLG\n\nnPn\n\n(5)\nThese estimates are unbiased: E[dG\ni (!) = rLG(!), where the expectation\nis taken over i, picked with probability \u21e1i. The non-uniform sampling probabilities \u21e1i are used to\nbias the sampling according to the Lipschitz constant of the stochastic gradient in order to sample\nmore often gradients that change quickly. This strategy has been \ufb01rst introduced for variance reduced\nmethods by Xiao and Zhang (2014) for SVRG and has been discussed for saddle point optimization\nby Palaniappan and Bach (2016).\nOriginally, SVRG was introduced as an epoch based algorithm with a \ufb01xed epoch size: in Alg. 1,\none epoch is an inner loop of size N (Line 6). However, Hofmann et al. (2015) proposed instead to\n5The \u201cnoise dataset\u201d in a GAN is not \ufb01nite though; see \u00a7 D.1 for details on how to cope with this in practice.\n\n4\n\n\fsample the size of each epoch from a geometric distribution, enabling them to analyze SVRG the\nsame way as SAGA under a uni\ufb01ed framework called q-memorization algorithm. We generalize\ntheir framework to handle the extrapolation step (EG) and provide a convergence proof for such\nq-memorization algorithms for games in \u00a7 C.2.\nOne advantage of Hofmann et al. (2015)\u2019s framework is also that the sampling of the epoch size does\nnot depend on the condition number of the problem, whereas the original proof for SVRG had to\nconsider an epoch size larger than the condition number (see Leblond et al. (2018, Corollary 16) for\na detailed discussion on the convergence rate for SVRG). Thus, this new version of SVRG with a\nrandom epoch size becomes adaptive to the local strong convexity since none of its hyper-parameters\ndepend on the strong convexity constant.\nHowever, because of some new technical aspects when working with monotone operators, Palaniappan\nand Bach (2016)\u2019s proofs (both for SAGA and SVRG) require a step-size (and epoch length for\nSVRG) that depends on the strong monotonicity constant making these algorithms not adaptive to\nlocal strong monotonicity. This motivates the proposed SVRE algorithm, which may be adaptive to\nlocal strong monotonicity, and is thus more appropriate for non-convex optimization.\n\n3.2 SVRE: Stochastic Variance Reduced Extragradient\n\nWe describe our proposed algorithm called stochastic variance reduced extragradient (SVRE) in Alg. 1.\nIn an analogous manner to how Palaniappan and Bach (2016) combined SVRG with the gradient\nmethod, SVRE combines SVRG estimates of the gradient (5) with the extragradient method (EG).\nWith SVRE we are able to improve the convergence rates for variance reduction for a large class of\nstochastic games (see Table 1 and Thm. 2), and we show in \u00a7 3.3 that it is the only method which\nempirically converges on the simple example of \u00a7 2.2.\nWe now describe the theoretical setup for the convergence result. A standard assumption in convex\noptimization is the assumption of strong convexity of the function. However, in a game, the operator,\n\nv : ! 7!\u21e5r\u2713LG(!) , r'LD(!)\u21e4> ,\n\n(6)\nassociated with the updates is no longer the gradient of a single function. To make an analogous\nassumption for games the optimization literature considers the notion of strong monotonicity.\nDe\ufb01nition 1. An operator F : ! 7! (F\u2713(!), F'(!)) 2 Rd+p is said to be (\u00b5\u2713, \u00b5')-strongly\nmonotone if for all !, !0 2 Rp+d we have\n\n\u2326((\u2713, '), (\u27130, '0)) := \u00b5\u2713k\u2713 \u27130k2 + \u00b5'k' '0k2 \uf8ff (F (!) F (!0))>(! !0) ,\n\nwhere we write ! := (\u2713, ') 2 Rd+p. A monotone operator is a (0, 0)-strongly monotone operator.\nThis de\ufb01nition is a generalization of strong convexity for operators: if f is \u00b5-strongly convex, then\nrf is a \u00b5-monotone operator. Another assumption is the regularity assumption,\nDe\ufb01nition 2. An operator F : ! 7! (F\u2713(!), F'(!)) 2 Rd+p is said to be (\u2713, )-regular if,\n\n\u2713k\u2713 \u27130k2 + 2\n2\n\n'k' '0k2 \uf8ff kF (!) F (!0)k2 ,\n\n8 !, !0 2 Rp+d .\n\n(7)\n\nNote that an operator is always (0, 0)-regular. This assumption originally introduced by Tseng (1995)\nhas been recently used (Azizian et al., 2019) to improve the convergence rate of extragradient. For\ninstance for a full rank bilinear matrix problem is its smallest singular value. More generally, in the\ncase \u2713 = ', the regularity constant is a lower bound on the minimal singular value of the Jacobian\nof F (Azizian et al., 2019).\nOne of our main assumptions is the cocoercivity assumption, which implies the Lipchitzness of the\noperator in the unconstrained case. We use the cocoercivity constant because it provides a tighter\nbound for general strongly monotone and Lipschitz games (see discussion following Theorem 2).\nDe\ufb01nition 3. An operator F : ! 7! (F\u2713(!), F'(!)) 2 Rd+p is said to be (`\u2713,` ')-cocoercive, if\nfor all !, !0 2 \u2326 we have\nkF (!) F (!0)k2 \uf8ff `\u2713(F\u2713(!) F\u2713(!0))>(\u2713 \u27130) + `'(F'(!) F'(!0))>(' '0) .\n\n(8)\n\n5\n\n\fNote that for a L-Lipschitz and \u00b5-strongly monotone operator, we have ` 2 [L, L2/\u00b5] (Facchinei\nand Pang, 2003). For instance, when F is the gradient of a convex function, we have ` = L.\nMore generally, when F (!) = (rf (\u2713) + M ',rg(') M>\u2713), where f and g are \u00b5-strongly\nconvex and L smooth we have that = min(M ) and kMk2 = O(\u00b5L) is a suf\ufb01cient condition for\n` = O(L) (see \u00a7B). Under this assumption on each cost function of the game operator, we can de\ufb01ne\na cocoercivity constant adapted to the non-uniform sampling scheme of our stochastic algorithm:\n\n\u00af`(\u21e1)2 :=\n\n1\nn\n\n1\nn\u21e1i\n\n`2\ni .\n\n(9)\n\nnXi=1\n\nE[k!0 !\u21e4k2\n2] ,\nwe get\n\nThe standard uniform sampling scheme corresponds to \u21e1i := 1\n\nn and the optimal non-uniform sampling\n\nscheme corresponds to \u02dc\u21e1i := `iPn\nFor our main result, we make strong convexity, cocoercivity and regularity assumptions.\nAssumption 1. For 1 \uf8ff i \uf8ff n, the gradients r\u2713LG\ncocoercive and (\u2713\n\ni )-regular. The operator (6) is (\u00b5\u2713, \u00b5')-strongly monotone.\n\n. By Jensen\u2019s inequality, we have: \u00af`(\u02dc\u21e1) \uf8ff \u00af`(\u21e1) \uf8ff maxi `i.\n\ni are respectively `\u2713\n\ni and r'LD\n\ni , '\n\ni=1 `i\n\ni and `'\ni -\n\nWe now present our convergence result for SVRE with non-uniform sampling (to make our constants\ncomparable to those of Palaniappan and Bach (2016)), but note that we have used uniform sampling\nin all our experiments (for simplicity).\nTheorem 2. Under Assumption 1, after t iterations, the iterate !t := (\u2713t, 't) computed by SVRE\n(Alg. 1) with step-size \u2318\u2713 \uf8ff (40\u00af`\u2713)1 and \u2318' \uf8ff (40\u00af`')1 and sampling scheme (\u02dc\u21e1\u2713, \u02dc\u21e1') veri\ufb01es:\n\nE[k!t !\u21e4k2\n\n2] \uf8ff 1 \n\nmin(\u2318\u2713\u00b5\u2713 +\n\n1\n2\n\n9\u23182\n\u2713 \u00af2\n\u2713\n10\n\n,\u2318 '\u00b5' +\n\n9\u23182\n'\u00af2\n'\n10\n\n,\n\n4\n\n5n)!t\n\nwhere \u00af`\u2713(\u21e1\u2713) and \u00af`'(\u21e1') are de\ufb01ned in (9). Particularly, for \u2318\u2713 = 1\n40\u00af`\u2713\n\nand \u2318' = 1\n40\u00af`'\n\nE[k!t !\u21e4k2\n\n2] \uf8ff 1 \n\n1\n2\n\nmin( 1\n\n40\u21e3 \u00b5\u2713\n\n\u00af`\u2713\n\n+\n\n\u00af2\n\u2713\n45\u00af`2\n\n\u2713\u2318, 1\n40\u21e3 \u00b5'\n\n\u00af`'\n\n+\n\n\u00af2\n'\n45\u00af`2\n\n'\u2318,\n\n4\n\n5n)!t\n\nE[k!0 !\u21e4k2\n2] .\n\n\u00b5'\n\u00afL'\n\n+ \u00af2\n\u2713\n\u00af`2\n\u2713\n\n+ \u00af2\n'\n\u00af`2\n'\n\nand \uf8ff' := \u00b5'\n\u00af`'\n\n. They avoid a dependence on the maximum of the condition numbers squared, max{\uf8ff2\n\nWe prove this theorem in \u00a7 C.2. We can notice that the respective condition numbers of LG and LD\nde\ufb01ned as \uf8ff\u2713 := \u00b5\u2713\nappear in our convergence rate. The cocoercivity\n\u00af`\u2713\nconstant ` belongs to [L, L2/\u00b5], thus our rate may be signi\ufb01cantly faster6 than the convergence rate\nof the (non-accelerated) algorithm of Palaniappan and Bach (2016) that depends on the product\n\u00b5\u2713\n\u2713},\n',\uf8ff 2\n\u00afL\u2713\nby using the weighted Euclidean norm \u2326(\u2713, ') de\ufb01ned in (14) and rescaling the functions LG and\nLD with their strong-monotonicity constant. However, this rescaling trick suffers from two issues:\n(i) we do not know in practice a good estimate of the strong monotonicity constant, which was not\nthe case in Palaniappan and Bach (2016)\u2019s application; and (ii) the algorithm does not adapt to\nlocal strong-monotonicity. This property is important in non-convex optimization since we want the\nalgorithm to exploit the (potential) local stability properties of a stationary point.\n\n3.3 Motivating example\nThe example (23) for \u270f = 0 seems to be challenging in the stochastic setting since all the\nstandard methods and even the stochastic extragradient method fails to \ufb01nd its Nash equilib-\nrium (note that this example is not strongly monotone). We set n = d = 100, and draw\n[Ai]kl = kli and [bi]k, [ci]k \u21e0N (0, 1/d) , 1 \uf8ff k, l \uf8ff d, where kli = 1 if k = l = i and 0\notherwise. Our optimization problem is:\n\nmin\n\u27132Rd\n\nmax\n'2Rd\n\n1\nn\n\nnXi=1\n\n(\u2713>bi + \u2713>Ai' + c>i ').\n\n(10)\n\n6Particularly, when F is the gradient of a convex function (or close to it) we have ` \u21e1 L and thus our rate\nrecovers the standard ln(1/\u270f)L/\u00b5, improving over the accelerated algorithm of Palaniappan and Bach (2016).\nMore generally, under the assumptions of Proposition 2, we also recover ln(1/\u270f)L/\u00b5.\n\n6\n\n\ftPt1\n\nWe compare variants of the following algorithms (with uniform sampling and average our results over\n5 different seeds): (i) AltSGD: the standard method to train GANs\u2013stochastic gradient with alternating\nupdates of each player.\n(ii) SVRE: Alg. 1. The AVG pre\ufb01x correspond to the uniform average of\nthe iterates, \u00af! := 1\ns=0 !s. We observe in Fig. 4 that AVG-SVRE converges sublinearly (whereas\nAVG-AltSGD fails to converge).\nThis motivates a new variant of SVRE based on the idea that even if the averaged iterate converges,\nwe do not compute the gradient at that point and thus we do not bene\ufb01t from the fact that this iterate\nis closer to the optimums (see \u00a7 A.1). Thus the idea is to occasionally restart the algorithm, i.e.,\nconsider the averaged iterate as the new starting point of our algorithm and compute the gradient at\nthat point. Restart goes well with SVRE as we already occasionally stop the inner loop to recompute\n\u00b5S, at which point we decide (with a probability p to be \ufb01xed) whether or not to restart the algorithm\nby taking the snapshot at point \u00af!t instead of !t. This variant of SVRE is described in Alg. 3 in \u00a7 E\nand the variant combining VRAd in \u00a7 D.1.\nIn Fig. 4 we observe that the only method that converges is SVRE and its variants. We do not provide\nconvergence guarantees for Alg. 3 and leave its analysis for future work. However, it is interesting\nthat, to our knowledge, this algorithm is the only stochastic algorithm (excluding batch extragradient\nas it is not stochastic) that converge for (23). Note that we tried all the algorithms presented in Fig. 3\nfrom Gidel et al. (2019a) on this unconstrained problem and that all of them diverge.\n\n4 GAN Experiments\n\n(iii) SVHN (Netzer et al., 2011), and\n\nIn this section, we investigate the empirical performance of SVRE for GAN training. Note, however,\nthat our theoretical analysis does not hold for games with non-convex objectives such as GANs.\n(ii) CIFAR-10\nDatasets. We used the following datasets: (i) MNIST (Lecun and Cortes),\n(iv) ImageNet ILSVRC 2012 (Rus-\n(Krizhevsky, 2009, \u00a73),\nsakovsky et al., 2015), using 28\u21e528, 3\u21e532\u21e532, 3\u21e532\u21e532, and 3\u21e564\u21e564 resolution, respectively.\nMetrics. We used the Inception score (IS, Salimans et al., 2016) and the Fr\u00e9chet Inception\ndistance (FID, Heusel et al., 2017) as performance metrics for image synthesis. To gain insights if\nSVRE indeed reduces the variance of the gradient estimates, we used the second moment estimate\u2013\nSME (uncentered variance), computed with an exponentially moving average. See \u00a7 F.1 for details.\nDNN architectures.\nFor experiments on MNIST, we used the DCGAN architectures (Radford\net al., 2016), described in \u00a7 F.2.1. For real-world datasets, we used two architectures (see \u00a7 F.2 for\ndetails and \u00a7 F.2.2 for motivation): (i) SAGAN (Zhang et al., 2018), and\n(ii) ResNet, replicating the\nsetup of Miyato et al. (2018), described in detail in \u00a7 F.2.3 and F.2.4, respectively. For clarity, we\nrefer the former as shallow, and the latter as deep architectures.\n\nOptimization methods. We conduct experiments using the following optimization methods for\nGANs: (i) BatchE: full\u2013batch extragradient,\n(ii) SG: stochastic gradient (alternating GAN), and\n(iii) SE: stochastic extragradient, and\n(iv) SVRE: stochastic variance reduced extragradient. These\ncan be combined with adaptive learning rate methods such as Adam or with parameter averaging,\nhereafter denoted as \u2013A and AVG\u2013, respectively. In \u00a7 D.1, we present a variant of Adam adapted\nto variance reduced algorithms, that is referred to as \u2013VRAd. When using the SE\u2013A baseline and\ndeep architectures, the convergence rapidly fails at some point of training (cf. \u00a7 G.3). This motivates\nexperiments where we start from a stored checkpoint taken before the baseline diverged, and continue\ntraining with SVRE. We denote these experiments with WS\u2013SVRE (warm-start SVRE).\n\n4.1 Results\n\nComparison on MNIST. The MNIST common benchmark allowed for comparison with full-batch\nextragradient, as it is feasible to compute. Fig. 3 depicts the IS metric while using either a stochastic,\nfull-batch or variance reduced version of extragradient (see details of SVRE-GAN in \u00a7 D.2). We\nalways combine the stochastic baseline (SE) with Adam, as proposed by Gidel et al. (2019a). In terms\nof number of parameter updates, SVRE performs similarly to BatchE\u2013A (see Fig. 5a, \u00a7 G). Note that\nthe latter requires signi\ufb01cantly more computation: Fig. 3a depicts the IS metric using the number of\nmini-batch computations as x-axis (a surrogate for the wall-clock time, see below). We observe that,\n\n7\n\n\f(a) IS (higher is better), MNIST\n\n(b) Generator\u2013SME, MNIST\n\n(c) FID (lower is better), SVHN\n\nFigure 3: Figures a & b. Stochastic, full-batch and variance reduced extragradient optimization on\nMNIST. We used \u2318 = 102 for SVRE. SE\u2013A with \u2318 = 103 achieves similar IS performances as\n\u2318 = 102 and \u2318 = 104, omitted from Fig. a for clarity. Figure c. FID on SVHN, using shallow\narchitectures. See \u00a7 4 and \u00a7 F for naming of methods and details on the implementation, respectively.\n\nSG-A SE-A SVRE WS-SVRE\n\nCIFAR-10 21.70 18.65 23.56\nSVHN 5.66 5.14 4.81\n\n16.77\n4.88\n\nTable 2: Best obtained FID scores for the dif-\nferent optimization methods using the deep ar-\nchitectures (see Table 8, \u00a7 F.2.4). WS\u2013SVRE\nstarts from the best obtained scores of SE\u2013A.\nSee \u00a7 F and \u00a7 G for implementation details and\nadditional results, respectively.\n\nFigure 4: Distance to the optimum of (10), see\n\u00a7 3.3 for the experimental setup.\n\nas SE\u2013A has slower per-iteration convergence rate, SVRE converges faster on this dataset. At the end\nof training, all methods reach similar performances (IS is above 8.5, see Table 9, \u00a7 G).\nComputational cost. The relative cost of one pass over the dataset for SVRE versus vanilla SGD is\na factor of 5: the full batch gradient is computed (on average) after one pass over the dataset, giving\na slowdown of 2; the factor 5 takes into account the extra stochastic gradient computations for the\nvariance reduction, as well as the extrapolation step overhead. However, as SVRE provides less noisy\ngradient, it may converge faster per iteration, compensating the extra per-update cost. Note that many\ncomputations can be done in parallel. In Fig. 3a, the x-axis uses an implementation-independent\nsurrogate for wall-clock time that counts the number of mini-batch gradient computations. Note that\nsome training methods for GANs require multiple discriminator updates per generator update, and\nwe observed that to stabilize our baseline when using the deep architectures it was required to use 1:5\nupdate ratio of G:D (cf. \u00a7 G.3), whereas for SVRE we used ratio of 1:1 (Tab. 2 lists the results).\nSecond moment estimate and Adam. Fig. 3b depicts the averaged second-moment estimate for\nparameters of the Generator, where we observe that SVRE effectively reduces it over the iterations.\nThe reduction of these values may be the reason why Adam combined with SVRE performs poorly (as\nthese values appear in the denominator, see \u00a7 D.1). To our knowledge, SVRE is the \ufb01rst optimization\nmethod with a constant step size that has worked empirically for GANs on non-trivial datasets.\nComparison on real-world datasets.\nIn Fig. 3c, we compare SVRE with the SE\u2013A baseline on\nSVHN, using shallow architectures. We observe that although SE\u2013A in some experiments obtains\nbetter performances in the early iterations, SVRE allows for obtaining improved \ufb01nal performances.\nTab. 2 summarizes the results on CIFAR-10 and SVHN with deep architectures. We observe that,\nwith deeper architectures, SE\u2013A is notably more unstable, as training collapsed in 100% of the\nexperiments. To obtain satisfying results for SE\u2013A, we used various techniques such as a schedule of\nthe learning rate and different update ratios (see \u00a7 G.3). On the other hand, SVRE did not collapse in\nany of the experiments but took longer time to converge compared to SE\u2013A. Interestingly, although\n\n8\n\n0100200300400500600Numberofpasses105104103102101100101102103DistancetotheoptimumAVG-AltSGDAVG-SVRESVREp=1/2SVREp=1/10SVRE-VRAdp=1/10SVRE-Ap=1/10\fWS\u2013SVRE starts from an iterate point after which the baseline diverges, it continues to improve the\nobtained FID score and does not diverge. See \u00a7 G for additional experiments.\n\n5 Related work\n\nSurprisingly, there exist only a few works on variance reduction methods for monotone operators,\nnamely from Palaniappan and Bach (2016) and Davis (2016). The latter requires a co-coercivity\nassumption on the operator and thus only convex optimization is considered. Our work provides a new\nway to use variance reduction for monotone operators, using the extragradient method (Korpelevich,\n1976). Recently, Iusem et al. (2017) proposed an extragradient method with variance reduction for\nan in\ufb01nite sum of operators. The authors use mini-batches of growing size in order to reduce the\nvariance of their algorithm and to converge with a constant step-size. However, this approach is\nprohibitively expensive in our application. Moreover, Iusem et al. are not using the SAGA/SVRG\nstyle of updates exploiting the \ufb01nite sum formulation, leading to sublinear convergence rate, while\nour method bene\ufb01ts from a linear convergence rate exploiting the \ufb01nite sum assumption.\nDaskalakis et al. (2018) proposed a method called Optimistic-Adam inspired by game theory. This\nmethod is closely related to extragradient, with slightly different update scheme. More recently, Gidel\net al. (2019a) proposed to use extragradient to train GANs, introducing a method called ExtraAdam.\nThis method outperformed Optimistic-Adam when trained on CIFAR-10. Our work is also an attempt\nto \ufb01nd principled ways to train GANs. Considering that the game aspect is better handled by the\nextragradient method, we focus on the optimization issues arising from the noise in the training\nprocedure, a disregarded potential issue in GAN training.\nIn the context of deep learning, despite some very interesting theoretical results on non-convex\nminimization (Reddi et al., 2016; Allen-Zhu and Hazan, 2016), the effectiveness of variance reduced\nmethods is still an open question, and a recent technical report by Defazio and Bottou (2018) provides\nnegative empirical results on the variance reduction aspect. In addition, two recent large scale studies\nshowed that increased batch size has: (i) only marginal impact on single objective training (Shallue\net al., 2018) and\n(ii) a surprisingly large performance improvement on GAN training (Brock et al.,\n2019). In our work, we are able to show positive results for variance reduction in a real-world\ndeep learning setting. This unexpected difference seems to con\ufb01rm the remarkable discrepancy, that\nremains poorly understood, between multi-objective optimization and standard minimization.\n\n6 Discussion\n\nMotivated by a simple bilinear game optimization problem where stochasticity provably breaks the\nconvergence of previous stochastic methods, we proposed the novel SVRE algorithm that combines\nSVRG with the extragradient method for optimizing games. On the theory side, SVRE improves\nupon the previous best results for strongly-convex games, whereas empirically, it is the only method\nthat converges for our stochastic bilinear game counter-example.\nWe empirically observed that SVRE for GAN training obtained convergence speed similar to Batch-\nExtragradient on MNIST, while the latter is computationally infeasible for large datasets. For shallow\narchitectures, SVRE matched or improved over baselines on all four datasets. Our experiments with\ndeeper architectures show that SVRE is notably more stable with respect to hyperparameter choice.\nMoreover, while its stochastic counterpart diverged in all our experiments, SVRE did not. However,\nwe observed that SVRE took more iterations to converge when using deeper architectures, though\nnotably, we were using constant step-sizes, unlike the baselines which required Adam. As adaptive\nstep-sizes often provide signi\ufb01cant improvements, developing such an appropriate version for SVRE\nis a promising direction for future work. In the meantime, the stability of SVRE suggests a practical\nuse case for GANs as warm-starting it just before the baseline diverges, and running it for further\nimprovements, as demonstrated with the WS\u2013SVRE method in our experiments.\n\nAcknowledgements\n\nThis research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence\nResearch Chair in \u201cData Science for Realtime Decision-making\u201d, by the NSERC Discovery Grant\nRGPIN-2017-06936, by the Hasler Foundation through the MEMUDE project, and by a Google\n\n9\n\n\fFocused Research Award. Authors would like to thank Compute Canada for providing the GPUs\nused for this research. TC would like to thank Sebastian Stich and Martin Jaggi, and GG and TC\nwould like to thank Hugo Berard for helpful discussions.\n\nReferences\nZ. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, 2016.\nW. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel. A tight and uni\ufb01ed analysis of extragradient\n\nfor a whole spectrum of differentiable games. arXiv preprint arXiv:1906.05945, 2019.\n\nL. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.\nS. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\nA. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural image\n\nsynthesis. In ICLR, 2019.\n\nC. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.\nD. Davis. Smart: The stochastic monotone aggregated root-\ufb01nding algorithm. arXiv:1601.00698,\n\n2016.\n\nA. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deep learning.\n\narXiv:1812.04529, 2018.\n\nA. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support\n\nfor non-strongly convex composite objectives. In NIPS, 2014.\n\nF. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Prob-\nlems Vol I. Springer Series in Operations Research and Financial Engineering, Finite-Dimensional\nVariational Inequalities and Complementarity Problems. Springer-Verlag, 2003.\n\nG. Gidel, H. Berard, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on\n\ngenerative adversarial nets. In ICLR, 2019a.\n\nG. Gidel, R. A. Hemmat, M. Pezeshki, R. L. Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas.\n\nNegative momentum for improved game dynamics. In AISTATS, 2019b.\n\nX. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks.\n\nIn AISTATS, 2010.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\nP. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity\n\nproblems: a survey of theory, algorithms and applications. Mathematical programming, 1990.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385,\n\n2015.\n\nM. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two\n\ntime-scale update rule converge to a local nash equilibrium. In NIPS, 2017.\n\nT. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient\n\ndescent with neighbors. In NIPS, 2015.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\nA. Iusem, A. Jofr\u00e9, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction\n\nfor stochastic variational inequalities. SIAM Journal on Optimization, 2017.\n\nR. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn NIPS, 2013.\n\n10\n\n\fA. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox\n\nalgorithm. Stochastic Systems, 2011.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\nG. Korpelevich. The extragradient method for \ufb01nding saddle points and other problems. Matecon,\n\n1976.\n\nA. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis, 2009.\nR. Leblond, F. Pederegosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization\n\nanalysis for stochastic incremental methods. JMLR, 19(81):1\u201368, 2018.\n\nY. Lecun and C. Cortes. The MNIST database of handwritten digits. URL http://yann.lecun.\n\ncom/exdb/mnist/.\n\nJ. H. Lim and J. C. Ye. Geometric GAN. arXiv:1705.02894, 2017.\nL. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.\nT. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial\n\nnetworks. In ICLR, 2018.\n\nY. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natu-\nral images with unsupervised feature learning. 2011. URL http://ufldl.stanford.edu/\nhousenumbers/.\n\nB. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In\n\nNIPS, 2016.\n\nA. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016.\n\nS. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In ICML, 2016.\n\nH. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n1951.\n\nO. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.\nIJCV, 115(3):211\u2013252, 2015.\n\nT. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques\n\nfor training GANs. In NIPS, 2016.\n\nT. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In ICML, 2013.\nM. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, 2017.\n\nC. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the\n\neffects of data parallelism on neural network training. arXiv:1811.03600, 2018.\n\nC. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture\n\nfor computer vision. arXiv:1512.00567, 2015.\n\nP. Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal\n\nof Computational and Applied Mathematics, 1995.\n\nA. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient\n\nmethods in machine learning. In NIPS, 2017.\n\nL. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\nH. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-Attention Generative Adversarial Networks.\n\narXiv:1805.08318, 2018.\n\n11\n\n\f", "award": [], "sourceid": 192, "authors": [{"given_name": "Tatjana", "family_name": "Chavdarova", "institution": "DeepMind & Mila & Idiap & EPFL"}, {"given_name": "Gauthier", "family_name": "Gidel", "institution": "Mila"}, {"given_name": "Fran\u00e7ois", "family_name": "Fleuret", "institution": "Idiap"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}]}