{"title": "Improved Training of Wasserstein GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 5767, "page_last": 5777, "abstract": "Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.", "full_text": "Improved Training of Wasserstein GANs\n\nIshaan Gulrajani1\u21e4, Faruk Ahmed1, Martin Arjovsky2, Vincent Dumoulin1, Aaron Courville1,3\n\n1 Montreal Institute for Learning Algorithms\n2 Courant Institute of Mathematical Sciences\n\n3 CIFAR Fellow\n\n{faruk.ahmed,vincent.dumoulin,aaron.courville}@umontreal.ca\n\nma4371@nyu.edu\n\nigul222@gmail.com\n\nAbstract\n\nGenerative Adversarial Networks (GANs) are powerful generative models, but\nsuffer from training instability. The recently proposed Wasserstein GAN (WGAN)\nmakes progress toward stable training of GANs, but sometimes can still generate\nonly poor samples or fail to converge. We \ufb01nd that these problems are often due\nto the use of weight clipping in WGAN to enforce a Lipschitz constraint on the\ncritic, which can lead to undesired behavior. We propose an alternative to clipping\nweights: penalize the norm of gradient of the critic with respect to its input. Our\nproposed method performs better than standard WGAN and enables stable train-\ning of a wide variety of GAN architectures with almost no hyperparameter tuning,\nincluding 101-layer ResNets and language models with continuous generators.\nWe also achieve high quality generations on CIFAR-10 and LSUN bedrooms. \u2020\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast\ngenerative modeling as a game between two networks: a generator network produces synthetic data\ngiven some noise source and a discriminator network discriminates between the generator\u2019s output\nand true data. GANs can produce very visually appealing samples, but are often hard to train, and\nmuch of the recent work on the subject [22, 18, 2, 20] has been devoted to \ufb01nding ways of stabilizing\ntraining. Despite this, consistently stable training of GANs remains an open problem.\n\nIn particular, [1] provides an analysis of the convergence properties of the value function being\noptimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages\nthe Wasserstein distance to produce a value function which has better theoretical properties than the\noriginal. WGAN requires that the discriminator (called the critic in that work) must lie within the\nspace of 1-Lipschitz functions, which the authors enforce through weight clipping.\n\nOur contributions are as follows:\n\n1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.\n\n2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.\n\n3. We demonstrate stable training of varied GAN architectures, performance improvements\nover weight clipping, high-quality image generation, and a character-level GAN language\nmodel without any discrete sampling.\n\n\u21e4Now at Google Brain\n\u2020Code for our models is available at https://github.com/igul222/improved wgan training.\n\n\f2 Background\n\n2.1 Generative adversarial networks\n\nThe GAN training strategy is to de\ufb01ne a game between two competing networks. The generator\nnetwork maps a source of noise to the input space. The discriminator network receives either a\ngenerated sample or a true data sample and must distinguish between the two. The generator is\ntrained to fool the discriminator.\n\nFormally, the game between the generator G and the discriminator D is the minimax objective:\n\nmin\nG\n\nmax\n\nD\n\nE\nx\u21e0Pr\n\n[log(D(x))] + E\n\u02dcx\u21e0Pg\n\n[log(1  D( \u02dcx))],\n\n(1)\n\nwhere Pr is the data distribution and Pg is the model distribution implicitly de\ufb01ned by \u02dcx =\nG(z), z \u21e0 p(z) (the input z to the generator is sampled from some simple noise distribution,\nsuch as the uniform distribution or a spherical Gaussian distribution).\n\nIf the discriminator is trained to optimality before each generator parameter update, then minimiz-\ning the value function amounts to minimizing the Jensen-Shannon divergence between Pr and Pg\n[9], but doing so often leads to vanishing gradients as the discriminator saturates. In practice, [9]\nadvocates that the generator be instead trained to maximize E \u02dcx\u21e0Pg [log(D( \u02dcx))], which goes some\nway to circumvent this dif\ufb01culty. However, even this modi\ufb01ed loss function can misbehave in the\npresence of a good discriminator [1].\n\n2.2 Wasserstein GANs\n\n[2] argues that the divergences which GANs typically minimize are potentially not continuous with\nrespect to the generator\u2019s parameters, leading to training dif\ufb01culty. They propose instead using\nthe Earth-Mover (also called Wasserstein-1) distance W (q, p), which is informally de\ufb01ned as the\nminimum cost of transporting mass in order to transform the distribution q into the distribution p\n(where the cost is mass times transport distance). Under mild assumptions, W (q, p) is continuous\neverywhere and differentiable almost everywhere.\n\nThe WGAN value function is constructed using the Kantorovich-Rubinstein duality [24] to obtain\n\nmin\nG\n\nmax\nD2D\n\nE\n\nx\u21e0Pr\u21e5D(x)\u21e4  E\n\n\u02dcx\u21e0Pg\u21e5D( \u02dcx))\u21e4\n\n(2)\n\nwhere D is the set of 1-Lipschitz functions and Pg is once again the model distribution implicitly\nde\ufb01ned by \u02dcx = G(z), z \u21e0 p(z). In that case, under an optimal discriminator (called a critic in the\npaper, since it\u2019s not trained to classify), minimizing the value function with respect to the generator\nparameters minimizes W (Pr, Pg).\nThe WGAN value function results in a critic function whose gradient with respect to its input is\nbetter behaved than its GAN counterpart, making optimization of the generator easier. Additionally,\nWGAN has the desirable property that its value function correlates with sample quality, which is not\nthe case for GANs.\n\nTo enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie\nwithin a compact space [c, c]. The set of functions satisfying this constraint is a subset of the\nk-Lipschitz functions for some k which depends on c and the critic architecture. In the following\nsections, we demonstrate some of the issues with this approach and propose an alternative.\n\n2.3 Properties of the optimal WGAN critic\n\nIn order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate\nour approach, we highlight some properties of the optimal critic in the WGAN framework. We prove\nthese in the Appendix.\n\n2\n\n\fProposition 1. Let Pr and Pg be two distributions in X , a compact metric space. Then, there is a\n1-Lipschitz function f\u21e4 which is the optimal solution of maxkfkL\uf8ff1 Ey\u21e0Pr [f (y)]  Ex\u21e0Pg [f (x)].\nLet \u21e1 be the optimal coupling between Pr and Pg, de\ufb01ned as the minimizer of: W (Pr, Pg) =\ninf \u21e12\u21e7(Pr,Pg) E(x,y)\u21e0\u21e1 [kx  yk] where \u21e7(Pr, Pg) is the set of joint distributions \u21e1(x, y) whose\nmarginals are Pr and Pg, respectively. Then, if f\u21e4 is differentiable\u2021, \u21e1(x = y) = 0\u00a7, and xt =\ntx + (1  t)y with 0 \uf8ff t \uf8ff 1, it holds that P(x,y)\u21e0\u21e1hrf\u21e4(xt) = yxt\n\nCorollary 1. f\u21e4 has gradient norm 1 almost everywhere under Pr and Pg.\n\nkyxtki = 1.\n\n3 Dif\ufb01culties with weight constraints\n\nWe \ufb01nd that weight clipping in WGAN leads to optimization dif\ufb01culties, and that even when op-\ntimization succeeds the resulting critic can have a pathological value surface. We explain these\nproblems below and demonstrate their effects; however we do not claim that each one always occurs\nin practice, nor that they are the only such mechanisms.\nOur experiments use the speci\ufb01c form of weight constraint from [2] (hard clipping of the magnitude\nof each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization),\nas well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.\nTo some extent these problems can be mitigated with batch normalization in the critic, which [2]\nuse in all of their experiments. However even with batch normalization, we observe that very deep\nWGAN critics often fail to converge.\n\n8 Gaussians\n\n25 Gaussians\n\nSwiss Roll\n\n(a) Value surfaces of WGAN critics trained to op-\ntimality on toy datasets using (top) weight clipping\nand (bottom) gradient penalty. Critics trained with\nweight clipping fail to capture higher moments of the\ndata distribution. The \u2018generator\u2019 is held \ufb01xed at the\nreal data plus Gaussian noise.\n\n(b) (left) Gradient norms of deep WGAN critics dur-\ning training on toy datasets either explode or vanish\nwhen using weight clipping, but not when using a\ngradient penalty. (right) Weight clipping (top) pushes\nweights towards two values (the extremes of the clip-\nping range), unlike gradient penalty (bottom).\n\nFigure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.\n\n3.1 Capacity underuse\nImplementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler\nfunctions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm\nalmost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural\nnetwork architectures which try to attain their maximum gradient norm k end up learning extremely\nsimple functions.\nTo demonstrate this, we train WGAN critics with weight clipping to optimality on several toy distri-\nbutions, holding the generator distribution Pg \ufb01xed at the real distribution plus unit-variance Gaus-\nsian noise. We plot value surfaces of the critics in Figure 1a. We omit batch normalization in the\n\n\u2021We can actually assume much less, and talk only about directional derivatives on the direction of the line;\nwhich we show in the proof always exist. This would imply that in every point where f\u21e4 is differentiable (and\nthus we can take gradients in a neural network setting) the statement holds.\n\n\u00a7This assumption is in order to exclude the case when the matching point of sample x is x itself. It is\nsatis\ufb01ed in the case that Pr and Pg have supports that intersect in a set of measure 0, such as when they are\nsupported by two low dimensional manifolds that don\u2019t perfectly align [1].\n\n3\n\n1310741Discriminatorlayer2010010Gradientnorm(logscale)Weightclipping(c=0.001)Weightclipping(c=0.01)Weightclipping(c=0.1)Gradientpenalty0.020.010.000.010.02WeightsWeightclipping0.500.250.000.250.50WeightsGradientpenalty\fAlgorithm 1 WGAN with gradient penalty. We use default values of  = 10, ncritic = 5, \u21b5 =\n0.0001, 1 = 0, 2 = 0.9.\nRequire: The gradient penalty coef\ufb01cient , the number of critic iterations per generator iteration\n\nncritic, the batch size m, Adam hyperparameters \u21b5, 1, 2.\n\nfor t = 1, ..., ncritic do\nfor i = 1, ..., m do\n\nRequire: initial critic parameters w0, initial generator parameters \u27130.\n1: while \u2713 has not converged do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end while\n\nend for\nSample a batch of latent variables {z(i)}m\n\u2713 Adam(r\u2713\n\ni=1 \u21e0 p(z).\ni=1 Dw(G\u2713(z)),\u2713,\u21b5, 1, 2)\n\nend for\nw Adam(rw\nmPm\n\ni=1 L(i), w,\u21b5, 1, 2)\n\n1\n\nmPm\n\n1\n\nSample real data x \u21e0 Pr, latent variable z \u21e0 p(z), a random number \u270f \u21e0 U [0, 1].\n\u02dcx G\u2713(z)\n\u02c6x \u270fx + (1  \u270f) \u02dcx\nL(i) Dw( \u02dcx)  Dw(x) + (kr \u02c6xDw( \u02c6x)k2  1)2\n\ncritic. In each case, the critic trained with weight clipping ignores higher moments of the data dis-\ntribution and instead models very simple approximations to the optimal functions. In contrast, our\napproach does not suffer from this behavior.\n\n3.2 Exploding and vanishing gradients\nWe observe that the WGAN optimization process is dif\ufb01cult because of interactions between the\nweight constraint and the cost function, which result in either vanishing or exploding gradients\nwithout careful tuning of the clipping threshold c.\nTo demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c\nin [101, 102, 103], and plot the norm of the gradient of the critic loss with respect to successive\nlayers of activations. Both generator and critic are 12-layer ReLU MLPs without batch normaliza-\ntion. Figure 1b shows that for each of these values, the gradient either grows or decays exponentially\nas we move farther back in the network. We \ufb01nd our method results in more stable gradients that\nneither vanish nor explode, allowing training of more complicated networks.\n\n4 Gradient penalty\n\nWe now propose an alternative way to enforce the Lipschitz constraint. A differentiable function\nis 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider di-\nrectly constraining the gradient norm of the critic\u2019s output with respect to its input. To circumvent\ntractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm\nfor random samples \u02c6x \u21e0 P \u02c6x. Our new objective is\n\nL = E\n\u02dcx\u21e0Pg\n\n|\n\n[D( \u02dcx)]  E\nx\u21e0Pr\n{z\nOriginal critic loss\n\n[D(x)]\n\n+  E\n\n\u02c6x\u21e0P \u02c6x\u21e5(kr \u02c6xD( \u02c6x)k2  1)2\u21e4 .\n}\n\nOur gradient penalty\n\n{z\n\n|\n\n}\n\n(3)\n\nSampling distribution We implicitly de\ufb01ne P \u02c6x sampling uniformly along straight lines between\npairs of points sampled from the data distribution Pr and the generator distribution Pg. This is\nmotivated by the fact that the optimal critic contains straight lines with gradient norm 1 connecting\ncoupled points from Pr and Pg (see Proposition 1). Given that enforcing the unit gradient norm\nconstraint everywhere is intractable, enforcing it only along these straight lines seems suf\ufb01cient and\nexperimentally results in good performance.\nPenalty coef\ufb01cient All experiments in this paper use  = 10, which we found to work well across\na variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.\n\n4\n\n\fNo critic batch normalization Most prior GAN implementations [21, 22, 2] use batch normaliza-\ntion in both the generator and the discriminator to help stabilize training, but batch normalization\nchanges the form of the discriminator\u2019s problem from mapping a single input to a single output to\nmapping from an entire batch of inputs to a batch of outputs [22]. Our penalized training objective\nis no longer valid in this setting, since we penalize the norm of the critic\u2019s gradient with respect\nto each input independently, and not the entire batch. To resolve this, we simply omit batch nor-\nmalization in the critic in our models, \ufb01nding that they perform well without it. Our method works\nwith normalization schemes which don\u2019t introduce correlations between examples. In particular, we\nrecommend layer normalization [3] as a drop-in replacement for batch normalization.\nTwo-sided penalty We encourage the norm of the gradient to go towards 1 (two-sided penalty)\ninstead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the\ncritic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost\neverywhere under Pr and Pg and in large portions of the region in between (see subsection 2.3). In\nour early observations we found this to perform slightly better, but we don\u2019t investigate this fully.\nWe describe experiments on the one-sided penalty in the appendix.\n\n5 Experiments\n5.1 Training random architectures within a set\nWe experimentally demonstrate our model\u2019s ability to train a large number of architectures which\nwe think are useful to be able to train. Starting from the DCGAN architecture, we de\ufb01ne a set of\narchitecture variants by changing model settings to random corresponding values in Table 1. We\nbelieve that reliable training of many of the architectures in this set is a useful goal, but we do not\nclaim that our set is an unbiased or representative sample of the whole space of useful architectures:\nit is designed to demonstrate a successful regime of our method, and readers should evaluate whether\nit contains architectures similar to their intended application.\n\nTable 1: We evaluate WGAN-GP\u2019s ability to train the architectures in this set.\n\nNonlinearity (G)\nNonlinearity (D)\nDepth (G)\nDepth (D)\nBatch norm (G)\nBatch norm (D; layer norm for WGAN-GP)\nBase \ufb01lter count (G)\nBase \ufb01lter count (D)\n\n2\n\n2\n\n[ReLU, LeakyReLU, softplus(2x+2)\n[ReLU, LeakyReLU, softplus(2x+2)\n[4, 8, 12, 20]\n[4, 8, 12, 20]\n[True, False]\n[True, False]\n[32, 64, 128]\n[32, 64, 128]\n\n 1, tanh]\n 1, tanh]\n\nFrom this set, we sample 200 architectures and train each on 32\u21e532 ImageNet with both WGAN-GP\nand the standard GAN objectives. Table 2 lists the number of instances where either: only the stan-\ndard GAN succeeded, only WGAN-GP succeeded, both succeeded, or both failed, where success\nis de\ufb01ned as inception score > min score. For most choices of score threshold, WGAN-GP\nsuccessfully trains many architectures from this set which we were unable to train with the standard\nGAN objective.\n\nTable 2: Outcomes of training 200 random architectures, for different success thresholds. For\ncomparison, our standard DCGAN achieved a score of 7.24. A longer version of this table can be\nfound in the appendix.\n\nMin. score Only GAN Only WGAN-GP Both succeeded Both failed\n0\n1\n11\n90\n200\n\n192\n110\n42\n5\n0\n\n8\n88\n147\n104\n0\n\n1.0\n3.0\n5.0\n7.0\n9.0\n\n0\n1\n0\n1\n0\n\n5\n\n\fDCGAN\nBaseline (G: DCGAN, D: DCGAN)\n\nLSGAN\n\nWGAN (clipping)\n\nWGAN-GP (ours)\n\nG: No BN and a constant number of \ufb01lters, D: DCGAN\n\nG: 4-layer 512-dim ReLU MLP, D: DCGAN\n\nNo normalization in either G or D\n\nGated multiplicative nonlinearities everywhere in G and D\n\ntanh nonlinearities everywhere in G and D\n\n101-layer ResNet G and D\n\nFigure 2: Different GAN architectures trained with different methods. We only succeeded in train-\ning every architecture with a shared set of hyperparameters using WGAN-GP.\n\n5.2 Training varied architectures on LSUN bedrooms\n\nTo demonstrate our model\u2019s ability to train many architectures with its default settings, we train six\ndifferent GAN architectures on the LSUN bedrooms dataset [30]. In addition to the baseline DC-\nGAN architecture from [21], we choose six architectures whose successful training we demonstrate:\n(1) no BN and a constant number of \ufb01lters in the generator, as in [2], (2) 4-layer 512-dim ReLU\nMLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated\nmultiplicative nonlinearities, as in [23], (5) tanh nonlinearities, and (6) 101-layer ResNet generator\nand discriminator.\nAlthough we do not claim it is impossible without our method, to the best of our knowledge this\nis the \ufb01rst time very deep residual networks were successfully trained in a GAN setting. For each\narchitecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight\nclipping, DCGAN [21], and Least-Squares GAN [17]. For each objective, we used the default set\nof optimizer hyperparameters recommended in that work (except LSGAN, where we searched over\nlearning rates).\nFor WGAN-GP, we replace any batch normalization in the discriminator with layer normalization\n(see section 4). We train each model for 200K iterations and present samples in Figure 2. We only\nsucceeded in training every architecture with a shared set of hyperparameters using WGAN-GP.\nFor every other training method, some of these architectures were unstable or suffered from mode\ncollapse.\n\n5.3\n\nImproved performance over weight clipping\n\nOne advantage of our method over weight clipping is improved training speed and sample quality.\nTo demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR-\n10 [13] and plot Inception scores [22] over the course of training in Figure 3. For WGAN-GP,\n\n6\n\n\fFigure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for\nfour models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for\nthe optimizer), and DCGAN. WGAN-GP signi\ufb01cantly outperforms weight clipping and performs\ncomparably to DCGAN.\n\nwe train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight\nclipping, and another model with Adam and a higher learning rate. Even with the same optimizer,\nour method converges faster and to a better score than weight clipping. Using Adam further improves\nperformance. We also plot the performance of DCGAN [21] and \ufb01nd that our method converges\nmore slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.\n\n5.4 Sample quality on CIFAR-10 and LSUN bedrooms\n\nFor equivalent architectures, our method achieves comparable sample quality to the standard GAN\nobjective. However the increased stability allows us to improve sample quality by exploring a wider\nrange of architectures. To demonstrate this, we \ufb01nd an architecture which establishes a new state of\nthe art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using\nthe method in [19]), the same architecture outperforms all other published models except for SGAN.\n\nTable 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art perfor-\nmance, and our conditional model outperforms all others except SGAN.\n\nUnsupervised\n\nMethod\nALI [8] (in [26])\nBEGAN [4]\nDCGAN [21] (in [11])\nImproved GAN (-L+HA) [22]\nEGAN-Ent-VI [7]\nDFM [26]\nWGAN-GP ResNet (ours)\n\nScore\n5.34 \u00b1 .05\n5.62\n6.16 \u00b1 .07\n6.86 \u00b1 .06\n7.07 \u00b1 .10\n7.72 \u00b1 .13\n7.86 \u00b1 .07\n\nSupervised\n\nMethod\nSteinGAN [25]\nDCGAN (with labels, in [25])\nImproved GAN [22]\nAC-GAN [19]\nSGAN-no-joint [11]\nWGAN-GP ResNet (ours)\nSGAN [11]\n\nScore\n6.35\n6.58\n8.09 \u00b1 .07\n8.25 \u00b1 .07\n8.37 \u00b1 .08\n8.42 \u00b1 .10\n8.59 \u00b1 .12\n\nWe also train a deep ResNet on 128 \u21e5 128 LSUN bedrooms and show samples in Figure 4. We\nbelieve these samples are at least competitive with the best reported so far on any resolution for this\ndataset.\n\n5.5 Modeling discrete data with a continuous generator\n\nTo demonstrate our method\u2019s ability to model degenerate distributions, we consider the problem of\nmodeling a complex discrete distribution with a GAN whose generator is de\ufb01ned over a continuous\nspace. As an instance of this problem, we train a character-level GAN language model on the Google\nBillion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a\nlatent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a\nsoftmax nonlinearity at the output, but use no sampling step: during training, the softmax output is\n\n7\n\n0.00.51.01.52.0Generatoriterations\u21e51051234567InceptionScoreConvergenceonCIFAR-10WeightclippingGradientPenalty(RMSProp)GradientPenalty(Adam)DCGAN01234Wallclocktime(inseconds)\u21e51051234567InceptionScoreConvergenceonCIFAR-10WeightclippingGradientPenalty(RMSProp)GradientPenalty(Adam)DCGAN\fFigure 4: Samples of 128\u21e5 128 LSUN bedrooms. We believe these samples are at least comparable\nto the best published results so far.\n\npassed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we\njust take the argmax of each output vector.\n\nWe present samples from the model in Table 4. Our model makes frequent spelling errors (likely\nbecause it has to output each character independently) but nonetheless manages to learn quite a lot\nabout the statistics of language. We were unable to produce comparable results with the standard\nGAN objective, though we do not claim that doing so is impossible.\n\nTable 4: Samples from a WGAN character-level language model trained with our method on sen-\ntences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output\none-hot character embeddings from a latent vector without any discrete sampling step. We were\nunable to achieve comparable results with the standard GAN objective and a continuous generator.\n\nWGAN with gradient penalty (1D CNN)\nBusino game camperate spent odea\nIn the bankaway of smarling the\nSingersMay , who kill that imvic\nKeray Pents of the same Reagun D\nManging include a tudancs shat \"\nHis Zuith Dudget , the Denmbern\nIn during the Uitational questio\nDivos from The \u2019 noth ronkies of\nShe like Monday , of macunsuer S\n\nSolice Norkedin pring in since\nThiS record ( 31. ) UBS ) and Ch\nIt was not the annuas were plogr\nThis will be us , the ect of DAN\nThese leaded as most-worsd p2 a0\nThe time I paidOa South Cubry i\nDour Fraps higs it was these del\nThis year out howneed allowed lo\nKaulna Seto consficutes to repor\n\nThe difference in performance between WGAN and other GANs can be explained as follows. Con-\n\nsider the simplex n = {p 2 Rn : pi  0,Pi pi = 1}, and the set of vertices on the simplex (or\none-hot vectors) Vn = {p 2 Rn : pi 2{ 0, 1},Pi pi = 1}\u2713 n. If we have a vocabulary of\nsize n and we have a distribution Pr over sequences of size T , we have that Pr is a distribution on\nn = Vn \u21e5\u00b7\u00b7\u00b7\u21e5 Vn. Since V T\nn (by\nV T\nassigning zero probability mass to all points not in V T\n\nn ).\n\nn is a subset of T\n\nn , we can also treat Pr as a distribution on T\n\n8\n\n\f(a)\n\n(b)\n\nFigure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum\nas the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of\nMNIST show over\ufb01tting when using either our method (left) or weight clipping (right). In particular,\nwith our method, the critic over\ufb01ts faster than the generator, causing the training loss to increase\ngradually over time even as the validation loss drops.\n\nn ) on T\n\nn , but Pg can easily\nPr is discrete (or supported on a \ufb01nite number of elements, namely V T\nbe a continuous distribution over T\nn . The KL divergences between two such distributions are\nin\ufb01nite, and so the JS divergence is saturated. In practice, this means a discriminator might quickly\nlearn to reject all samples that don\u2019t lie on V T\nn (sequences of one-hot vectors) and give meaningless\ngradients to the generator. However, it is easily seen that the conditions of Theorem 1 and Corollary\nn . This means that\n1 of [2] are satis\ufb01ed even on this non-standard learning scenario with X = T\nW (Pr, Pg) is still well de\ufb01ned, continuous everywhere and differentiable almost everywhere, and\nwe can optimize it just like in any other continuous variable setting. The way this manifests is that in\nWGANs, the Lipschitz constraint forces the critic to provide a linear gradient from all T\nn towards\ntowards the real points in V T\nn .\nOther attempts at language modeling with GANs [31, 14, 29, 5, 15, 10] typically use discrete models\nand gradient estimators [27, 12, 16]. Our approach is simpler to implement, though whether it scales\nbeyond a toy language model is unclear.\n\n5.6 Meaningful loss curves and detecting over\ufb01tting\nAn important bene\ufb01t of weight-clipped WGANs is that their loss correlates with sample quality\nand converges toward a minimum. To show that our method preserves this property, we train a\nWGAN-GP on the LSUN bedrooms dataset [30] and plot the negative of the critic\u2019s loss in Figure 5a.\nWe see that the loss converges as the generator minimizes W (Pr, Pg).\nGANs, like all models trained on limited data, will eventually over\ufb01t. To explore the loss curve\u2019s\nbehavior when the network over\ufb01ts, we train large unregularized WGANs on a random 1000-image\nsubset of MNIST and plot the negative critic loss on both the training and validation sets in Fig-\nure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic over\ufb01ts\nand provides an inaccurate estimate of W (Pr, Pg), at which point all bets are off regarding correla-\ntion with sample quality. However in WGAN-GP, the training loss gradually increases even while\nthe validation loss drops.\n[28] also measure over\ufb01tting in GANs by estimating the generator\u2019s log-likelihood. Compared\nto that work, our method detects over\ufb01tting in the critic (rather than the generator) and measures\nover\ufb01tting against the same loss that the network minimizes.\n\n6 Conclusion\n\nIn this work, we demonstrated problems with weight clipping in WGAN and introduced an alterna-\ntive in the form of a penalty term in the critic loss which does not exhibit the same problems. Using\nour method, we demonstrated strong modeling performance and stability across a variety of archi-\ntectures. Now that we have a more stable algorithm for training GANs, we hope our work opens\nthe path for stronger modeling performance on large-scale image datasets and language. Another\ninteresting direction is adapting our penalty term to the standard GAN objective function, where it\nmight stabilize training by encouraging the discriminator to learn smoother decision boundaries.\n\n9\n\n024Generatoriterations\u21e510401020304050Negativecriticlosstrainvalidation0.00.51.01.52.0Generatoriterations\u21e51040510Negativecriticlosstrainvalidation0.00.51.01.52.0Generatoriterations\u21e51040.00.20.40.60.8Negativecriticlosstrainvalidation\fAcknowledgements\n\nWe would like to thank Mohamed Ishmael Belghazi, L\u00b4eon Bottou, Zihang Dai, Stefan Doerr,\nIan Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Sai Rajeshwar, Aditya\nRamesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.\n\nReferences\n[1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial\n\nnetworks. 2017.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,\n\n2016.\n\n[4] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial\n\nnetworks. arXiv preprint arXiv:1703.10717, 2017.\n\n[5] T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio. Maximum-likelihood\naugmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.\n\n[6] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One bil-\nlion word benchmark for measuring progress in statistical language modeling. arXiv preprint\narXiv:1312.3005, 2013.\n\n[7] Z. Dai, A. Almahairi, P. Bachman, E. Hovy, and A. Courville. Calibrating energy-based gen-\n\nerative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.\n\n[8] V. Dumoulin, M. I. D. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and\n\nA. Courville. Adversarially learned inference. 2017.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nIn Advances in neural information processing\n\nand Y. Bengio. Generative adversarial nets.\nsystems, pages 2672\u20132680, 2014.\n\n[10] R. D. Hjelm, A. P. Jacob, T. Che, K. Cho, and Y. Bengio. Boundary-seeking generative adver-\n\nsarial networks. arXiv preprint arXiv:1702.08431, 2017.\n\n[11] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial\n\nnetworks. arXiv preprint arXiv:1612.04357, 2016.\n\n[12] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[13] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\n[14] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. arXiv preprint arXiv:1701.06547, 2017.\n\n[15] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition gan for visual\n\nparagraph generation. arXiv preprint arXiv:1703.07022, 2017.\n\n[16] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation\n\nof discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[17] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Least squares generative adversarial networks.\n\narXiv preprint arXiv:1611.04076, 2016.\n\n[18] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks.\n\narXiv preprint arXiv:1611.02163, 2016.\n\n10\n\n\f[19] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er gans.\n\narXiv preprint arXiv:1610.09585, 2016.\n\n[20] B. Poole, A. A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for\n\ngans. arXiv preprint arXiv:1612.02780, 2016.\n\n[21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convo-\n\nlutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n[22] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen.\n\nImproved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages\n2226\u20132234, 2016.\n\n[23] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image\ngeneration with pixelcnn decoders. In Advances in Neural Information Processing Systems,\npages 4790\u20134798, 2016.\n\n[24] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[25] D. Wang and Q. Liu. Learning to draw samples: With application to amortized mle for gener-\n\native adversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\n[26] D. Warde-Farley and Y. Bengio. Improving generative adversarial networks with denoising\n\nfeature matching. 2017.\n\n[27] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[28] Y. Wu, Y. Burda, R. Salakhutdinov, and R. Grosse. On the quantitative analysis of decoder-\n\nbased generative models. arXiv preprint arXiv:1611.04273, 2016.\n\n[29] Z. Yang, W. Chen, F. Wang, and B. Xu. Improving neural machine translation with conditional\n\nsequence generative adversarial nets. arXiv preprint arXiv:1703.04887, 2017.\n\n[30] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of\na large-scale image dataset using deep learning with humans in the loop. arXiv preprint\narXiv:1506.03365, 2015.\n\n[31] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: sequence generative adversarial nets with policy\n\ngradient. arXiv preprint arXiv:1609.05473, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2945, "authors": [{"given_name": "Ishaan", "family_name": "Gulrajani", "institution": "Google"}, {"given_name": "Faruk", "family_name": "Ahmed", "institution": "MILA"}, {"given_name": "Martin", "family_name": "Arjovsky", "institution": "New York University"}, {"given_name": "Vincent", "family_name": "Dumoulin", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}]}