{"title": "Stabilizing Training of Generative Adversarial Networks through Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 2018, "page_last": 2028, "abstract": "Deep generative models based on Generative Adversarial Networks (GANs) have demonstrated impressive sample quality but in order to work they require a careful choice of architecture, parameter initialization, and selection of hyper-parameters. This fragility is in part due to a dimensional mismatch or non-overlapping support between the model distribution and the data distribution, causing their density ratio and the associated f -divergence to be undefined. We overcome this fundamental limitation and propose a new regularization approach with low computational cost that yields a stable GAN training procedure. We demonstrate the effectiveness of this regularizer accross several architectures trained on common benchmark image generation tasks. Our regularization turns GAN models into reliable building blocks for deep learning.", "full_text": "Stabilizing Training of Generative Adversarial\n\nNetworks through Regularization\n\nDepartment of Computer Science\n\nKevin Roth\n\nETH Z\u00fcrich\n\nAurelien Lucchi\n\nDepartment of Computer Science\n\nETH Z\u00fcrich\n\nkevin.roth@inf.ethz.ch\n\naurelien.lucchi@inf.ethz.ch\n\nSebastian Nowozin\nMicrosoft Research\n\nCambridge, UK\n\nsebastian.Nowozin@microsoft.com\n\nThomas Hofmann\n\nDepartment of Computer Science\n\nETH Z\u00fcrich\n\nthomas.hofmann@inf.ethz.ch\n\nAbstract\n\nDeep generative models based on Generative Adversarial Networks (GANs) have\ndemonstrated impressive sample quality but in order to work they require a careful\nchoice of architecture, parameter initialization, and selection of hyper-parameters.\nThis fragility is in part due to a dimensional mismatch or non-overlapping support\nbetween the model distribution and the data distribution, causing their density ratio\nand the associated f-divergence to be unde\ufb01ned. We overcome this fundamental\nlimitation and propose a new regularization approach with low computational cost\nthat yields a stable GAN training procedure. We demonstrate the effectiveness\nof this regularizer accross several architectures trained on common benchmark\nimage generation tasks. Our regularization turns GAN models into reliable building\nblocks for deep learning. 1\n\n1\n\nIntroduction\n\nA recent trend in the world of generative models is the use of deep neural networks as data generating\nmechanisms. Two notable approaches in this area are variational auto-encoders (VAEs) [14, 28] as\nwell as generative adversarial networks (GAN) [8]. GANs are especially appealing as they move\naway from the common likelihood maximization viewpoint and instead use an adversarial game\napproach for training generative models. Let us denote by P(x) and Q\u2713(x) the data and model\ndistribution, respectively. The basic idea behind GANs is to pair up a \u2713-parametrized generator\nnetwork that produces Q\u2713 with a discriminator which aims to distinguish between P and Q\u2713, whereas\nthe generator aims for making Q\u2713 indistinguishable from P. Effectively the discriminator represents\na class of objective functions F that measures dissimilarity of pairs of probability distributions. The\n\ufb01nal objective is then formed via a supremum over F, leading to the saddle point problem\n\nmin\n\n\u2713 \uf8ff`(Q\u2713;F) := sup\n\nF2F\n\nF (P, Q\u2713) .\n\n(1)\n\nThe standard way of representing a speci\ufb01c F is through a family of statistics or discriminants 2 ,\ntypically realized by a neural network [8, 26]. In GANs, we use these discriminators in a logistic\nclassi\ufb01cation loss as follows\n\nF (P, Q; ) = EP [g((x))] + EQ [g((x))] ,\n\n1Code available at https://github.com/rothk/Stabilizing_GANs\n\n(2)\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwhere g(z) = ln((z)) is the log-logistic function (for reference, ((x)) = D(x) in [8]).\nAs shown in [8], for the Bayes-optimal discriminator \u21e4 2 , the above generator objective reduces\nto the Jensen-Shannon (JS) divergence between P and Q. The work of [25] later generalized this to\na more general class of f-divergences, which gives more \ufb02exibility in cases where the generative\nmodel may not be expressive enough or where data may be scarce.\nWe consider three different challenges for learning the model distribution:\n(A) empirical estimation: the model family may contain the true distribution or a good approximation\nthereof, but one has to identify it based on a \ufb01nite training sample drawn from P. This is commonly\naddressed by the use of regularization techniques to avoid over\ufb01tting, e.g. in the context of estimating\nf-divergences with M-estimators [24]. In our work, we suggest a novel (Tikhonov) regularizer,\nderived and motivated from a training-with-noise scenario, where P and Q are convolved with white\nGaussian noise [30, 3], namely\n\nF(P, Q; ) := F (P \u21e4 \u21e4, Q \u21e4 \u21e4; ), \u21e4= N (0, I) .\n\n(3)\n\n(B) density misspeci\ufb01cation: the model distribution and true distribution both have a density function\nwith respect to the same base measure but there exists no parameter for which these densities are\nsuf\ufb01ciently similar. Here, the principle of parameter estimation via divergence minimization is\nprovably sound in that it achieves a well-de\ufb01ned limit [1, 21]. It therefore provides a solid foundation\nfor statistical inference that is robust with regard to model misspeci\ufb01cations.\n(C) dimensional misspeci\ufb01cation: the model distribution and the true distribution do not have a\ndensity function with respect to the same base measure or \u2013 even worse \u2013 supp(P) \\ supp(Q) may\nbe negligible. This may occur, whenever the model and/or data are con\ufb01ned to low-dimensional\nmanifolds [3, 23]. As pointed out in [3], a geometric mismatch can be detrimental for f-GAN\nmodels as the resulting f-divergence is not \ufb01nite (the sup in Eq. (1) is +1). As a remedy, it has\nbeen suggested to use an alternative family of distance functions known as integral probability\nmetrics [22, 31]. These include the Wasserstein distance used in Wasserstein GANs (WGAN) [3] as\nwell as RKHS-induced maximum mean discrepancies [9, 16, 6], which all remain well-de\ufb01ned. We\nwill provide evidence (analytically and experimentally) that the noise-induced regularization method\nproposed in this paper effectively makes f-GAN models robust against dimensional misspeci\ufb01cations.\nWhile this introduces some dependency on the (Euclidean) metric of the ambient data space, it does\nso on a well-controlled length scale (the amplitude of noise or strength of the regularization ) and\nby retaining the bene\ufb01ts of f-divergences. This is a rather gentle modi\ufb01cation compared to the more\nradical departure taken in Wasserstein GANs, which rely solely on the ambient space metric (through\nthe notion of optimal mass transport).\nIn what follows, we will take Eq. (3) as the starting point and derive an approximation via a regularizer\nthat is simple to implement as an integral operator penalizing the squared gradient norm. As opposed\nto a na\u00efve norm penalization, each f-divergence has its own characteristic weighting function over\nthe input space, which depends on the discriminator output. We demonstrate the effectiveness\nof our approach on a simple Gaussian mixture as well as on several benchmark image datasets\ncommonly used for generative models. In both cases, our proposed regularization yields stable GAN\ntraining and produces samples of higher visual quality. We also perform pairwise tests of regularized\nvs. unregularized GANs using a novel cross-testing protocol.\nIn summary, we make the following contributions:\n\n\u2022 We systematically derive a novel, ef\ufb01ciently computable regularization method for f-GAN.\n\u2022 We show how this addresses the dimensional misspeci\ufb01cation challenge.\n\u2022 We empirically demonstrate stable GAN training across a broad set of models.\n\n2 Background\n\nThe fundamental way to learn a generative model in machine learning is to (i) de\ufb01ne a parametric\nfamily of probability densities {Q\u2713}, \u2713 2 \u21e5 \u2713 Rd, and (ii) \ufb01nd parameters \u2713\u21e4 2 \u21e5 such that Q\u2713 is\nclosest (in some sense) to the true distribution P. There are various ways to measure how close model\nand real distribution are, or equivalently, various ways to de\ufb01ne a distance or divergence function\nbetween P and Q. In the following we review different notions of divergences used in the literature.\n\n2\n\n\ff-divergence. GANs [8] are known to minimize the Jensen-Shannon divergence between P and\nQ. This was generalized in [25] to f-divergences induced by a convex functions f. An interesting\nproperty of f-divergences is that they permit a variational characterization [24, 27] via\n\nDf (P||Q) := EQ\uf8fff \n\ndP\n\ndQ =ZX\n\nsup\n\nu \u2713u \u00b7\n\ndP\n\ndQ f c(u)\u25c6 dQ,\n\n(4)\n\nwhere dP/dQ is the Radon-Nikodym derivative and f c(t) \u2318 supu2domf{ut f (u)} is the Fenchel\ndual of f. By de\ufb01ning an arbitrary class of statistics 3 : X! R we arrive at the bound\n\n {EP[ ] EQ[f c ]} .\n\n(5)\n\nDf (P||Q) sup\n\n Z \u2713 \u00b7\n\ndP\n\ndQ f c \u25c6 dQ = sup\n\nEq. (5) thus gives us a variational lower bound on the f-divergence as an expectation over P and\nQ, which is easier to evaluate (e.g. via sampling from P and Q, respectively) than the density\nbased formulation. We can see that by identifying = g and with the choice of f such that\nf c = ln(1 exp), we get f c = ln(1 ()) = g() thus recovering Eq. (2).\nIntegral Probability Metrics (IPM). An alternative family of divergences are integral probability\nmetrics [22, 31], which \ufb01nd a witness function to distinguish between P and Q. This class of\nmethods yields an objective similar to Eq. (2) that requires optimizing a distance function between\ntwo distributions over a function class F. Particular choices for F yield the kernel maximum mean\ndiscrepancy approach of [9, 16] or Wasserstein GANs [3]. The latter distance is de\ufb01ned as\n\nW (P, Q) = sup\n\nkfkL\uf8ff1{EP[f ] EQ[f ]},\n\n(6)\n\nwhere the supremum is taken over functions f which have a bounded Lipschitz constant.\nAs shown in [3], the Wasserstein metric implies a different notion of convergence compared to the\nJS divergence used in the original GAN. Essentially, the Wasserstein metric is said to be weak as it\nrequires the use of a weaker topology, thus making it easier for a sequence of distribution to converge.\nThe use of a weaker topology is achieved by restricting the function class to the set of bounded\nLipschitz functions. This yields a hard constraint on the function class that is empirically hard to\nsatisfy. In [3], this constraint is implemented via weight clipping, which is acknowledged to be a\n\"terrible way\" to enforce the Lipschitz constraint. As will be shown later, our regularization penalty\ncan be seen as a soft constraint on the Lipschitz constant of the function class which is easy to\nimplement in practice. Recently, [10] has also proposed a similar regularization; while their proposal\nwas motivated for Wasserstein GANs and does not extend to f-divergences it is interesting to observe\nthat both their and our regularization work on the gradient.\n\nTraining with Noise. As suggested in [3, 30], one can break the dimensional misspeci\ufb01cation\ndiscussed in Section 1 by adding continuous noise to the inputs of the discriminator, therefore\nsmoothing the probability distribution. However, this requires to add high-dimensional noise, which\nintroduces signi\ufb01cant variance in the parameter estimation process. Counteracting this requires a\nlot of samples and therefore ultimately leads to a costly or impractical solution. Instead we propose\nan approach that relies on analytic convolution of the densities P and Q with Gaussian noise. As\nwe demonstrate below, this yields a simple weighted penalty function on the norm of the gradients.\nConceptually we think of this noise not as being part of the generative process (as in [3]), but rather\nas a way to de\ufb01ne a smoother family of discriminants for the variational bound of f-divergences.\n\nRegularization for Mode Dropping. Other regularization techniques address the problem of mode\ndropping and are complementary to our approach. This includes the work of [7] which incorporates a\nsupervised training signal as a regularizer on top of the discriminator target. To implement supervision\nthe authors use an additional auto-encoder as well as a two-step training procedure which might\nbe computationally expensive. A similar approach was proposed by [20] that stabilizes GANs by\nunrolling the optimization of the discriminator. The main drawback of this approach is that the\ncomputational cost scales with the number of unrolling steps. In general, it is not clear to what extent\nthese methods not only stabilize GAN training, but also address the conceptual challenges listed in\nSection 1.\n\n3\n\n\f3 Noise-Induced Regularization\nFrom now onwards, we consider the general f-GAN [25] objective de\ufb01ned as\n\nF (P, Q; ) \u2318 EP[ ] EQ[f c ].\n\n(7)\n\n3.1 Noise Convolution\nFrom a practitioners point of view, training with noise can be realized by adding zero-mean random\nvariables \u21e0 to samples x \u21e0 P, Q during training. Here we focus on normal white noise \u21e0 \u21e0 \u21e4=\nN (0, I) (the same analysis goes through with a Laplacian noise distribution for instance). From a\ntheoretical perspective, adding noise is tantamount to convolving the corresponding distribution as\nEPE\u21e4[ (x + \u21e0)] =Z (x)Z p(x \u21e0)(\u21e0)d\u21e0 dx =Z (x)(p \u21e4 )(x)dx = EP\u21e4\u21e4[ ].\nwhere p and are probability densities of P and \u21e4, respectively, with regard to the Lebesgue measure.\nThe noise distribution \u21e4 as well as the resulting P\u21e4\u21e4 are guaranteed to have full support in the ambient\nspace, i.e. (x) > 0 and (p \u21e4 )(x) > 0 (8x). Technically, applying this to both P and Q makes the\nresulting generalized f-divergence well-de\ufb01ned, even when the generative model is dimensionally\nmisspeci\ufb01ed. Note that approximating E\u21e4 through sampling was previously investigated in [30, 3].\n\n(8)\n\n3.2 Convolved Discriminants\nWith symmetric noise, (\u21e0) = (\u21e0), we can write Eq. (8) equivalently as\n\nEP\u21e4\u21e4[ ] = EPE\u21e4[ (x + \u21e0)] =Z p(x)Z (x \u21e0)(\u21e0) d\u21e0 dx = EP[ \u21e4 ].\n\n(9)\n\nFor the Q-expectation in Eq. (7) one gets, by the same argument, EQ\u21e4\u21e4[f c ] = EQ [(f c ) \u21e4 ].\nFormally, this generalizes the variational bound for f-divergences in the following manner:\n(10)\nF (P \u21e4 \u21e4, Q \u21e4 \u21e4; ) = F (P, Q; \u21e4 , (f c ) \u21e4 ), F (P, Q; \u21e2, \u2327 ) := EP[\u21e2] EQ[\u2327 ]\nAssuming that F is closed under \u21e4 convolutions, the regularization will result in a relative weakening\nof the discriminator as we take the sup over a smaller, more regular family. Clearly, the low-pass\neffect of \u21e4-convolutions can be well understood in the Fourier domain. In this equivalent formulation,\nwe leave P and Q unchanged, yet we change the view the discriminator can take on the ambient data\nspace: metaphorically speaking, the generator is paired up with a short-sighted adversary.\n\n3.3 Analytic Approximations\nIn general, it may be dif\ufb01cult to analytically compute \u21e4 or \u2013 equivalently \u2013 E\u21e4[ (x + \u21e0)].\nHowever, for small we can use a Taylor approximation of around \u21e0 = 0 (cf. [5]):\n\n (x + \u21e0) = (x) + [r (x)]T \u21e0 +\n\n1\n2\n\n\u21e0T [r2 (x)] \u21e0 + O(\u21e03)\n\n(11)\n\nwhere r2 denotes the Hessian, whose trace Tr(r2) = 4 is known as the Laplace operator. The\nproperties of white noise result in the approximation\n\nE\u21e4[ (x + \u21e0)] = (x) +\n\n\n24 (x) + O(2)\n\n(12)\n\nand thereby lead directly to an approximation of F (see Eq. (3)) via F = F0 plus a correction, i.e.\n\nF(P, Q; ) = F(P, Q; ) +\n\n\n2 {EP [4 ] EQ [4(f c )]} + O(2) .\n\n(13)\n\nWe can interpret Eq. (13) as follows: the Laplacian measures how much the scalar \ufb01elds and\nf c differ at each point from their local average. It is thereby an in\ufb01nitesimal proxy for the (exact)\nconvolution.\nThe Laplace operator is a sum of d terms, where d is the dimensionality of the ambient data space. As\nsuch it does not suffer from the quadratic blow-up involved in computing the Hessian. If we realize\nthe discriminator via a deep network, however, then we need to be able to compute the Laplacian\nof composed functions. For concreteness, let us assume that = h G, G = (g1, . . . , gk) and look\n\n4\n\n\fat a single input x, i.e. gi : R ! R, then\n(h G)0 =Xi\n\ng0i \u00b7 (@ih G),\n\n(h G)00 =Xi\n\ng00i \u00b7 (@ih G) +Xi,j\n\ng0i \u00b7 g0j \u00b7 (@i@jh G)\n\n(14)\n\nSo at the intermediate layer, we would need to effectively operate with a full Hessian, which is\ncomputationally demanding, as has already been observed in [5].\n\nF(P, Q; \u21e4) = F(P, Q; \u21e4) \n\n(17)\n\n3.4 Ef\ufb01cient Gradient-Based Regularization\nWe would like to derive a (more) tractable strategy for regularizing , which (i) avoids the detrimental\nvariance that comes from sampling \u21e0, (ii) does not rely on explicitly convolving the distributions\nP and Q, and (iii) avoids the computation of Laplacians as in Eq. (13). Clearly, this requires to\nmake further simpli\ufb01cations. We suggest to exploit properties of the maximizer \u21e4 of F that can be\ncharacterized by [24]\n\n(f c0 \u21e4) dQ = dP =) EP[h] = EQ[(f c0 \u21e4) \u00b7 h]\n\n(15)\nThe relevance of this becomes clear, if we apply the chain rule to 4(f c ), assuming that f c is\ntwice differentiable\n(16)\n\n4(f c ) = (f c00 ) \u00b7 ||r ||2 +f c0 4 ,\nas now we get a convenient cancellation of the Laplacians at = \u21e4 + O()\n\n(8h, integrable).\n\nWe can (heuristically) turn this into a regularizer by taking the leading terms,\n\n\n2\n\nEQh(f c00 \u21e4) \u00b7 kr \u21e4k2i + O(2) .\n\u2326f (Q; ), \u2326f (Q; ) := EQh(f c00 ) \u00b7 kr k2i .\n\n\n2\n\nF(P, Q; ) \u21e1 F(P, Q; ) \n\n(18)\nNote that we do not assume that the Laplacian terms cancel far away from the optimum, i.e. we do not\nassume Eq. (15) to hold for far away from \u21e4. Instead, the underlying assumption we make is that\noptimizing the gradient-norm regularized objective F(P, Q; ) makes converge to \u21e4 + O(),\nfor which we know that the Laplacian terms cancel [5, 2].\nThe convexity of f c implies that the weighting function of the squared gradient norm is non-negative,\ni.e. f c00 0, which in turn implies that the regularizer \n2 \u2326f (Q; ) is upper bounded (by zero).\nMaximization of F(P, Q; ) with respect to is therefore well-de\ufb01ned. Further considerations\nregarding the well-de\ufb01nedness of the regularizer can be found in sec. 7.2 in the Appendix.\n4 Regularizing GANs\nWe have shown that training with noise is equivalent to regularizing the discriminator. Inspired by\nthe above analysis, we propose the following class of f-GAN regularizers:\n\nRegularized f-GAN\n\nF(P, Q; ) = EP [ ] EQ [f c ] \n\u2326f (Q; ) := EQh(f c00 )kr k2i\n\n\n2\n\n\u2326f (Q; )\n\n(19)\n\nThe regularizer corresponding to the commonly used parametrization of the Jensen-Shannon GAN\ncan be derived analogously as shown in the Appendix. We obtain,\n\nRegularized Jensen-Shannon GAN\n\nF(P, Q; ') = EP [ln(')] + EQ [ln(1 ')] \n\u2326JS(P, Q; ') := EP\u21e5(1 '(x))2||r(x)||2\u21e4 + EQ\u21e5'(x)2||r(x)||2\u21e4\n\n\u2326JS(P, Q; ')\n\n\n2\n\n(20)\n\nwhere = 1(') denotes the logit of the discriminator '. We prefer to compute the gradient of \nas it is easier to implement and more robust than computing gradients after applying the sigmoid.\n\n5\n\n\fAlgorithm 1 Regularized JS-GAN. Default values: 0 = 2.0, \u21b5 = 0.01 (with annealing), = 0.1 (without\nannealing), n' = 1\nRequire: Initial noise variance 0, annealing decay rate \u21b5, number of discriminator update steps n'\n\nper generator iteration, minibatch size m, number of training iterations T\nRequire: Initial discriminator parameters !0, initial generator parameters \u27130\n\nfor t = 1, ..., T do\n\n 0 \u00b7 \u21b5t/T # annealing\nfor 1, ..., n' do\n\n1\nm\n\nSample minibatch of real data {x(1), ..., x(m)}\u21e0 P.\nSample minibatch of latent variables from prior {z(1), ..., z(m)}\u21e0 p(z).\nF (!, \u2713) =\n\nmXi=1 h ln\u21e3'!(x(i))\u2318 + ln\u21e31 '!(G\u2713(z(i)))\u2318i\nmXi=1 \uf8ff\u21e31 '!(x(i))\u23182r!(x(i))2 + '!G\u2713(z(i))2r\u02dcx!(\u02dcx)\u02dcx=G\u2713(z(i))2\n\n\u2326(!, \u2713) =\n\n1\nm\n\n\n2\n\n\u2326(!, \u2713)\u2318 # gradient ascent\n\n! ! + r!\u21e3F (!, \u2713) \nend for\nSample minibatch of latent variables from prior {z(1), ..., z(m)}\u21e0 p(z).\nmXi=1\nF (!, \u2713) =\n\nln\u21e31 '!(G\u2713(z(i)))\u2318 or Falt(!, \u2713) = \n\nmXi=1\n\n1\nm\n\n1\nm\n\n\u2713 \u2713 r\u2713F(!, \u2713) # gradient descent\n\nln\u21e3'!(G\u2713(z(i)))\u2318\n\nend for\nThe gradient-based updates can be performed with any gradient-based learning rule. We used\nAdam in our experiments.\n\n4.1 Training Algorithm\nRegularizing the discriminator provides an ef\ufb01cient way to convolve the distributions and is thereby\nsuf\ufb01cient to address the dimensional misspeci\ufb01cation challenges outlined in the introduction. This\nleaves open the possibility to use the regularizer also in the objective of the generator. On the one\nhand, optimizing the generator through the regularized objective may provide useful gradient signal\nand therefore accelerate training. On the other hand, it destabilizes training close to convergence\n(if not dealt with properly), since the generator is incentiviced to put probability mass where the\ndiscriminator has large gradients. In the case of JS-GANs, we recommend to pair up the regularized\nobjective of the discriminator with the \u201calternative\u201d or \u201cnon-saturating\u201d objective for the generator,\nproposed in [8], which is known to provide strong gradients out of the box (see Algorithm 1).\n\n4.2 Annealing\nThe regularizer variance lends itself nicely to annealing. Our experimental results indicate that a\nreasonable annealing scheme consists in regularizing with a large initial early in training and then\n(exponentially) decaying to a small non-zero value. We leave to future work the question of how to\ndetermine an optimal annealing schedule.\n\n2D submanifold mixture of Gaussians in 3D space\n\n5 Experiments\n5.1\nTo demonstrate the stabilizing effect of the regularizer, we train a simple GAN architecture [20] on a\n2D submanifold mixture of seven Gaussians arranged in a circle and embedded in 3D space (further\ndetails and an illustration of the mixture distribution are provided in the Appendix). We emphasize\nthat this mixture is degenerate with respect to the base measure de\ufb01ned in ambient space as it does\nnot have fully dimensional support, thus precisely representing one of the failure scenarios commonly\n\n6\n\n\f.\n\nG\nE\nR\nN\nU\n\n1\n0\n0\n\n.\n\n0\n1\n\n.\n\nFigure 1: 2D submanifold mixture. The \ufb01rst row shows one of several unstable unregularized GANs\ntrained to learn the dimensionally misspeci\ufb01ed mixture distribution. The remaining rows show\nregularized GANs (with regularized objective for the discriminator and unregularized objective for\nthe generator) for different levels of regularization . Even for small but non-zero noise variance, the\nregularized GAN can essentially be trained inde\ufb01nitely without collapse. The color of the samples is\nproportional to the density estimated from a Gaussian KDE \ufb01t. The target distribution is shown in\nFig. 5. GANs were trained with one discriminator update per generator update step (indicated).\n\ndescribed in the literature [3]. The results are shown in Fig. 1 for both standard unregularized GAN\ntraining as well as our regularized variant.\nWhile the unregularized GAN collapses in literally every run after around 50k iterations, due to the\nfact that the discriminator concentrates on ever smaller differences between generated and true data\n(the stakes are getting higher as training progresses), the regularized variant can be trained essentially\ninde\ufb01nitely (well beyond 200k iterations) without collapse for various degrees of noise variance, with\nand without annealing. The stabilizing effect of the regularizer is even more pronounced when the\nGANs are trained with \ufb01ve discriminator updates per generator update step, as shown in Fig. 6.\n\n5.2 Stability across various architectures\nTo demonstrate the stability of the regularized training procedure and to showcase the excellent\nquality of the samples generated from it, we trained various network architectures on the CelebA [17],\nCIFAR-10 [15] and LSUN bedrooms [32] datasets. In addition to the deep convolutional GAN\n(DCGAN) of [26], we trained several common architectures that are known to be hard to train\n[4, 26, 19], therefore allowing us to establish a comparison to the concurrently proposed gradient-\npenalty regularizer for Wasserstein GANs [10]. Among these architectures are a DCGAN without\nany normalization in either the discriminator or the generator, a DCGAN with tanh activations and a\ndeep residual network (ResNet) GAN [11]. We used the open-source implementation of [10] for our\nexperiments on CelebA and LSUN, with one notable exception: we use batch normalization also for\nthe discriminator (as our regularizer does not depend on the optimal transport plan or more precisely\nthe gradient penalty being imposed along it).\nAll networks were trained using the Adam optimizer [13] with learning rate 2 \u21e5 104 and hyper-\nparameters recommended by [26]. We trained all datasets using batches of size 64, for a total of\n200K generator iterations in the case of LSUN and 100k iterations on CelebA. The results of these\nexperiments are shown in Figs. 3 & 2. Further implementation details can be found in the Appendix.\n\n5.3 Training time\nWe empirically found regularization to increase the overall training time by a marginal factor\nof roughly 1.4 (due to the additional backpropagation through the computational graph of the\ndiscriminator gradients). More importantly, however, (regularized) f-GANs are known to converge\n(or at least generate good looking samples) faster than their WGAN relatives [10].\n\n7\n\n\fRESNET\n\nDCGAN\n\nNO NORMALIZATION\n\nTANH\n\nFigure 2: Stability accross various architectures: ResNet, DCGAN, DCGAN without normalization\nand DCGAN with tanh activations (details in the Appendix). All samples were generated from\nregularized GANs with exponentially annealed 0 = 2.0 (and alternative generator loss) as described\nin Algorithm 1. Samples were produced after 200k generator iterations on the LSUN dataset (see also\nFig. 8 for a full-resolution image of the ResNet GAN). Samples for the unregularized architectures\ncan be found in the Appendix.\n\nUNREG.\n\n0.5\n\n1.0\n\n2.0\n\nFigure 3: Annealed Regularization. CelebA samples generated by (un)regularized ResNet GANs.\nThe initial level of regularization 0 is shown below each batch of images. 0 was exponentially\nannealed as described in Algorithm 1. The regularized GANs can be trained essentially inde\ufb01nitely\nwithout collapse, the superior quality is again evident. Samples were produced after 100k generator\niterations.\n\n5.4 Regularization vs. explicitly adding noise\n\nWe compare our regularizer against the common practitioner\u2019s approach to explicitly adding noise to\nimages during training. In order to compare both approaches (analytic regularizer vs. explicit noise),\nwe \ufb01x a common batch size (64 in our case) and subsequently train with different noise-to-signal\nratios (NSR): we take (batch-size/NSR) samples (both from the dataset and generated ones) to each\nof which a number of NSR noise vectors is added and feed them to the discriminator (so that overall\nboth models are trained on the same batch size). We experimented with NSR 1, 2, 4, 8 and show the\nbest performing ratio (further ratios in the Appendix). Explicitly adding noise in high-dimensional\nambient spaces introduces additional sampling variance which is not present in the regularized variant.\nThe results, shown in Fig. 4, con\ufb01rm that the regularizer stabilizes across a broad range of noise\nlevels and manages to produce images of considerably higher quality than the unregularized variants.\n\n5.5 Cross-testing protocol\n\nWe propose the following pairwise cross-testing protocol to assess the relative quality of two GAN\nmodels: unregularized GAN (Model 1) vs. regularized GAN (Model 2). We \ufb01rst report the confusion\nmatrix (classi\ufb01cation of 10k samples from the test set against 10k generated samples) for each model\nseparately. We then classify 10k samples generated by Model 1 with the discriminator of Model 2\nand vice versa. For both models, we report the fraction of false positives (FP) (Type I error) and false\nnegatives (FN) (Type II error). The discriminator with the lower FP (and/or lower FN) rate de\ufb01nes\nthe better model, in the sense that it is able to more accurately classify out-of-data samples, which\nindicates better generalization properties. We obtained the following results on CIFAR-10:\n\n8\n\n\fUNREGULARIZED\n\n0.01\n\nEXPLICIT NOISE\n\n0.1\n\n1.0\n\n0.001\n\n0.01\n\nREGULARIZED\n\n0.1\n\n1.0\n\nFigure 4: CIFAR-10 samples generated by (un)regularized DCGANs (with alternative generator loss),\nas well as by training a DCGAN with explicitly added noise (noise-to-signal ratio 4). The level of\nregularization or noise is shown above each batch of images. The regularizer stabilizes across a\nbroad range of noise levels and manages to produce images of higher quality than the unregularized\nvariants. Samples were produced after 50 training epochs.\n\nRegularized GAN ( = 0.1)\n\nUnregularized GAN\n\nTrue condition\n\nPositive Negative\n0.0002\n0.9688\n0.0312\n0.9998\n\nPredicted\n\nPredicted\n\nPositive\nNegative\nCross-testing: FP: 0.0\nFor both models, the discriminator is able to recognize his own generator\u2019s samples (low FP in the\nconfusion matrix). The regularized GAN also manages to perfectly classify the unregularized GAN\u2019s\nsamples as fake (cross-testing FP 0.0) whereas the unregularized GAN classi\ufb01es the samples of the\nregularized GAN as real (cross-testing FP 1.0). In other words, the regularized model is able to fool\nthe unregularized one, whereas the regularized variant cannot be fooled.\n\nPositive\nNegative\nCross-testing: FP: 1.0\n\nTrue condition\n\nPositive Negative\n0.0013\n0.9987\n\n1.0\n0.0\n\n6 Conclusion\n\nWe introduced a regularization scheme to train deep generative models based on generative adversarial\nnetworks (GANs). While dimensional misspeci\ufb01cations or non-overlapping support between the\ndata and model distributions can cause severe failure modes for GANs, we showed that this can be\naddressed by adding a penalty on the weighted gradient-norm of the discriminator. Our main result is\na simple yet effective modi\ufb01cation of the standard training algorithm for GANs, turning them into\nreliable building blocks for deep learning that can essentially be trained inde\ufb01nitely without collapse.\nOur experiments demonstrate that our regularizer improves stability, prevents GANs from over\ufb01tting\nand therefore leads to better generalization properties (cf cross-testing protocol). Further research on\nthe optimization of GANs as well as their convergence and generalization can readily be built upon\nour theoretical results.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Devon Hjelm for pointing out that the regularizer works well with ResNets.\nKR is thankful to Yannic Kilcher, Lars Mescheder and the dalab team for insightful discussions. Big\nthanks also to Ishaan Gulrajani and Taehoon Kim for their open-source GAN implementations. This\nwork was supported by Microsoft Research through its PhD Scholarship Programme.\n\nReferences\n[1] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry. American Mathe-\n\nmatical Soc., 2007.\n\n[2] Guozhong An. The effects of adding noise during backpropagation training on a generalization\n\nperformance. Neural Comput., pages 643\u2013674, 1996.\n\n[3] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. In ICLR, 2017.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. Proceedings of Machine Learning Research. PMLR, 2017.\n\n[5] Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computa-\n\ntion, 7:108\u2013116, 1995.\n\n[6] Diane Bouchacourt, Pawan K Mudigonda, and Sebastian Nowozin. Disco nets: Dissimilarity\ncoef\ufb01cients networks. In Advances in Neural Information Processing Systems, pages 352\u2013360,\n2016.\n\n[7] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized\n\ngenerative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. In Advances in\nNeural Information Processing Systems, pages 2672\u20132680, 2014.\n\n[9] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\n2017.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page\n770\u2013778, 2016.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. Proceedings of Machine Learning Research, pages 448\u2013456.\nPMLR, 2015.\n\n[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. The\n\nInternational Conference on Learning Representations (ICLR), 2014.\n\n[14] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. The International\n\nConference on Learning Representations (ICLR), 2013.\n\n[15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[16] Yujia Li, Kevin Swersky, and Richard S Zemel. Generative moment matching networks. In\n\nICML, pages 1718\u20131727, 2015.\n\n10\n\n\f[17] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in\nthe wild. In Proceedings of the IEEE International Conference on Computer Vision, pages\n3730\u20133738, 2015.\n\n[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[19] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances\n\nin Neural Information Processing Systems, 2017.\n\n[20] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. In International Conference on Learning Representations (ICLR), 2016.\n\n[21] Tom Minka. Divergence measures and message passing. Technical report, Microsoft Research,\n\n2005.\n\n[22] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances\n\nin Applied Probability, 29:429\u2013443, 1997.\n\n[23] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis.\n\nIn Advances in Neural Information Processing Systems, pages 1786\u20131794, 2010.\n\n[24] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func-\ntionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information\nTheory, 56(11):5847\u20135861, 2010.\n\n[25] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neu-\nral samplers using variational divergence minimization. In Advances in Neural Information\nProcessing Systems, pages 271\u2013279, 2016.\n\n[26] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[27] Mark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments.\n\nJournal of Machine Learning Research, 12:731\u2013817, 2011.\n\n[28] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\napproximate inference in deep generative models. In Proceedings of the 31st International\nConference on Machine Learning, 2014.\n\n[29] David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley\n\n& Sons, 2015.\n\n[30] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised\n\nmap inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[31] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG\nLanckriet. On integral probability metrics, phi-divergences and binary classi\ufb01cation. arXiv\npreprint arXiv:0901.2698, 2009.\n\n[32] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction\nof a large-scale image dataset using deep learning with humans in the loop. arXiv preprint\narXiv:1506.03365, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1236, "authors": [{"given_name": "Kevin", "family_name": "Roth", "institution": "ETH"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research Cambridge"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}]}