{"title": "A Convex Duality Framework for GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 5248, "page_last": 5258, "abstract": "Generative adversarial network (GAN) is a minimax game between a generator mimicking the true model and a discriminator distinguishing the samples produced by the generator from the real training samples. Given an unconstrained discriminator able to approximate any function, this game reduces to finding the generative model minimizing a divergence measure, e.g. the Jensen-Shannon (JS) divergence, to the data distribution. However, in practice the discriminator is constrained to be in a smaller class F such as neural nets. Then, a natural question is how the divergence minimization interpretation changes as we constrain F. In this work, we address this question by developing a convex duality framework for analyzing GANs. For a convex set F, this duality framework interprets the original GAN formulation as finding the generative model with minimum JS-divergence to the distributions penalized to match the moments of the data distribution, with the moments specified by the discriminators in F. We show that this interpretation more generally holds for f-GAN and Wasserstein GAN. As a byproduct, we apply the duality framework to a hybrid of f-divergence and Wasserstein distance. Unlike the f-divergence, we prove that the proposed hybrid divergence changes continuously with the generative model, which suggests regularizing the discriminator's Lipschitz constant in f-GAN and vanilla GAN. We numerically evaluate the power of the suggested regularization schemes for improving GAN's training performance.", "full_text": "A Convex Duality Framework for GANs\n\nFarzan Farnia\u2217\n\nfarnia@stanford.edu\n\nDavid Tse\u2217\n\ndntse@stanford.edu\n\nAbstract\n\nGenerative adversarial network (GAN) is a minimax game between a generator\nmimicking the true model and a discriminator distinguishing the samples produced\nby the generator from the real training samples. Given an unconstrained discrimi-\nnator able to approximate any function, this game reduces to \ufb01nding the generative\nmodel minimizing a divergence score, e.g. the Jensen-Shannon (JS) divergence, to\nthe data distribution. However, in practice the discriminator is constrained to be\nin a smaller class F such as convolutional neural nets. Then, a natural question is\nhow the divergence minimization interpretation will change as we constrain F. In\nthis work, we address this question by developing a convex duality framework for\nanalyzing GAN minimax problems. For a convex set F, this duality framework\ninterprets the original vanilla GAN problem as \ufb01nding the generative model with\nthe minimum JS-divergence to the distributions penalized to match the moments\nof the data distribution, with the moments speci\ufb01ed by the discriminators in F.\nWe show that this interpretation more generally holds for f-GAN and Wasserstein\nGAN. We further apply the convex duality framework to explain why regularizing\nthe discriminator\u2019s Lipschitz constant, e.g. via spectral normalization or gradi-\nent penalty, can greatly improve the training performance in a general f-GAN\nproblem including the vanilla GAN formulation. We prove that Lipschitz regu-\nlarization can be interpreted as convolving the original divergence score with the\n\ufb01rst-order Wasserstein distance, which results in a continuously-behaving target\ndivergence measure. We numerically explore the power of Lipschitz regularization\nfor improving the continuity behavior and training performance in GAN problems.\n\n1\n\nIntroduction\n\nLearning a probability model from data samples is a fundamental task in unsupervised learning. The\nrecently developed generative adversarial network (GAN) [1] leverages the power of deep neural\nnetworks to successfully address this task across various domains [2]. In contrast to traditional\nmethods of parameter \ufb01tting like maximum likelihood estimation, the GAN approach views the\nproblem as a game between a generator G whose goal is to generate fake samples that are close to\nthe real data training samples and a discriminator D whose goal is to distinguish between the real\nand fake samples. The generator creates the fake samples by mapping from random noise input.\nThe following minimax problem is the original GAN problem, also called vanilla GAN, introduced in\n[1]\n\nE(cid:2)log D(X)(cid:3) + E(cid:2)log(cid:0)1 \u2212 D(G(Z))(cid:1)(cid:3).\n\nG\u2208G max\nmin\nD\u2208F\n\n(1)\n\nHere Z denotes the generator\u2019s noise input, X represents the random vector for the real data distributed\nas PX, and G and F respectively represent the generator and discriminator function sets. Implement-\ning this minimax game using deep neural network classes G and F has lead to the state-of-the-art\ngenerative model for many different tasks.\n\n\u2217Department of Electrical Engineering, Stanford University.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) Divergence minimization in vanilla GAN with D unconstrained, between the generative\nmodels and PX, (b) Divergence minimization in vanilla GAN with D constrained to a linear space F,\nbetween the generative models and the discriminator moment matching models formed around PX.\n\nTo shed light on the probabilistic meaning of vanilla GAN, [1] shows that given an unconstrained\ndiscriminator D, i.e. if F contains all possible functions, the minimax problem (1) will reduce to\n\nmin\nG\u2208G JSD(PX, PG(Z)),\n\n(2)\n\nwhere JSD denotes the Jensen-Shannon (JS) divergence. The optimization problem (2) can be\ninterpreted as \ufb01nding the closest generative model to the data distribution PX (Figure 1a), where\ndistance is measured using the JS-divergence. Various GAN formulations were later proposed by\nchanging the divergence measure in (2). f-GAN [3] generalizes vanilla GAN by minimizing a general\nf-divergence. Wasserstein GAN (WGAN) [4] is based on the \ufb01rst-order Wasserstein (the earth-\nmover\u2019s) distance. MMD-GAN [5, 6, 7] considers the maximum mean discrepancy. Energy-based\nGAN [8] uses the total variation distance. Quadratic GAN [9] \ufb01nds the distribution minimizing the\nsecond-order Wasserstein distance.\nHowever, GANs trained in practice differ from this minimum divergence formulation, since their\ndiscriminator is not optimized over an unconstrained set and is constrained to smaller classes such as\nconvolutional neural nets. As shown in [9, 10], constraining the discriminator is in fact necessary to\nguarantee good generalization properties for a GAN\u2019s learned model. Then, how does the minimum\ndivergence interpretation illustarted in Figure 1a change after we constrain the discrminator? An\nexisting approach used in [10, 11] is to view the maximum discriminator objective as a discriminator\nclass F-based distance between probability distributions. For unconstrained F, the F-based distance\nreduces to the original divergence measure, e.g. the JS-divergence in vanilla GAN.\nWhile [10] demonstrates a useful application of F-based distances in analyzing GANs\u2019 generalization\nproperties, the connection between F-based distances and the original divergence score remains\nunclear for a constrained F. Then, what is the probabilistic interpretation of GAN minimax game\nin practice where a constrained discriminator is used? In this work, we address this question by\ninterpreting the dual problem to the discriminator maximization problem. To analyze the dual problem,\nwe develop a convex duality framework for divergence minimization problems with generalized\nmoment matching constraints. We apply this convex duality framework to the f-divergence and\nWasserstein distance families, providing interpretation for f-GAN, including vanilla GAN minimizing\nthe JS-divergence, and Wasserstein GAN.\nSpeci\ufb01cally, we generalize [1]\u2019s interpretation of the vanilla GAN problem (1), which only holds\nfor an unconstrained discriminator set, to the more general case with linear space discriminator\nsets. Under this assumption, we interpret vanilla GAN as the following JS-divergence minimization\nbetween two sets of probability distributions (Figure 1b), the generative models and the discriminator\nmoment-matching models,\n\n(3)\nHere PF (PX) denotes the set of discriminator moment matching models that contains any distribution\nQ satisfying moment matching constraints EQ[D(X)] = EP [D(X)] for any discriminator D \u2208 F.\n\nJSD(PG(Z), Q).\n\nmin\nG\u2208G\n\nmin\n\nQ\u2208PF (PX)\n\n2\n\n\fMore generally, we show that a similar interpretation holds for GANs trained over convex discrimina-\ntor sets. We also discuss the application of our duality framework to neural net discriminators with\nbounded Lipschitz constants. While a set of neural network functions is not necessarily convex, we\nprove that a convex combination of Lipschitz-bounded neural nets can be approximated by uniformly\ncombining boundedly-many neural net functions. This result applied to our duality framework shows\nthat the convex duality interpretation approximately holds for neural net discriminators.\nAs a byproduct, we apply the duality framework to the in\ufb01mal convolution hybrid of f-divergence\nand the \ufb01rst-order Wasserstein (W1) distance, e.g. the following hybrid of JS-divergence and W1\ndistance:\n\ndJSD,W1 (P1, P2) := min\nQ\n\nW1(P1, Q) + JSD(Q, P2).\n\n(4)\n\nWe prove that unlike the JS-divergence this hybrid divergence changes continuously and remedies the\nundesired discontinuous behavior of JS-divergence in optimizing generator parameters for vanilla\nGAN. [4] observes this issue with minimzing the JS-divergence in vanilla GAN and proposes\nto instead minimize the continuously-changing W1 distance in WGAN. However, as empirically\ndemonstrated in [12] vanilla GAN with a Lipschitz-bounded discriminator results in superior and\nstate-of-the-art generative models over multiple benchmark tasks. In this paper, we leverage the\nconvex duality framework to prove that the in\ufb01mal convolution hybrid dJSD,W1, possessing the\nsame desired continuity property as in the W1-distance, is in fact the divergence score minimized in\nvanilla GAN with a Lipschitz-bounded discriminator. Hence, our analysis provides an explanation\nfor why regularizing the discriminator\u2019s Lipschitz constant via gradient penalty [13] or spectral\nnormalization [12] greatly improves the training performance in vanilla GAN. We then extend our\nfocus to the in\ufb01mal convolution hybrid between the f-divergence and the second-order Wasserstein\n(W2) distance. In this case, we derive the f-GAN (e.g. vanilla GAN) problem with its discriminator\nbeing adversarially trained over the generator\u2019s samples. We numerically evaluate the power of these\nhybrid divergences and their implied regularization schemes for training GANs.\n\n2 Divergence Measures\n\n2.1\n\nJensen-Shannon divergence\n\nThe Jensen-Shannon divergence is de\ufb01ned in terms of the KL-divergence (denoted by KL) as\n\nJSD(P, Q) :=\n\nKL(P(cid:107)M ) +\n\n1\n2\n\nKL(Q(cid:107)M )\n\n1\n2\n\nis the mid-distribution between P and Q. Unlike the KL-divergence, the JS-\n\nwhere M = P +Q\ndivergence is symmetric JSD(P, Q) = JSD(Q, P ) and bounded 0 \u2264 JSD(P, Q) \u2264 log 2.\n2.2\n\nf-divergence\n\n2\n\nThe f-divergence family [14] generalizes the KL and JS divergence measures. Given a convex lower\nsemicontinuous function f with f (1) = 0, the f-divergence df is de\ufb01ned as\n\ndf (P, Q) := EP\n\n(5)\nHere EP denotes expectation over distribution P and p, q denote the density functions for distributions\nP, Q, respectively. The KL-divergence and the JS-divergence are members of the f-divergence family,\ncorresponding to respectively fKL(t) = t log t and fJSD(t) = t\n\n2 log t \u2212 t+1\n\n2 .\n2 log t+1\n\np(x)\n\n(cid:90)\n\n(cid:2)f(cid:0) q(X)\n\np(X)\n\n(cid:1)(cid:3) =\n\np(x)f(cid:0) q(x)\n\n(cid:1) dx.\n\n2.3 Optimal transport cost, Wasserstein distance\nThe optimal transport cost for cost function c(x, x(cid:48)), which we denote by Wc, is de\ufb01ned as\n\nD c-concave\n\n3\n\nWc(P, Q) :=\n\ninf\n\nM\u2208\u03a0(P,Q)\n\n(6)\n\nwhere \u03a0(P, Q) contains all couplings with marginals P, Q. The Kantorovich duality [15] shows that\nfor a non-negative lower semi-continuous cost c,\nEP\n\n(cid:2)Dc(X)(cid:3),\n\nWc(P, Q) = max\n\n(7)\n\nE(cid:2)c(X, X(cid:48))(cid:3),\n(cid:2)D(X)(cid:3) \u2212 EQ\n\n\fwhere we use Dc to denote D\u2019s c-transform de\ufb01ned as Dc(x) := supx(cid:48) D(x(cid:48)) \u2212 c(x, x(cid:48)) and call\nD c-concave if D is the c-transform of a valid function. An important special case is the \ufb01rst-order\nWasserstein (W1) distance corresponding to the norm cost c(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107), i.e.\n\nFor the norm cost function, a function D is c-concave if and only if D is 1-Lipschitz, and the\nc-transform Dc = D holds for a 1-Lipschitz D. Therefore, the Kantorovich duality result (7) implies\n\n(8)\n\n(9)\n\n(10)\n\nW1(P, Q) :=\n\ninf\n\nM\u2208\u03a0(P,Q)\n\nW1(P, Q) = max\n\nD 1-Lipschitz\n\nEP\n\nE(cid:2)(cid:107)X \u2212 X(cid:48)(cid:107)(cid:3).\n(cid:2)D(X)(cid:3) \u2212 EQ\n(cid:2)D(X)(cid:3).\nE(cid:2)(cid:107)X \u2212 X(cid:48)(cid:107)2(cid:3)1/2\n\n.\n\nIn this paper, we also consider and analyze the second-order Wasserstein (W2) distance, corresponding\nto the norm-squared cost c(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2, de\ufb01ned as\n\nW2(P, Q) :=\n\ninf\n\nM\u2208\u03a0(P,Q)\n\n3 Divergence minimization in GANs: a convex duality framework\n\nIn this section, we develop a convex duality framework for analyzing divergence minimization\nproblems conditioned to moment-matching constraints. Our framework generalizes the duality\nframework developed in [16] for the f-divergence family.\nFor a general divergence measure d(P, Q), we de\ufb01ne d\u2019s convex conjugate for distribution P , which\nP , as the following operator mapping a real-valued function with domain X to a real\nwe denote by d\u2217\nnumber\n(11)\nHere the supremum is over all distributions on the support set X . The following theorem connects this\noperation to divergence minimization problems under moment matching constraints. Next section,\nwe discuss the application of this theorem in deriving several well-known GAN formulations for\ndivergence measures discussed in Section 2.\nTheorem 1. Suppose divergence d(P, Q) is non-negative, lower semicontinuous and convex in\ndistribution Q. Consider a convex set of continuous functions F and assume support set X is\ncompact. Then,\n\nEQ[D(X)] \u2212 d(P, Q).\n\nd\u2217\nP (D) := sup\nQ\n\nG\u2208G max\nmin\nD\u2208F\nG\u2208G min\n\nQ\n\n= min\n\nEPX [D(X)] \u2212 d\u2217\n\n(cid:8)d(PG(Z), Q) + max\n\nD\u2208F { EPX[D(X)] \u2212 EQ[D(X)]}(cid:9).\n\n(D)\n\nPG(Z)\n\n(12)\n\nProof. We defer the proof to the Appendix.\n\nTheorem 1 interprets the LHS minimax problem in (12) as \ufb01nding the closest generative model to\na set of distributions penalized to share the same generalized moments speci\ufb01ed by discriminators\nin F with PX. The following corollary of Theorem 1 shows if we further assume that F is a linear\nspace, then the additive penalty term penalizing the worst-case moment mismatch will turn to hard\nconstraints in the discriminator optimization problem. This result reveals a divergence minimization\nproblem between the generative models and the following set PF (P ) which we call the discriminator\nmoment matching models,\n\nPF (P ) :=(cid:8)Q : \u2200D \u2208 F, EQ[D(X)] = EP [D(X)](cid:9).\n\n(13)\nCorollary 1. In Theorem 1 suppose F is also a linear space, i.e. for any D1, D2 \u2208 F, \u03bb \u2208 R we\nhave D1 + D2 \u2208 F and \u03bbD1 \u2208 F. Then,\nEPX [D(X)] \u2212 d\u2217\n\nd(PG(Z), Q).\n\n(14)\n\nmin\n\nPG(Z)\n\n(D) = min\nG\u2208G\n\nQ\u2208PF (PX)\n\nG\u2208G max\nmin\nD\u2208F\n\nIn next section, we apply this duality framework to divergence measures discussed in Section 2 and\nshow how to derive various GAN problems through this convex duality framework.\n\n4\n\n\f4 Duality framework applied to different divergence measures\n\nf-divergence: f-GAN and vanilla GAN\n\n4.1\nTheorem 2 shows the application of Theorem 1 to an f-divergence. Here we use f\u2217 to denote\nf\u2019s convex-conjugate [17], de\ufb01ned as f\u2217(u) := supt ut \u2212 f (t). Theorem 2 applies to a general\nf-divergence df as long as the convex-conjugate f\u2217 is a non-deacreasing function, a condition met by\nall f-divergence examples discussed in [3] with the only exception of Pearson \u03c72-divergence.\nTheorem 2. Consider f-divergence df where the corresponding f has a non-decreasing convex-\nconjugate f\u2217. In addition to the assumptions in Theorem 1, suppose that F is closed to adding\nconstants, i.e. D + \u03bb \u2208 F for any D \u2208 F, \u03bb \u2208 R. Then, the minimax problem in the LHS of (12)\nand (14), reduces to\n\nE[D(X)] \u2212 E(cid:2)f\u2217(cid:0)D(G(Z))(cid:1)(cid:3).\n\n(15)\n\nG\u2208G max\nmin\nD\u2208F\n\nProof. We defer the proof to the Appendix.\n\nThe minimax problem (15) is the f-GAN problem introduced and discussed in [3]. Therefore,\nTheorem 2 reveals that f-GAN searches for the generative model minimizing the f-divergence to the\ndiscriminator moment matching models speci\ufb01ed by discriminator set F. The following example\nshows the application of this result to the vanilla GAN introduced in the original GAN work [1].\n2 log t \u2212\nExample 1. Consider the JS-divergence, i.e. f-divergence corresponding to fJSD(t) = t\nt+1\n2 log t+1\n\n2 . Then, (15) up to additive and multiplicative constants reduces to\n\nE[D(X)] + E(cid:2)log(cid:0)1 \u2212 exp(D(G(Z))(cid:1)(cid:3).\n\nG\u2208G max\nmin\nD\u2208F\n\n(16)\nMoreover, if for function set \u02dcF the corresponding F = {D : D(x) = \u2212 log(1 + exp( \u02dcD(x))), \u02dcD \u2208\n\u02dcF} is a convex set, then (16) will reduce to the following minimax game which is the vanilla GAN\nproblem (1) with sigmoid activation applied to the discriminator output,\n\nE(cid:2) log\n\nG\u2208G max\nmin\n\u02dcD\u2208 \u02dcF\n\n1\n\n1 + exp( \u02dcD(X))\n\n(cid:3) + E(cid:2) log\n\n(cid:3).\n\nexp( \u02dcD(X))\n\n1 + exp( \u02dcD(X))\n\nTherefore the minimax game between G and D in (18) can be viewed as minimizing the optimal\ntransport cost between generative models and the distributions matching moments over F with PX\u2019s\nmoments. The following example applies this result to the \ufb01rst-order Wasserstein distance and\nrecovers the WGAN problem [4] with a constrained 1-Lipschitz discriminator.\nExample 2. Let the optimal transport cost in (18) be the W1 distance, and suppose F is a convex\nsubset of 1-Lipschitz functions. Then, the minimax problem (18) will reduce to\n\nE[D(X)] \u2212 E(cid:2)D(G(Z))(cid:3).\n\nG\u2208G max\nmin\nD\u2208F\n\nTherefore, the moment-matching interpretation also holds for WGAN: for a convex set F of 1-\nLipschitz functions WGAN \ufb01nds the generative model with minimum W1 distance to the distributions\npenalized to share the same moments over F with the data distribution. We discuss two more\nexamples in the Appendix: 1) for the indicator cost cI (x, x(cid:48)) = I(x (cid:54)= x(cid:48)) corresponding to the total\nvariation distance we draw the connection to the energy-based GAN [8], 2) for the second-order\ncost c2(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2 we recover [9]\u2019s quadratic GAN formulation under the LQG setting\nassumptions, i.e. linear generator, quadratic discriminator and Gaussian input data.\n\n5\n\n4.2 Optimal Transport Cost: Wasserstein GAN\n\nTheorem 3. Let divergence d be optimal transport cost Wc where c is a non-negative lower semi-\ncontinuous cost function. Then, the minimax problem in the LHS of (12) and (14) reduces to\n\nE[D(X)] \u2212 E(cid:2)Dc(G(Z))(cid:3).\n\nG\u2208G max\nmin\nD\u2208F\n\nProof. We defer the proof to the Appendix.\n\n(17)\n\n(18)\n\n(19)\n\n\f5 Duality framework applied to neural net discriminators\n\nWe applied the duality framework to analyze GAN problems with convex discriminator sets. However,\na neural net set Fnn = {fw : w \u2208 W}, where fw denotes a neural net function with \ufb01xed\narchitecture and weights w in feasible set W, does not generally satisfy this convexity assumption.\nNote that a linear combination of several neural net functions in Fnn may not remain in Fnn.\nTherefore, we apply the duality framework to Fnn\u2019s convex hull, which we denote by conv(Fnn),\ncontaining any convex combination of neural net functions in Fnn. However, a convex combination\nof in\ufb01nitely-many neural nets from Fnn is characterized by in\ufb01nitely-many parameters, which makes\noptimizing the discriminator over conv(Fnn) computationally intractable. In the following theorem,\nwe show that although a function in conv(Fnn) is a combination of in\ufb01nitely-many neural nets, that\nfunction can be approximated by uniformly combining boundedly-many neural nets in Fnn.\nTheorem 4. Suppose any function fw \u2208 Fnn is L-Lipschitz and bounded as |fw(x)| \u2264 M. Also,\nassume that the k-dimensional random input X is norm-bounded as (cid:107)X(cid:107)2 \u2264 R. Then, any function\nin conv(Fnn) can be uniformly approximated over the ball (cid:107)x(cid:107)2 \u2264 R within \u0001-error by a uniform\ncombination \u02c6f (x) = 1\nm\n\n(cid:80)m\ni=1 fwi(x) of m = O( M 2k log(LR/\u0001)\n\n) functions (fwi)m\n\ni=1 \u2208 Fnn.\n\n\u00012\n\nProof. We defer the proof to the Appendix.\n\nThe above theorem suggests using a uniform combination of multiple discriminator nets to \ufb01nd a\nbetter approximation of the solution to the divergence minimization problem in Theorem 1 solved\nover conv(Fnn). Note that this approach is different from MIX-GAN [10] proposed for achieving\nequilibrium in GAN minimiax game. While our approach considers a uniform combination of\nmultiple neural nets as the discriminator, MIX-GAN considers a randomized combination of the\nminimax game over multiple neural net discriminators and generators.\n\n6\n\nIn\ufb01mal Convolution hybrid of f-divergence and Wasserstein distance:\nGAN with Lipschitz or adversarially-trained discriminator\n\nHere we apply the convex duality framework to a novel class of divergence measures. For an f-\ndivergence df , we de\ufb01ne the divergence score df,W1, which we call the in\ufb01mal convolution hybrid of\ndf and W1 divergence measures, as follows\n\nW1(P1, Q) + df (Q, P2).\n\ndf,W1 (P1, P2) := inf\nQ\n\n(20)\nThe above in\ufb01mum is taken over all distributions on the support set X , \ufb01nding the distribution Q\u2217\nminimizing the sum of the Wasserstein distance between P1 and Q and the f-divergence from Q to P2.\nEarlier in the introduction, we mentioned and discussed a special case of the above de\ufb01nition for the\nhybrid between the JS-divergence and W1-distance. While f-divergence in f-GAN, e.g. JS-divergence\nin vanilla GAN, does not change continuously with the generator parameters, the following theorem\nproves that similar to the continuous behavior of W1-distance shown in [18, 4] the in\ufb01mal convolution\nhybrid divergence changes continuously with the generative model.\nTheorem 5. Suppose G\u03b8 \u2208 G is continuously changing with parameters \u03b8. Then, for any Q and\nZ, df,W1(PG\u03b8 (Z), Q) will behave continuously as a function of \u03b8. Moreover, if G\u03b8 is assumed to be\nlocally Lipschitz, then df,W1(PG\u03b8 (Z), Q) will be differentiable w.r.t. \u03b8 almost everywhere.\n\nProof. We defer the proof to the Appendix.\n\nOur next result reveals the minimax problem dual to minimizing this hybrid divergence with symmet-\nric f-divergence component. We note that this symmetricity condition is met by the JS-divergence\nand the squared Hellinger divergence among the f-divergence examples discussed in [3].\nTheorem 6. Consider df,W1 with a symmetric f-divergence df , i.e. df (P, Q) = df (Q, P ), satisfying\nthe assumptions in Theorem 2. If the composition f\u2217 \u25e6 D is 1-Lipschitz for all D \u2208 F, the minimax\nproblem in Theorem 1 for the hybrid df,W1 reduces to the f-GAN problem, i.e.\n\nE[D(X)] \u2212 E(cid:2) f\u2217(cid:0)D(G(Z)(cid:1)(cid:3).\n\n(21)\n\nG\u2208G max\nmin\nD\u2208F\n\nProof. We defer the proof to the Appendix.\n\n6\n\n\fThe above theorem reveals that if the discriminator\u2019s Lipschitz constant in f-GAN is properly\nregularized, then solving the f-GAN problem over the regularized discriminator in fact minimizes\nthe continuously-changing divergence df,W1. As a special case, in vanilla GAN (17) we only need\nto constrain the discriminator \u02dcD to be 1-Lipschitz, which can be done via the gradient penalty\nregularization [13] or the spectral normalization of \u02dcD\u2019s weight matrices [12]. Therefore, using these\ntechniques we indeed minimize the continuously-behaving divergence score dJSD,W1. These results\nare consistent with [12]\u2019s empirical results indicating that regularizing the discriminator\u2019s Lipschitz\nconstant improves the training performance in vanilla GAN.\nOur discussion has so far focused on convolving f-divergence with the \ufb01rst order Wasserstein distance,\nwhich translates into training f-GAN with a Lipschitz-bounded discriminator. As another solution,\nwe show that the desired continuity property can also be achieved through the following in\ufb01mal\nconvolution with the second-order Wasserstein (W2) distance-squared:\n\ndf,W2(P1, P2) := inf\nQ\n\n(22)\nTheorem 7. Suppose G\u03b8 \u2208 G continuously changes with parameters \u03b8 \u2208 Rk. Then, for any\ndistribution Q and random vector Z, df,W2 (PG\u03b8 (Z), Q) will be continuous in \u03b8. Also, if we further\nassume G\u03b8 is bounded and locally-Lipschitz w.r.t. \u03b8, then the divergence df,W2(PG\u03b8 (Z), Q) is almost\neverywhere differentiable w.r.t. \u03b8.\n\nW 2\n\n2 (P1, Q) + df (Q, P2).\n\nProof. We defer the proof to the Appendix.\n\nThe following result shows that minimizing df,W2 reduces to f-GAN problem where the discriminator\nis being adversarially trained.\nTheorem 8. Assume df and F satisfy the assumptions in Theorem 6. Then, the minimax problem in\nTheorem 1 corresponding to the hybrid df,W2 divergence reduces to\n\nE[D(X)] + E(cid:2) min\n\n\u2212f\u2217(cid:0)D( G(Z) + u )(cid:1) + (cid:107)u(cid:107)2(cid:3).\n\nG\u2208G max\nmin\nD\u2208F\n\nu\n\n(23)\n\nProof. We defer the proof to the Appendix.\n\nThe above result reduces minimizing the hybrid df,W2 divergence to an f-GAN minimax game with\nan additional third player. Here, the third player assists the generator by perturbing the generated fake\nsamples in order to make them harder to be distinguished from the real samples by the discriminator.\nThe cost for perturbing a fake sample G(Z) to G(Z) + u will be proportional to (cid:107)u(cid:107)2, constraining\nthe power of the third player who plays adversarially against the discriminator. To implement the\ngame between the three players, we can adversarially learn the discriminator while we are training\nGAN, via the Wasserstein risk minimization (WRM) adversarial learning scheme discussed in [19].\n\n7 Numerical Experiments\n\nTo evaluate our theoretical results, we used the CelebA [20] and LSUN-bedroom [21] datasets.\nFurthermore, in the Appendix we include the results of our experiments over the MNIST [22]\ndataset. We considered vanilla GAN [1] with the minimax formulation in (17) and DCGAN [23]\nconvolutional architecture for the neural net discriminator and generator. We used the code provided\nby [13] and trained DCGAN via Adam optimizer [24] for 200,000 generator iterations. We applied 5\ndiscriminator updates per generator update.\nFigure 2 shows how the discriminator loss evaluated over 2000 validation samples, which is an\nestimate of the divergence measure, changed as we trained the DCGAN over LSUN samples. Using\nstandard DCGAN regularizied by only batch normalization (BN) [25], we observed (Figure 2- top\nleft) that the JS-divergence estimate always remained close to its maximum value log2 2 = 1 and also\ncorrelated poorly with the visual quality of the generated samples. In this experiment, the vanilla GAN\ntraining failed and led to mode collapse starting at about the 110,000th iteration. On the other hand,\nafter replacing BN with two different Lipschitz regularization tecniques, spectral normalization (SN)\n[12] and gradient penalty (GP) [13], to ensure that the discriminator is 1-Lipschitz, the discriminator\nloss decreased in a continuous monotonic fashion (Figures 2-top right and 2-bottom left).\nThese observations are consistent with Theorems 5 and 6 showing that the discriminator loss will\nbecome an estimate for the in\ufb01mal convolution hybrid dJSD,W1 divergence which is behaving contin-\nuously with generator parameters. Also, the samples generated by the Lipschitz-regularized DCGAN\n\n7\n\n\fFigure 2: Divergence estimate in DCGAN trained over LSUN samples, (top-left) JS-divergence in\nDCGAN regularized with batch normalization (BN), (top-right) hybrid dJSD,W1 in DCGAN with a\n1-Lipschitz spectrally-normalized (SN) discriminator, (bottom-left) hybrid dJSD,W1 in DCGAN with\na 1-Lipschitz discriminator regularized via the gradient penalty (GP), (bottom-right) hybrid dJSD,W2\nin DCGAN with discriminator being adversarially-trained using WRM.\n\nlooked qualitatively better and correlated well with the estimate of dJSD,W1 divergence. Figure\n2-bottom right shows that a similar desired behavior with nice monotonic decrease in discriminator\u2019s\nloss can also be achieved through minimizing the second-order hybrid divergence dJSD,W2. In this\nexperiment, we trained the discriminator in vanilla GAN via the Wasserstein risk minimization\n(WRM) adversarial learning scheme [19].\nFigure 3 shows the results of similar experiments over the CelebA dataset. Again, we observed\n(Figure 3-top left) that the JS-divergence estimate remains close to 1 while training DCGAN with\nBN. However, after applying two different Lipschitz regularization methods, SN and GP in Figures\n3-top right and bottom left, we observed that the hybrid dJSD,W1 changed nicely and monotonically,\nand correlated well with the quality of samples generated. Figure 3-bottom right shows that a similar\ndesired behavior can also be obtained after minimizing the second-order in\ufb01mal convolution hybrid\ndJSD,W2 divergence. We defer the presentation of some random samples generated by the generators\ntrained in these experiments to the Appendix.\n\n8 Related Work\n\nTheoretical studies of GAN have focused on three different aspects: approximation, generalization,\nand optimization. Regarding the approximation properties of GANs, [11] studies GANs\u2019 approxima-\ntion power through a moment-matching approach. The authors view the maximized discriminator\nobjective as an F-based adversarial divergence, showing that the adversarial divergence between two\ndistributions will be at its minimum value if the two distributions have the same generalized moments\nspeci\ufb01ed by F. Our convex duality framework provides a dual interpretation for their results and\ndraws the connection between the adversarial diveregnce and the original divergence scores. [26]\nstudies the f-GAN problem through an information geometric approach and the connection between\nthe Bregman divergence and the f-divergence.\n\n8\n\n\fFigure 3: Divergence estimate in DCGAN trained over CelebA samples, (top-left) JS-divergence\nin DCGAN regularized with batch normalization, (top-right) hybrid dJSD,W1 in DCGAN with a\n1-Lipschitz spectrally-normalized discriminator, (bottom-left) hybrid dJSD,W1 in DCGAN with a\n1-Lipschitz discriminator regularized via the gradient penalty, (bottom-right) hybrid dJSD,W2 in\nDCGAN with its discriminator being adversarially-trained using WRM.\n\nAnalyzing the generalization performance in GANs has been another problem of interest in the\nmachine learning literature. [10] proves generalization guarantee results for GANs in terms of the\nF-based distance measures. [27] uses an elegant approach based on birthday paradox to empirically\nstudy the generalizibility of a GAN\u2019s learned models. [28] develops a quantitative approach for\nexamining diversity and generalization for a GAN\u2019s learned distribution. [29] studies approximation-\ngeneralization trade-offs in GANs by analyzing the discriminative power in F-based distances.\nRegarding the optimization aspects of GANs, [30, 31] propose duality-based methods for improving\noptimization performance in training deep generative models. [32] suggests convolving the data\ndistribution with a Gassian distribution for regularizing the learning problem in f-GANs. Moreover,\nseveral other works including [33, 34, 35, 9, 36] explore the optimization and stability properties of\nGANs. We also note that the same convex analysis approach used in this paper for studying GANs\nhas also provided several powerful frameworks for analyzing other supervised and unsupervised\nlearning problems [37, 38, 39, 40, 41].\n\nAcknowledgments: We are grateful for support under a Stanford Graduate Fellowship, the National\nScience Foundation grant under CCF-1563098, and the Center for Science of Information (CSoI), an\nNSF Science and Technology Center under grant agreement CCF-0939370.\n\nReferences\n\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[2] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.\n\narXiv:1701.00160, 2016.\n\narXiv preprint\n\n9\n\n\f[3] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. International Conference on Machine Learning, 2017.\n\n[5] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural\nnetworks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906,\n2015.\n\n[6] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Interna-\n\ntional Conference on Machine Learning, pages 1718\u20131727, 2015.\n\n[7] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd\ngan: Towards deeper understanding of moment matching network. In Advances in Neural\nInformation Processing Systems, pages 2200\u20132210, 2017.\n\n[8] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\narXiv preprint arXiv:1609.03126, 2016.\n\n[9] Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans: the lqg setting.\n\narXiv preprint arXiv:1710.10793, 2017.\n\n[10] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\n\nequilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.\n\n[11] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence\nproperties of generative adversarial learning. In Advances in Neural Information Processing\nSystems, pages 5551\u20135559, 2017.\n\n[12] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. International Conference on Learning Representations,\n2018.\n\n[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[14] Imre Csisz\u00e1r, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations\n\nand Trends R(cid:13) in Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[15] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[16] Yasemin Altun and Alex Smola. Unifying divergence minimization and statistical inference\nvia convex duality. In International Conference on Computational Learning Theory, pages\n139\u2013153, 2006.\n\n[17] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[18] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. arXiv preprint arXiv:1701.04862, 2017.\n\n[19] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\nprincipled adversarial training. In International Conference on Learning Representations, 2018.\n\n[20] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[21] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:\nConstruction of a large-scale image dataset using deep learning with humans in the loop. arXiv\npreprint arXiv:1506.03365, 2015.\n\n10\n\n\f[22] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/,\n\n1998.\n\n[23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[26] Richard Nock, Zac Cranko, Aditya K Menon, Lizhen Qu, and Robert C Williamson. f-gans\nin an information geometric nutshell. In Advances in Neural Information Processing Systems,\npages 456\u2013464, 2017.\n\n[27] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv\n\npreprint arXiv:1706.08224, 2017.\n\n[28] Shibani Santurkar, Ludwig Schmidt, and Aleksander Madry. A classi\ufb01cation-based perspective\n\non gan distributions. arXiv preprint arXiv:1711.00970, 2017.\n\n[29] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the\ndiscrimination-generalization tradeoff in GANs. International Conference on Learning Repre-\nsentations, 2018.\n\n[30] Xu Chen, Jiang Wang, and Hao Ge. Training generative adversarial networks via primal-dual\nsubgradient methods: a Lagrangian perspective on GAN. In International Conference on\nLearning Representations, 2018.\n\n[31] Shengjia Zhao, Jiaming Song, and Stefano Ermon. The information autoencoding family: A\nlagrangian perspective on latent variable generative models. arXiv preprint arXiv:1806.06514,\n2018.\n\n[32] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training\nof generative adversarial networks through regularization. In Advances in Neural Information\nProcessing Systems, pages 2015\u20132025, 2017.\n\n[33] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In\n\nAdvances in Neural Information Processing Systems, pages 5591\u20135600, 2017.\n\n[34] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances\n\nin Neural Information Processing Systems, pages 1823\u20131833, 2017.\n\n[35] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans\n\nwith optimism. arXiv preprint arXiv:1711.00141, 2017.\n\n[36] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Solving approximate\n\nwasserstein gans to stationarity. arXiv preprint arXiv:1802.08249, 2018.\n\n[37] Miroslav Dud\u00edk, Steven J Phillips, and Robert E Schapire. Maximum entropy density estimation\nwith generalized regularization and an application to species distribution modeling. Journal of\nMachine Learning Research, 8(Jun):1217\u20131260, 2007.\n\n[38] Meisam Razaviyayn, Farzan Farnia, and David Tse. Discrete r\u00e9nyi classi\ufb01ers. In Advances in\n\nNeural Information Processing Systems, pages 3276\u20133284, 2015.\n\n[39] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in\n\nNeural Information Processing Systems, pages 4240\u20134248, 2016.\n\n[40] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classi\ufb01cation:\nA risk minimization perspective. In Advances in Neural Information Processing Systems, pages\n559\u2013567, 2016.\n\n[41] Rizal Fathony, Mohammad Ali Bashiri, and Brian Ziebart. Adversarial surrogate losses for\nordinal regression. In Advances in Neural Information Processing Systems, pages 563\u2013573,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 2512, "authors": [{"given_name": "Farzan", "family_name": "Farnia", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}