{"title": "A Domain Agnostic Measure for Monitoring and Evaluating GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 12092, "page_last": 12102, "abstract": "Generative Adversarial Networks (GANs) have shown remarkable results in modeling complex distributions, but their evaluation remains an unsettled issue. Evaluations are essential for: (i) relative assessment of different models and (ii) monitoring the progress of a single model throughout training. The latter cannot be determined by simply inspecting the generator and discriminator loss curves as they behave non-intuitively. We leverage the notion of duality gap from game theory to propose a measure that addresses both (i) and (ii) at a low computational cost. Extensive experiments show the effectiveness of this measure to rank different GAN models and capture the typical GAN failure scenarios, including mode collapse and non-convergent behaviours. This evaluation metric also provides meaningful monitoring on the progression of the loss during training. It highly correlates with FID on natural image datasets, and with domain specific scores for text, sound and cosmology data where FID is not directly suitable. In particular, our proposed metric requires no labels or a pretrained classifier, making it domain agnostic.", "full_text": "A Domain Agnostic Measure for Monitoring and\n\nEvaluating GANs\n\nPaulina Grnarova\u2217\n\nK\ufb01r Y. Levy\n\nETH Zurich\n\nTechnion-Israel Institute of Technology\n\nAurelien Lucchi\n\nETH Zurich\n\nNathana\u00ebl Perraudin\n\nIan Goodfellow\n\nThomas Hofmann\n\nAndreas Krause\n\nSwiss Data Science Center\n\nETH Zurich\n\nETH Zurich\n\nAbstract\n\nGenerative Adversarial Networks (GANs) have shown remarkable results in mod-\neling complex distributions, but their evaluation remains an unsettled issue. Evalua-\ntions are essential for: (i) relative assessment of different models and (ii) monitoring\nthe progress of a single model throughout training. The latter cannot be determined\nby simply inspecting the generator and discriminator loss curves as they behave\nnon-intuitively. We leverage the notion of duality gap from game theory to propose\na measure that addresses both (i) and (ii) at a low computational cost. Exten-\nsive experiments show the effectiveness of this measure to rank different GAN\nmodels and capture the typical GAN failure scenarios, including mode collapse\nand non-convergent behaviours. This evaluation metric also provides meaningful\nmonitoring on the progression of the loss during training. It highly correlates with\nFID on natural image datasets, and with domain speci\ufb01c scores for text, sound\nand cosmology data where FID is not directly suitable. In particular, our proposed\nmetric requires no labels or a pretrained classi\ufb01er, making it domain agnostic.\n\n1\n\nIntroduction\n\nIn recent years, a large body of research has focused on practical and theoretical aspects of Generative\nadversarial networks (GANs) [9]. This has led to the development of several GAN variants [24, 2]\nas well as some evaluation metrics such as FID or the Inception score that are both data-dependent\nand dedicated to images. A domain independent quantitative metric is however still a key missing\ningredient that hinders further developments.\n\nOne of the main reasons behind the lack of such a metric originates from the nature of GANs that\nimplement an adversarial game between two players, namely a generator and a discriminator. Let us\ndenote the data distribution by pdata(x), the model distribution by pu(x) and the prior over latent\nvariables by pz. A probabilistic discriminator is denoted by Dv : x 7\u2192 [0; 1] and a generator by\nGu : z 7\u2192 x. The GAN objective is:\n\nmin\n\nmax\n\nM (u, v) =\n\nu\n\nv\n\n1\n2\n\nEx\u223cpdata log Dv(x) +\n\n1\n2\n\nEz\u223cpz log(1 \u2212 Dv(Gu(z))).\n\n(1)\n\nEach of the two players tries to optimize their own objective, which is exactly balanced by the loss\nof the other player, thus yielding a two-player zero-sum minimax game. The minimax nature of the\nobjective and the use of neural networks as players make the process of learning a generative model\nchallenging. We focus our attention on two of the central open issues behind these dif\ufb01culties and\nhow they translate to a need for an assessment metric.\n\n\u2217Correspondence to paulina.grnarova@inf.ethz.ch\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Comparison of information obtained by different metrics for likelihood and minmax-based models.\nThe red dashed line corresponds to the optimal point for stopping the training.\n\ni) Convergence metric The need for an adequate convergence metric is especially relevant given\nthe dif\ufb01culty of training GANs: current approaches often fail to converge [31] or oscillate between\ndifferent modes of the data distribution [21]. The ability of reliably detecting non-convergent behavior\nhas been pointed out as an open problem in many previous works, e.g., by [20] as a stepping stone\ntowards a deeper analysis as to which GAN variants converge. Such a metric is not only important for\ndriving the research efforts forward, but from a practical perspective as well. Deciding when to stop\ntraining is dif\ufb01cult as the curves of the discriminator and generator losses oscillate (see Fig. 1) and are\nnon-informative as to whether the model is improving or not [2]. This is especially troublesome when\na GAN is trained on non-image data in which case one might not be able to use visual inspection or\nFID/Inception scores as a proxy.\n\nii) Evaluation metric Another key problem we address is the relative comparison of the learned\ngenerative models. While several evaluation metrics exist, there is no clear consensus regarding\nwhich metric is the most appropriate. Many metrics achieve reasonable discriminability (i.e., ability\nto distinguish generated samples from real ones), but also tend to have a high computational cost.\nSome popular metrics are also speci\ufb01c to image data. We refer the reader to [3] for an in-depth\ndiscussion of the merits and drawbacks of existing evaluation metrics.\n\nIn more traditional likelihood-based models, the train/test curves do address the problems raised in i)\nand ii). For GANs, the generator/discriminator curves (see Fig. 1) are however largely uninformative\ndue to the minimax nature of GANs where both players can undo each other\u2019s progress.\n\nIn this paper, we leverage ideas from game theory to propose a simple and computationally ef\ufb01cient\nmetric for GANs. Our approach is to view GANs as a zero-sum game between a generator G and\ndiscriminator D. From this perspective, \u201csolving\" the game is equivalent to \ufb01nding an equilibrium,\ni.e., a pair (G\u2217, D\u2217) such that no side may increase its utility by unilateral deviation. A natural metric\nfor measuring the sub-optimality (w.r.t. an equilibrium) of a given solution (G, D) is the duality gap\n[33, 22]. We therefore suggest to use it as a metric for GANs akin to a test loss in the likelihood case\n(See Fig. 1 - duality gap2).\n\nThere are several important issues that we address in order to make the duality gap an appropriate\nand practical metric for GANs. Our contributions include the following:\n\n\u2022 We show that the duality gap allows to assess the similarity between the generated data and true\n\ndata distribution (see Theorem 1).\n\n\u2022 We show how to appropriately estimate the duality gap in the typical machine learning scenario\n\nwhere our access to the GAN learning objective is only through samples.\n\n\u2022 We provide a computationally ef\ufb01cient way to estimate the duality gap during training.\n\u2022 In scenarios where one is interested in assessing the quality of the learned generator, we show\nhow to use a related metric \u2013 the minimax loss \u2013 that takes only the generator into consideration\nin order to detect mode collapse and measure sample quality.\n\n2The curves are obtained for a progressive GAN trained on CelebA\n\n2\n\n\f\u2022 We extensively demonstrate the effectiveness of these metrics on a range of datasets, GAN\nvariants and failure modes. Unlike the FID or Inception score that require labelled data or a\ndomain dependent classi\ufb01er, our metrics are domain independent and do not require labels.\n\nRelated work. While several evaluation metrics have been proposed [31, 30, 12, 18], previous\nresearch has pointed out various limitations of these metrics, thus leaving the evaluation of GANs as\nan unsettled issue [20]. Since the data log-likelihood is commonly used to train generative models,\nit may appear to be a sensible metric for GANs. However, its computation is often intractable and\n[32] also demonstrate that it has severe limitations as it might yield low visual quality samples\ndespite of a high likelihood. Perhaps the most popular evaluation metric for GANs is the inception\nscore [31] that measures both diversity of the generated samples and discriminability. While diversity\nis measured as the entropy of the output distribution, the discriminability aspect requires a pretrained\nneural network to assign high scores to images close to training images. Various modi\ufb01cations of\nthe inception score have been suggested. The Frechet Inception Distance (FID) [12] models features\nfrom a hidden layer as two multivariate Gaussians for the generated and true data. However, the\nGaussian assumption might not hold in practice and labelled data is required in order to train a\nclassi\ufb01er. Without labels, transfer learning is possible to datasets under limited conditions (i.e., the\nsource and target distributions should not be too dissimilar). In [26], two metrics are introduced to\nevaluate a single model playing against past and future versions of itself, as well as to measure the\naptitude of two different fully trained models. In some way, this can be seen as an approximation of\nthe minimax value we advocate in this paper, where instead of doing a full-on optimization in order\nto \ufb01nd the best adversary for the \ufb01xed generator, the search space is limited to discriminators that are\nsnapshots from training, or discriminators trained with different seeds.\n\nThe ideas of duality and equilibria developed in the seminal work of\n[33, 22] have become a\ncornerstone in many \ufb01elds of science, but are relatively unexplored for GANs. Some exceptions are\n[5, 10, 8, 13] but these works do not address the problem of evaluation. Closer to us, game theoretic\nmetrics were previously mentioned in [25], but without a discussion addressing the stochastic nature\nand other practical dif\ufb01culties of GANs, thus not yielding a practical applicable method. We conclude\nour discussion by pointing out the vast literature on duality used in the optimization community as a\nconvergence criterion for min-max saddle point problems, see e.g. [23, 14]. Some recent work uses\nLagrangian duality in order to derive an objective to train GANs [4] or to dualize the discriminator,\ntherefore reformulating the saddle point objective as a maximization problem [17]. A similar approach\nproposed by [7] uses the dual formulation of Wasserstein GANs to train the decoder. Although we\nalso make use of duality, there are signi\ufb01cant differences. Unlike prior work, our contribution does\nnot relate to optimising GANs. Instead, we focus on establishing that the duality gap acts as a proxy\nto measure convergence, which we do theoretically (Th. 1) as well as empirically, the latter requiring\na new ef\ufb01cient estimation procedure discussed in Sec. 3.\n\n2 Duality Gap as Performance Measure\n\nStandard learning tasks are often described as (stochastic) optimization problems; this applies to\ncommon Deep Learning scenarios as well as to classical tasks such as logistic and linear regression.\nThis formulation gives rise to a natural performance measure, namely the test loss3. In contrast,\nGANs are formulated as (stochastic) zero-sum games. Unfortunately, this fundamentally different\nformulation does not allow us to use the same performance metric. In this section, we describe a\nperformance measure for GANs, which naturally arises from a game theoretic perspective. We start\nwith a brief overview of zero-sum games, including a description of the Duality gap metric.\n\nA zero-sum game is de\ufb01ned by two players P1 and P2 who choose a decision from their respective\ndecision sets K1 and K2. A game objective M : K1 \u00d7 K2 7\u2192 R sets the utilities of the players.\nConcretely, upon choosing a pure strategy (u, v) \u2208 K1 \u00d7 K2 the utility of P1 is \u2212M (u, v), while\nthe utility of P2 is M (u, v). The goal of either P1/P2 is to maximize their worst case utilities:\n\nmin\nu\u2208K1\n\nmax\nv\u2208K2\n\nM (u, v)\n\n(Goal of P1), max\nv\u2208K2\n\nmin\nu\u2208K1\n\nM (u, v)\n\n(Goal of P2)\n\n(2)\n\nThis formulation raises the question of whether there exists a solution (u\nmay jointly converge. The latter only occurs if there exists (u\n\n\u2217) to which both players\n\u2217) such that neither P1 nor P2 may\n\n\u2217, v\n\n\u2217, v\n\n3For classi\ufb01cation tasks using the zero-one test error is also very natural. Nevertheless, in regression tasks\n\nthe test loss is often the only reasonable performance measure.\n\n3\n\n\fincrease their utility by unilateral deviation. Such a solution is called a pure equilibrium, formally,\n\nmax\nv\u2208K2\n\nM (u\n\n\u2217, v) = min\nu\u2208K1\n\nM (u, v\n\n\u2217) (Pure Equilibrium).\n\nWhile a pure equilibrium does not always exist, the seminal work of [22] shows that an extended\nnotion of equilibrium always does. Speci\ufb01cally, there always exists a distribution D1 over elements\nof K1, and a distribution D2 over elements of K2, such that the following holds,\n\nmax\nv\u2208K2\n\nEu\u223cD1 M (u, v) = min\nu\u2208K1\n\nEv\u223cD2 M (u, v) (MNE).\n\nSuch a solution is called a Mixed Nash Equilibrium (MNE). This notion of equilibrium gives rise to\nthe following natural performance measure of a given pure/mixed strategy.\n\nDe\ufb01nition 1 (Duality Gap). Let D1 and D2 be \ufb01xed distributions over elements from K1 and K2\nrespectively. Then the duality gap DG of (D1, D2) is de\ufb01ned as follows,\n\nDG(D1, D2) := max\nv\u2208K2\n\nEu\u223cD1 M(u, v) \u2212 min\nu\u2208K1\n\nEv\u223cD2 M(u, v).\n\nParticularly, for a given pure strategy (u, v) \u2208 K1 \u00d7 K2 we de\ufb01ne,\n\nDG(u, v) := max\n\u2032\u2208K2\n\nv\n\nM(u, v\n\n\u2032) \u2212 min\n\u2032\u2208K1\n\nu\n\nM(u\n\n\u2032, v) .\n\n(3)\n\n(4)\n\nTwo well-known properties of the duality gap are that it is always non-negative and is exactly zero in\n(mixed) Nash Equilibria. These properties are very appealing from a practical point of view, since it\nmeans that the duality gap gives us an immediate handle for measuring convergence.\n\nNext we illustrate the usefulness of the duality gap metric by analyzing the ideal case where both G\nand D have unbounded capacity. The latter notion introduced by [9] means that the generator can\nrepresent any distribution, and the discriminator can represent any decision rule. The next proposition\nshows that in this case, as long as G is not equal to the true distribution then the duality gap is\nalways positive. In particular, we show that the duality gap is at least as large as the Jensen-Shannon\ndivergence between true and fake distributions. We also show that if G outputs the true distribution,\nthen there exists a discriminator such that the duality gap (DG) is zero. See a proof in the Appendix.\n\nTheorem 1 (DG and JSD). Consider the GAN objective in Eq. [1], and assume that the generator\nand discriminator networks have unbounded capacity. Then the duality gap of a given \ufb01xed solution\n(Gu, Dv) is lower bounded by the Jensen-Shannon divergence between the true distribution pdata\nand the fake distribution qu generated by Gu, i.e. DG(u, v) \u2265 JSD(pdata || qu). Moreover, if Gu\noutputs the true distribution, then there exists a discriminator Dv such that DG(Gu, Dv) = 0.\n\nNote that different GAN objectives are known to be related to other types of divergences [24], and\nwe believe that the Theorem above can be generalized to other GAN objectives [2, 11].\n\n3 Estimating the Duality Gap for GANs\n\nAppropriately estimating the duality gap from samples. Supervised learning problems are often\nformulated as stochastic optimization programs, meaning that we may only access estimates of the\nexpected loss by using samples. One typically splits the data into training and test sets 4. The\ntraining set is used to \ufb01nd a solution whose quality is estimated using a separate test set (which\nprovides an unbiased estimate of the true expected loss). Similarly, GANs are formulated as stochastic\nzero-sum games (Eq. [1]) but the issue of evaluating the duality gap metric is more delicate. This is\nbecause we have three phases in the evaluation: (i) training a model (u, v), (ii) \ufb01nding the worst case\ndiscriminator/generator, vworst \u2190 arg maxv\u2208K2 M (u, v), and uworst \u2190 arg minu\u2208K1 M (u, v),\nand (iii) computing the duality gap by estimating: DG := M(u, vworst) \u2212 M(uworst, v). Now since\nwe do not have direct access to the expected objective, one should use different samples for each of\nthe three mentioned phases in order to maintain an unbiased estimate of the expected duality gap.\nThus we split our dataset into three disjoint subsets: a training set, an adversary \ufb01nding set, and a test\nset which are respectively used in phases (i), (ii) and (iii).\n\n4Of course, one should also use a validation set, but this is less important for our discussion here.\n\n4\n\n\fRING\n\nSPIRAL\n\nGRID\n\nstep 0\n\nstep 3000\n\nstep 6000\n\nstep 9000\n\nstep 12000\n\nstep 15000\n\nstep 18000\n\nstep 21000\n\nstep 24000\n\nstep 0\n\nstep 5000\n\nstep 10000\n\nstep 15000\n\nstep 20000\n\nstep 25000\n\nstep 30000\n\nstep 35000\n\nstep 0\n\nstep 5000\n\nstep 10000\n\nstep 15000\n\nstep 20000\n\nstep 25000\n\nstep 30000\n\nstep 35000\n\nstep 40000\n\nstep 45000\n\ne\nl\nb\na\nt\ns\n\ne\nl\nb\na\nt\ns\nn\nu\n\nstep 0\n\nstep 3000\n\nstep 6000\n\nstep 9000\n\nstep 12000\n\nstep 15000\n\nstep 18000\n\nstep 21000\n\nstep 24000\n\nstep 0\n\nstep 5000\n\nstep 10000\n\nstep 15000\n\nstep 20000\n\nstep 25000\n\nstep 30000\n\nstep 35000\n\nstep 0\n\nstep 5000\n\nstep 10000\n\nstep 15000\n\nstep 20000\n\nstep 25000\n\nstep 30000\n\nstep 35000\n\nFigure 2: Progression of duality gap (DG) throughout training and heatmaps of generated samples.\n\nMinimax Loss as a metric for evaluating generators. For all experiments, we report both the\nduality gap (DG) and the minimax loss M (u, vworst). The latter is the \ufb01rst term in the expression of\nthe DG and intuitively measures the \u2018goodness\u2019 of a generator Gu. If Gu is optimal and covers pdata,\nthe minimax loss achieves its optimal value as well. This happens when Dvworst outputs 0.5 for both\nthe real and generated samples. Whenever the generated distribution does not cover the entire support\nof pdata or compromises the sample quality, this is detected by Dvworst and hence, the minimax loss\nincreases. This makes it a compelling metric for detecting mode collapse and evaluating sample\nquality. Note that in order to compute this metric one only needs a batch of generated samples, i.e.\nthe generator can be used as a black-box. Hence, this metric is not limited to generators trained as\npart of a GAN, but can instead be used for any generator that can be sampled from.\n\nPractical and ef\ufb01cient estimation of the duality gap for GANs.\nIn practice, the metrics are\ncomputed by optimizing a separate generator/discriminator using a gradient based algorithm. To\nspeed up the optimization, we initialize the networks using the parameters of the adversary at the\nstep being evaluated. Hence, if we are evaluating the GAN at step t, we train vworst for ut and\nuworst for vt by using vt as a starting point for vworst and analogously, ut as a starting point for\nuworst for a number of \ufb01xed steps. We also explored further approximations of DG, where instead of\nusing optimization to \ufb01nd vworst and uworst, we limit the search space to a set of discriminators and\ngenerators stored as snapshots throughout the training, similarly to [26] (see results in Appendix C.5).\nIn Appendix D we include an in-depth analysis of the quality of the approximation of the DG and\nhow it compares to the true theoretical DG.\n\nNon-negativity DG. While DG is non-negative in theory, this might not hold in practice since we\nonly \ufb01nd approximate (uworst, vworst). Nevertheless, the practical scheme that we describe above\nmakes sure that we do not get negative values for DG in practice. We elaborate on this in AppendixB.\n\n4 Experimental results\n\nWe carefully design a series of experiments to examine commonly encountered failure modes of\nGANs and analyze how this is re\ufb02ected by the two metrics. Speci\ufb01cally, we show the sensitivity of\nthe duality gap metric to (non-) convergence and the susceptibility of the minimax loss to re\ufb02ect the\nsample quality. Further details and additional extensive experiments can be found in the Appendix.\nNote that our goal is not to provide a rigorous comparative analysis between different GAN variants,\nbut to show that both metrics capture properties that are particularly useful to monitor training.\n\n4.1 Mixture of Gaussians\n\nWe train a vanilla GAN on three toy datasets with increasing dif\ufb01culty: a) RING: a mixture of 8\nGaussians, b) SPIRAL: a mixture of 20 Gaussians and c) GRID: a mixture of 25 Gaussians. As the\ntrue data distribution is known, this setting allows for tracking convergence.\n\n5\n\n05000100001500020000Steps051015Duality gap0100002000030000Steps0.00.51.0Duality gap010000200003000040000Steps051015Duality gap05000100001500020000Steps121314Duality gap0100002000030000Steps0510Duality gap0100002000030000Steps1213Duality gap\f(a) DG\n\n(b) Minimax\n\n(c) Modes\n\n(d) Std\n\nFigure 3: DG, minimax, modes covered and std. Tab. 7 in App. shows Pearson correlation between the metrics.\n\nDuality gap and convergence. Our \ufb01rst goal is to illustrate the relation between convergence and\nthe duality gap. To that end, we analyze the progression of DG throughout training in stable and\nunstable settings. One common problem of GANs is unstable mode collapse, where the generator\nalternates between generating different modes. We simulate such instabilities and compare them\nagainst successful GANs in Fig. 2. The gap goes to zero for all stable models after convergence to\nthe true data distribution. Conversely, unstable training is re\ufb02ected both in terms of the large value\nreached by DG as well as its trend over iterations (e.g., large oscillations and an increasing trend\nindicate unstable behavior). Thus the duality gap is a powerful tool for monitoring the training and\ndetecting unstable collapse.\n\nMinimax loss re\ufb02ects sample quality. As previously argued, another useful metric to look at is\nthe minimax loss which focuses solely on the generator. For the toy datasets, we measure the sample\nquality using (i) the number of covered modes and (ii) the number of generated samples that fall\nwithin 3 standard deviations of the modes (std). Fig. 3 shows signi\ufb01cant anti-correlation, which\nindicates the minimax loss can be used as a proxy for determining the overall sample quality.\n\n4.2 Duality gap and stable mode collapse\n\nThe previous experiment shows how unstable mode col-\nlapse is captured by DG - the trend is unstable and is\ntypically within a high range. We are now interested in\nthe case of stable mode collapse, where the model does\nconverge, but only to a subset of the modes.\n\nWe train a GAN on MNIST where the generator collapses\nto generating from only one class and does not change\nthe mode as the number of training steps increases. Fig. 4\nshows the DG curve. The trend is \ufb02at and stable, but the\nvalue of DG is not zero, thus showing that looking at both\nthe trend and value of the DG is helpful for detecting stable mode collapse as well.\n\nFigure 4: DG evolution detects stable mode\ncollapse.\n\n4.3 The trend of the duality gap progression curve\n\nWe now analyze the trend of the DG curves. The plots (see Fig. 2) show it does not always\nmonotonically decrease throughout training as we do observe non-smooth spikes. This raises the\nquestion as to whether these spikes are the result of the instabilities of the training or due to the metric\nitself? To address this, we train a GAN on a 2D submanifold Gaussian mixture embedded in 3D space.\nSuch a setting captures a commonly encountered GAN failure as this mixture is degenerate with\nrespect to the base measure de\ufb01ned in ambient space due to the lack of fully dimensional support.\n\nIt has been shown [29] that an unregularized GAN collapses in every run after 50K iterations (see\nFig. 5 - right) because of the focus of the discriminator on smaller differences between the true and\ngenerated samples, whereas the training of a regularized version can essentially avoid collapsing even\nwell beyond 200K iterations (see Fig. 5 - left). Thus we would expect the DG curves for the two\nsettings to show different trends, which is indeed what we observe. For the unregularized version, DG\ndecreases to small values and is stable until 40K steps, after which it starts increasing, re\ufb02ecting the\ncollapse of the generator. A practitioner looking at the DG curve can thus learn that (i) the training\nshould be stopped between 20-40K steps and (ii) there is a collapse in the training (information\nthat is especially valuable when the generated data is of non-image type or cannot be visualised).\nConversely, the DG trend for the regularized version is stable and very quickly converges to values\n\n6\n\n05101520Epochs0510Duality gap05101520Epochs1.00.5Worst minimax05101520Epochs05# Modes05101520Epochs010002000Pts within 3 sdt\fregularized GAN\n\nunregularized GAN\n\nFigure 5: DG progression for a regularized (left) and unregularized GAN (right). Generated samples are shown\nfor various steps. DG is able to capture instabilites (see green box).\n\nclose to zero, which re\ufb02ects the improved quality. This suggests that the usage of DG opens avenues\nto further understand and compare the effects of various regularizers. Note that in [29] different\nlevels of the regularizer were only visually compared due to the lack of a proper metric.\n\n4.4 Comparison with image-speci\ufb01c criteria\n\nWe further analyze the sensitivity of the minimax loss to various changes in the sample quality for\nnatural images that fall broadly in two categories: (i) mode sensitivity and (ii) visual sample quality.\nWe compare against the commonly used Inception Score (INC) and Frechet Inception Distance (FID).\nBoth metrics use the generator as a black-box through sampling. We follow the same setup for the\nevaluation of minimax and use the GAN zero-sum objective. Note that changing the objective to\nWGAN formulation makes it closely related to the Wasserstein critic [2].\n\nSensitivity to modes. As natural images are inherently multimodal, the generated distribution\ncommonly ignores some of the true modes, which is a phenomenon known as mode dropping.\nAnother common issue is intra-mode collapse that occurs when the generator is generating from all\nmodes, but there is no variety within a mode. We then turn to mode invention where the generator\ncreates non-existent modes. Fig. 18 in the Appendix shows the trends of all metrics for various\ndegrees of mode dropping, invention and intra-mode collapse, where a class label is considered a\nmode. INC is unable to detect both intra-mode collapse and invented modes. On the other hand, both\nFID and minimax loss exhibit desirable sensitivity to various mode changing.\n\nSample quality. We study the metrics\u2019 ability to detect compromised sample quality by distorting\nreal images using Gaussian noise, blur and swirl at an increasing intensity. As shown in Fig. 19 in the\nAppendix, all metrics, including minimax, detect different degrees of visual sample quality.\n\nIn Appendix C.3.1 we also show that the metric is computationally ef\ufb01cient to be used in practice.\nTab. 1 gives a summary of all the results.\n\nPoperty\\Metric\nSensitivity to mode collapse\nSensitivity to mode invention\nSensitivity to intra-mode collapse\nSensitivity to visual quality and transformations\nComputational: Fast\nComputational: Needs labeled data or a pretrained classi\ufb01er\nComputational: Can be applied to any domain without change\n\nINC\nmoderate\n\nlow\nlow\n\nmoderate\n\nyes\nyes\nno\n\nFID minimax\nhigh\nhigh\nhigh\nhigh\nyes\nyes\nno\n\nhigh\nhigh\nhigh\nhigh\nyes\nno\nyes\n\nTable 1: Comparison of INC, FID and minimax on various properties.\n\n4.5 DG as a measure for the image domain\n\nThe previous section illustrates the metrics have desirable properties that makes them effective in\ncapturing different failure modes in terms of image-speci\ufb01c criteria. Since generating images is one\nof the most common use cases of GANs, we further explore the usefulness of the DG on generating\nfaces through a ProgGAN trained on CelebA. Figure 6 shows that unlike the GAN losses, the DG\n\n7\n\n\ftrend can capture the progress, which is also in agreement with the trend of the largest singular values\nof the convolutional layers of G and D.\n\nFigure 6: ProgGAN trained on CelebA: (left) losses vs. DG; (right) largest singular values of the conv layers\n\n4.6 Generalization to other domains and GAN losses\n\nThese experiments test the ability of the two metrics to adapt to a different GAN loss formulation\n(WGAN-GP [11] and SeqGAN [34]), as well as other domains (cosmology, audio, text).\n\nN-body simulations in cosmology. We consider the \ufb01eld of observational cosmology that relies\non computationally expensive simulations with very different statistics from natural images. In an\nattempt to reduce this burden [28] trained a WGAN-GP to replace the traditional N-body simulators,\nrelying on three statistics to assess the quality of the generated samples: mass histrogram, peak count\nand power spectral density. A random selection of real and generated samples shown in Fig. 21 in the\nAppendix demonstrate the high visual quality achieved by the generator.\n\nWe evaluate the agreement between the statistics of the\nreal and generated samples using the squared norm of the\nstatistics differences (lower scores are therefore better). In\nFig. 7, we show the evolution of the scores corresponding\nto the three statistics as well as DG. We observe a strong\ncorrelation, especially between the peaks. Furthermore,\nit seems that the duality gap takes all the statistics into\naccount. In Tab. 2, we observe a strong empirical corre-\nlation between the duality gap, the minimax value and\nthe cosmological scores. We also observe that the FID is\nsigni\ufb01cantly less correlated that the duality gap, which is explained by the fact that the images we\nuse are not natural images and hence that the statistics of the Inception Network are not suitable to\nevaluate the quality of cosmological samples.\n\nTable 2: Pearson correlation between cos-\nmological scores (mass histogram, peak his-\ntogram and Power Spectral Density (PSD))\nand metrics (dual gap, minimax and FID).\n\nDual gap\nMinmax value\nFID\n\nPSD\n0.71\n0.75\n0.34\n\nMass hist.\n\nPeak hist.\n\n0.53\n0.66\n0.40\n\n0.38\n0.51\n0.21\n\nFigure 7: DG and cosmo-score evolution. DG strongly correlates with all 3 scores.\n\nAudio Time-Frequency consistency. Generating an audio waveform is a challenging problem as it\nrequires an agreement between scales from the range of milliseconds to tens of seconds. To overcome\nthis challenge, one may use more powerful and intuitive features such as a Time-Frequency (TF)\nrepresentation. Nevertheless, in order to obtain a natural listening experience, a TF representation\nneeds to be consistent, i.e., there must exist an audio signal which leads to this TF representation. [19]\nde\ufb01ne a measure estimating the consistency of a representation and use it to assess the convergence\nof their GAN. In Fig. 8, we show the evolution the measure and of DG and minimax. We observe\na clear correlation, especially with the minimax value, which is expected as both the consistency\nmeasure and minimax evaluate only the generated samples.\n\nText generation. Another challenging modality for generative models is text. SeqGAN [34] is a\nGAN-based sequence framework evaluated by negative log-likelihood: nll-oracle and nll-test. The\n\n8\n\n02000400060008000100001200014000Iterations0.000.250.500.751.00Normalized metricMass histogramScoreDuality Gap02000400060008000100001200014000Iterations0.000.250.500.751.00Normalized metricPeak histogramScoreDuality Gap02000400060008000100001200014000Iterations0.000.250.500.751.00Normalized metricPower spectral densityScoreDuality Gap\fFigure 8: Evolution of the consis-\ntency measure vs. the DG.\n\nFigure 9: Evolution of nll-oracle and nll-test (left) vs. DG and minimax\n(right). The dashed line represents the optimal stopping point.\n\n\ufb01rst metric computes the likelihood of generated data against an oracle, whereas the second makes\nuse of the generator and the likelihood of real test data. Fig. 9 shows DG and minimax correlate well\nwith both metrics. Moreover, both DG and minimax can determine the optimal stopping point.\n\n4.7 Comparison of different models\n\nSo far, we have seen that our estimation of DG\ngeneralizes to various loss functions and data\ndistributions. We now turn our attention to evalu-\nating the generalization abilities to models com-\nputed from different classes (i.e. neural net-\nworks with different architectures). We propose\nto do this evaluation by taking the original GAN\nminmax objective as a reference. The reasoning\nbehind this choice is that a better discriminator would always be fooled less by a worst generator,\nirrespective of how it was trained. The analogue holds for the generator as well, resulting in a lower\nDG/minimax value on a selected objective.\n\nTable 3: Pearson correlation between FID and INC with\nDG and minimax (mm) for different GAN variants.\n\nNS\nNS + SN\nNS + SN + GP\nWGAN + GP\n\nINC/mm\n-0.87\n-0.91\n-0.95\n0.91\n\nFID/mm\n0.88\n0.85\n0.94\n0.90\n\nINC/DG\n-0.34\n-0.93\n-0.91\n-0.87\n\nFID/DG\n0.32\n0.93\n0.94\n0.87\n\n3\n1\n2\n4\n\n3\n1\n2\n4\n\n120.50\n\n47.33\n\n27.22\n\n7.38\n\n3\n1\n2\n4\n\n3\n1\n2\n4\n\nFID\n\nrank\n\nINC\n\nrank\n\nmm\n\nrank\n\nDG\n\nrank\n\nNS\n\nNS+SN\n\ncompute time (s)\n\nNS+SN+GP\nWGAN+GP\n\nscore\n7.1\n7.78\n7.75\n4.93\n\nscore\n-1.07\n-1.31\n-1.30\n-0.2\n\nscore\n2.36\n0.21\n0.23\n4.53\n\nscore\n28.65\n22.25\n23.46\n72.23\n\nTable 4: Scores and ranking for various GAN models using different metrics.\nThe \ufb01nal row gives the computation time in seconds.\n\nTo that end, we compare dif-\nferent commonly used ResNet\nbased GAN variants on Ci-\nfar10: a GAN using the non\nsaturating update rule (NS),\nspectrally normalized GAN\n(NS + SN), spectrally normal-\nized GAN with the addition of\na gradient penalty (NS + SN +\nGP) and a WGAN with gradient penalty (WGAN GP). We use the optimal hyperparameters suggested\nin [16]. Table 3 shows the Pearson correlation between DG and the minimax metric against FID and\nthe Inception score (INC), which are known to work well on this dataset. We \ufb01nd that the minimax\nmetric always highly correlates with both FID and INC. This is expected as (i) minimax evaluates\nthe generator only, just as FID and INC that do not take the discriminator into consideration and (ii)\nas previously shown minimax is sensitive to mode changes and sample quality. Interestingly, DG\nalso correlates highly whenever the discriminator is properly regularized. The level of correlation\nis however reduced for the unregularized variants. We hypothesise this is due to instabilities in the\ntraining, for which the generator might be improving, while the discriminator is becoming worse.\n\nTab. 4 shows the ranking of the models is the same for all four metrics, suggesting that DG/minimax\nis a sensible choice for comparing different GAN models.\n\n5 Conclusion\n\nWe propose a domain agnostic evaluation measure for GANs that relies on the duality gap (DG) and\nupper bounds the JS divergence between real and generated data. This measure allows for meaningful\nmonitoring of the progress made during training, which was lacking until now. We demonstrate that\nDG and its minimax part are able to detect various GAN failure modes (stable and unstable mode\ncollapse, divergence, sample quality etc.) and rank different models successfully. Thus these metrics\naddress two problems commonly faced by practitioners: 1) when should one stop training? and 2) if\nthe training procedure has converged, have we reached the optimum? Finally, a signi\ufb01cant advantage\nof the metrics is that, unlike many existing approaches, they require no labelled data and no domain\nspeci\ufb01c classi\ufb01er. Therefore, they are well-suited for applications of GANs other than the traditional\ngeneration task for images.\n\n9\n\n20000300004000050000600007000080000Iteration0246GapDual gap /3MinMax value0.00.10.20.3Consistency score05101520253035Epochs10.0010.2510.5010.75nll-oracle6.06.57.07.5nll-test05101520253035Epochs0.20.4Dual Gap1.51.00.50.0minmax\fAcknowledgements. The authors would like to thank August DuMont Sch\u00fctte for helping with the\nProgGAN experiment and Gokula Krishnan Santhanam and Gary B\u00e9cigneul for useful discussions.\n\nReferences\n\n[1] A tensor\ufb02ow implementation of \"deep convolutional generative adversarial networks\". https:\n\n//github.com/carpedm20/DCGAN-tensorflow. Accessed: 2018-11-16.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[3] Ali Borji. Pros and cons of gan evaluation measures. arXiv preprint arXiv:1802.03446, 2018.\n\n[4] Xu Chen, Jiang Wang, and Hao Ge. Training generative adversarial networks via primal-dual\nsubgradient methods: A lagrangian perspective on gan. arXiv preprint arXiv:1802.01765, 2018.\n\n[5] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans\n\nwith optimism. International Conference on Learning Representations (ICLR), 2018.\n\n[6] Li Deng. The mnist database of handwritten digit images for machine learning research [best of\n\nthe web]. IEEE Signal Processing Magazine, 29(6):141\u2013142, 2012.\n\n[7] Mevlana Gemici, Zeynep Akata, and Max Welling. Primal-dual wasserstein gan. arXiv preprint\n\narXiv:1805.09575, 2018.\n\n[8] Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational in-\nequality perspective on generative adversarial nets. International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[10] Paulina Grnarova, K\ufb01r Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An\nonline learning approach to generative adversarial networks. International Conference on\nLearning Representations (ICLR), 2018.\n\n[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n[13] Ya-Ping Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative\n\nadversarial networks. arXiv preprint arXiv:1811.02002, 2018.\n\n[14] Nikos Komodakis and Jean-Christophe Pesquet. Playing with duality: An overview of recent\nprimal? dual approaches for solving large-scale optimization problems. IEEE Signal Processing\nMagazine, 32(6):31\u201354, 2015.\n\n[15] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www.\n\ncs. toronto. edu/kriz/cifar. html, 2014.\n\n[16] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The\ngan landscape: Losses, architectures, regularization, and normalization. arXiv preprint\narXiv:1807.04720, 2018.\n\n[17] Yujia Li, Alexander Schwing, Kuan-Chieh Wang, and Richard Zemel. Dualing gans.\n\nIn\n\nAdvances in Neural Information Processing Systems, pages 5606\u20135616, 2017.\n\n[18] David Lopez-Paz and Maxime Oquab. Revisiting classi\ufb01er two-sample tests. arXiv preprint\n\narXiv:1610.06545, 2016.\n\n10\n\n\f[19] Andr\u00e9s Mara\ufb01oti, Nicki Holighaus, Nathana\u00ebl Perraudin, and Piotr Majdak. Adversarial\ngeneration of time-frequency features with application in audio synthesis. arXiv preprint\narXiv:1902.04072, 2019.\n\n[20] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do\nactually converge? In International Conference on Machine Learning, pages 3478\u20133487, 2018.\n\n[21] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. arXiv preprint arXiv:1611.02163, 2016.\n\n[22] John F Nash et al. Equilibrium points in n-person games. Proceedings of the national academy\n\nof sciences, 36(1):48\u201349, 1950.\n\n[23] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574\u2013\n1609, 2009.\n\n[24] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\nsamplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n271\u2013279. 2016.\n\n[25] Frans A Oliehoek, Rahul Savani, Jose Gallego, Elise van der Pol, and Roderich Gro\u00df. Beyond\n\nlocal nash equilibria for adversarial networks. arXiv preprint arXiv:1806.07268, 2018.\n\n[26] Catherine Olsson, Surya Bhupatiraju, Tom Brown, Augustus Odena, and Ian Goodfellow. Skill\n\nrating for generative models. arXiv preprint arXiv:1808.04888, 2018.\n\n[27] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[28] Andres C. Rodriguez, Tomasz Kacprzak, Aurelien Lucchi, Adam Amara, Raphael Sgier, Janis\nFluri, Thomas Hofmann, and Alexandre R\u00e9fr\u00e9gier. Fast cosmic web simulations with generative\nadversarial networks. arXiv preprint arXiv:1801.09070, 2018.\n\n[29] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training\nof generative adversarial networks through regularization. In Advances in Neural Information\nProcessing Systems, pages 2018\u20132028, 2017.\n\n[30] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess-\ning generative models via precision and recall. In Advances in Neural Information Processing\nSystems, pages 5228\u20135237, 2018.\n\n[31] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[32] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[33] J Von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295\u2013320,\n\n1928.\n\n[34] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[35] Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu.\n\nTexygen: A benchmarking platform for text generation models. SIGIR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6501, "authors": [{"given_name": "Paulina", "family_name": "Grnarova", "institution": "ETH Zurich"}, {"given_name": "Kfir Y.", "family_name": "Levy", "institution": "Technion"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}, {"given_name": "Nathanael", "family_name": "Perraudin", "institution": "Swiss Data Science Center - EPFL / ETH Zurich"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}