{"title": "AdaGAN: Boosting Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5424, "page_last": 5433, "abstract": "Generative Adversarial Networks (GAN) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called AdaGAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a re-weighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove analytically that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes.", "full_text": "AdaGAN: Boosting Generative Models\n\nIlya Tolstikhin\n\nMPI for Intelligent Systems\n\nT\u00fcbingen, Germany\nilya@tue.mpg.de\n\nSylvain Gelly\nGoogle Brain\n\nZ\u00fcrich, Switzerland\n\nsylvaingelly@google.com\n\nOlivier Bousquet\n\nGoogle Brain\n\nZ\u00fcrich, Switzerland\n\nobousquet@google.com\n\nCarl-Johann Simon-Gabriel\nMPI for Intelligent Systems\n\nT\u00fcbingen, Germany\n\ncjsimon@tue.mpg.de\n\nBernhard Sch\u00f6lkopf\n\nMPI for Intelligent Systems\n\nT\u00fcbingen, Germany\nbs@tue.mpg.de\n\nAbstract\n\nGenerative Adversarial Networks (GAN) are an effective method for training\ngenerative models of complex data such as natural images. However, they are\nnotoriously hard to train and can suffer from the problem of missing modes where\nthe model is not able to produce examples in certain regions of the space. We\npropose an iterative procedure, called AdaGAN, where at every step we add a new\ncomponent into a mixture model by running a GAN algorithm on a re-weighted\nsample. This is inspired by boosting algorithms, where many potentially weak\nindividual predictors are greedily aggregated to form a strong composite predictor.\nWe prove analytically that such an incremental procedure leads to convergence\nto the true distribution in a \ufb01nite number of steps if each step is optimal, and\nconvergence at an exponential rate otherwise. We also illustrate experimentally\nthat this procedure addresses the problem of missing modes.\n\n1\n\nIntroduction\n\nImagine we have a large corpus, containing unlabeled pictures of animals, and our task is to build a\ngenerative probabilistic model of the data. We run a recently proposed algorithm and end up with a\nmodel which produces impressive pictures of cats and dogs, but not a single giraffe. A natural way to\n\ufb01x this would be to manually remove all cats and dogs from the training set and run the algorithm on\nthe updated corpus. The algorithm would then have no choice but to produce new animals and, by\niterating this process until there\u2019s only giraffes left in the training set, we would arrive at a model\ngenerating giraffes (assuming suf\ufb01cient sample size). At the end, we aggregate the models obtained\nby building a mixture model. Unfortunately, the described meta-algorithm requires manual work for\nremoving certain pictures from the unlabeled training set at every iteration.\nLet us turn this into an automatic approach, and rather than including or excluding a picture, put\ncontinuous weights on them. To this end, we train a binary classi\ufb01er to separate \u201ctrue\u201d pictures of\nthe original corpus from the set of \u201csynthetic\u201d pictures generated by the mixture of all the models\ntrained so far. We would expect the classi\ufb01er to make con\ufb01dent predictions for the true pictures of\nanimals missed by the model (giraffes), because there are no synthetic pictures nearby to be confused\nwith them. By a similar argument, the classi\ufb01er should make less con\ufb01dent predictions for the true\npictures containing animals already generated by one of the trained models (cats and dogs). For each\npicture in the corpus, we can thus use the classi\ufb01er\u2019s con\ufb01dence to compute a weight which we use\nfor that picture in the next iteration, to be performed on the re-weighted dataset.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fend for\n\nt\n\nALGORITHM 1 AdaGAN, a meta-algorithm to con-\nstruct a \u201cstrong\u201d mixture of T individual generative\nmodels (f.ex. GANs), trained sequentially.\nInput: Training sample SN := {X1, . . . , XN}.\nOutput: Mixture generative model G = GT .\n\nTrain vanilla GAN G1 = GAN(SN , W1) with a\nuniform weight W1 = (1/N, . . . , 1/N ) over the\ntraining points\nfor t = 2, . . . , T do\n\nt = GAN(SN , Wt)\n\n#Choose the overall weight of the next mixture\ncomponent\n\u03b2t = ChooseMixtureWeight(t)\n#Update the weight of each training example\nWt = UpdateTrainingWeights(Gt\u22121, SN , \u03b2t)\n#Train t-th \u201cweak\u201d component generator Gc\nt\nGc\n#Update the overall generative model:\n#Form a mixture of Gt\u22121 and Gc\nt.\nGt = (1 \u2212 \u03b2t)Gt\u22121 + \u03b2tGc\n\nThe present work provides a principled way to perform this re-weighting, with theoretical guarantees\nshowing that the resulting mixture models indeed approach the true data distribution.1\nBefore discussing how to build the mixture, let\nus consider the question of building a single\ngenerative model. A recent trend in modelling\nhigh dimensional data such as natural images\nis to use neural networks [1, 2]. One popu-\nlar approach are Generative Adversarial Net-\nworks (GAN) [2], where the generator is trained\nadversarially against a classi\ufb01er, which tries to\ndifferentiate the true from the generated data.\nWhile the original GAN algorithm often pro-\nduces realistically looking data, several issues\nwere reported in the literature, among which\nthe missing modes problem, where the generator\nconverges to only one or a few modes of the data\ndistribution, thus not providing enough variabil-\nity in the generated data. This seems to match\nthe situation described earlier, which is why we\nwill most often illustrate our algorithm with a\nGAN as the underlying base generator. We call\nit AdaGAN, for Adaptive GAN, but we could ac-\ntually use any other generator: a Gaussian mix-\nture model, a VAE [1], a WGAN [3], or even\nan unrolled [4] or mode-regularized GAN [5],\nwhich were both already speci\ufb01cally developed\nto tackle the missing mode problem. Thus, we do not aim at improving the original GAN or any\nother generative algorithm. We rather propose and analyse a meta-algorithm that can be used on top\nof any of them. This meta-algorithm is similar in spirit to AdaBoost in the sense that each iteration\ncorresponds to learning a \u201cweak\u201d generative model (e.g., GAN) with respect to a re-weighted data\ndistribution. The weights change over time to focus on the \u201chard\u201d examples, i.e. those that the mixture\nhas not been able to properly generate so far.\nRelated Work Several authors [6, 7, 8] have proposed to use boosting techniques in the context of\ndensity estimation by incrementally adding components in the log domain. This idea was applied\nto GANs in [8]. A major downside of these approaches is that the resulting mixture is a product of\ncomponents and sampling from such a model is nontrivial (at least when applied to GANs where the\nmodel density is not expressed analytically) and requires techniques such as Annealed Importance\nSampling [9] for the normalization.\nWhen the log likelihood can be computed, [10] proposed to use an additive mixture model. They\nderived the update rule via computing the steepest descent direction when adding a component with\nin\ufb01nitesimal weight. However, their results do not apply once the weight \u03b2 becomes non-in\ufb01nitesimal.\nIn contrast, for any \ufb01xed weight of the new component our approach gives the overall optimal update\n(rather than just the best direction) for a speci\ufb01ed f-divergence. In both theories, improvements of the\nmixture are guaranteed only if the new \u201cweak\u201d learner is still good enough (see Conditions 10&11)\nSimilarly, [11] studied the construction of mixtures minimizing the Kullback divergence and proposed\na greedy procedure for doing so. They also proved that under certain conditions, \ufb01nite mixtures can\napproximate arbitrary mixtures at a rate 1/k where k is the number of components in the mixture\nwhen the weight of each newly added component is 1/k. These results are speci\ufb01c to the Kullback\ndivergence but are consistent with our more general results.\nAn additive procedure similar to ours was proposed in [12] but with a different re-weighting scheme,\nwhich is not motivated by a theoretical analysis of optimality conditions. On every new iteration the\nauthors run GAN on the k training examples with maximal values of the discriminator from the last\niteration.\n\n1Note that the term \u201cmixture\u201d should not be interpreted to imply that each component models only one mode:\n\nthe models to be combined into a mixture can themselves cover multiple modes.\n\n2\n\n\fFinally, many papers investigate completely different approaches for addressing the same issue by\ndirectly modifying the training objective of an individual GAN. For instance, [5] add an autoencoding\ncost to the training objective of GAN, while [4] allow the generator to \u201clook few steps ahead\u201d when\nmaking a gradient step.\nThe paper is organized as follows. In Section 2 we present our main theoretical results regarding\niterative optimization of mixture models under general f-divergences. In Section 2.4 we show that if\noptimization at each step is perfect, the process converges to the true data distribution at exponential\nrate (or even in a \ufb01nite number of steps, for which we provide a necessary and suf\ufb01cient condition).\nThen we show in Section 2.5 that imperfect solutions still lead to the exponential rate of convergence\nunder certain \u201cweak learnability\u201d conditions. These results naturally lead to a new boosting-style\niterative procedure for constructing generative models. When used with GANs, it results in our\nAdaGAN algorithm, detailed in Section 3 . Finally, we report initial empirical results in Section 4,\nwhere we compare AdaGAN with several benchmarks, including original GAN and uniform mixture\nof multiple independently trained GANs. Part of new theoretical results are reported without proofs,\nwhich can be found in appendices.\n\n2 Minimizing f-divergence with Mixtures\n\n2.1 Preliminaries and notations\n\nGenerative Density Estimation In density estimation, one tries to approximate a real data distribu-\ntion Pd, de\ufb01ned over the data space X , by a model distribution Pmodel. In the generative approach\none builds a function G : Z \u2192 X that transforms a \ufb01xed probability distribution PZ (often called the\nnoise distribution) over a latent space Z into a distribution over X . Hence Pmodel is the pushforward\nof PZ, i.e. Pmodel(A) = PZ(G\u22121(A)). With this approach it is in general impossible to compute the\ndensity dPmodel(x) and the log-likelihood of the training data under the model, but one can easily\nsample from Pmodel by sampling from PZ and applying G. Thus, to construct G, instead of compar-\ning Pmodel directly with Pd, one compares their samples. To do so, one uses a similarity measure\nD(Pmodel(cid:107)Pd) which can be estimated from samples of those distributions, and thus approximately\nminimized over a class G of functions.\nf-Divergences In order to measure the agreement between the model distribution and the true\ndistribution we will use an f-divergence de\ufb01ned in the following way:\n\n(cid:90)\n\n(cid:18) dQ\n\ndP\n\n(cid:19)\n\nDf (Q(cid:107)P ) :=\n\nf\n\n(x)\n\ndP (x)\n\n(1)\n\nfor any pair of distributions P, Q with densities dP , dQ with respect to some dominating reference\nmeasure \u00b5 (we refer to Appendix D for more details about such divergences and their domain of\nde\ufb01nition). Here we assume that f is convex, de\ufb01ned on (0,\u221e), and satis\ufb01es f (1) = 0. We will\ndenote by F the set of such functions. 2\nAs demonstrated in [16, 17], several commonly used symmetric f-divergences are Hilbertian metrics,\nwhich in particular means that their square root satis\ufb01es the triangle inequality. This is true for the\nJensen-Shannon divergence3, the Hellinger distance and the Total Variation among others. We will\ndenote by FH the set of functions f such that Df is a Hilbertian metric.\nGAN and f-divergences The original GAN algorithm [2] optimizes the following criterion:\n\nmin\n\nG\n\nmax\n\nD\n\nEPd [log D(X)] + EPZ [log(1 \u2212 D(G(Z)))] ,\n\n(2)\n\nwhere D and G are two functions represented by neural networks. This optimization is performed on\na pair of samples (a training sample from Pd and a \u201cfake\u201d sample from PZ), which corresponds to\napproximating the above criterion by using the empirical distributions. In the non-parametric limit\nfor D, this is equivalent to minimizing the Jensen-Shannon divergence [2]. This point of view can be\ngeneralized to any other f-divergence [13]. Because of this strong connection between adversarial\n\n2Examples of f-divergences include the Kullback-Leibler divergence (obtained for f (x) = x log x) and\n2 + x log x). Other examples can be found in [13]. For\n\nJensen-Shannon divergence (f (x) = \u2212(x + 1) log x+1\nfurther details we refer to Section 1.3 of [14] and [15].\n\n3which means such a property can be used in the context of the original GAN algorithm.\n\n3\n\n\ftraining of generative models and minimization of f-divergences, we cast the results of this section\ninto the context of general f-divergences.\nGenerative Mixture Models In order to model complex data distributions, it can be convenient to\nuse a mixture model of the following form: P T\ni \u03b1i = 1, and\neach of the T components is a generative density model. This is natural in the generative context,\nsince sampling from a mixture corresponds to a two-step sampling, where one \ufb01rst picks the mixture\ncomponent (according to the multinomial distribution with parameters \u03b1i) and then samples from it.\nAlso, this allows to construct complex models from simpler ones.\n\ni=1 \u03b1iPi, where \u03b1i \u2265 0,(cid:80)\n\nmodel :=(cid:80)T\n\n2.2\n\nIncremental Mixture Building\n\nWe restrict ourselves to the case of f-divergences and assume that, given an i.i.d. sample from any\nunknown distribution P , we can construct a simple model Q \u2208 G which approximately minimizes4\n(3)\n\nmin\n\nQ\u2208G Df (Q(cid:107) P ).\n\nInstead of modelling the data with a single distribution, we now want to model it with a mixture of\ndistributions Pi,where each Pi is obtained by a training procedure of the form (3) with (possibly)\ndifferent target distributions P for each i. A natural way to build a mixture is to do it incrementally:\nwe train the \ufb01rst model P1 to minimize Df (P1 (cid:107) Pd) and set the corresponding weight to \u03b11 = 1,\nmodel = P1. Then after having trained t components P1, . . . , Pt \u2208 G we can form the\nleading to P 1\n(t + 1)-st mixture model by adding a new component Q with weight \u03b2 as follows:\n\nt(cid:88)\n\nP t+1\n\nmodel :=\n\n(1 \u2212 \u03b2)\u03b1iPi + \u03b2Q.\n\n(4)\n\n(5)\n\nwhere \u03b2 \u2208 [0, 1] and Q \u2208 G is computed by minimizing:\n\ni=1\n\nDf ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd),\n\nmin\n\nQ\n\nwhere we denoted Pg := P t\nmodel the current generative mixture model before adding the new\ncomponent. We do not expect to \ufb01nd the optimal Q that minimizes (5) at each step, but we aim at\nconstructing some Q that slightly improves our current approximation of Pd, i.e. such that for c < 1\n(6)\n\nDf ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd) \u2264 c \u00b7 Df (Pg (cid:107) Pd) .\n\nThis greedy approach has a signi\ufb01cant drawback in practice. As we build up the mixture, we need to\nmake \u03b2 decrease (as P t\nmodel approximates Pd better and better, one should make the correction at\neach step smaller and smaller). Since we are approximating (5) using samples from both distributions,\nthis means that the sample from the mixture will only contain a fraction \u03b2 of examples from Q. So,\nas t increases, getting meaningful information from a sample so as to tune Q becomes harder and\nharder (the information is \u201cdiluted\u201d). To address this issue, we propose to optimize an upper bound\non (5) which involves a term of the form Df (Q(cid:107) R) for some distribution R, which can be computed\nas a re-weighting of the original data distribution Pd. This procedure is reminiscent of the AdaBoost\nalgorithm [18], which combines multiple weak predictors into one strong composition. On each step\nAdaBoost adds new predictor to the current composition, which is trained to minimize the binary loss\non the re-weighted training set. The weights are constantly updated to bias the next weak learner\ntowards \u201chard\u201d examples, which were incorrectly classi\ufb01ed during previous stages.\nIn the following we will analyze the properties of (5) and derive upper bounds that provide practical\noptimization criteria for building the mixture. We will also show that under certain assumptions, the\nminimization of the upper bound leads to the optimum of the original criterion.\n\n2.3 Upper Bounds\n\nWe provide two upper bounds on the divergence of the mixture in terms of the divergence of the\nadditive component Q with respect to some reference distribution R.\n\n4One example of such a setting is running GANs.\n\n4\n\n\fDf ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd) \u2264 \u03b2Df (Q(cid:107) R) + (1 \u2212 \u03b2)Df\n\n(cid:18)\n\nPg (cid:107) Pd \u2212 \u03b2R\n1 \u2212 \u03b2\n\n(7)\n\n(8)\n\n(cid:19)\n\n.\n\n(cid:19)\n\nLemma 1 Given two distributions Pd, Pg and some \u03b2 \u2208 [0, 1], then, for any Q and R, and f \u2208 FH:\n\n(cid:113)\nDf ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd) \u2264(cid:113)\n\nIf, more generally, f \u2208 F, but \u03b2dR \u2264 dPd, then:\n\n(cid:113)\n\n\u03b2Df (Q(cid:107) R) +\n\nDf ((1 \u2212 \u03b2)Pg + \u03b2R (cid:107) Pd) .\n\n(cid:18)\n\nWe can thus exploit those bounds by introducing some well-chosen distribution R and then minimizing\nthem with respect to Q. A natural choice for R is a distribution that minimizes the last term of the\nupper bound (which does not depend on Q). Our main result indicates the shape of the distributions\nminimizing the right-most terms in those bounds.\nTheorem 1 For any f-divergence Df , with f \u2208 F and f differentiable, any \ufb01xed distributions\nPd, Pg, and any \u03b2 \u2208 (0, 1], the minimizer of (5) over all probability distributions P has density\n\n1\n\u03b2\n\ndQ\u2217\n\n\u03b2(x) =\n\ndPd\n\u03b2\n\ndPg\ndPd\n\n\u03bb\u2217 \u2212 (1 \u2212 \u03b2)\n\n(\u03bb\u2217dPd(x) \u2212 (1 \u2212 \u03b2)dPg(x))+ =\n\nPd((1 \u2212 \u03b2)dPg > dPd) = 0, which is equivalent to \u03b2dQ\u2217\nTheorem 2 Given two distributions Pd, Pg and some \u03b2 \u2208 (0, 1], assume Pd (dPg = 0) < \u03b2. Let\nf \u2208 F. The problem\n\nfor the unique \u03bb\u2217 \u2208 [\u03b2, 1] satisfying (cid:82) dQ\u2217\n(cid:18)\nPg (cid:107) Pd \u2212 \u03b2Q\n(cid:0)dPd(x) \u2212 \u03bb\u2020(1 \u2212 \u03b2)dPg(x)(cid:1)\n1 \u2212 \u03b2\n\n\u03b2 = 1. Also, \u03bb\u2217 = 1 if and only if\n\n\u03b2 = dPd \u2212 (1 \u2212 \u03b2)dPg.\n\nhas a solution with the density dQ\n\n.\n\n+\n\n(9)\n\n+ for the unique \u03bb\u2020 \u2265 1\n\nmin\n\nQ:\u03b2dQ\u2264dPd\n\u2020\n\u03b2(x) = 1\n\u03b2\n\n(cid:19)\n\nDf\n\nthat satis\ufb01es(cid:82) dQ\n\n\u2020\n\u03b2 = 1.\n\nSurprisingly, in both Theorems 1 and 2, the solutions do not depend on the choice of the function f,\nwhich means that the solution is the same for any f-divergence5. Note that \u03bb\u2217 is implicitly de\ufb01ned\nby a \ufb01xed-point equation. In Section 3 we will show how it can be computed ef\ufb01ciently in the case of\nempirical distributions.\n\n2.4 Convergence Analysis for Optimal Updates\n\nIn previous section we derived analytical expressions for the distributions R minimizing last terms\nin upper bounds (8) and (7). Assuming Q can perfectly match R, i.e. Df (Q(cid:107) R) = 0, we are now\ninterested in the convergence of the mixture (4) to the true data distribution Pd when Q = Q\u2217\n\u03b2 or\n\u2020\n\u2020\n\u03b2. We start with simple results showing that adding Q\u2217\n\u03b2 or Q\n\u03b2 to the current mixture would\nQ = Q\nyield a strict improvement of the divergence.\n\nLemma 2 (Property 6: exponential improvements) Under the conditions of Theorem 1, we have\n\nDf\n\n(cid:0)(1 \u2212 \u03b2)Pg + \u03b2Q\u2217\n(cid:13)(cid:13) Pd \u2212 \u03b2Q\n\n(cid:33)\n\n\u2020\n\u03b2\n\n\u03b2\n\n1 \u2212 \u03b2\n\n(cid:32)\n\nDf\n\nPg\n\n(cid:13)(cid:13) Pd\n\n(cid:1) \u2264 Df\n\n(cid:0)(1 \u2212 \u03b2)Pg + \u03b2Pd\n\n(cid:13)(cid:13) Pd\n(cid:0)(1 \u2212 \u03b2)Pg + \u03b2Q\n\n(cid:1) \u2264 (1 \u2212 \u03b2)Df (Pg (cid:107) Pd).\n(cid:13)(cid:13) Pd\n\n(cid:1) \u2264 (1 \u2212 \u03b2)Df (Pg (cid:107) Pd).\n\n\u2020\n\u03b2\n\n\u2264 Df (Pg (cid:107) Pd) and Df\n\nUnder the conditions of Theorem 2, we have\n\nImagine repeatedly adding T new components to the current mixture Pg, where on every step we use\nthe same weight \u03b2 and choose the components described in Theorem 1. In this case Lemma 2 guaran-\ntees that the original objective value Df (Pg (cid:107) Pd) would be reduced at least to (1\u2212 \u03b2)T Df (Pg (cid:107) Pd).\n5in particular, by replacing f with f\u25e6(x) := xf (1/x), we get the same solution for the criterion written in\nthe other direction. Hence the order in which we write the divergence does not matter and the optimal solution is\noptimal for both orders.\n\n5\n\n\f\u03b2 depends on the true distribution Pd, which is of course unknown.\n\nThis exponential rate of convergence, which at \ufb01rst may look surprisingly good, is simply explained\nby the fact that Q\u2217\nLemma 2 also suggests setting \u03b2 as large as possible since we assume we can compute the optimal\nmixture component (which for \u03b2 = 1 is Pd). However, in practice we may prefer to keep \u03b2 relatively\nsmall, preserving what we learned so far through Pg: for instance, when Pg already covered part\nof the modes of Pd and we want Q to cover the remaining ones. We provide further discussions on\nchoosing \u03b2 in Section 3.\n\n2.5 Weak to Strong Learnability\n\u2020\nIn practice the component Q that we add to the mixture is not exactly Q\u2217\n\u03b2 or Q\n\u03b2, but rather an\napproximation to them. In this section we show that if this approximation is good enough, then we\nretain the property (6) (exponential improvements).\nLooking again at Lemma 1 we notice that the \ufb01rst upper bound is less tight than the second one.\nIndeed, take the optimal distributions provided by Theorems 1 and 2 and plug them back as R into\nthe upper bounds of Lemma 1. Also assume that Q can match R exactly, i.e. Df (Q(cid:107) R) = 0. In\nthis case both sides of (7) are equal to Df ((1 \u2212 \u03b2)Pg + \u03b2Q\u2217\n\u03b2 (cid:107) Pd), which is the optimal value for\nthe original objective (5). On the other hand, (8) does not become an equality and the r.h.s. is not\nthe optimal one for (5). However, earlier we agreed that our aim is to reach the modest goal (6) and\nnext we show that this is indeed possible.Corollaries 1 and 2 provide suf\ufb01cient conditions for strict\nimprovements when we use the upper bounds (8) and (7) respectively.\nCorollary 1 Given Pd, Pg, and some \u03b2 \u2208 (0, 1], assume Pd\nin Theorem 2. If Q is such that\n\n\u2020\n< \u03b2. Let Q\n\u03b2 be as de\ufb01ned\n\n(cid:16) dPg\n\n(cid:17)\n\n= 0\n\ndPd\n\n\u2020\n\u03b2) \u2264 \u03b3Df (Pg (cid:107) Pd)\nDf (Q(cid:107) Q\n\n(10)\n\nfor \u03b3 \u2208 [0, 1], then Df ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd) \u2264 (1 \u2212 \u03b2(1 \u2212 \u03b3))Df (Pg (cid:107) Pd).\nCorollary 2 Let f \u2208 FH. Take any \u03b2 \u2208 (0, 1], Pd, Pg, and let Q\u2217\nis such that\n\n\u03b2 be as de\ufb01ned in Theorem 1. If Q\n\n(11)\nfor some \u03b3 \u2208 [0, 1], then Df ((1 \u2212 \u03b2)Pg + \u03b2Q(cid:107) Pd) \u2264 C\u03b3,\u03b2 \u00b7 Df (Pg (cid:107) Pd) , where C\u03b3,\u03b2 =\n\n1 \u2212 \u03b2(cid:1)2 is strictly smaller than 1 as soon as \u03b3 < \u03b2/4 (and \u03b2 > 0).\n\n(cid:0)\u221a\n\n\u03b3\u03b2 +\n\n\u221a\n\nDf (Q(cid:107) Q\u2217\n\n\u03b2) \u2264 \u03b3Df (Pg (cid:107) Pd)\n\nConditions 10 and 11 may be compared to the \u201cweak learnability\u201d condition of AdaBoost. As long\n\u2020\n\u03b2 or Q\u2217\nas our weak learner is able to solve the surrogate problem (3) of matching respectively Q\n\u03b2\naccurately enough, the original objective (5) is guaranteed to decrease as well. It should be however\nnoted that Condition 11 with \u03b3 < \u03b2/4 is perhaps too strong to call it \u201cweak learnability\u201d. Indeed, as\nalready mentioned before, the weight \u03b2 is expected to decrease to zero as the number of components\nin the mixture distribution Pg increases. This leads to \u03b3 \u2192 0, making it harder to meet Condition 11.\nThis obstacle may be partially resolved by the fact that we will use a GAN to \ufb01t Q, which corresponds\nto a relatively rich6 class of models G in (3). In other words, our weak learner is not so weak. On\nthe other hand, Condition 10 of Corollary 1 is milder. No matter what \u03b3 \u2208 [0, 1] and \u03b2 \u2208 (0, 1] are,\nthe new component Q is guaranteed to strictly improve the objective functional. This comes at the\nprice of the additional condition Pd(dPg/dPd = 0) < \u03b2, which asserts that \u03b2 should be larger than\nthe mass of true data Pd missed by the current model Pg. We argue that this is a rather reasonable\ncondition: if Pg misses many modes of Pd we would prefer assigning a relatively large weight \u03b2 to\nthe new component Q. However, in practice, both Conditions 10 and 11 are dif\ufb01cult to check. A\nrigorous analysis of situations when they are guaranteed is a direction for future research.\n\n6The hardness of meeting Condition 11 of course largely depends on the class of models G used to \ufb01t Q in\n\n(3). For now we ignore this question and leave it for future research.\n\n6\n\n\f3 AdaGAN\n\nWe now describe the functions ChooseMixtureWeight and UpdateTrainingWeights of Algorithm 1.\nThe complete AdaGAN meta-algorithm with the details of UpdateTrainingWeight and ChooseMix-\ntureWeight, is summarized in Algorithm 3 of Appendix A.\nUpdateTrainingWeights At each iteration we add a new component Q to the current mixture Pg\nwith weight \u03b2. The component Q should approach the \u201coptimal target\u201d Q\u2217\n\u03b2 provided by (9) in Theo-\nrem 1. This distribution depends on the density ratio dPg/dPd, which is not directly accessible, but it\ncan be estimated using adversarial training. Indeed, we can train a separate mixture discriminator DM\nto distinguish between samples from Pd and samples from the current mixture Pg. It is known [13]\nthat for an arbitrary f-divergence, there exists a corresponding function h such that the values of the\noptimal discriminator DM are related to the density ratio by\n\nWe can replace dPg(x)/dPd(x) in (9) with h(cid:0)DM (x)(cid:1). For the Jensen-Shannon divergence, used by\n\n(12)\n\ndPg\ndPd\n\nthe original GAN algorithm, h(z) = 1\u2212z\nSN = (X1, . . . , XN ), each example Xi receives weight\n\nz . In practice, when we compute dQ\u2217\n\n\u03b2 on the training sample\n\n(x) = h(cid:0)DM (x)(cid:1).\n\nThe only remaining task is to determine \u03bb\u2217. As the weights wi in (13) must sum to 1, we get:\n\n(cid:0)\u03bb\u2217 \u2212 (1 \u2212 \u03b2)h(di)(cid:1)\n\uf8eb\uf8ed1 +\n\n\u03b2(cid:80)\n\ni\u2208I(\u03bb\u2217) pi\n\nwi =\n\n1\n\u03b2N\n\n\u03bb\u2217 =\n\n+ , where di = DM (Xi) .\n\n(cid:88)\n\n(1 \u2212 \u03b2)\n\npih(di)\n\n\u03b2\n\ni\u2208I(\u03bb\u2217)\n\n\uf8f6\uf8f8\n\n(13)\n\n(14)\n\nwhere I(\u03bb) := {i : \u03bb > (1 \u2212 \u03b2)h(di)}. To \ufb01nd I(\u03bb\u2217), we sort h(di) in increasing order: h(d1) \u2264\n. . . \u2264 h(dN ). Then I(\u03bb\u2217) is a set consisting of the \ufb01rst k indices. We then successively test all k-s\nuntil the \u03bb given by (14) veri\ufb01es (1\u2212 \u03b2)h(dk) < \u03bb \u2264 (1\u2212 \u03b2)h(dk+1) . This procedure is guaranteed\nto converge by Theorem 1. It is summarized in Algorithm 2 of Appendix A\nChooseMixtureWeight For every \u03b2 there is an optimal re-weighting scheme with weights given\nby (13). If the GAN could perfectly approximate its target Q\u2217\n\u03b2, then choosing \u03b2 = 1 would be\noptimal, because Q\u2217\n1 = Pd. But in practice, GANs cannot do that. So we propose to choose \u03b2\nheuristically by imposing that each generator of the \ufb01nal mixture model has same weight. This yields\n\u03b2t = 1/t, where t is the iteration index. Other heuristics are proposed in Appendix B, but did not\nlead to any signi\ufb01cant difference.\nThe optimal discriminator In practice it is of course hard to \ufb01nd the optimal discriminator DM\nachieving the global maximum of the variational representation for the f-divergence and verifying (12).\nFor the JS-divergence this would mean that DM is the classi\ufb01er achieving minimal expected cross-\nentropy loss in the binary classi\ufb01cation between Pg and Pd. In practice, we observed that the\nreweighting (13) leads to the desired property of emphasizing at least some of the missing modes\nas long as DM distinguishes reasonably between data points already covered by the current model\nPg and those which are still missing. We found an early stopping (while training DM ) suf\ufb01cient\nto achieve this. In the worst case, when DM over\ufb01ts and returns 1 for all true data points, the\nreweighting simply leads to the uniform distribution over the training set.\n\n4 Experiments\n\nWe ran AdaGAN7 on toy datasets, for which we can interpret the missing modes in a clear and\nreproducible way, and on MNIST, which is a high-dimensional dataset. The goal of these experiments\nwas not to evaluate the visual quality of individual sample points, but to demonstrate that the\nre-weighting scheme of AdaGAN promotes diversity and effectively covers the missing modes.\n\n7Code available online at https://github.com/tolstikhin/adagan\n\n7\n\n\fToy Datasets Our target distribution is a mixture of isotropic Gaussians over R2. The distances\nbetween the means are large enough to roughly avoid overlaps between different Gaussian components.\nWe vary the number of modes to test how well each algorithm performs when there are fewer or more\nexpected modes. We compare the baseline GAN algorithm with AdaGAN variations, and with other\nmeta-algorithms that all use the same underlying GAN procedure. For details on these algorithms\nand on the architectures of the underlying generator and discriminator, see Appendix B.\nTo evaluate how well the generated distribution matches the target distribution, we use a coverage\nmetric C. We compute the probability mass of the true data \u201ccovered\u201d by the model Pmodel.\nMore precisely, we compute C := Pd(dPmodel > t) with t such that Pmodel(dPmodel > t) = 0.95.\nThis metric is more interpretable than the likelihood, making it easier to assess the difference in\nperformance of the algorithms. To approximate the density of Pmodel we use a kernel density\nestimation, where the bandwidth is chosen by cross validation. We repeat the run 35 times with the\nsame parameters (but different random seeds). For each run, the learning rate is optimized using\na grid search on a validation set. We report the median over those multiple runs, and the interval\ncorresponding to the 5% and 95% percentiles.\nFigure 2 summarizes the performance of algorithms as a function of the number of iterations T . Both\nthe ensemble and the boosting approaches signi\ufb01cantly outperform the vanilla GAN and the \u201cbest of\nT \u201d algorithm. Interestingly, the improvements are signi\ufb01cant even after just one or two additional\niterations (T = 2 or 3). Our boosting approach converges much faster. In addition, its variance is\nmuch lower, improving the likelihood that a given run gives good results. On this setup, the vanilla\nGAN approach has a signi\ufb01cant number of catastrophic failures (visible in the lower bounds of the\nintervals). Further empirical results are available in Appendix B, where we compared AdaGAN\nvariations to several other baseline meta-algorithms in more details (Table 1) and combined AdaGAN\nwith the unrolled GANs (UGAN) [4] (Figure 3). Interestingly, Figure 3 shows that AdaGAN ran with\nUGAN outperforms the vanilla UGAN on the toy datasets, demonstrating the advantage of using\nAdaGAN as a way to further improve the mode coverage of any existing GAN implementations.\n\nFigure 1: Coverage C of the true data by the model distribution P T\nmodel, as a function of iterations T .\nExperiments correspond to the data distribution with 5 modes. Each blue point is the median over\n35 runs. Green intervals are de\ufb01ned by the 5% and 95% percentiles (see Section 4). Iteration 0 is\nequivalent to one vanilla GAN. The left plot corresponds to taking the best generator out of T runs.\nThe middle plot is an \u201censemble\u201d GAN, simply taking a uniform mixture of T independently trained\nGAN generators. The right plot corresponds to our boosting approach (AdaGAN), with \u03b2t = 1/t.\n\nMNIST and MNIST3 We ran experiments both on the original MNIST and on the 3-digit MNIST\n(MNIST3) [5, 4] dataset, obtained by concatenating 3 randomly chosen MNIST images to form a\n3-digit number between 0 and 999. According to [5, 4], MNIST contains 10 modes, while MNIST3\ncontains 1000 modes, and these modes can be detected using the pre-trained MNIST classi\ufb01er. We\ncombined AdaGAN both with simple MLP GANs and DCGANs [19]. We used T \u2208 {5, 10}, tried\nmodels of various sizes and performed a reasonable amount of hyperparameter search.\nSimilarly to [4, Sec 3.3.1] we failed to reproduce the missing modes problem for MNIST3 reported in\n[5] and found that simple GAN architectures are capable of generating all 1000 numbers. The authors\nof [4] proposed to arti\ufb01cially introduce the missing modes again by limiting the generators\u2019 \ufb02exibility.\nIn our experiments, GANs trained with the architectures reported in [4] were often generating poorly\nlooking digits. As a result, the pre-trained MNIST classi\ufb01er was outputting random labels, which\nagain led to full coverage of the 1000 numbers. We tried to threshold the con\ufb01dence of the pre-trained\nclassi\ufb01er, but decided that this metric was too ad-hoc.\n\n8\n\n\fFor MNIST we noticed that the re-weighted distribu-\ntion was often concentrating its mass on digits having\nvery speci\ufb01c strokes: on different rounds it could\nhighlight thick, thin, vertical, or diagonal digits, indi-\ncating that these traits were underrepresented in the\ngenerated samples (see Figure 2). This suggests that\nAdaGAN does a reasonable job at picking up differ-\nent modes of the dataset, but also that there are more\nthan 10 modes in MNIST (and more than 1000 in\nMNIST3). It is not clear how to evaluate the quality\nof generative models in this context.\nWe also tried to use the \u201cinversion\u201d metric discussed\nin Section 3.4.1 of [4]. For MNIST3 we noticed that\na single GAN was capable of reconstructing most\nof the training points very accurately both visually\nand in the (cid:96)2-reconstruction sense. The \u201cinversion\u201d\nmetric tests whether the trained model can generate\ncertain examples or not, but unfortunately it does not\ntake into account the probabilities of doing so.\n\n5 Conclusion\n\nFigure 2: Digits from the MNIST dataset cor-\nresponding to the smallest (left) and largest\n(right) weights, obtained by the AdaGAN\nprocedure (see Section 3) in one of the runs.\nBold digits (left) are already covered and next\nGAN will concentrate on thin (right) digits.\n\nWe studied the problem of minimizing general f-divergences with additive mixtures of distributions.\nThe main contribution of this work is a detailed theoretical analysis, which naturally leads to an\niterative greedy procedure. On every iteration the mixture is updated with a new component, which\nminimizes f-divergence with a re-weighted target distribution. We provided conditions under which\nthis procedure is guaranteed to converge to the target distribution at an exponential rate. While\nour results can be combined with any generative modelling techniques, we focused on GANs\nand provided a boosting-style algorithm AdaGAN. Preliminary experiments show that AdaGAN\nsuccessfully produces a mixture which iteratively covers the missing modes.\n\nReferences\n[1] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems, pages 2672\u20132680, 2014.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv:1701.07875,\n\n2017.\n\n[4] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks.\n\narXiv:1611.02163, 2017.\n\n[5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized\n\ngenerative adversarial networks. arXiv:1612.02136, 2016.\n\n[6] Max Welling, Richard S. Zemel, and Geoffrey E. Hinton. Self supervised boosting. In Advances\n\nin neural information processing systems, pages 665\u2013672, 2002.\n\n[7] Zhuowen Tu. Learning generative models via discriminative approaches.\n\nConference on Computer Vision and Pattern Recognition, pages 1\u20138. IEEE, 2007.\n\nIn 2007 IEEE\n\n[8] Aditya Grover and Stefano Ermon. Boosted generative models.\n\nsubmission, 2016.\n\nICLR 2017 conference\n\n[9] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n\n[10] Saharon Rosset and Eran Segal. Boosting density estimation. In Advances in Neural Information\n\nProcessing Systems, pages 641\u2013648, 2002.\n\n9\n\n\f[11] A Barron and J Li. Mixture density estimation. Biometrics, 53:603\u2013618, 1997.\n\n[12] Yaxing Wang, Lichao Zhang, and Joost van de Weijer. Ensembles of generative adversarial\n\nnetworks. arXiv:1612.00991, 2016.\n\n[13] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neu-\nral samplers using variational divergence minimization. In Advances in Neural Information\nProcessing Systems, 2016.\n\n[14] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.\n\n[15] M. D. Reid and R. C. Williamson. Information, divergence and risk for binary experiments.\n\nJournal of Machine Learning Research, 12:731\u2013817, 2011.\n\n[16] Bent Fuglede and Flemming Topsoe. Jensen-shannon divergence and hilbert space embedding.\n\nIn IEEE International Symposium on Information Theory, pages 31\u201331, 2004.\n\n[17] Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on\n\nprobability measures. In AISTATS, pages 136\u2013143, 2005.\n\n[18] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[19] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. In ICLR, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2808, "authors": [{"given_name": "Ilya", "family_name": "Tolstikhin", "institution": "MPI for Intelligent Systems"}, {"given_name": "Sylvain", "family_name": "Gelly", "institution": "Google Brain"}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": "Google"}, {"given_name": "Carl-Johann", "family_name": "SIMON-GABRIEL", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}]}