{"title": "Rethinking Generative Mode Coverage: A Pointwise Guaranteed Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2088, "page_last": 2099, "abstract": "Many generative models have to combat missing modes. The conventional wisdom to this end is by reducing through training a statistical distance (such as f -divergence) between the generated distribution and provided data distribution. But this is more of a heuristic than a guarantee. The statistical distance measures a global, but not local, similarity between two distributions. Even if it is small, it does not imply a plausible mode coverage. Rethinking this problem from a game-theoretic perspective, we show that a complete mode coverage is firmly attainable. If a generative model can approximate a data distribution moderately well under a global statistical distance measure, then we will be able to find a mixture of generators that collectively covers every data point and thus every mode, with a lower-bounded generation probability. Constructing the generator mixture has a connection to the multiplicative weights update rule, upon which we propose our algorithm. We prove that our algorithm guarantees complete mode coverage. And our experiments on real and synthetic datasets confirm better mode coverage over recent approaches, ones that also use generator mixtures but rely on global statistical distances.", "full_text": "Rethinking Generative Mode Coverage:\n\nA Pointwise Guaranteed Approach\n\nPeilin Zhong\u21e4\n\nYuchen Mo\u21e4\n\nChang Xiao\u21e4\nColumbia University\n\nPengyu Chen\n\nChangxi Zheng\n\n{peilin, chang, cxz}@cs.columbia.edu\n{yuchen.mo, pengyu.chen}@columbia.edu\n\nAbstract\n\nMany generative models have to combat missing modes. The conventional wis-\ndom to this end is by reducing through training a statistical distance (such as\nf-divergence) between the generated distribution and provided data distribution.\nBut this is more of a heuristic than a guarantee. The statistical distance measures\na global, but not local, similarity between two distributions. Even if it is small,\nit does not imply a plausible mode coverage. Rethinking this problem from a\ngame-theoretic perspective, we show that a complete mode coverage is \ufb01rmly\nattainable. If a generative model can approximate a data distribution moderately\nwell under a global statistical distance measure, then we will be able to \ufb01nd a\nmixture of generators that collectively covers every data point and thus every mode,\nwith a lower-bounded generation probability. Constructing the generator mixture\nhas a connection to the multiplicative weights update rule, upon which we propose\nour algorithm. We prove that our algorithm guarantees complete mode coverage.\nAnd our experiments on real and synthetic datasets con\ufb01rm better mode coverage\nover recent approaches, ones that also use generator mixtures but rely on global\nstatistical distances.\n\n1\n\nIntroduction\n\nA major pillar of machine learning, the generative approach aims at learning a data distribution from\na provided training dataset. While strikingly successful, many generative models suffer from missing\nmodes. Even after a painstaking training process, the generated samples represent only a limited\nsubset of the modes in the target data distribution, yielding a much lower entropy distribution.\nBehind the missing mode problem is the conventional wisdom of training a generative model.\nFormulated as an optimization problem, the training process reduces a statistical distance between the\ngenerated distribution and the target data distribution. The statistical distance, such as f-divergence\nor Wasserstein distance, is often a global measure.\nIt evaluates an integral of the discrepancy\nbetween two distributions over the data space (or a summation over a discrete dataset). In practice,\nreducing the global statistical distance to a perfect zero is virtually a mission impossible. Yet a\nsmall statistical distance does not certify the generator complete mode coverage. The generator may\nneglect underrepresented modes\u2014ones that are less frequent in data space\u2014in exchange for better\nmatching the distribution of well represented modes, thereby lowering the statistical distance. In\nshort, a global statistical distance is not ideal for promoting mode coverage (see Figure 1 for a 1D\nmotivating example and later Figure 2 for examples of a few classic generative models).\nThis inherent limitation is evident in various types of generative models (see Appendix A for the\nanalysis of a few classic generative models). Particularly in generative adversarial networks (GANs),\nmode collapse has been known as a prominent issue. Despite a number of recent improvements\ntoward alleviating it [1, 2, 3, 4, 5, 6], none of them offers a complete mode coverage. In fact, even the\n\n\u21e4equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffundamental question remains unanswered: what precisely does a complete mode coverage mean?\nAfter all, the de\ufb01nition of \u201cmodes\u201d in a dataset is rather vague, depending on what speci\ufb01c distance\nmetric is used for clustering data items (as discussed and illustrated in [4]).\nWe introduce an explicit notion of complete mode\ncoverage, by switching from the global statistical\ndistance to local pointwise coverage: provided a\ntarget data distribution P with a probability density\np(x) at each point x of the data space X , we claim\nthat a generator G has a complete mode coverage of\nP if the generator\u2019s probability g(x) for generating\nx is pointwise lower bounded, that is,\ng(x) \u00b7 p(x),8x 2X ,\n\nFigure 1: Motivating example. Consider a\n1D target distribution P with three modes, i.e., a\nmixture of three Gaussians, P = 0.9\u00b7N (0, 1)+\n0.05 \u00b7 N (10, 1) + 0.05 \u00b7 N (10, 1) (solid or-\nange curve). If we learn this distribution using\na single Gaussian Q (black dashed curve). The\nstatistical distance between the two is small:\nDTV(Q k P ) \uf8ff 0.1 and DKL(Q k P ) \uf8ff 0.16.\nThe probability of drawing samples from the\nside modes (in [14,6] and [6, 14]) of the tar-\nget distribution P is Prx\u21e0P [6 \uf8ff| x|\uf8ff 14] \u21e1\n0.1, but the probability of generating samples\nfrom Q in the same intervals is Prx\u21e0Q[6 \uf8ff\n|x|\uf8ff 14] \u21e1 109. The side modes are missed!\n\n(1)\nfor a reasonably large relaxation constant 2\n(0, 1). This notion of mode coverage ensures that\nevery point x in the data space X will be generated\nby G with a \ufb01nite and lower-bounded probability\ng(x). Thereby, in contrast to the generator trained\nby reducing a global statistical distance (recall Fig-\nure 1), no mode will have an arbitrarily small gener-\nation probability, and thus no mode will be missed.\nMeanwhile, our mode coverage notion (1) stays\ncompatible with the conventional heuristic toward\nreducing a global statistical distance, as the satisfac-\ntion of (1) implies that the total variation distance\nbetween P and G is upper bounded by 1 (see a proof in Appendix C).\nAt \ufb01rst sight, the pointwise condition (1) seems more stringent than reducing a global statistical\ndistance, and pursuing it might require a new formulation of generative models. Perhaps somewhat\nsurprisingly, a rethink from a game-theoretic perspective reveal that this notion of mode coverage is\nviable without formulating any new models. Indeed, a mixture of existing generative models (such as\nGANs) suf\ufb01ces. In this work, we provide an algorithm for constructing the generator mixture and a\ntheoretical analysis showing the guarantee of our mode coverage notion (1).\n\n1.1 A Game-Theoretic Analysis\nBefore delving into our algorithm, we offer an intuitive view of why our mode coverage notion (1)\nis attainable through a game-theoretic lens. Consider a two-player game between Alice and Bob:\ngiven a target data distribution P and a family G of generators2, Alice chooses a generator G 2G ,\nand Bob chooses a data point x 2X . If the probability density g(x) of Alice\u2019s G generating Bob\u2019s\nchoice of x satis\ufb01es g(x) 1\n4 p(x), the game produces a value v(G, x) = 1, otherwise it produces\nv(G, x) = 0. Here 1/4 is used purposely as an example to concretize our intuition. Alice\u2019s goal is to\nmaximize the game value, while Bob\u2019s goal is to minimize the game value.\nNow, consider two situations. In the \ufb01rst situation, Bob \ufb01rst chooses a mixed strategy, that is, a\ndistribution Q over X . Then, Alice chooses the best generator G 2G according to Bob\u2019s distribution\nQ. When the game starts, Bob samples a point x using his choice of distribution Q. Together\nwith Alice\u2019s choice G, the game produces a value. Since x is now a random variable over Q, the\nexpected game value is maxG2G E\n[v(G, x)]. In the second situation, Alice \ufb01rst chooses a mixed\nx\u21e0Q\nstrategy, that is, a distribution RG of generators over G. Then, given Alice\u2019s choice RG, Bob chooses\nthe best data point x 2X . When the game starts, Alice samples a generator G from the chosen\ndistribution RG. Together with Bob\u2019s choice of x, the game produces a value, and the expected value\nis minx2X EG\u21e0RG\nAccording to von Neumann\u2019s minimax theorem [7, 8], Bob\u2019s optimal expected value in the \ufb01rst\nsituation must be the same as Alice\u2019s optimal value in the second situation:\n[v(G, x)].\n\n[v(G, x)].\n\n(2)\n\nmin\nQ\n\nmax\nG2G\n\nE\nx\u21e0Q\n\n[v(G, x)] = max\nRG\n\nmin\nx2X\n\nE\n\nG\u21e0RG\n\n2An example of the generator family is the GANs. The de\ufb01nition will be made clear later in this paper.\n\n2\n\ntarget distributiongenerated distribution0614-6-14\fWith this equality realized, our agenda in the rest of the analysis is as follows. First, we show a lower\nbound of the left-hand side of (2), and then we use the right-hand side to reach the lower-bound\nof g(x) as in (1), for Alice\u2019s generator G. To this end, we need to depart off from the current\ngame-theoretic analysis and discuss the properties of existing generative models for a moment.\nExisting generative models such as GANs [9, 1, 10] aim to reproduce arbitrary data distributions.\nWhile it remains intractable to have the generated distribution match exactly the data distribution,\nthe approximations are often plausible. One reason behind the plausible performance is that the data\nspace encountered in practice is \u201cnatural\u201d and restricted\u2014all English sentences or all natural object\nimages or all images on a manifold\u2014but not a space of arbitrary data. Therefore, it is reasonable to\nexpect the generators in G (e.g., all GANs) to meet the following requirement3 (without con\ufb02icting\nthe no-free-lunch theorem [11]): for any distribution Q over a natural data space X encountered\nin practice, there exists a generator G 2G such that the total variation distance between G and Q\n2RX |q(x) g(x)| dx \uf8ff , where q(\u00b7) and g(\u00b7) are the\nis upper bounded by a constant , that is, 1\nprobability densities on Q and the generated samples of G, respectively. Again as a concrete example,\nwe use = 0.1. With this property in mind, we now go back to our game-theoretic analysis.\nBack to the \ufb01rst situation described above. Once Bob\u2019s distribution Q (over X ) and Alice\u2019s generator\nG are identi\ufb01ed, then given a target distribution P over X and an x drawn by Bob from Q, the\nprobability of having Alice\u2019s G cover P (i.e., g(x) 1\n4 p(x)) at x is lower bounded. In our current\nexample, we have the following lower bound:\n\nPr\nx\u21e0Q\n\n[g(x) 1/4 \u00b7 p(x)] 0.4.\n\n(3)\n\nHere 0.4 is related to the total variation distance bound (i.e., = 0.1) between G and Q, and this\nlower bound value is derived in Appendix D. Next, notice that on the left-hand side of (2), the\nexpected value, Ex\u21e0Q[v(G, x)], is equivalent to the probability in (3). Thus, we have\n\nmin\nQ\n\nmax\nG2G\n\nE\nx\u21e0Q\n\n[v(G, x)] 0.4.\n\n(4)\n\nBecause of the equality in (2), this is also the lower bound of its right-hand side, from which we know\nthat there exists a distribution RG of generators such that for any x 2X , we have\n\n(5)\n\nE\n\nG\u21e0RG\n\n[v(G, x)] = Pr\nG\u21e0RG\n\n[g(x) 1/4 \u00b7 p(x)] 0.4.\n\nThis expression shows that for any x 2X , if we draw a generator G from RG, then with a probability\nat least 0.4, G\u2019s generation probability density satis\ufb01es g(x) 1\n4 p(x). Thus, we can think RG as a\n\u201ccollective\u201d generator G\u21e4, or a mixture of generators. When generating a sample x, we \ufb01rst choose\na generator G according to RG and then sample an x using G. The overall probability g\u21e4(x) of\ngenerating x satis\ufb01es g\u21e4(x) > 0.1p(x)\u2014precisely the pointwise lower bound that we pose in (1).\nTakeaway from the analysis. This analysis reveals that a complete mode coverage is \ufb01rmly viable.\nYet it offers no recipe on how to construct the mixture of generators and their distribution RG using\nexisting generative models. Interestingly, as pointed out by Arora et al. [12], a constructive version\nof von Neumann\u2019s minimax theorem is related to the general idea of multiplicative weights update.\nTherefore, our key contributions in this work are i) the design of a multiplicative weights update\nalgorithm (in Sec. 3) to construct a generator mixture, and ii) a theoretical analysis showing that\nour generator mixture indeed obtains the pointwise data coverage (1). In fact, we only need a small\nnumber of generators to construct the mixture (i.e., it is easy to train), and the distribution RG for\nusing the mixture is as simple as a uniform distribution (i.e., it is easy to use).\n\n2 Related Work\n\nThere exists a rich set of works improving classic generative models for alleviating missing modes,\nespecially in the framework of GANs, by altering objective functions [13, 14, 15, 10, 16, 17],\nchanging training methods [18, 19], modifying neural network architectures [2, 20, 21, 22, 23], or\nregularizing latent space distributions [4, 24]. The general philosophy behind these improvements is\nto reduce the statistical distance between the generated distribution and target distribution by making\n\n3This requirement is weaker than the mainstream goal of generative models, which all aim to approximate a\ntarget data distribution as closely as possible. Here we only require the approximation error is upper bounded.\n\n3\n\n\fthe models easier to train. Despite their technical differences, their optimization goals are all toward\nreducing a global statistical distance.\nThe idea of constructing a mixture of generators has been explored, with two ways of construction.\nIn the \ufb01rst way, a set of generators are trained simultaneously. For example, Locatello et al. [25]\nused multiple generators, each responsible for sampling a subset of data points decided in a k-means\nclustering fashion. Other methods focus on the use of multiple GANs [26, 27, 28]. The theoretical\nintuition behind these approaches is by viewing a GAN as a two-player game and extending it to\nreach a Nash equilibrium with a mixture of generators [26]. In contrast, our method does not depend\nspeci\ufb01cally on GANs, and our game-theoretic view is fundamentally different (recall Sec. 1.1).\nAnother way of training a mixture of generators takes a sequential approach. This is related to boosting\nalgorithms in machine learning. Grnarova et al. [29] viewed the problem of training GANs as \ufb01nding\na mixed strategy in a zero-sum game, and used the Follow-the-Regularized-Leader algorithm [30]\nfor training a mixture of generators iteratively. Inspired by AdaBoost [31], other approaches train a\n\u201cweak\u201d generator that \ufb01ts a reweighted data distribution in each iteration, and all iterations together\nform an additive mixture of generators [32, 33] or a multiplicative mixture of generators [34].\nOur method can be also viewed as a boosting strategy. From this perspective, the most related is\nAdaGAN [33], while signi\ufb01cant differences exist. Theoretically, AdaGAN (and other boosting-like\nalgorithms) is based on the assumption that the reweighted data distribution in each iteration becomes\nprogressively easier to learn. It requires a generator in each iteration to have a statistical distance\nto the reweighted distribution smaller than the previous iteration. As we will discuss in Sec. 5,\nthis assumption is not always feasible. We have no such assumption. Our method can use a weak\ngenerator in each iteration. If the generator is more expressive, the theoretical lower bound of our\npointwise coverage becomes larger (i.e., a larger in (1)). Algorithmically, our reweighting scheme\nis simple and different from AdaGAN, only doubling the weights or leaving them unchanged in each\niteration. Also, in our mixture of generators, they are treated uniformly, and no mixture weights are\nneeded, whereas AdaGAN needs a set of weights that are heuristically chosen.\nTo summarize, in stark contrast to all prior methods, our approach is rooted in a different philosophy\nof training generative models. Rather than striving for reducing a global statistical distance, our\nmethod revolves around an explicit notion of complete mode coverage as de\ufb01ned in (1). Unlike other\nboosting algorithms, our algorithm of constructing the mixture of generators guarantees complete\nmode coverage, and this guarantee is theoretically proved.\n\n3 Algorithm\nA mixture of generators. Provided a target distribution P on a data domain X , we train a mixture\nof generators to pursue pointwise mode coverage (1). Let G\u21e4 = {G1, . . . , GT} denote the resulting\nmixture of T generators. Each of them (Gt, t = 1...T ) may use any existing generative model such\nas GANs. Existing methods that also rely on a mixture of generators associate each generator a\nnonuniform weight \u21b5t and choose a generator for producing a sample randomly based on the weights.\nOften, these weights are chosen heuristically, e.g., in AdaGAN [33]. Our mixture is conceptually and\ncomputationally simpler. Each generator is treated equally. When using G\u21e4 to generate a sample, we\n\ufb01rst choose a generator Gi uniformly at random, and then use Gi to generate the sample.\nAlgorithm overview. Our algorithm of training G\u21e4 can be understood as a speci\ufb01c rule design in\nthe framework of multiplicative weights update [12]. Outlined in Algorithm 1, it runs iteratively.\nIn each iteration, a generator Gt is trained using an updated data distribution Pt (see Line 6-7 of\nAlgorithm 1). The intuition here is simple: if in certain data domain regions the current generator fails\nto cover the target distribution suf\ufb01ciently well, then we update the data distribution to emphasize\nthose regions for the next round of generator training (see Line 9 of Algorithm 1). In this way, each\ngenerator can focus on the data distribution in individual data regions. Collectively, they are able to\ncover the distribution over the entire data domain, and thus guarantee pointwise data coverage.\nTraining. Each iteration of our algorithm trains an individual generator Gt, for which many existing\ngenerative models, such as GANs [9], can be used. The only prerequisite is that Gt needs to be\ntrained to approximate the data distribution Pt moderately well. This requirement arises from our\ngame-theoretic analysis (Sec. 1.1), wherein the total variation distance between Gt\u2019s distribution and\nPt needs to be upper bounded. Later in our theoretical analysis (Sec. 4), we will formally state this\nrequirement, which, in practice, is easily satis\ufb01ed by most existing generative models.\n\n4\n\n\fAlgorithm 1 Constructing a mixture of generators\n1: Parameters: T , a positive integer number of generators, and 2 (0, 1), a covering threshold.\n2: Input: a target distribution P on a data domain X .\n3: For each x 2X , initialize its weight w1(x) = p(x).\n4: for t = 1 ! T do\n5:\n6:\n7:\n8:\n9:\n10: end for\n11: Output: a mixture of generators G\u21e4 = {G1, . . . , GT}.\n\nConstruct a distribution Pt over X as follows:\nFor every x 2X , normalize the probability density pt(x) = wt(x)\nTrain a generative model Gt on the distribution Pt.\nEstimate generated density gt(x) for every x 2X .\nFor each x 2X , if gt(x) < \u00b7p(x), set wt+1(x) = 2\u00b7wt(x). Otherwise, set wt+1(x) = wt(x).\n\n, where Wt =RX\n\nwt(x)dx.\n\nWt\n\nEstimation of generated probability density.\nIn Line 8 of Algorithm 1, we need to estimate the\nprobability gt(x) of the current generator sampling a data point x. Our estimation follows the idea\nof adversarial training, similar to AdaGAN [33]. First, we train a discriminator Dt to distinguish\nbetween samples from Pt and samples from Gt. The optimization objective of Dt is de\ufb01ned as\n\nmax\nDt\n\nE\nx\u21e0Pt\n\n[log Dt(x)] + E\nx\u21e0Gt\n\n[log(1 Dt(x))].\n\n=\n\n<,\n\np(x)Wt\n\nUnlike AdaGAN [33], here Pt is the currently updated data distribution, not the original target\ndistribution, and Gt is the generator trained in the current round, not a mixture of generators in all\npast rounds. As pointed out previously [35, 33], once Dt is optimized, we have Dt(x) =\nfor all x 2X , and equivalently gt(x)\ntesting the data coverage), we rewrite the condition gt(x) < \u00b7 p(x) as\nDt(x) 1\u25c6 wt(x)\n\nDt(x) 1. Using this property in Line 9 of Algorithm 1 (for\n\n=\u2713 1\n\npt(x) = 1\n\ngt(x)\np(x)\n\ngt(x)\npt(x)\n\npt(x)\np(x)\n\npt(x)\n\npt(x)+gt(x)\n\nwhere the second equality utilize the evaluation of pt(x) in Line 6 (i.e., pt(x) = wt(x)/Wt).\nNote that if the generators Gt are GANs, then the discriminator of each Gt can be reused as Dt here.\nReusing Dt introduces no additional computation. In contrast, AdaGAN [33] always has to train an\nadditional discriminator Dt in each round using the mixture of generators of all past rounds.\nWorking with empirical dataset.\nIn practice, the true data distribution P is often unknown when\ni=1 is given. Instead, the empirical dataset is considered as n i.i.d.\nan empirical dataset X = {xi}n\nsamples drawn from P . According to the Glivenko-Cantelli theorem [36], the uniform distribution\nover n i.i.d. samples from P will converge to P as n approaches to in\ufb01nity. Therefore, provided the\nempirical dataset, we do not need to know the probability density p(x) of P , as every sample xi 2 X\nis considered to have a \ufb01nite and uniform probability measure. An empirical version of Algorithm 1\nand more explanation are presented in the supplementary document (Algorithm 2 and Appendix B).\n\n4 Theoretical Analysis\nWe now provide a theoretical understanding of our algorithm, showing that the pointwise data\ncoverage (1) is indeed obtained. Our analysis also sheds some light on how to choose the parameters\nof Algorithm 1.\n\n4.1 Preliminaries\nWe \ufb01rst clarify a few notational conventions and introduce two new theoretical notions for our\nsubsequent analysis. Our analysis is in continuous setting; results on discrete datasets follow directly.\nNotation. Formally, we consider a d-dimensional measurable space (X ,B(X )), where X is the\nd-dimensional data space, and B(X ) is the Borel -algebra over X to enable probability measure. We\nuse a capital letter (e.g., P ) to denote a probability measure on this space. When there is no ambiguity,\nwe also refer them as probability distributions (or distributions). For any subset S2B (X ), the\nprobability of S under P is P (S) := Prx\u21e0P [x 2S ]. We use G to denote a generator. When there is\n\n5\n\n\fno ambiguity, G also denotes the distribution of its generated samples. All distributions are assumed\nabsolutely continuous. Their probability density functions (i.e., the derivative with respect to the\nLebesgue measure) are referred by their corresponding lowercase letters (e.g., p(\u00b7), q(\u00b7), and g(\u00b7)).\nMoreover, we use [n] to denote the set {1, 2, ..., n}, N>0 for the set of all positive integers, and 1(E)\nfor the indicator function whose value is 1 if the event E happens, and 0 otherwise.\nf-divergence. Widely used in objective functions of training generative models, f-divergence is a\nstatistical distance between two distributions. Let P and Q be two distributions over X . Provided a\nconvex function f on (0,1) such that f (1) = 0, f-divergence of Q from P is de\ufb01ned as Df (Q k\nf\u21e3 q(x)\np(x)\u2318 p(x)dx. Various choices of f lead to some commonly used f-divergence metrics\nP ) :=RX\nsuch as total variation distance DTV, Kullback-Leibler divergence DKL, Hellinger distance DH, and\nJensen-Shannon divergence DJS [35, 37]. Among them, total variation distance is upper bounded\nby many other f-divergences. For instance, DTV(Q k P ) is upper bounded byq 1\n2 DKL(Q k P ),\np2DH(Q k P ), andp2DJS(Q k P ), respectively. Thus, if two distributions are close under those\nf-divergence measures, so are they under total variation distance. For this reason, our theoretical\nanalysis is based on the total variation distance.\n-cover and (, )-cover. We introduce two new notions for analyzing our algorithm. The \ufb01rst is\nthe notion of -cover. Given a data distribution P over X and a value 2 (0, 1], if a generator G\nsatis\ufb01es g(x) \u00b7 p(x) at a data point x 2X , we say that x is -covered by G under distribution P .\nUsing this notion, the pointwise mode coverage (1) states that x is -covered by G under distribution\nP for all x 2X . We also extend this notion to a measurable subset S2B (X ): we say that S is\n-covered by G under distribution P if G(S) \u00b7 P (S) is satis\ufb01ed.\nNext, consider another distribution Q over X . We say that G can (, )-cover (P, Q), if the following\ncondition holds:\n(6)\n\n[x is -covered by G under distributionP ] .\n\nPr\nx\u21e0Q\n\nFor instance, using this notation, Equation (3) in our game-theoretic analysis states that G can\n(0.25, 0.4)-cover (P, Q).\n\n4.2 Guarantee of Pointwise Data Coverage\nIn each iteration of Algorithm 1, we expect the generator Gt to approximate the given data distribution\nPt suf\ufb01ciently well. We now formalize this expectation and understand its implication. Our intuition\nis that by \ufb01nding a property similar to (3), we should be able to establish a pointwise coverage lower\nbound in a way similar to our analysis in Sec. 1.1. Such a property is given by the following lemma\n(and proved in Appendix E.1).\nLemma 1. Consider two distributions, P and Q, over the data space X , and a generator G producing\nsamples in X . For any , 2 (0, 1], if DT V (G k Q) \uf8ff , then G can (, 1 2 )-cover (P, Q).\nIntuitively, when G and Q are identi\ufb01ed, is set. If is reduced, then more data points in X can\nbe -covered by G under P . Thus, the probability de\ufb01ned in (6) becomes larger, as re\ufb02ected by the\nincreasing 1 2 . On the other hand, consider a \ufb01xed . As the discrepancy between G and Q\nbecomes larger, increases. Then, sampling an x according to Q will have a smaller chance to land\nat a point that is -covered by G under P , as re\ufb02ected by the decreasing 1 2 .\nNext, we consider Algorithm 1 and identify a suf\ufb01cient condition under which the output mixture of\ngenerators G\u21e4 covers every data point with a lower-bounded guarantee (i.e., our goal (1)). Simply\nspeaking, this suf\ufb01cient condition is as follows: in each round t, the generator Gt is trained such that\ngiven an x drawn from distribution Pt, the probability of x being -covered by Gt under P is also\nlower bounded. A formal statement is given in the next lemma (proved in Appendix E.2).\nLemma 2. Recall that T 2 N>0 and 2 (0, 1) are the input parameters of Algorithm 1. For any\n\" 2 [0, 1) and any measurable subset S2B (X ) whose probability measure satis\ufb01es P (S) 1/2\u2318T\nwith some \u2318 2 (0, 1), if in every round t 2 [T ], Gt can (, 1 \")-cover (P, Pt), then the resulting\nmixture of generators G\u21e4 can (1 \"/ln 2 \u2318)-cover S under distribution P .\nThis lemma is about lower-bounded coverage of a measurable subset S, not a point x 2X . At \ufb01rst\nsight, it is not of the exact form in (1) (i.e., pointwise -coverage). This is because formally speaking\nit makes no sense to talk about covering probability at a single point (whose measure is zero). But as\n\n6\n\n\fT approaches to 1, S that satis\ufb01es P (S) 1/2\u2318T can also approach to a point (and \u2318 approaches\nto zero). Thus, Lemma 2 provides a condition for pointwise lower-bounded coverage in the limiting\nsense. In practice, the provided dataset is always discrete, and the probability measure at each discrete\ndata point is \ufb01nite. Then, Lemma 2 is indeed a suf\ufb01cient condition for pointwise lower-bounded\ncoverage.\nFrom Lemma 1, we see that the condition posed by Lemma 2 is indeed satis\ufb01ed by our algorithm,\nand combing both lemmas yields our \ufb01nal theorem (proved in Appendix E.3).\nTheorem 1. Recall that T 2 N>0 and 2 (0, 1) are the input parameters of Algorithm 1. For\nany measurable subset S2B (X ) whose probability measure satis\ufb01es P (S) 1/2\u2318T with some\n\u2318 2 (0, 1), if in every round t 2 [T ], DTV(Gt k Pt) \uf8ff , then the resulting mixture of generators G\u21e4\ncan (1 ( + 2)/ ln 2 \u2318)-cover S under distribution P .\nIn practice, existing generative models (such as GANs) can approximate Pt suf\ufb01ciently well, and\nthus DTV(Gt k Pt) \uf8ff is always satis\ufb01ed for some . According to Theorem 1, a pointwise\nlower-bounded coverage can be obtained by our Algorithm 1. If we choose to use a more expressive\ngenerative model (e.g., a GAN with a stronger network architecture), then Gt can better \ufb01t Pt in each\nround, yielding a smaller used in Theorem 1. Consequently, the pointwise lower bound of the data\ncoverage becomes larger, and effectively the coef\ufb01cient in (1) becomes larger.\n\nInsights from the Analysis\n\n4.3\n, \u2318, , and T in Theorem 1.\nIn Theorem 1, depends on the expressive power of the generators\nbeing used. It is therefore determined once the generator class G is chosen. But \u2318 can be directly set\nby the user and a smaller \u2318 demands a larger T to ensure P (S) 1/2\u2318T is satis\ufb01ed. Once and \u2318 is\ndetermined, we can choose the best by maximizing the coverage bound (i.e., (1(+2)/ ln 2\u2318))\nin Theorem 1. For example, if \uf8ff 0.1,\u2318 \uf8ff 0.01, then \u21e1 1/4 would optimize the coverage bound\n(see Appendix E.4 for more details), and in this case the coef\ufb01cient in (1) is at least 1/30.\nTheorem 1 also sets the tone for the training cost. As explained in Appendix E.4, given a training\ndataset of size n, the size of the generator mixture, T , needs to be at most O(log n). This theoretical\nbound is consistent with our experimental results presented in Sec. 5. In practice, only a small number\nof generators are needed.\nEstimated density function gt. The analysis in Sec. 4.2 assumes that the generated probability\ndensity gt of the generator Gt in each round is known, while in practice we have to estimate gt by\ntraining a discriminator Dt (recall Section 3). Fortunately, only mild assumptions in terms of the\nquality of Dt are needed to retain the pointwise lower-bounded coverage. Roughly speaking, Dt\nneeds to meet two conditions: 1) In each round t, only a fraction of the covered data points (i.e., those\nwith gt(x) \u00b7 p(x)) is falsely classi\ufb01ed by Dt and doubled their weights. 2) In each round t, if the\nweight of a data point x is not doubled based on the estimation of Dt(x), then there is a good chance\nthat x is truly covered by Gt (i.e., gt(x) \u00b7 p(x)). A detailed and formal discussion is presented in\nAppendix E.5. In short, our estimation of gt would not deteriorate the ef\ufb01cacy of the algorithm, as\nalso con\ufb01rmed in our experiments.\nGeneralization. An intriguing question for all generative models is their generalization perfor-\nmance: how well can a generator trained on an empirical distribution (with a \ufb01nite number of\ndata samples) generate samples that follow the true data distribution? While the generalization\nperformance has been long studied for supervised classi\ufb01cation, generalization of generative models\nremains a widely open theoretical question. We propose a notion of generalization for our method,\nand provide a preliminary theoretical analysis. All the details are presented in Appendix E.6.\n\n5 Experiments\nWe now present our major experimental results, while referring to Appendix F for network details\nand more results. We show that our mixture of generators is able to cover all the modes in various\nsynthetic and real datasets, while existing methods always have some modes missed.\nPrevious works on generative models used the Inception Score [1] or the Fr\u00e9chet Inception Dis-\ntance [18] as their evaluation metric. But we do not use them, because they are both global measures,\nnot re\ufb02ecting mode coverage in local regions [38]. Moreover, these metrics are designed to measure\nthe quality of generated images, which is orthogonal to our goal. For example, one can always use a\nmore expressive GAN in each iteration of our algorithm to obtain better image quality and thus better\ninception scores.\n\n7\n\n\f(a)(b)(c)(d)(e)(f)Figure2:Generativemodelsonsyntheticdataset.(a)Thedatasetconsistsoftwomodes:onemajormodeasanexpandingsinecurve(y=xsin4x\u21e1)andaminormodeasaGaussianlocatedat(10,0)(highlightedintherebbox).(b-f)Weshowcolor-codeddistributionsofgeneratedsamplesfrom(b)EM,(c)GAN,(d)AdaGAN,(e)VAE,and(f)ourmethod(i.e.,amixtureofGANs).Onlyourmethodisabletocoverthesecondmode(highlightedinthegreenbox;zoomintoview).SincethephenomenonofmissingmodesisparticularlyprominentinGANs,ourexperimentsemphasizeonthemodecoverageperformanceofGANsandcompareourmethod(usingamixtureofGANs)withDCGAN[39],MGAN[27],andAdaGAN.ThelattertwoalsousemultipleGANstoimprovemodecoverage,althoughtheydonotaimforthesamemodecoveragenotionasours.Overview.We\ufb01rstoutlineallourexperiments,includingthosepresentedinAppendixF.i)Wecompareourmethodwithanumberofclassicgenerativemodelsonasyntheticdataset.ii)InAppendixF.3,wealsocompareourmethodwithAdaGAN[33]onothersyntheticdatasetsaswellasstackedMNISTdataset,becausebothareboostingalgorithmsaimingatimprovingmodecoverage.iii)WefurthercompareourmethodwithasinglelargeDCGAN,AdaGAN,andMGANontheFashion-MNISTdataset[40]mixedwithaverysmallportionofMNISTdataset[41].Variousgenerativemodelsonsyntheticdataset.AsweshowinAppendixA,manygenerativemodels,suchasexpectation-maximization(EM)methods,VAEs,andGANs,allrelyonaglobalstatisticaldistanceintheirtraining.Wethereforetesttheirmodecoverageandcomparewithours.WeconstructonR2asyntheticdatasetwithtwomodes.The\ufb01rstmodeconsistsofdatapointswhosex-coordinateisuniformlysampledbyxi\u21e0[10,10]andthey-coordinateisyi=xisin4xi\u21e1.ThesecondmodehasdatapointsformingaGaussianat(0,10).Thetotalnumberofdatapointsinthe\ufb01rstmodeis400\u21e5ofthesecond.AsshowninFigure2,generativemodelsincludeEM,GAN,VAE,andAdaGAN[33]allfailtocoverthesecondmode.Ourmethod,incontrast,capturesbothmodes.WerunKDEtoestimatethelikelihoodofourgeneratedsamplesonoursyntheticdataexperiments(usingKDEbandwidth=0.1).WecomputeL=1/NPiPmodel(xi),wherexiisasampleintheminormode.Fortheminormode,ourmethodhasameanloglikelihoodof-1.28,whileAdaGANhasonly-967.64(almostnosamplesfromAdaGAN).\u201c1\u201dsFrequencyAvgProb.DCGAN130.14\u21e51040.49MGANcollapsed--AdaGAN600.67\u21e51040.45Ourmethod2893.2\u21e51040.68Table1:Ratiosofgeneratedimagesclassi\ufb01edas\u201c1\u201d.Wegenerate9\u21e5105imagesfromeachmethod.Thesecondcolumnindicatesthenumbersofsam-plesbeingclassi\ufb01edas\u201c1\u201d,andthethirdcolumnindicatestheratio.Inthefourthcolumn,weaveragethepredictionprobabilitiesoverallgeneratedimagesthatareclassi\ufb01edas\u201c1\u201d.Fashion-MNISTandpartialMNIST.OurnextexperimentistochallengedifferentGANmodelswitharealdatasetthathasseparatedandunbalancedmodes.Thisdatasetcon-sistsoftheentiretrainingdatasetofFashion-MNIST(with60kimages)mixedwithran-domlysampled100MNISTimageslabeledas\u201c1\u201d.Thesizeofgeneratormixtureisal-wayssettobe30forAdaGAN,MGANandourmethod,andallgeneratorssharethesamenetworkstructure.Additionally,whencom-paringwithasingleDCGAN,weensurethattheDCGAN\u2019stotalnumberofparametersiscomparabletothetotalnumberofparametersofthe30generatorsinAdaGAN,MGAN,andours.Toevaluatetheresults,wetrainan11-classclassi\ufb01ertodistinguishthe10classesinFashion-MNISTandoneclassinMNIST(i.e.,\u201c1\u201d).First,wecheckhowmanysamplesfromeachmethodareclassi\ufb01edas\u201c1\u201d.ThetestsetupandresultsareshowninTable1anditscaption.Theresultssuggestthatourmethodcangeneratemore\u201c1\u201dsampleswithhigherpredictioncon\ufb01dence.NotethatMGANhasastrongmodecollapseandfailstoproduce\u201c1\u201dsamples.WhileDCGANandAdaGANgeneratesomesamplesthatareclassi\ufb01edas\u201c1\u201d,inspectingthegeneratedimagesrevealsthatthosesamplesareallvisuallyfarfrom\u201c1\u201ds,butincorrectlyclassi\ufb01edbythepre-trainedclassi\ufb01er(seeFigure3).Incontrast,ourmethodisabletogeneratesamplescloseto\u201c1\u201d.Wealsonotethatourmethodcanproducehigher-qualityimagesiftheunderlyinggenerativemodelsineachroundbecomestronger.8\fAdaGANMGANDCGANOur ApproachFigure3:Mostcon\ufb01dent\u201c1\u201dsamples.Hereweshowsamplesthataregeneratedbyeachtestedmethodsandalsoclassi\ufb01edbythepre-trainedclassi\ufb01ermostcon\ufb01dentlyas\u201c1\u201dimages(i.e.,top10intermsoftheclassi\ufb01edprobability).Samplesofourmethodarevisuallymuchcloserto\u201c1\u201d.0.60.50.40.30.20.10.001234567890.60.50.40.30.20.10.001234567890.60.50.40.30.20.10.001234567890.60.50.40.30.20.10.00123456789AdaGANMGANDCGANOursClassFrequencyFrequencyFrequencyFrequencyClassClassClassFigure5:Distributionofgeneratedsamples.Trainingsamplesaredrawnuniformlyfromeachclass.ButgeneratedsamplesbyAdaGANandMGANareconsiderablynonuniform,whilethosefromDCGANandourmethodaremoreuniform.Thisexperimentsuggeststhattheconventionalheuristicofreducingastatisticaldistancemightnotmerititsuseintraininggenerativemodels.0.0050.001601020304050RatioIterationsFigure4:Weightratioof\u201c1\u201ds.Wecalculatetheratioofthetotalweightsoftrainingimageslabeledby\u201c1\u201dtothetotalweightsofalltrainingimagesineachround,andplotherehowtheratiochangeswithrespecttotheiterationsinouralgorithm.Anotherremarkablefeatureisobservedinouralgorithm.Ineachroundofourtrainingalgorithm,wecalculatethetotalweight\u00afwtofprovidedtrain-ingsamplesclassi\ufb01edas\u201c1\u201daswellasthetotalweightWtofalltrainingsamples.Whenplottingtheratio\u00afwt/Wtchangingwithrespecttothenum-berofrounds(Figure4),interestingly,wefoundthatthisratiohasamaximumvalueataround0.005inthisexample.Weconjecturethatinthetrainingdatasetiftheratioof\u201c1\u201dimagesamongalltrainingimagesisaround1/200,thenasinglegeneratormaylearnandgenerate\u201c1\u201dimages(theminoritymode).Toverifythisconjecture,wetrainedaGAN(withthesamenetworkstructure)onanothertrainingdatasetwith60ktrainingimagesfromFashion-MNISTmixedwith300MNIST\u201c1\u201dimages.Wethenusethetrainedgeneratortosample100kimages.Asaresult,Inafractionof4.2\u21e5104,thoseimagesareclassi\ufb01edas\u201c1\u201d.Figure8inAppendixFshowssomeofthoseimages.Thisresultcon\ufb01rmsourconjectureandsuggeststhat\u00afwt/Wtmaybeusedasameasureofmodebiasinadataset.Lastly,inFigure5,weshowthegenerateddistributionoverthe10Fashion-MNISTclassesfromeachtestedmethod.Weneglecttheclass\u201c1\u201d,asMGANfailstogeneratethem.ThegeneratedsamplesofAdaGANandMGANishighlynonuniform,thoughinthetrainingdataset,the10classesofimagesareuniformlydistributed.OurmethodandDCGANproducemoreuniformsamples.Thissuggeststhatalthoughothergenerativemodels(suchasAdaGANandMGAN)aimtoreduceaglobalstatisticaldistance,thegeneratedsamplesmaynoteasilymatchtheempiricaldistribution\u2014inthiscase,auniformdistribution.Ourmethod,whilenotaimingforreducingthestatisticaldistanceinthe\ufb01rstplace,matchesthetargetempiricaldistributionplausibly,asabyproduct.6ConclusionWehavepresentedanalgorithmthatiterativelytrainsamixtureofgenerators,drivenbyanexplicitnotionofcompletemodecoverage.Withthisnotionfordesigninggenerativemodels,ourworkposesanalternativegoal,onethatdiffersfromtheconventionaltrainingphilosophy:insteadofreducingaglobalstatisticaldistancebetweenthetargetdistributionandgenerateddistribution,oneonlyneedstomakethedistancemildlysmallbutnothavetoreduceittowardaperfectzero,andourmethodisabletoboostthegenerativemodelwiththeoreticallyguaranteedmodecoverage.Acknowledgments.ThisworkwassupportedinpartbytheNationalScienceFoundation(CAREER-1453101,1816041,1910839,1703925,1421161,1714818,1617955,1740833),SimonsFoundation(#491119toAlexandrAndoni),GoogleResearchAward,aGooglePhDFellowship,aSnapResearchFellowship,aColumbiaSEASCKGSBFellowship,andSoftBankGroup.9\fReferences\n[1] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[2] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. arXiv preprint arXiv:1611.02163, 2016.\n\n[3] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[4] Chang Xiao, Peilin Zhong, and Changxi Zheng. Bourgan: Generative networks with metric\nembeddings. In Advances in Neural Information Processing Systems, pages 2269\u20132280, 2018.\n\n[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[6] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL\nConference on Computational Natural Language Learning, pages 10\u201321, 2016.\n\n[7] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295\u2013320,\n\n1928.\n\n[8] Ding-Zhu Du and Panos M Pardalos. Minimax and applications, volume 4. Springer Science &\n\nBusiness Media, 2013.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[10] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[11] David H Wolpert, William G Macready, et al. No free lunch theorems for optimization. IEEE\n\ntransactions on evolutionary computation, 1(1):67\u201382, 1997.\n\n[12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a\n\nmeta-algorithm and applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[13] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized\n\ngenerative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.\n\n[14] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\narXiv preprint arXiv:1609.03126, 2016.\n\n[15] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.\nLeast squares generative adversarial networks. In 2017 IEEE International Conference on\nComputer Vision (ICCV), pages 2813\u20132821. IEEE, 2017.\n\n[16] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[17] Yunus Saatci and Andrew G Wilson. Bayesian gan.\n\nprocessing systems, pages 3622\u20133631, 2017.\n\nIn Advances in neural information\n\n[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n10\n\n\f[19] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\n\nnatural image synthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[20] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Ar-\njovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704,\n2016.\n\n[21] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples\n\nin generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.\n\n[22] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton.\nVeegan: Reducing mode collapse in gans using implicit variational learning. In Advances in\nNeural Information Processing Systems, pages 3310\u20133320, 2017.\n\n[23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[24] Chongxuan Li, Max Welling, Jun Zhu, and Bo Zhang. Graphical generative adversarial networks.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 6072\u20136083. Curran Associates,\nInc., 2018.\n\n[25] Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar R\u00e4tsch, Sylvain Gelly, and Bern-\nhard Sch\u00f6lkopf. Clustering meets implicit generative models. arXiv preprint arXiv:1804.11130,\n2018.\n\n[26] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\n\nequilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.\n\n[27] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. MGAN: Training generative adver-\nsarial nets with multiple generators. In International Conference on Learning Representations,\n2018.\n\n[28] David Keetae Park, Seungjoo Yoo, Hyojin Bahng, Jaegul Choo, and Noseong Park. Megan:\nMixture of experts of generative adversarial networks for multimodal image generation. arXiv\npreprint arXiv:1805.02481, 2018.\n\n[29] Paulina Grnarova, K\ufb01r Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An\nonline learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269,\n2017.\n\n[30] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[31] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n[32] Yaxing Wang, Lichao Zhang, and Joost van de Weijer. Ensembles of generative adversarial\n\nnetworks. arXiv preprint arXiv:1612.00991, 2016.\n\n[33] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard\nSch\u00f6lkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing\nSystems, pages 5430\u20135439, 2017.\n\n[34] Aditya Grover and Stefano Ermon. Boosted generative models.\n\nConference on Arti\ufb01cial Intelligence, 2018.\n\nIn Thirty-Second AAAI\n\n[35] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[36] Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist.\n\nItal. Attuari, 4(421-424), 1933.\n\n[37] Shun-ichi Amari. Information geometry and its applications. Springer, 2016.\n\n11\n\n\f[38] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,\n\n2018.\n\n[39] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n[40] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[41] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[42] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n12\n\n\f", "award": [], "sourceid": 1224, "authors": [{"given_name": "Peilin", "family_name": "Zhong", "institution": "Columbia University"}, {"given_name": "Yuchen", "family_name": "Mo", "institution": "Columbia University"}, {"given_name": "Chang", "family_name": "Xiao", "institution": "Columbia University"}, {"given_name": "Pengyu", "family_name": "Chen", "institution": "Columbia University"}, {"given_name": "Changxi", "family_name": "Zheng", "institution": "Columbia University"}]}