{"title": "Disconnected Manifold Learning for Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7343, "page_last": 7353, "abstract": "Natural images may lie on a union of disjoint manifolds rather than one globally connected manifold, and this can cause several difficulties for the training of common Generative Adversarial Networks (GANs). In this work, we first show that single generator GANs are unable to correctly model a distribution supported on a disconnected manifold, and investigate how sample quality, mode dropping and local convergence are affected by this. Next, we show how using a collection of generators can address this problem, providing new insights into the success of such multi-generator GANs. Finally, we explain the serious issues caused by considering a fixed prior over the collection of generators and propose a novel approach for learning the prior and inferring the necessary number of generators without any supervision. Our proposed modifications can be applied on top of any other GAN model to enable learning of distributions supported on disconnected manifolds. We conduct several experiments to illustrate the aforementioned shortcoming of GANs, its consequences in practice, and the effectiveness of our proposed modifications in alleviating these issues.", "full_text": "Disconnected Manifold Learning for Generative\n\nAdversarial Networks\n\nMahyar Khayatkhoei\n\nDepartment of Computer Science\n\nRutgers University\n\nm.khayatkhoei@cs.rutgers.edu\n\nAhmed Elgammal\n\nDepartment of Computer Science\n\nRutgers University\n\nelgammal@cs.rutgers.edu\n\nManeesh Singh\nVerisk Analytics\n\nmaneesh.singh@verisk.com\n\nAbstract\n\nNatural images may lie on a union of disjoint manifolds rather than one globally\nconnected manifold, and this can cause several dif\ufb01culties for the training of\ncommon Generative Adversarial Networks (GANs). In this work, we \ufb01rst show\nthat single generator GANs are unable to correctly model a distribution supported\non a disconnected manifold, and investigate how sample quality, mode dropping\nand local convergence are affected by this. Next, we show how using a collection of\ngenerators can address this problem, providing new insights into the success of such\nmulti-generator GANs. Finally, we explain the serious issues caused by considering\na \ufb01xed prior over the collection of generators and propose a novel approach for\nlearning the prior and inferring the necessary number of generators without any\nsupervision. Our proposed modi\ufb01cations can be applied on top of any other GAN\nmodel to enable learning of distributions supported on disconnected manifolds. We\nconduct several experiments to illustrate the aforementioned shortcoming of GANs,\nits consequences in practice, and the effectiveness of our proposed modi\ufb01cations in\nalleviating these issues.\n\n1\n\nIntroduction\n\nConsider two natural images, picture of a bird and picture of a cat for example, can we continuously\ntransform the bird into the cat without ever generating a picture that is not neither bird nor cat? In\nother words, is there a continuous transformation between the two that never leaves the manifold of\n\"real looking\" images? It is often the case that real world data falls on a union of several disjoint\nmanifolds and such a transformation does not exist, i.e. the real data distribution is supported on a\ndisconnected manifold, and an effective generative model needs to be able to learn such manifolds.\nGenerative Adversarial Networks (GANs) [10], model the problem of \ufb01nding the unknown distribu-\ntion of real data as a two player game where one player, called the discriminator, tries to perfectly\nseparate real data from the data generated by a second player, called the generator, while the second\nplayer tries to generate data that can perfectly fool the \ufb01rst player. Under certain conditions, Good-\nfellow et al. [10] proved that this process will result in a generator that generates data from the real\ndata distribution, hence \ufb01nding the unknown distribution implicitly. However, later works uncovered\nseveral shortcomings of the original formulation, mostly due to violation of one or several of its\nassumptions in practice [1, 2, 20, 24]. Most notably, the proof only works for when optimizing in\nthe function space of generator and discriminator (and not in the parameter space) [10], the Jensen\nShannon Divergence is maxed out when the generated and real data distributions have disjoint support\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fresulting in vanishing or unstable gradient [1], and \ufb01nally the mode dropping problem where the\ngenerator fails to correctly capture all the modes of the data distribution, for which to the best of our\nknowledge there is no de\ufb01nitive reason yet.\nOne major assumption for the convergence of GANs is that the generator and discriminator both\nhave unlimited capacity [10, 2, 24, 14], and modeling them with neural networks is then justi\ufb01ed\nthrough the Universal Approximation Theorem. However, we should note that this theorem is only\nvalid for continuous functions. Moreover, neural networks are far from universal approximators in\npractice. In fact, we often explicitly restrict neural networks through various regularizers to stabilize\ntraining and enhance generalization. Therefore, when generator and discriminator are modeled by\nstable regularized neural networks, they may no longer enjoy a good convergence as promised by the\ntheory.\nIn this work, we focus on learning distributions with disconnected support, and show how limitations\nof neural networks in modeling discontinuous functions can cause dif\ufb01culties in learning such\ndistributions with GANs. We study why these dif\ufb01culties arise, what consequences they have in\npractice, and how one can address these dif\ufb01culties by using a collection of generators, providing\nnew insights into the recent success of multi-generator models. However, while all such models\nconsider the number of generators and the prior over them as \ufb01xed hyperparameters [3, 14, 9], we\npropose a novel prior learning approach and show its necessity in effectively learning a distribution\nwith disconnected support. We would like to stress that we are not trying to achieve state of the art\nperformance in our experiments in the present work, rather we try to illustrate an important limitation\nof common GAN models and the effectiveness of our proposed modi\ufb01cations. We summarize the\ncontributions of this work below:\n\ninsights into the success of multi generator GAN models in practice (Section 3).\n\n\u2022 We identify a shortcoming of GANs in modeling distributions with disconnected support,\nand investigate its consequences, namely mode dropping, worse sample quality, and worse\nlocal convergence (Section 2).\n\u2022 We illustrate how using a collection of generators can solve this shortcoming, providing new\n\u2022 We show that choosing the number of generators and the probability of selecting them are\nimportant factors in correctly learning a distribution with disconnected support, and propose\na novel prior learning approach to address these factors. (Section 3.1)\n\u2022 Our proposed model can effectively learn distributions with disconnected supports and infer\nthe number of necessary disjoint components through prior learning. Instead of one large\nneural network as the generator, it uses several smaller neural networks, making it more\nsuitable for parallel learning and less prone to bad weight initialization. Moreover, it can be\neasily integrated with any GAN model to enjoy their bene\ufb01ts as well (Section 5).\n\n2 Dif\ufb01culties of Learning Disconnected Manifolds\n\nA GAN as proposed by Goodfellow et al. [10], and most of its successors (e.g. [2, 11]) learn a\ncontinuous G : Z \u2192 X , which receives samples from some prior p(z) as input and generates real\ndata as output. The prior p(z) is often a standard multivariate normal distribution N (0, I) or a\nbounded uniform distribution U(\u22121, 1). This means that p(z) is supported on a globally connected\nsubspace of Z. Since a continuous function always keeps the connectedness of space intact [15],\nthe probability distribution induced by G is also supported on a globally connected space. Thus G,\na continuous function by design, can not correctly model a union of disjoint manifolds in X . We\nhighlight this fact in Figure 1 using an illustrative example where the support of real data is {+2,\u22122}.\nWe will look at some consequences of this shortcoming in the next part of this section. For the\nremainder of this paper, we assume the real data is supported on a manifold Sr which is a union of\ndisjoint globally connected manifolds each denoted by Mi; we refer to each Mi as a submanifold\n(note that we are overloading the topological de\ufb01nition of submanifolds in favor of brevity):\n\nnr(cid:91)\n\nSr =\n\nMi\n\n\u2200i (cid:54)= j : Mi \u2229 Mj = \u2205\n\ni=1\n\nSample Quality. Since GAN\u2019s generator tries to cover all submanifolds of real data with a single\nglobally connected manifold, it will inevitably generate off real-manifold samples. Note that to avoid\n\n2\n\n\f(a) Suboptimal Continuous G\n\n(b) Optimal G\u2217\n\nFigure 1: Illustrative example of continuous generator G(z) : Z \u2192 X with prior z \u223c U(\u22121, 1),\n2 (\u03b4(x \u2212 2) + \u03b4(x + 2)), a distribution supported on\ntrying to capture real data coming from p(x) = 1\nunion of two disjoint manifolds. (a) shows an example of what a stable neural network is capable of\nlearning for G (a continuous and smooth function), (b) shows an optimal generator G\u2217(z). Note that\nsince z is uniformly sampled, G(z) is necessarily generating off manifold samples (in [\u22122, 2]) due to\nits continuity.\n\noff manifold regions, one should push the generator to learn a higher frequency function, the learning\nof which is explicitly avoided by stable training procedures and means of regularization. Therefore\nthe GAN model in a stable training, in addition to real looking samples, will also generate low quality\noff real-manifold samples. See Figure 2 for an example of this problem.\nMode Dropping. In this work, we use the term mode dropping to refer to the situation where one\nor several submanifolds of real data are not completely covered by the support of the generated\ndistribution. Note that mode collapse is a special case of this de\ufb01nition where all but a small part of\na single submanifold are dropped. When the generator can only learn a distribution with globally\nconnected support, it has to learn a cover of the real data submanifolds, in other words, the generator\ncan not reduce the probability density of the off real-manifold space beyond a certain value. However,\nthe generator can try to minimize the volume of the off real-manifold space to minimize the probability\nof generating samples there. For example, see how in Figure 2b the learned globally connected\nmanifold has minimum off real-manifold volume, for example it does not learn a cover that crosses\nthe center (the same manifold is learned in 5 different runs). So, in learning the cover, there is a trade\noff between covering all real data submanifolds, and minimizing the volume of the off real-manifold\nspace in the cover. This trade off means that the generator may sacri\ufb01ce certain submanifolds, entirely\nor partially, in favor of learning a cover with less off real-manifold volume, hence mode dropping.\nLocal Convergence. Nagarajan and Kolter [21] recently proved that the training of GANs is\nlocally convergent when generated and real data distributions are equal near the equilibrium point,\nand Mescheder et al. [19] showed the necessity of this condition on a prototypical example. Therefore\nwhen the generator can not learn the correct support of the real data distribution, as is in our discussion,\nthe resulting equilibrium may not be locally convergent. In practice, this means the generator\u2019s support\nkeeps oscillating near the data manifold.\n\n3 Disconnected Manifold Learning\nThere are two ways to achieve disconnectedness in X : making Z disconnected, or making G :\nZ \u2192 X discontinuous. The former needs considerations for how to make Z disconnected, for\nexample adding discrete dimensions [6], or using a mixture of Gaussians [12]. The latter solution\ncan be achieved by introducing a collections of independent neural networks as G. In this work, we\ninvestigate the latter solution since it is more suitable for parallel optimization and can be more robust\nto bad initialization.\nWe \ufb01rst introduce a set of generators Gc : Z \u2192 X instead of a single one, independently constructed\non a uniform prior in the shared latent space Z. Each generator can therefore potentially learn a\nseparate connected manifold. However, we need to encourage these generators to each focus on a\ndifferent submanifold of the real data, otherwise they may all learn a cover of the submanifolds and\n\n3\n\n+2-2+1-1\ud835\udcb5\ud835\udcb3+2-2+1-1\ud835\udcb5\ud835\udcb3\f(a) Real Data\n\n(b) WGAN-GP\n\n(c) DMWGAN\n\n(d) DMWGAN-PL\n\nFigure 2: Comparing Wasserstein GAN (WGAN) and its Disconnected Manifold version with\nand without prior learning (DMWGAN-PL, DMWGAN) on disjoint line segments dataset when\nng = 10. Different colors indicate samples from different generators. Notice how WGAN-GP fails to\ncapture the disconnected manifold of real data, learning a globally connected cover instead, and thus\ngenerating off real-manifold samples. DMWGAN also fails due to incorrect number of generators.\nIn contrast, DMWGAN-PL is able to infer the necessary number of disjoint components without\nany supervision and learn the correct disconnected manifold of real data. Each \ufb01gure shows 10K\nsamples from the respective model. We train each model 5 times, the results shown are consistent\nacross different runs.\n\n(a) WGAN-GP\n\n(b) DMWGAN\n\n(c) DMWGAN-PL\n\nFigure 3: Comparing WGAN-GP, DMWGAN and DMWGAN-PL convergence on unbalanced\ndisjoint line segments dataset when ng = 10. The real data is the same line segments as in Figure 2,\nexcept the top right line segment has higher probability. Different colors indicate samples from\ndifferent generators. Notice how DMWGAN-PL (c) has vanished the contribution of redundant\ngenerators wihtout any supervision. Each \ufb01gure shows 10K samples from the respective model. We\ntrain each model 5 times, the results shown are consistent across different runs.\n\nexperience the same issues of a single generator GAN. Intuitively, we want the samples generated\nby each generator to be perfectly unique to that generator, in other words, each sample should be\na perfect indicator of which generator it came from. Naturally, we can achieve this by maximizing\nthe mutual information I(c; x), where c is generator id and x is generated sample. As suggested\nby Chen et al. [6], we can implement this by maximizing a lower bound on mutual information\nbetween generator ids and generated images:\n\nI(c; x) = H(c) \u2212 H(c|x)\n\n(cid:2)Ec(cid:48)\u223cp(c(cid:48)|x) [ln p(c(cid:48)|x)](cid:3)\n\n= H(c) + Ec\u223cp(c),x\u223cpg(x|c)\n= H(c) + Ex\u223cpg(x) [KL(p(c|x)||q(c|x))] + Ec\u223cp(c),x\u223cpg(x|c),c(cid:48)\u223cp(c(cid:48)|x) [ln q(c(cid:48)|x)]\n\u2265 H(c) + Ec\u223cp(c),x\u223cpg(x|c),c(cid:48)\u223cp(c(cid:48)|x) [ln q(c(cid:48)|x)]\n= H(c) + Ec\u223cp(c),x\u223cpg(x|c) [ln q(c|x)]\n\nwhere q(c|x) is the distribution approximating p(c|x), pg(x|c) is induced by each generator Gc,\nKL is the Kullback Leibler divergence, and the last equality is a consequence of Lemma 5.1 in [6].\nTherefore, by modeling q(c|x) with a neural network Q(x; \u03b3), the encoder network, maximizing\nI(c; x) boils down to minimizing a cross entropy loss:\n\nLc = \u2212Ec\u223cp(c),x\u223cpg(x|c) [ln q(c|x)]\n\n(1)\n\n4\n\n\fUtilizing the Wasserstein GAN [2] objectives, discriminator (critic) and generator maximize the\nfollowing, where D(x; w) : X \u2192 R is the critic function:\n\nVd = Ex\u223cpr(x) [D(x; w)] \u2212 Ec\u223cp(c),x\u223cpg(x|c) [D(x; w)]\nVg = Ec\u223cp(c),x\u223cpg(x|c) [D(x; w)] \u2212 \u03bbLc\n\n(2)\n(3)\nWe call this model Disconnected Manifold Learning WGAN (DMWGAN) in our experiments. We\ncan similarly apply our modi\ufb01cations to the original GAN [10] to construct DMGAN. We add the\nsingle sided version of penalty gradient regularizer [11] to the discriminator/critic objectives of both\nmodels and all baselines. See Appendix A for details of our algorithm and the DMGAN objectives.\nSee Appendix F for more details and experiments on the importance of the mutual information term.\nThe original convergence theorems of Goodfellow et al. [10] and Arjovsky et al. [2] holds for the\nproposed DM versions respectively, because all our modi\ufb01cations concern the internal structure of the\ngenerator, and can be absorbed into the unlimited capacity assumption. More concretely, all generators\ntogether can be viewed as a uni\ufb01ed generator where p(c)pg(x|c) becomes the generator probability,\nand Lc can be considered as a constraint on the generator function space incorporated using a\nLagrange multiplier. While most multi-generator models consider p(c) as a uniform distribution over\ngenerators, this naive choice of prior can cause certain dif\ufb01culties in learning a disconnected support.\nWe will discuss this point, and also introduce and motivate the metrics we use for evaluations, in the\nnext two subsections.\n\n3.1 Learning the Generator\u2019s Prior\n\nIn practice, we can not assume that the true number of submanifolds in real data is known a priori.\nSo let us consider two cases regarding the number of generators ng, compared to the true number of\nsubmanifolds in data nr, under a \ufb01xed uniform prior p(c). If ng < nr then some generators have to\ncover several submanifolds of the real data, thus partially experiencing the same issues discussed in\nSection 2. If ng > nr, then some generators have to share one real submanifold, and since we are\nforcing the generators to maintain disjoint supports, this results in partially covered real submanifolds,\ncausing mode dropping. See Figures 2c and 3b for examples of this issue. Note that an effective\nsolution to the latter problem reduces the former problem into a trade off: the more the generators,\nthe better the cover. We can address the latter problem by learning the prior p(c) such that it vanishes\nthe contribution of redundant generators. Even when ng = nr, what if the distribution of data over\nsubmanifolds are not uniform? Since we are forcing each generator to learn a different submanifold,\na uniform prior over the generators would result in a suboptimal distribution. This issue further shows\nthe necessity of learning the prior over generators.\nWe are interested in \ufb01nding the best prior p(c) over generators. Notice that q(c|x) is implicitly\nlearning the probability of x \u2208 X belonging to each generator Gc, hence q(c|x) is approximating\nthe true posterior p(c|x). We can take an EM approach to learning the prior: the expected value of\nq(c|x) over the real data distribution gives us an approximation of p(c) (E step), which we can use to\ntrain the DMGAN model (M step). Instead of using empirical average to learn p(c) directly, we learn\nit with a model r(c; \u03b6), which is a softmax function over parameters {\u03b6i}ng\ni=1 corresponding to each\ngenerator. This enables us to control the learning of p(c), the advantage of which we will discuss\nshortly. We train r(c) by minimizing the cross entropy as follows:\nH(p(c), r(c)) = \u2212Ec\u223cp(c) [log r(c)] = \u2212Ex\u223cpr(x),c\u223cp(c|x) [log r(c)] = Ex\u223cpr(x) [H(p(c|x), r(c))]\nWhere H(p(c|x), r(c)) is the cross entropy between model distribution r(c) and true posterior p(c|x)\nwhich we approximate by q(c|x). However, learning the prior from the start, when the generators\nare still mostly random, may prevent most generators from learning by vanishing their probability\ntoo early. To avoid this problem, we add an entropy regularizer and decay its weight \u03bb(cid:48)(cid:48) with time to\ngradually shift the prior r(c) away from uniform distribution. Thus the \ufb01nal loss for training r(c)\nbecomes:\n\nLprior = Ex\u223cpr(x) [H(q(c|x), r(c))] \u2212 \u03b1t\u03bb(cid:48)(cid:48)H(r(c))\n\n(4)\nWhere H(r(c)) is the entropy of model distribution, \u03b1 is the decay rate, and t is training timestep.\nThe model is not very sensitive to \u03bb(cid:48)(cid:48) and \u03b1, any combination that insures a smooth transition away\nfrom uniform distribution is valid. We call this augmented model Disconnected Manifold Learning\nGAN with Prior Learning (DMGAN-PL) in our experiments. See Figures 2 and 3 for examples\nshowing the advantage of learning the prior.\n\n5\n\n\f3.2 Choice of Metrics\n\nWe require metrics that can assess inter-mode variation, intra-mode variation and sample quality. The\ncommon metric, Inception Score [23], has several drawbacks [4, 18], most notably it is indifferent to\nintra-class variations and favors generators that achieve close to uniform distribution over classes of\ndata. Instead, we consider more direct metrics together with FID score [13] for natural images.\nFor inter mode variation, we use the Jensen Shannon Divergence (JSD) between the class distribution\nof a pre-trained classi\ufb01er over real data and generator\u2019s data. This can directly tell us how well the\ndistribution over classes are captured. JSD is favorable to KL due to being bounded and symmetric.\nFor intra mode variation, we de\ufb01ne mean square geodesic distance (MSD): the average squared\ngeodesic distance between pairs of samples classi\ufb01ed into each class. To compute the geodesic\ndistance, Euclidean distance is used in a small neighborhood of each sample to construct the Isomap\ngraph [26] over which a shortest path distance is calculated. This shortest path distance is an\napproximation to the geodesic distance on the true image manifold [25]. Note that average square\ndistance, for Euclidean distance, is equal to twice the trace of the Covariance matrix, i.e. sum of the\neigenvalues of covariance matrix, and therefore can be an indicator of the variance within each class:\n\n(cid:2)xT x(cid:3) \u2212 2Ex [x]T Ex [x] = 2T r(Cov(x))\n\n(cid:2)||x \u2212 y||2(cid:3) = 2Ex\n\nEx,y\n\nIn our experiments, we choose the smallest k for which the constructed k nearest neighbors graph\n(Isomap) is connected in order to have a better approximation of the geodesic distance (k = 18).\nAnother concept we would like to evaluate is sample quality. Given a pretrained classi\ufb01er with small\ntest error, samples that are classi\ufb01ed with high con\ufb01dence can be reasonably considered good quality\nsamples. We plot the ratio of samples classi\ufb01ed with con\ufb01dence greater than a threshold, versus the\ncon\ufb01dence threshold, as a measure of sample quality: the more off real-manifold samples, the lower\nthe resulting curve. Note that the results from this plot are exclusively indicative of sample quality\nand should be considered in conjunction with the aforementioned metrics.\nWhat if the generative model memorizes the dataset that it is trained on? Such a model would\nscore perfectly on all our metrics, while providing no generalization at all. First, note that a single\ngenerator GAN model can not memorize the dataset because it can not learn a distribution supported\non N disjoint components as discussed in Section 2. Second, while our modi\ufb01cations introduces\ndisconnnectedness to GANs, the number of generators we use in our proposed modi\ufb01cations are in\nthe order of data submanifolds which is several orders of magnitude less than common dataset sizes.\nNote that if we were to assign one unique point of the Z space to each dataset sample, then a neural\nnetwork could learn to memorize the dataset by mapping each selected z \u2208 Z to its corresponding\nreal sample (we have introduced N disjoint component in Z space in this case), however this is not\nhow GANs are modeled. Therefore, the memorization issue is not of concern for common GANs\nand our proposed models (note that this argument is addressing the very narrow case of dataset\nmemorization, not over-\ufb01tting in general).\n\n4 Related Works\n\n(cid:2)||z \u2212 F\u03b8(G\u03b3(z))||2\n\n2\n\n(cid:3). The main advantage of these models is to prevent\n\nSeveral recent works have directly targeted the mode collapse problem by introducing a network\nF : X \u2192 Z that is trained to map back the data into the latent space prior p(z). It can therefore\nprovide a learning signal if the generated data has collapsed. ALI [8] and BiGAN [7] consider pairs\nof data and corresponding latent variable (x, z), and construct their discriminator to distinguish such\npairs of real and generated data. VEEGAN [24] uses the same discriminator, but also adds an explicit\nreconstruction loss Ez\u223cp(z)\nloss of information by the generator (mapping several z \u2208 Z to a single x \u2208 X ). However, in case of\ndistributions with disconnected support, these models do not provide much advantage over common\nGANs and suffer from the same issues we discussed in Section 2 due to having a single generator.\nAnother set of recent works have proposed using multiple generators in GANs in order to improve\ntheir convergence. MIX+GAN [3] proposes using a collection of generators based on the well-known\nadvantage of learning a mixed strategy versus a pure strategy in game theory. MGAN [14] similarly\nuses a collection of k generators in order to model a mixture distribution, and train them together\nwith a k-class classi\ufb01er to encourage them to each capture a different component of the real mixture\ndistribution. MAD-GAN [9], also uses k generators, together with a k + 1-class discriminator which\nis trained to correctly classify samples from each generator and true data (hence a k + 1 classi\ufb01er),\n\n6\n\n\fJSD MNIST \u00d710\u22122\nModel\n0.13 std 0.05\nWGAN-GP\n0.17 std 0.08\nMIX+GAN\nDMWGAN\n0.23 std 0.06\nDMWGAN-PL 0.06 std 0.02\n\nJSD Face-Bed \u00d710\u22124\n0.23 std 0.15\n0.83 std 0.57\n0.46 std 0.25\n0.10 std 0.05\n\nFID Face-Bed\n8.30 std 0.27\n8.02 std 0.14\n7.96 std 0.08\n7.67 std 0.16\n\nTable 1: Inter-class variation measured by Jensen Shannon Divergence (JSD) with true class distri-\nbution for MNIST and Face-Bedroom dataset, and FID score for Face-Bedroom (smaller is better).\nWe run each model 5 times with random initialization, and report average values with one standard\ndeviation interval\n\nin order to increase the diversity of generated images. While these models provide reasons for why\nmultiple generators can model mixture distributions and achieve more diversity, they do not address\nwhy single generator GANs fail to do so. In this work, we explain why it is the disconnectedness of\nthe support that single generator GANs are unable to learn, not the fact that real data comes from a\nmixture distribution. Moreover, all of these works use a \ufb01xed number of generators and do not have\nany prior learning, which can cause serious problems in learning of distributions with disconnected\nsupport as we discussed in Section 3.1 (see Figures 2c and 3b for examples of this issue).\nFinally, several works have targeted the problem of learning the correct manifold of data. MDGAN [5],\nuses a two step approach to closely capture the manifold of real data. They \ufb01rst approximate the\ndata manifold by learning a transformation from encoded real images into real looking images,\nand then train a single generator GAN to generate images similar to the transformed encoded\nimages of previous step. However, MDGAN can not model distributions with disconnected supports.\nInfoGAN [6] introduces auxiliary dimensions to the latent space Z, and maximizes the mutual\ninformation between these extra dimensions and generated images in order to learn disentangled\nrepresentations in the latent space. DeLiGAN [12] uses a \ufb01xed mixture of Gaussians as its latent\nprior, and does not have any mechanisms to encourage diversity. While InfoGAN and DeLiGAN can\ngenerate disconnected manifolds, they both assume a \ufb01xed number of discreet components equal to\nthe number of underlying classes and have no prior learning over these components, thus suffering\nfrom the issues discussed in Section 3.1. Also, neither of these works discusses the incapability of\nsingle generator GANs to learn disconnected manifolds and its consequences.\n\n5 Experiments\n\nIn this section we present several experiments to investigate the issues and proposed solutions men-\ntioned in Sections 2 and 3 respectively. The same network architecture is used for the discriminator\nand generator networks of all models under comparison, except we use 1\n4 number of \ufb01lters in each\nlayer of multi-generator models compared to the single generator models, to control the effect of\ncomplexity. In all experiments, we train each model for a total of 200 epochs with a \ufb01ve to one update\nratio between discriminator and generator. Q, the encoder network, is built on top of discriminator\u2019s\nlast hidden layer, and is trained simultaneously with generators. Each data batch is constructed\nby \ufb01rst selecting 32 generators according to the prior r(c; \u03b6), and then sampling each one using\nz \u223c U(\u22121, 1). See Appendix B for details of our networks and the hyperparameters.\nDisjoint line segments. This dataset is constructed by sampling data with uniform distribution over\nfour disjoint line segments to achieve a distribution supported on a union of disjoint low-dimensional\nmanifolds. See Figure 2 for the results of experiments on this dataset. In Figure 3, an unbalanced\nversion of this dataset is used, where 0.7 probability is placed on the top right line segment, and the\nother segments have 0.1 probability each. The generator and discriminator are both MLPs with two\nhidden layers, and 10 generators are used for multi-generator models. We choose WGAN-GP as the\nstate of the art GAN model in these experiments (we observed similar or worse convergence with\nother \ufb02avors of single generator GANs). MGAN achieves similar results to DMWGAN.\nMNIST dataset. MNIST [16] is particularly suitable since samples with different class labels can\nbe reasonably interpreted as lying on disjoint manifolds (with minor exceptions like certain 4s and\n9s). The generator and discriminator are DCGAN like networks [22] with three convolution layers.\nFigure 4 shows the mean squared geodesic distance (MSD) and Table 1 reports the corresponding\n\n7\n\n\f(a) Intra-class variation MNIST\n\n(b) Sample quality MNIST\n\n(c) Sample quality Face-Bed\n\nFigure 4: (a) Shows intra-class variation in MNIST. Bars show the mean square distance (MSD)\nwithin each class of the dataset. On average, DMGAN-PL outperforms WGAN-GP in capturing\nintra class variation, as measured by MSD, with larger signi\ufb01cance on certain classes. (b) Shows the\nsample quality in MNIST experiment. (c) Shows sample quality in Face-Bed experiment. Notice\nhow DMWGAN-PL outperforms other models due to fewer off real-manifold samples. We run each\nmodel 5 times with random initialization, and report average values with one standard deviation\nintervals in both \ufb01gures. 10K samples are used for metric evaluations.\n\n(a) WGAN-GP\n\n(b) DMWGAN\n\n(c) DMWGAN-PL\n\nFigure 5: Samples randomly generated by GAN models trained on Face-Bed dataset. Notice how\nWGAN-GP generates combined face-bedroom images (red boxes) in addition to faces and bedrooms,\ndue to learning a connected cover of the real data support. DMWGAN does not generate such\nsamples, however it generates completely off manifold samples (red boxes) due to having redundant\ngenerators and a \ufb01xed prior. DMWGAN-PL is able to correctly learn the disconnected support of\nreal data. The samples and trained models are not cherry picked.\n\ndivergences in order to compare their inter mode variation. 20 generators are used for multi-generator\nmodels. See Appendix C for experiments using modi\ufb01ed GAN objective. Results demonstrate the\nadvantage of adding our proposed modi\ufb01cation on both GAN and WGAN. See Appendix D for\nqualitative results.\nFace-Bed dataset. We combine 20K face images from CelebA dataset [17] and 20K bedroom images\nfrom LSUN Bedrooms dataset [27] to construct a natural image dataset supported on a disconnected\nmanifold. We center crop and resize images to 64 \u00d7 64. 5 generators are used for multi-generator\n\n8\n\n0123456789Modes01000200030004000500060007000MSDAverage Distance over ModesRealDMWGAN-PLWGAN-GP0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0Sample RatioSample QualityRealDMWGAN-PLWGAN-GP0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0Sample RatioSample QualityRealDMWGAN-PLDMWGANMIXGANWGAN-GP\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 6: DMWGAN-PL prior learning during training on MNIST with 20 generators (a,b) and\non Face-Bed with 5 generators (c, d). (a, c) show samples from top generators with prior greater\nthan 0.05 and 0.2 respectively. (b, d) show the probability of selecting each generator r(c; \u03b6) during\ntraining, each color denotes a different generator. The color identifying each generator in (b) and\nthe border color of each image in (a) are corresponding, similarly for (d) and (c). Notice how prior\nlearning has correctly learned probability of selecting each generators and dropped out redundant\ngenerators without any supervision.\n\nmodels. Figures 4c, 5 and Table 1 show the results of this experiment. See Appendix E for more\nqualitative results.\n\n6 Conclusion and Future Works\n\nIn this work we showed why the single generator GANs can not correctly learn distributions supported\non disconnected manifolds, what consequences this shortcoming has in practice, and how multi-\ngenerator GANs can effectively address these issues. Moreover, we showed the importance of learning\na prior over the generators rather than using a \ufb01xed prior in multi-generator models. However, it is\nimportant to highlight that throughout this work we assumed the disconnectedness of the real data\nsupport. Verifying this assumption in major datasets, and studying the topological properties of these\ndatasets in general, are interesting future works. Extending the prior learning to other methods, such\nas learning a prior over shape of Z space, and also investigating the effects of adding diversity to\ndiscriminator as well as the generators, also remain as exciting future paths for research.\n\nAcknowledgement\n\nThis work was supported by Verisk Analytics and NSF-USA award number 1409683.\n\n9\n\n0100000200000300000400000500000Iterations0.000.020.040.060.080.100.12ProbabilityRL Policy0100000200000300000400000500000Iterations0.00.10.20.30.40.5ProbabilityRL Policy\fReferences\n[1] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial networks.\n\narXiv preprint arXiv:1701.04862, 2017.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in\n\ngenerative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.\n\n[4] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.\n\n[5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative\n\nadversarial networks. arXiv preprint arXiv:1612.02136, 2016.\n\n[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[7] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and\n\nAaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[9] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri, Philip HS Torr, and Puneet K Dokania. Multi-agent\n\ndiverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 2017.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769\u20135779,\n2017.\n\n[12] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and V Babu Radhakrishnan. Deligan: Generative\nadversarial networks for diverse and limited data. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), volume 1, 2017.\n\n[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural\nInformation Processing Systems, pages 6629\u20136640, 2017.\n\n[14] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. MGAN: Training generative adversarial\nnets with multiple generators. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=rkmu5b0a-.\n\n[15] John L Kelley. General topology. Courier Dover Publications, 2017.\n\n[16] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n\n[17] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[18] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created\n\nequal? a large-scale study. arXiv preprint arXiv:1711.10337, 2017.\n\n[19] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually\n\nconverge? arXiv preprint arXiv:1801.04406, 2018.\n\n[20] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks.\n\narXiv preprint arXiv:1611.02163, 2016.\n\n[21] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances\n\nin Neural Information Processing Systems, pages 5591\u20135600, 2017.\n\n[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n10\n\n\f[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242,\n2016.\n\n[24] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing\nmode collapse in gans using implicit variational learning. In Advances in Neural Information Processing\nSystems, pages 3310\u20133320, 2017.\n\n[25] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[26] Ming-Hsuan Yang. Extended isomap for pattern classi\ufb01cation. In AAAI/IAAI, pages 224\u2013229, 2002.\n\n[27] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale\n\nimage dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3662, "authors": [{"given_name": "Mahyar", "family_name": "Khayatkhoei", "institution": "Rutgers University"}, {"given_name": "Maneesh", "family_name": "Singh", "institution": "Verisk Analytics"}, {"given_name": "Ahmed", "family_name": "Elgammal", "institution": "Rutgers University"}]}