{"title": "Bayesian GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 3622, "page_last": 3631, "abstract": "Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, we use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks. The resulting approach is straightforward and obtains good performance without any standard interventions such as feature matching or mini-batch discrimination. By exploring an expressive posterior over the parameters of the generator, the Bayesian GAN avoids mode-collapse, produces interpretable and diverse candidate samples, and provides state-of-the-art quantitative results for semi-supervised learning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming DCGAN, Wasserstein GANs, and DCGAN ensembles.", "full_text": "Bayesian GAN\n\nYunus Saatchi\nUber AI Labs\n\nAndrew Gordon Wilson\n\nCornell University\n\nAbstract\n\nGenerative adversarial networks (GANs) can implicitly learn rich distributions over\nimages, audio, and data which are hard to model with an explicit likelihood. We\npresent a practical Bayesian formulation for unsupervised and semi-supervised\nlearning with GANs. Within this framework, we use stochastic gradient Hamilto-\nnian Monte Carlo to marginalize the weights of the generator and discriminator\nnetworks. The resulting approach is straightforward and obtains good performance\nwithout any standard interventions such as label smoothing or mini-batch discrimi-\nnation. By exploring an expressive posterior over the parameters of the generator,\nthe Bayesian GAN avoids mode-collapse, produces interpretable and diverse candi-\ndate samples, and provides state-of-the-art quantitative results for semi-supervised\nlearning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming\nDCGAN, Wasserstein GANs, and DCGAN ensembles.\n\n1\n\nIntroduction\n\nLearning a good generative model for high-dimensional natural signals, such as images, video\nand audio has long been one of the key milestones of machine learning. Powered by the learning\ncapabilities of deep neural networks, generative adversarial networks (GANs) [4] and variational\nautoencoders [6] have brought the \ufb01eld closer to attaining this goal.\nGANs transform white noise through a deep neural network to generate candidate samples from\na data distribution. A discriminator learns, in a supervised manner, how to tune its parameters\nso as to correctly classify whether a given sample has come from the generator or the true data\ndistribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As\nlong as the generator has suf\ufb01cient capacity, it can approximate the CDF inverse-CDF composition\nrequired to sample from a data distribution of interest. Since convolutional neural networks by design\nprovide reasonable metrics over images (unlike, for instance, Gaussian likelihoods), GANs using\nconvolutional neural networks can in turn provide a compelling implicit distribution over images.\nAlthough GANs have been highly impactful, their learning objective can lead to mode collapse, where\nthe generator simply memorizes a few training examples to fool the discriminator. This pathology\nis reminiscent of maximum likelihood density estimation with Gaussian mixtures: by collapsing\nthe variance of each component we achieve in\ufb01nite likelihood and memorize the dataset, which is\nnot useful for a generalizable density estimate. Moreover, a large degree of intervention is required\nto stabilize GAN training, including label smoothing and mini-batch discrimination [9, 10]. To\nhelp alleviate these practical dif\ufb01culties, recent work has focused on replacing the Jensen-Shannon\ndivergence implicit in standard GAN training with alternative metrics, such as f-divergences [8] or\nWasserstein divergences [1]. Much of this work is analogous to introducing various regularizers for\nmaximum likelihood density estimation. But just as it can be dif\ufb01cult to choose the right regularizer,\nit can also be dif\ufb01cult to decide which divergence we wish to use for GAN training.\nIt is our contention that GANs can be improved by fully probabilistic inference. Indeed, a posterior\ndistribution over the parameters of the generator could be broad and highly multimodal. GAN\ntraining, which is based on mini-max optimization, always estimates this whole posterior distribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fover the network weights as a point mass centred on a single mode. Thus even if the generator\ndoes not memorize training examples, we would expect samples from the generator to be overly\ncompact relative to samples from the data distribution. Moreover, each mode in the posterior over the\nnetwork weights could correspond to wildly different generators, each with their own meaningful\ninterpretations. By fully representing the posterior distribution over the parameters of both the\ngenerator and discriminator, we can more accurately model the true data distribution. The inferred\ndata distribution can then be used for accurate and highly data-ef\ufb01cient semi-supervised learning.\nIn this paper, we propose a simple Bayesian formulation for end-to-end unsupervised and semi-\nsupervised learning with generative adversarial networks. Within this framework, we marginalize the\nposteriors over the weights of the generator and discriminator using stochastic gradient Hamiltonian\nMonte Carlo. We interpret data samples from the generator, showing exploration across several\ndistinct modes in the generator weights. We also show data and iteration ef\ufb01cient learning of the true\ndistribution. We also demonstrate state of the art semi-supervised learning performance on several\nbenchmarks, including SVHN, MNIST, CIFAR-10, and CelebA. The simplicity of the proposed\napproach is one of its greatest strengths: inference is straightforward, interpretable, and stable. Indeed\nall of the experimental results were obtained without many of the ad-hoc techniques often used to\ntrain standard GANs.\nWe have made code and tutorials available at\nhttps://github.com/andrewgordonwilson/bayesgan.\n\n2 Bayesian GANs\nGiven a dataset D = {x(i)} of variables x(i) \u223c pdata(x(i)), we wish to estimate pdata(x). We\ntransform white noise z \u223c p(z) through a generator G(z; \u03b8g), parametrized by \u03b8g, to produce\ncandidate samples from the data distribution. We use a discriminator D(x; \u03b8d), parametrized by \u03b8d,\nto output the probability that any x comes from the data distribution. Our considerations hold for\ngeneral G and D, but in practice G and D are often neural networks with weight vectors \u03b8g and \u03b8d.\nBy placing distributions over \u03b8g and \u03b8d, we induce distributions over an uncountably in\ufb01nite space of\ngenerators and discriminators, corresponding to every possible setting of these weight vectors. The\ngenerator now represents a distribution over distributions of data. Sampling from the induced prior\ndistribution over data instances proceeds as follows:\n(1) Sample \u03b8g \u223c p(\u03b8g); (2) Sample z(1), . . . , z(n) \u223c p(z); (3) \u02dcx(j) = G(z(j); \u03b8g) \u223c pgenerator(x).\nFor posterior inference, we propose unsupervised and semi-supervised formulations in Sec 2.1 - 2.2.\nWe note that in an exciting recent work Tran et al. [11] brie\ufb02y mention using a variational approach\nto marginalize weights in a generative model, as part of a general exposition on hierarchical implicit\nmodels (see also Karaletsos [5] for a nice theoretical exploration of related topics in graphical model\nmessage passing). While this related work is promising, our approach has several key differences:\n(1) our GAN representation is quite different, with novel formulations for the conditional posteriors,\npreserving a clear competition between generator and discriminator; (2) our representation for the\nposteriors is straightforward, provides novel formulations for unsupervised and semi-supervised\nlearning, and has state of the art results on many benchmarks. Conversely, Tran et al. [11] is only\npursued for fully supervised learning on a few small datasets; (3) we use sampling to explore a full\nposterior over the weights, whereas Tran et al. [11] perform a variational approximation centred on\none of the modes of the posterior (and due to the properties of the KL divergence is prone to an\noverly compact representation of even that mode); (4) we marginalize z in addition to \u03b8g, \u03b8d; and (5)\nthe ratio estimation approach in [11] limits the size of the neural networks they can use, whereas in\nour experiments we can use comparably deep networks to maximum likelihood approaches. In the\nexperiments we illustrate the practical value of our formulation.\nAlthough the high level concept of a Bayesian GAN has been informally mentioned in various\ncontexts, to the best of our knowledge we present the \ufb01rst detailed treatment of Bayesian GANs,\nincluding novel formulations, sampling based inference, and rigorous semi-supervised learning\nexperiments.\n\n2\n\n\f2.1 Unsupervised Learning\n\n(cid:32) ng(cid:89)\np(\u03b8d|z, X, \u03b8g) \u221d nd(cid:89)\n\np(\u03b8g|z, \u03b8d) \u221d\n\ni=1\n\n(cid:33)\n\nng(cid:89)\n\nTo infer posteriors over \u03b8g, \u03b8d, we propose to iteratively sample from the following conditional\nposteriors:\n\nD(G(z(i); \u03b8g); \u03b8d)\n\np(\u03b8g|\u03b1g)\n\n(1)\n\ni=1\n\ni=1\n\nD(x(i); \u03b8d) \u00d7\n\n(1 \u2212 D(G(z(i); \u03b8g); \u03b8d)) \u00d7 p(\u03b8d|\u03b1d) .\n\n(2)\np(\u03b8g|\u03b1g) and p(\u03b8d|\u03b1d) are priors over the parameters of the generator and discriminator, with\nhyperparameters \u03b1g and \u03b1d, respectively. nd and ng are the numbers of mini-batch samples for the\ndiscriminator and generator, respectively.1 We de\ufb01ne X = {x(i)}nd\ni=1.\nWe can intuitively understand this formulation starting from the generative process for data samples.\nSuppose we were to sample weights \u03b8g from the prior p(\u03b8g|\u03b1g), and then condition on this sample\nof the weights to form a particular generative neural network. We then sample white noise z from\np(z), and transform this noise through the network G(z; \u03b8g) to generate candidate data samples.\nThe discriminator, conditioned on its weights \u03b8d, outputs a probability that these candidate samples\ncame from the data distribution. Eq. (1) says that if the discriminator outputs high probabilities, then\nthe posterior p(\u03b8g|z, \u03b8d) will increase in a neighbourhood of the sampled setting of \u03b8g (and hence\ndecrease for other settings). For the posterior over the discriminator weights \u03b8d, the \ufb01rst two terms of\nEq. (2) form a discriminative classi\ufb01cation likelihood, labelling samples from the actual data versus\nthe generator as belonging to separate classes. And the last term is the prior on \u03b8d.\n\nClassical GANs as maximum likelihood Moreover, our proposed probabilistic approach is a\nnatural Bayesian generalization of the classical GAN: if one uses uniform priors for \u03b8g and \u03b8d, and\nperforms iterative MAP optimization instead of posterior sampling over Eq. (1) and (2), then the\nlocal optima will be the same as for Algorithm 1 of Goodfellow et al. [4]. We thus sometimes refer\nto the classical GAN as the ML-GAN. Moreover, even with a \ufb02at prior, there is a big difference\nbetween Bayesian marginalization over the whole posterior versus approximating this (often broad,\nmultimodal) posterior with a point mass as in MAP optimization (see Figure 3, Supplement).\n\nJd\n\nj=1\n\n=p(z)\n\n(cid:90)\n\n(cid:90)\n\nMarginalizing the noise\nsamples z. We can instead marginalize z from our posterior updates using simple Monte Carlo:\n\nIn prior work, GAN updates are implicitly conditioned on a set of noise\n\nJg(cid:88)\n\n(cid:122) (cid:125)(cid:124) (cid:123)\n(cid:80)Jd\nj p(\u03b8d|z(j), X, \u03b8g), z(j) \u223c p(z).\n\np(z|\u03b8d) dz \u2248 1\nJg\n\np(\u03b8g|\u03b8d) =\n\np(\u03b8g, z|\u03b8d)dz =\n\np(\u03b8g|z(j), \u03b8d) , z(j) \u223c p(z)\n\np(\u03b8g|z, \u03b8d)\nBy following a similar derivation, p(\u03b8d|\u03b8g) \u2248 1\nThis speci\ufb01c setup has several nice features for Monte Carlo integration. First, p(z) is a white noise\ndistribution from which we can take ef\ufb01cient and exact samples. Secondly, both p(\u03b8g|z, \u03b8d) and\np(\u03b8d|z, X, \u03b8g), when viewed as a function of z, should be reasonably broad over z by construction,\nsince z is used to produce candidate data samples in the generative procedure. Thus each term in the\nsimple Monte Carlo sum typically makes a reasonable contribution to the total marginal posterior\nestimates. We do note, however, that the approximation will typically be worse for p(\u03b8d|\u03b8g) due to\nthe conditioning on a minibatch of data in Equation 2.\nPosterior samples By iteratively sampling from p(\u03b8g|\u03b8d) and p(\u03b8d|\u03b8g) at every step of an epoch\none can, in the limit, obtain samples from the approximate posteriors over \u03b8g and \u03b8d. Having such\nsamples can be very useful in practice. Indeed, one can use different samples for \u03b8g to alleviate\nGAN collapse and generate data samples with an appropriate level of entropy, as well as forming\na committee of generators to strengthen the discriminator. The samples for \u03b8d in turn form a\ncommittee of discriminators which ampli\ufb01es the overall adversarial signal, thereby further improving\nthe unsupervised learning process. Arguably, the most rigorous method to assess the utility of these\nposterior samples is to examine their effect on semi-supervised learning, which is a focus of our\nexperiments in Section 4.\n\n1For mini-batches, one must make sure the likelihood and prior are scaled appropriately. See Supplement.\n\n3\n\n\f2.2 Semi-supervised Learning\n\nWe extend the proposed probabilistic GAN formalism to semi-supervised learning. In the semi-\nsupervised setting for K-class classi\ufb01cation, we have access to a set of n unlabelled observations,\n{x(i)}, as well as a (typically much smaller) set of ns observations, {(x(i)\ni=1, with class\ns \u2208 {1, . . . , K}. Our goal is to jointly learn statistical structure from both the unlabelled\nlabels y(i)\nand labelled examples, in order to make much better predictions of class labels for new test examples\nx\u2217 than if we only had access to the labelled training inputs.\nIn this context, we rede\ufb01ne the discriminator such that D(x(i) = y(i); \u03b8d) gives the probability that\nsample x(i) belongs to class y(i). We reserve the class label 0 to indicate that a data sample is the\noutput of the generator. We then infer the posterior over the weights as follows:\n\ns )}Ns\n\ns , y(i)\n\n(cid:32) ng(cid:89)\nK(cid:88)\np(\u03b8d|z, X, ys, \u03b8g) \u221d nd(cid:89)\nK(cid:88)\n\np(\u03b8g|z, \u03b8d) \u221d\n\ny=1\n\ni=1\n\ni=1\n\ny=1\n\n(cid:33)\n\nng(cid:89)\n\ni=1\n\nD(G(z(i); \u03b8g) = y; \u03b8d)\n\np(\u03b8g|\u03b1g)\n\nD(x(i) = y; \u03b8d)\n\nD(G(z(i); \u03b8g) = 0; \u03b8d)\n\nNs(cid:89)\n\n(3)\n\n(D(x(i)\n\ns = y(i)\n\ns ; \u03b8d))p(\u03b8d|\u03b1d)\n\ni=1\n\n(cid:90)\n\nT(cid:88)\n\nk=1\n\n(4)\nDuring every iteration we use ng samples from the generator, nd unlabeled samples, and all of the\nNs labeled samples, where typically Ns (cid:28) n. As in Section 2.1, we can approximately marginalize\nz using simple Monte Carlo sampling.\nMuch like in the unsupervised learning case, we can marginalize the posteriors over \u03b8g and \u03b8d. To\ncompute the predictive distribution for a class label y\u2217 at a test input x\u2217 we use a model average over\nall collected samples with respect to the posterior over \u03b8d:\n\np(y\u2217|x\u2217,D) =\n\np(y\u2217|x\u2217, \u03b8d)p(\u03b8d|D)d\u03b8d \u2248 1\nT\n\np(y\u2217|x\u2217, \u03b8(k)\n\nd ) , \u03b8(k)\n\nd \u223c p(\u03b8d|D) .\n\n(5)\n\nWe will see that this model average is effective for boosting semi-supervised learning performance.\nIn Section 3 we present an approach to MCMC sampling from the posteriors over \u03b8g and \u03b8d.\n\n3 Posterior Sampling with Stochastic Gradient HMC\n\nIn the Bayesian GAN, we wish to marginalize the posterior distributions over the generator and\ndiscriminator weights, for unsupervised learning in 2.1 and semi-supervised learning in 2.2. For this\npurpose, we use Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [3] for posterior sampling.\nThe reason for this choice is three-fold: (1) SGHMC is very closely related to momentum-based\nSGD, which we know empirically works well for GAN training; (2) we can import parameter settings\n(such as learning rates and momentum terms) from SGD directly into SGHMC; and most importantly,\n(3) many of the practical bene\ufb01ts of a Bayesian approach to GAN inference come from exploring\na rich multimodal distribution over the weights \u03b8g of the generator, which is enabled by SGHMC.\nAlternatives, such as variational approximations, will typically centre their mass around a single\nmode, and thus provide a unimodal and otherwise compact representation for the distribution, due to\nasymmetric biases of the KL-divergence.\nThe posteriors in Equations 3 and 4 are both amenable to HMC techniques as we can compute the\ngradients of the loss with respect to the parameters we are sampling. SGHMC extends HMC to the\ncase where we use noisy estimates of such gradients in a manner which guarantees mixing in the\nlimit of a large number of minibatches. For a detailed review of SGHMC, please see Chen et al. [3].\nUsing the update rules from Eq. (15) in Chen et al. [3], we propose to sample from the posteriors over\nthe generator and discriminator weights as in Algorithm 1. Note that Algorithm 1 outlines standard\nmomentum-based SGHMC: in practice, we found it helpful to speed up the \u201cburn-in\u201d process by\nreplacing the SGD part of this algorithm with Adam for the \ufb01rst few thousand iterations, after which\nwe revert back to momentum-based SGHMC. As suggested in Appendix G of Chen et al. [3], we\nemployed a learning rate schedule which decayed according to \u03b3/d where d is set to the number of\nunique \u201creal\u201d datapoints seen so far. Thus, our learning rate schedule converges to \u03b3/N in the limit,\nwhere we have de\ufb01ned N = |D|.\n\n4\n\n\fAlgorithm 1 One iteration of sampling for the Bayesian GAN. \u03b1 is the friction term for SGHMC, \u03b7 is the\nlearning rate. We assume that the stochastic gradient discretization noise term \u02c6\u03b2 is dominated by the main\nfriction term (this assumption constrains us to use small step sizes). We take Jg and Jd simple MC samples for\nthe generator and discriminator respectively, and M SGHMC samples for each simple MC sample. We rescale\nto accommodate minibatches as in the supplementary material.\n\n\u2022 Represent posteriors with samples {\u03b8j,m\nfor number of MC iterations Jg do\n\nj=1,m=1 from previous iteration\n\u2022 Sample Jg noise samples {z(1), . . . , z(Jg)} from noise prior p(z). Each z(i) has ng samples.\n\u2022 Update sample set representing p(\u03b8g|\u03b8d) by running SGHMC updates for M iterations:\n\nj=1,m=1 and {\u03b8j,m\n\nd }Jd,M\n\ng }Jg,M\n\ng \u2190 \u03b8j,m\n\u03b8j,m\n\ng + v; v \u2190 (1 \u2212 \u03b1)v + \u03b7\n\n\u2202 log\n\nk p(\u03b8g|z(i), \u03b8k,m\n\u2202\u03b8g\n\nd\n\n)\n\n+ n; n \u223c N (0, 2\u03b1\u03b7I)\n\n(cid:16)(cid:80)\n\ni\n\n(cid:80)\n\n(cid:17)\n\n\u2022 Append \u03b8j,m\n\ng\n\nto sample set.\n\nend for\nfor number of MC iterations Jd do\n\n\u2022 Sample minibatch of Jd noise samples {z(1), . . . , z(Jd)} from noise prior p(z).\n\u2022 Sample minibatch of nd data samples x.\n\u2022 Update sample set representing p(\u03b8d|z, \u03b8g) by running SGHMC updates for M iterations:\nd \u2190 \u03b8j,m\n\u03b8j,m\n\u2022 Append \u03b8j,m\n\n(cid:80)\nk p(\u03b8d|z(i), x, \u03b8k,m\n\u2202\u03b8d\n\nd + v; v \u2190 (1 \u2212 \u03b1)v + \u03b7\n\n\u2202 log(cid:0)(cid:80)\n\nto sample set.\n\n)(cid:1)\n\ng\n\ni\n\n+ n; n \u223c N (0, 2\u03b1\u03b7I)\n\nend for\n\nd\n\n4 Experiments\n\nWe evaluate our proposed Bayesian GAN (henceforth titled BayesGAN) on six benchmarks (synthetic,\nMNIST, CIFAR-10, SVHN, and CelebA) each with four different numbers of labelled examples. We\nconsider multiple alternatives, including: the DCGAN [9], the recent Wasserstein GAN (W-DCGAN)\n[1], an ensemble of ten DCGANs (DCGAN-10) which are formed by 10 random subsets 80% the\nsize of the training set, and a fully supervised convolutional neural network. We also compare to the\nreported MNIST result for the LFVI-GAN, brie\ufb02y mentioned in a recent work [11], where they use\nfully supervised modelling on the whole dataset with a variational approximation. We interpret many\nof the results from MNIST in detail in Section 4.2, and \ufb01nd that these observations carry forward to\nour CIFAR-10, SVHN, and CelebA experiments.\nFor all real experiments except MNIST we use a 6-layer Bayesian deconvolutional GAN (BayesGAN)\nfor the generative model G(z|\u03b8g) (see Radford et al. [9] for further details about structure). The\ncorresponding discriminator is a 6-layer 2-class DCGAN for the unsupervised GAN and a 6-layer,\nK + 1 class DCGAN for a semi-supervised GAN performing classi\ufb01cation over K classes. The\nconnectivity structure of the unsupervised and semi-supervised DCGANs were the same as for the\nBayesGAN. As recommended by [10], we used feature matching for all models on semi-supervised\nexperiments. For MNIST we found that 4-layers for all networks worked slightly better across the\nboard, due to the added simplicity of the dataset. Note that the structure of the networks in Radford\net al. [9] are slightly different from [10] (e.g. there are 4 hidden layers and fewer \ufb01lters per layer), so\none cannot directly compare the results here with those in Salimans et al. [10]. Our exact architecture\nspeci\ufb01cation is also given in our codebase. The performance of all methods could be improved\nthrough further calibrating architecture design for each individual benchmark. For the Bayesian\nGAN we place a N (0, 10I) prior on both the generator and discriminator weights and approximately\nintegrate out z using simple Monte Carlo samples. We run Algorithm 1 for 5000 iterations and then\ncollect weight samples every 1000 iterations and record out-of-sample predictive accuracy using\nBayesian model averaging (see Eq. 5). For Algorithm 1 we set Jg = 10, Jd = 1, M = 2, and\nnd = ng = 64. All experiments were performed on a single TitanX GPU for consistency, but\nBayesGAN and DCGAN-10 could be sped up to approximately the same runtime as DCGAN through\nmulti-GPU parallelization.\n\n5\n\n\fIn Table 1 we summarize the semi-supervised results, where we see consistently improved perfor-\nmance over the alternatives. All runs are averaged over 10 random subsets of labeled examples for\nestimating error bars on performance (Table 1 shows mean and 2 standard deviations). We also\nqualitatively illustrate the ability for the Bayesian GAN to produce complementary sets of data\nsamples, corresponding to different representations of the generator produced by sampling from the\nposterior over the generator weights (Figures 1, 2, 5). The supplement also contains additional plots\nof accuracy per epoch for semi-supervised experiments.\n\n4.1 Synthetic Dataset\n\nWe present experiments on a multi-modal synthetic dataset to test the ability to infer a multi-modal\nposterior p(\u03b8g|D). This ability not only helps avoid the collapse of the generator to a couple training\nexamples, an instance of over\ufb01tting in regular GAN training, but also allows one to explore a set of\ngenerators with different complementary properties, harmonizing together to encapsulate a rich data\ndistribution. We generate D-dimensional synthetic data as follows:\n\nz \u223c N (0, 10 \u2217 Id), A \u223c N (0, ID\u00d7d), x = Az + \u0001,\n\n\u0001 \u223c N (0, 0.01 \u2217 ID),\n\nd (cid:28) D\n\nWe then \ufb01t both a regular GAN and a Bayesian GAN to such a dataset with D = 100 and d = 2. The\ngenerator for both models is a two-layer neural network: 10-1000-100, fully connected, with ReLU\nactivations. We set the dimensionality of z to be 10 in order for the DCGAN to converge (it does not\nconverge when d = 2, despite the inherent dimensionality being 2!). Consistently, the discriminator\nnetwork has the following structure: 100-1000-1, fully-connected, ReLU activations. For this dataset\nwe place an N (0, I) prior on the weights of the Bayesian GAN and approximately integrate out z\nusing J = 100 Monte-Carlo samples. Figure 1 shows that the Bayesian GAN does a much better\njob qualitatively in generating samples (for which we show the \ufb01rst two principal components), and\nquantitatively in terms of Jensen-Shannon divergence (JSD) to the true distribution (determined\nthrough kernel density estimates). In fact, the DCGAN (labelled ML GAN as per Section 2) begins to\neventually increase in testing JSD after a certain number of training iterations, which is reminiscent\nof over-\ufb01tting. When D = 500, we still see good performance with the Bayesian GAN. We also see,\nwith multidimensional scaling [2], that samples from the posterior over Bayesian generator weights\nclearly form multiple distinct clusters, indicating that the SGHMC sampling is exploring multiple\ndistinct modes, thus capturing multimodality in weight space as well as in data space.\n\n4.2 MNIST\n\nMNIST is a well-understood benchmark dataset consisting of 60k (50k train, 10k test) labeled images\nof hand-written digits. Salimans et al. [10] showed excellent out-of-sample performance using only\na small number of labeled inputs, convincingly demonstrating the importance of good generative\nmodelling for semi-supervised learning. Here, we follow their experimental setup for MNIST.\nWe evaluate the Bayesian DCGAN for semi-supervised learning using Ns = {20, 50, 100, 200}\nlabelled training examples. We see in Table 1 that the Bayesian GAN has improved accuracy over the\nDCGAN, the Wasserstein GAN, and even an ensemble of 10 DCGANs! Moreover, it is remarkable\nthat the Bayesian GAN with only 100 labelled training examples (0.2% of the training data) is able to\nachieve 99.3% testing accuracy, which is comparable with a state of the art fully supervised method\nusing all 50, 000 training examples! We show a fully supervised model using ns samples to generally\nhighlight the practical utility of semi-supervised learning.\nMoreover, Tran et al. [11] showed that a fully supervised LFVI-GAN, on the whole MNIST training\nset (50, 000 labelled examples) produces 140 classi\ufb01cation errors \u2013 almost twice the error of our\nproposed Bayesian GAN approach using only ns = 100 (0.2%) labelled examples! We suspect\nthis difference largely comes from (1) the simple practical formulation of the Bayesian GAN in\nSection 2, (2) marginalizing z via simple Monte Carlo, and (3) exploring a broad multimodal\nposterior distribution over the generator weights with SGHMC with our approach versus a variational\napproximation (prone to over-compact representations) centred on a single mode.\nWe can also see qualitative differences in the unsupervised data samples from our Bayesian DCGAN\nand the standard DCGAN in Figure 2. The top row shows sample images produced from six generators\nproduced from six samples over the posterior of the generator weights, and the bottom row shows\nsample data images from a DCGAN. We can see that each of the six panels in the top row have\n\n6\n\n\fFigure 1: Left: Samples drawn from pdata(x) and visualized in 2-D after applying PCA. Right 2 columns:\nSamples drawn from pMLGAN(x) and pBGAN(x) visualized in 2D after applying PCA. The data is inherently\n2-dimensional so PCA can explain most of the variance using 2 principal components. It is clear that the\nBayesian GAN is capturing all the modes in the data whereas the regular GAN is unable to do so. Right:\n(Top 2) Jensen-Shannon divergence between pdata(x) and p(x; \u03b8) as a function of the number of iterations of\nGAN training for D = 100 (top) and D = 500 (bottom). The divergence is computed using kernel density\nestimates of large sample datasets drawn from pdata(x) and p(x; \u03b8), after applying dimensionality reduction\nto 2-D (the inherent dimensionality of the data). In both cases, the Bayesian GAN is far more effective at\nminimizing the Jensen-Shannon divergence, reaching convergence towards the true distribution, by exploring\na full distribution over generator weights, which is not possible with a maximum likelihood GAN (no matter\nhow many iterations). (Bottom) The sample set {\u03b8k\ng} after convergence viewed in 2-D using Multidimensional\nScaling (using a Euclidean distance metric between weight samples) [2]. One can clearly see several clusters,\nmeaning that the SGHMC sampling has discovered pronounced modes in the posterior over the weights.\n\nqualitative differences, almost as if a different person were writing the digits in each panel. Panel\n1 (top left), for example, is quite crisp, while panel 3 is fairly thick, and panel 6 (top right) has\nthin and fainter strokes. In other words, the Bayesian GAN is learning different complementary\ngenerative hypotheses to explain the data. By contrast, all of the data samples on the bottom row\nfrom the DCGAN are homogenous. In effect, each posterior weight sample in the Bayesian GAN\ncorresponds to a different style, while in the standard DCGAN the style is \ufb01xed. This difference\nis further illustrated for all datasets in Figure 5 (supplement). Figure 3 (supplement) also further\nemphasizes the utility of Bayesian marginalization versus optimization, even with vague priors.\nHowever, we do not necessarily expect high \ufb01delity images from any arbitrary generator sampled\nfrom the posterior over generators; in fact, such a generator would probably have less posterior\nprobability than the DCGAN, which we show in Section 2 can be viewed as a maximum likelihood\nanalogue of our approach. The advantage in the Bayesian approach comes from representing a whole\nspace of generators alongside their posterior probabilities.\nPractically speaking, we also stress that for reasonable sample generation from the maximum-\nlikelihood DCGAN we had to resort to using tricks including minibatch discrimination, feature\nnormalization and the addition of Gaussian noise to each layer of the discriminator. The Bayesian\nDCGAN needed none of these tricks. This robustness arises from a Gaussian prior over the weights\nwhich provides a useful inductive bias, and due to the MCMC sampling procedure which alleviates\n\n7\n\n\fTable 1: Detailed supervised and semi-supervised learning results for all datasets. In almost all experiments\nBayesGAN outperforms DCGAN and W-DCGAN substantially, and typically even outperforms ensembles of\nDCGANs. The runtimes, per epoch, in minutes, are provided in rows including the dataset name. While all\nexperiments were performed on a single GPU, note that DCGAN-10 and BayesGAN methods can be sped up\nstraightforwardly using multiple GPUs to obtain a similar runtime to DCGAN. Note also that the BayesGAN is\ngenerally much more ef\ufb01cient per epoch than the alternatives, as per Figure 4 (supplement). Results are averaged\nover 10 random supervised subsets \u00b1 2 stdev. Standard train/test splits are used for MNIST, CIFAR-10 and\nSVHN. For CelebA we use a test set of size 10k. Test error rates are across the entire test set.\nNo. of misclassi\ufb01cations for MNIST. Test error rate for others.\nSupervised\nN=50k, D = (28, 28)\n\nNs\n\nDCGAN-10\n112\n1453 \u00b1 532\n329 \u00b1 139\n102 \u00b1 11\n88 \u00b1 6\n217\n39.6 \u00b1 2.8\n32.4 \u00b1 2.9\n27.4 \u00b1 3.2\n22.6 \u00b1 2.2\n286\n31.8 \u00b1 4.1\n19.8 \u00b1 2.1\n17.1 \u00b1 2.3\n13.0 \u00b1 1.9\n767\n43.3 \u00b1 5.3\n28.2 \u00b1 1.3\n21.3 \u00b1 1.2\n20.1 \u00b1 1.4\n\nBayesGAN\n39\n1402 \u00b1 422\n321 \u00b1 194\n98 \u00b1 13\n82 \u00b1 5\n102\n41.3 \u00b1 5.1\n31.4 \u00b1 3.6\n25.9 \u00b1 3.7\n23.1 \u00b1 3.9\n107\n32.8 \u00b1 4.4\n21.9 \u00b1 3.5\n16.3 \u00b1 2.4\n12.7 \u00b1 1.4\n387\n42.4 \u00b1 6.7\n26.8 \u00b1 4.2\n22.6 \u00b1 3.7\n19.4 \u00b1 3.4\n\nMNIST\n20\n50\n100\n200\nCIFAR-10\n1000\n2000\n4000\n8000\nSVHN\n500\n1000\n2000\n4000\n\n2134 \u00b1 525\n1389 \u00b1 438\nN=50k, D = (32, 32, 3)\n63.4 \u00b1 2.6\n56.1 \u00b1 2.1\n51.4 \u00b1 2.9\n47.2 \u00b1 2.2\nN=75k, D = (32, 32, 3)\n53.5 \u00b1 2.5\n37.3 \u00b1 3.1\n26.3 \u00b1 2.1\n20.8 \u00b1 1.8\nCelebA N=100k, D = (50, 50, 3)\n53.8 \u00b1 4.2\n36.7 \u00b1 3.2\n34.3 \u00b1 3.8\n31.1 \u00b1 4.2\n\n1000\n2000\n4000\n8000\n\nDCGAN W-DCGAN\n19\n1623 \u00b1 325\n412 \u00b1 199\n134 \u00b1 28\n91 \u00b1 10\n38\n46.1 \u00b1 3.6\n35.8 \u00b1 3.8\n31.1 \u00b1 4.7\n24.4 \u00b1 5.5\n34\n36.1 \u00b1 4.2\n22.1 \u00b1 4.8\n21.0 \u00b1 1.3\n17.1 \u00b1 1.2\n117\n45.5 \u00b1 5.9\n30.1 \u00b1 3.3\n26.0 \u00b1 2.1\n21.0 \u00b1 1.9\n\n16\n\u2014 1618 \u00b1 388\n\u2014 432 \u00b1 187\n121 \u00b1 18\n95 \u00b1 7\n34\n48.6 \u00b1 3.4\n34.1 \u00b1 4.1\n30.8 \u00b1 4.6\n25.1 \u00b1 3.3\n31\n38.2 \u00b1 3.1\n23.6 \u00b1 4.6\n21.2 \u00b1 3.1\n18.2 \u00b1 1.7\n109\n48.1 \u00b1 4.8\n31.1 \u00b1 3.2\n28.3 \u00b1 3.2\n22.5 \u00b1 1.5\n\nthe risk of collapse and helps explore multiple modes (and uncertainty within each mode). To be\nbalanced, we also stress that in practice the risk of collapse is not fully eliminated \u2013 indeed, some\nsamples from p(\u03b8g|D) still produce generators that create data samples with too little entropy. In\npractice, sampling is not immune to becoming trapped in sharply peaked modes. We leave further\nanalysis for future work.\n\nFigure 2: Top: Data samples from six different generators corresponding to six samples from the posterior over\n\u03b8g. The data samples show that each explored setting of the weights \u03b8g produces generators with complementary\nhigh-\ufb01delity samples, corresponding to different styles. The amount of variety in the samples emerges naturally\nusing the Bayesian approach. Bottom: Data samples from a standard DCGAN (trained six times). By contrast,\nthese samples are homogenous in style.\n\n4.3 CIFAR-10\n\nCIFAR-10 is also a popular benchmark dataset [7], with 50k training and 10k test images, which is\nharder to model than MNIST since the data are 32x32 RGB images of real objects. Figure 5 shows\n\n8\n\n\fdatasets produced from four different generators corresponding to samples from the posterior over\nthe generator weights. As with MNIST, we see meaningful qualitative variation between the panels.\nIn Table 1 we also see again (but on this more challenging dataset) that using Bayesian GANs as a\ngenerative model within the semi-supervised learning setup signi\ufb01cantly decreases test set error over\nthe alternatives, especially when ns (cid:28) n.\n\n4.4 SVHN\n\nThe StreetView House Numbers (SVHN) dataset consists of RGB images of house numbers taken\nby StreetView vehicles. Unlike MNIST, the digits signi\ufb01cantly differ in shape and appearance. The\nexperimental procedure closely followed that for CIFAR-10. There are approximately 75k training\nand 25k test images. We see in Table 1 a particularly pronounced difference in performance between\nBayesGAN and the alternatives. Data samples are shown in Figure 5.\n\n4.5 CelebA\n\nThe large CelebA dataset contains 120k celebrity faces amongst a variety of backgrounds (100k\ntraining, 20k test images). To reduce background variations we used a standard face detector [12] to\ncrop the faces into a standard 50 \u00d7 50 size. Figure 5 shows data samples from the trained Bayesian\nGAN. In order to assess performance for semi-supervised learning we created a 32-class classi\ufb01cation\ntask by predicting a 5-bit vector indicating whether or not the face: is blond, has glasses, is male, is\npale and is young. Table 1 shows the same pattern of promising performance for CelebA.\n\n5 Discussion\n\nBy exploring rich multimodal distributions over the weight parameters of the generator, the Bayesian\nGAN can capture a diverse set of complementary and interpretable representations of data. We have\nshown that such representations can enable state of the art performance on semi-supervised problems,\nusing a simple inference procedure.\nEffective semi-supervised learning of natural high dimensional data is crucial for reducing the\ndependency of deep learning on large labelled datasets. Often labeling data is not an option, or\nit comes at a high cost \u2013 be it through human labour or through expensive instrumentation (such\nas LIDAR for autonomous driving). Moreover, semi-supervised learning provides a practical and\nquanti\ufb01able mechanism to benchmark the many recent advances in unsupervised learning.\nAlthough we use MCMC, in recent years variational approximations have been favoured for inference\nin Bayesian neural networks. However, the likelihood of a deep neural network can be broad with\nmany shallow local optima. This is exactly the type of density which is amenable to a sampling based\napproach, which can explore a full posterior. Variational methods, by contrast, typically centre their\napproximation along a single mode and also provide an overly compact representation of that mode.\nTherefore in the future we may generally see advantages in following a sampling based approach in\nBayesian deep learning. Aside from sampling, one could try to better accommodate the likelihood\nfunctions common to deep learning using more general divergence measures (for example based on\nthe family of \u03b1-divergences) instead of the KL divergence in variational methods, alongside more\n\ufb02exible proposal distributions.\nIn the future, one could also estimate the marginal likelihood of a probabilistic GAN, having integrated\naway distributions over the parameters. The marginal likelihood provides a natural utility function for\nautomatically learning hyperparameters, and for performing principled quanti\ufb01able model comparison\nbetween different GAN architectures. It would also be interesting to consider the Bayesian GAN in\nconjunction with a non-parametric Bayesian deep learning framework, such as deep kernel learning\n[13, 14]. We hope that our work will help inspire continued exploration into Bayesian deep learning.\n\nAcknowledgements We thank Pavel Izmailov and Ben Athiwaratkun for helping to create a tutorial\nfor the codebase, helpful comments and validation. We also thank Soumith Chintala for helpful\nadvice. We thank NSF IIS-1563887 for support.\n\n9\n\n\fReferences\n[1] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN.\n\narXiv:1701.07875.\n\narXiv preprint\n\n[2] Borg, I. and Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications.\n\nSpringer Science & Business Media.\n\n[3] Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In\n\nProc. International Conference on Machine Learning.\n\n[4] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,\nand Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680.\n\n[5] Karaletsos, T. (2016). Adversarial message passing for graphical models. arXiv preprint\n\narXiv:1612.05048.\n\n[6] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114.\n\n[7] Krizhevsky, A., Nair, V., and Hinton, G. (2010). Cifar-10 (Canadian institute for advanced\n\nresearch).\n\n[8] Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-GAN: Training generative neural samplers\nusing variational divergence minimization. In Advances in Neural Information Processing Systems,\npages 271\u2013279.\n\n[9] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.\n\n[10] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016).\n\nImproved techniques for training gans. CoRR, abs/1606.03498.\n\n[11] Tran, D., Ranganath, R., and Blei, D. (2017). Hierarchical implicit models and likelihood-free\nvariational inference. In Advances in Neural Information Processing Systems, pages 5529\u20135539.\n\n[12] Viola, P. and Jones, M. J. (2004). Robust real-time face detection. Int. J. Comput. Vision,\n\n57(2):137\u2013154.\n\n[13] Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. (2016a). Deep kernel learning.\n\nArti\ufb01cial Intelligence and Statistics.\n\n[14] Wilson, A. G., Hu, Z., Salakhutdinov, R. R., and Xing, E. P. (2016b). Stochastic variational\ndeep kernel learning. In Advances in Neural Information Processing Systems, pages 2586\u20132594.\n\n10\n\n\f", "award": [], "sourceid": 2025, "authors": [{"given_name": "Yunus", "family_name": "Saatci", "institution": "Uber AI Labs"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}]}