{"title": "Triple Generative Adversarial Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 4088, "page_last": 4098, "abstract": "Generative Adversarial Nets (GANs) have shown promise in image generation and semi-supervised learning (SSL). However, existing GANs in SSL have two problems: (1) the generator and the discriminator (i.e. the classifier) may not be optimal at the same time; and (2) the generator cannot control the semantics of the generated samples. The problems essentially arise from the two-player formulation, where a single discriminator shares incompatible roles of identifying fake samples and predicting labels and it only estimates the data without considering the labels. To address the problems, we present triple generative adversarial net (Triple-GAN), which consists of three players---a generator, a discriminator and a classifier. The generator and the classifier characterize the conditional distributions between images and labels, and the discriminator solely focuses on identifying fake image-label pairs. We design compatible utilities to ensure that the distributions characterized by the classifier and the generator both converge to the data distribution. Our results on various datasets demonstrate that Triple-GAN as a unified model can simultaneously (1) achieve the state-of-the-art classification results among deep generative models, and (2) disentangle the classes and styles of the input and transfer smoothly in the data space via interpolation in the latent space class-conditionally.", "full_text": "Triple Generative Adversarial Nets\n\nChongxuan Li, Kun Xu, Jun Zhu\u2217, Bo Zhang\n\nDept. of Comp. Sci. & Tech., TNList Lab, State Key Lab of Intell. Tech. & Sys.,\n\nCenter for Bio-Inspired Computing Research, Tsinghua University, Beijing, 100084, China\n\n{licx14, xu-k16}@mails.tsinghua.edu.cn, {dcszj, dcszb}@mail.tsinghua.edu.cn\n\nAbstract\n\nGenerative Adversarial Nets (GANs) have shown promise in image generation\nand semi-supervised learning (SSL). However, existing GANs in SSL have two\nproblems: (1) the generator and the discriminator (i.e.\nthe classi\ufb01er) may not\nbe optimal at the same time; and (2) the generator cannot control the semantics\nof the generated samples. The problems essentially arise from the two-player\nformulation, where a single discriminator shares incompatible roles of identifying\nfake samples and predicting labels and it only estimates the data without considering\nthe labels. To address the problems, we present triple generative adversarial\nnet (Triple-GAN), which consists of three players\u2014a generator, a discriminator\nand a classi\ufb01er. The generator and the classi\ufb01er characterize the conditional\ndistributions between images and labels, and the discriminator solely focuses on\nidentifying fake image-label pairs. We design compatible utilities to ensure that\nthe distributions characterized by the classi\ufb01er and the generator both converge to\nthe data distribution. Our results on various datasets demonstrate that Triple-GAN\nas a uni\ufb01ed model can simultaneously (1) achieve the state-of-the-art classi\ufb01cation\nresults among deep generative models, and (2) disentangle the classes and styles\nof the input and transfer smoothly in the data space via interpolation in the latent\nspace class-conditionally.\n\n1\n\nIntroduction\n\nDeep generative models (DGMs) can capture the underlying distributions of the data and synthesize\nnew samples. Recently, signi\ufb01cant progress has been made on generating realistic images based on\nGenerative Adversarial Nets (GANs) [7, 3, 22]. GAN is formulated as a two-player game, where the\ngenerator G takes a random noise z as input and produces a sample G(z) in the data space while the\ndiscriminator D identi\ufb01es whether a certain sample comes from the true data distribution p(x) or the\ngenerator. Both G and D are parameterized as deep neural networks and the training procedure is to\nsolve a minimax problem:\n\nmin\n\nmax\n\nU (D, G) = Ex\u223cp(x)[log(D(x))] + Ez\u223cpz(z)[log(1 \u2212 D(G(z)))],\n\nG\n\nD\n\nwhere pz(z) is a simple distribution (e.g., uniform or normal) and U (\u00b7) denotes the utilities. Given a\ngenerator and the de\ufb01ned distribution pg, the optimal discriminator is D(x) = p(x)/(pg(x) + p(x))\nin the nonparametric setting, and the global equilibrium of this game is achieved if and only if\npg(x) = p(x) [7], which is desired in terms of image generation.\nGANs and DGMs in general have also proven effective in semi-supervised learning (SSL) [11],\nwhile retaining the generative capability. Under the same two-player game framework, Cat-GAN [26]\ngeneralizes GANs with a categorical discriminative network and an objective function that minimizes\nthe conditional entropy of the predictions given the real data while maximizes the conditional entropy\n\n\u2217J. Zhu is the corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: An illustration of Triple-GAN (best view in color). The utilities of D, C and G are colored\nin blue, green and yellow respectively, with \u201cR\u201d denoting rejection, \u201cA\u201d denoting acceptance and\n\u201cCE\u201d denoting the cross entropy loss for supervised learning. \u201cA\u201ds and \u201cR\u201ds are the adversarial losses\nand \u201cCE\u201ds are unbiased regularizations that ensure the consistency between pg, pc and p, which are\nthe distributions de\ufb01ned by the generator, classi\ufb01er and true data generating process, respectively.\n\nof the predictions given the generated samples. Odena [20] and Salimans et al. [25] augment the\ncategorical discriminator with one more class, corresponding to the fake data generated by the\ngenerator. There are two main problems in existing GANs for SSL: (1) the generator and the\ndiscriminator (i.e. the classi\ufb01er) may not be optimal at the same time [25]; and (2) the generator\ncannot control the semantics of the generated samples.\nFor the \ufb01rst problem, as an instance, Salimans et al. [25] propose two alternative training objectives\nthat work well for either classi\ufb01cation or image generation in SSL, but not both. The objective of\nfeature matching works well in classi\ufb01cation but fails to generate indistinguishable samples (See\nSec.5.2 for examples), while the other objective of minibatch discrimination is good at realistic image\ngeneration but cannot predict labels accurately. The phenomena are not analyzed deeply in [25] and\nhere we argue that they essentially arise from the two-player formulation, where a single discriminator\nhas to play two incompatible roles\u2014identifying fake samples and predicting labels. Speci\ufb01cally,\nassume that G is optimal, i.e p(x) = pg(x), and consider a sample x \u223c pg(x). On one hand, as a\ndiscriminator, the optimal D should identify x as a fake sample with non-zero probability (See [7] for\nthe proof). On the other hand, as a classi\ufb01er, the optimal D should always predict the correct class\nof x con\ufb01dently since x \u223c p(x). It con\ufb02icts as D has two incompatible convergence points, which\nindicates that G and D may not be optimal at the same time. Moreover, the issue remains even given\nimperfect G, as long as pg(x) and p(x) overlaps as in most of the real cases. Given a sample form\nthe overlapped area, the two roles of D still compete by treating the sample differently, leading to\na poor classi\ufb01er2. Namely, the learning capacity of existing two-player models is restricted, which\nshould be addressed to advance current SSL results.\nFor the second problem, disentangling meaningful physical factors like the object category from the\nlatent representations with limited supervision is of general interest [30, 2]. However, to our best\nknowledge, none of the existing GANs can learn the disentangled representations in SSL, though\nsome work [22, 5, 21] can learn such representations given full labels. Again, we believe that the\nproblem is caused by their two-player formulation. Speci\ufb01cally, the discriminators in [26, 25] take\na single data instead of a data-label pair as input and the label information is totally ignored when\njustifying whether a sample is real or fake. Therefore, the generators will not receive any learning\nsignal regarding the label information from the discriminators and hence such models cannot control\nthe semantics of the generated samples, which is not satisfactory.\nTo address these problems, we present Triple-GAN, a \ufb02exible game-theoretical framework for both\nclassi\ufb01cation and class-conditional image generation in SSL, where we have a partially labeled\ndataset. We introduce two conditional networks\u2013a classi\ufb01er and a generator to generate pseudo labels\ngiven real data and pseudo data given real labels, respectively. To jointly justify the quality of the\nsamples from the conditional networks, we de\ufb01ne a single discriminator network which has the sole\nrole of distinguishing whether a data-label pair is from the real labeled dataset or not. The resulting\nmodel is called Triple-GAN because not only are there three networks, but we consider three joint\ndistributions, i.e. the true data-label distribution and the distributions de\ufb01ned by the conditional\nnetworks (See Figure 1 for the illustration of Triple-GAN). Directly motivated by the desirable\nequilibrium that both the classi\ufb01er and the conditional generator are optimal, we carefully design\n\n2The results of minibatch discrimination approach in [25] well support our analysis.\n\n2\n\n\ud835\udc7f\ud835\udc84,\ud835\udc80\ud835\udc84~\ud835\udc91\ud835\udc84(\ud835\udc7f,\ud835\udc80)\ud835\udc7f\ud835\udc8d,\ud835\udc80\ud835\udc8d\u223c\ud835\udc91(\ud835\udc7f,\ud835\udc80)\ud835\udc7f\ud835\udc84\u223c\ud835\udc91(\ud835\udc7f)\ud835\udc81\ud835\udc88\u223c\ud835\udc91\ud835\udc9b(\ud835\udc81)\ud835\udc7f\ud835\udc88,\ud835\udc80\ud835\udc88~\ud835\udc91\ud835\udc88(\ud835\udc7f,\ud835\udc80)\ud835\udc7f\ud835\udc8d,\ud835\udc80\ud835\udc8d\u223c\ud835\udc91(\ud835\udc7f,\ud835\udc80)GCDA/RAA/RCECE\ud835\udc80\ud835\udc88\u223c\ud835\udc91(\ud835\udc80)\fcompatible utilities including adversarial losses and unbiased regularizations (See Sec. 3), which lead\nto an effective solution to the challenging SSL task, justi\ufb01ed both in theory and practice.\nIn particular, theoretically, instead of competing as stated in the \ufb01rst problem, a good classi\ufb01er will\nresult in a good generator and vice versa in Triple-GAN (See Sec. 3.2 for the proof). Furthermore, the\ndiscriminator can access the label information of the unlabeled data from the classi\ufb01er and then force\nthe generator to generate correct image-label pairs, which addresses the second problem. Empirically,\nwe evaluate our model on the widely adopted MNIST [14], SVHN [19] and CIFAR10 [12] datasets.\nThe results (See Sec. 5) demonstrate that Triple-GAN can simultaneously learn a good classi\ufb01er and\na conditional generator, which agrees with our motivation and theoretical results.\nOverall, our main contributions are two folded: (1) we analyze the problems in existing SSL\nGANs [26, 25] and propose a novel game-theoretical Triple-GAN framework to address them with\ncarefully designed compatible objectives; and (2) we show that on the three datasets with incomplete\nlabels, Triple-GAN can advance the state-of-the-art classi\ufb01cation results of DGMs substantially and,\nat the same time, disentangle classes and styles and perform class-conditional interpolation.\n\n2 Related Work\n\nRecently, various approaches have been developed to learn directed DGMs, including Variational\nAutoencoders (VAEs) [10, 24], Generative Moment Matching Networks (GMMNs) [16, 6] and\nGenerative Adversarial Nets (GANs) [7]. These criteria are systematically compared in [28].\nOne primal goal of DGMs is to generate realistic samples, for which GANs have proven effective.\nSpeci\ufb01cally, LAP-GAN [3] leverages a series of GANs to upscale the generated samples to high\nresolution images through the Laplacian pyramid framework [1]. DCGAN [22] adopts (fractionally)\nstrided convolution layers and batch normalization [8] in GANs and generates realistic natural images.\nRecent work has introduced inference networks in GANs. For instance, InfoGAN [2] learns ex-\nplainable latent codes from unlabeled data by regularizing the original GANs via variational mutual\ninformation maximization. In ALI [5, 4], the inference network approximates the posterior distribu-\ntion of latent variables given true data in unsupervised manner. Triple-GAN also has an inference\nnetwork (classi\ufb01er) as in ALI but there exist two important differences in the global equilibria and\nutilities between them: (1) Triple-GAN matches both the distributions de\ufb01ned by the generator\nand classi\ufb01er to true data distribution while ALI only ensures that the distributions de\ufb01ned by the\ngenerator and inference network to be the same; (2) the discriminator will reject the samples from\nthe classi\ufb01er in Triple-GAN while the discriminator will accept the samples from the inference\nnetwork in ALI, which leads to different update rules for the discriminator and inference network.\nThese differences naturally arise because Triple-GAN is proposed to solve the existing problems\nin SSL GANs as stated in the introduction. Indeed, ALI [5] uses the same approach as [25] to deal\nwith partially labeled data and hence it still suffers from the problems. In addition, Triple-GAN\noutperforms ALI signi\ufb01cantly in the semi-supervised classi\ufb01cation task (See comparison in Table. 1).\nTo handle partially labeled data, the conditional VAE [11] treats the missing labels as latent variables\nand infer them for unlabeled data. ADGM [17] introduces auxiliary variables to build a more\nexpressive variational distribution and improve the predictive performance. The Ladder Network [23]\nemploys lateral connections between a variation of denoising autoencoders and obtains excellent SSL\nresults. Cat-GAN [26] generalizes GANs with a categorical discriminator and an objective function.\nSalimans et al. [25] propose empirical techniques to stabilize the training of GANs and improve the\nperformance on SSL and image generation under incompatible learning criteria. Triple-GAN differs\nsigni\ufb01cantly from these methods, as stated in the introduction.\n\n3 Method\n\nWe consider learning DGMs in the semi-supervised setting,3 where we have a partially labeled dataset\nwith x denoting the input data and y denoting the output label. The goal is to predict the labels y\nfor unlabeled data as well as to generate new samples x conditioned on y. This is different from the\nunsupervised setting for pure generation, where the only goal is to sample data x from a generator\nto fool a discriminator; thus a two-player game is suf\ufb01cient to describe the process as in GANs.\n\n3Supervised learning is an extreme case, where the training set is fully labeled.\n\n3\n\n\fIn our setting, as the label information y is incomplete (thus uncertain), our density model should\ncharacterize the uncertainty of both x and y, therefore a joint distribution p(x, y) of input-label pairs.\nA straightforward application of the two-player GAN is infeasible because of the missing values on\ny. Unlike the previous work [26, 25], which is restricted to the two-player framework and can lead\nto incompatible objectives, we build our game-theoretic objective based on the insight that the joint\ndistribution can be factorized in two ways, namely, p(x, y) = p(x)p(y|x) and p(x, y) = p(y)p(x|y),\nand that the conditional distributions p(y|x) and p(x|y) are of interest for classi\ufb01cation and class-\nconditional generation, respectively. To jointly estimate these conditional distributions, which are\ncharacterized by a classi\ufb01er network and a class-conditional generator network, we de\ufb01ne a single\ndiscriminator network which has the sole role of distinguishing whether a sample is from the true data\ndistribution or the models. Hence, we naturally extend GANs to Triple-GAN, a three-player game to\ncharacterize the process of classi\ufb01cation and class-conditional generation in SSL, as detailed below.\n\n3.1 A Game with Three Players\n\nTriple-GAN consists of three components: (1) a classi\ufb01er C that (approximately) characterizes the\nconditional distribution pc(y|x) \u2248 p(y|x); (2) a class-conditional generator G that (approximately)\ncharacterizes the conditional distribution in the other direction pg(x|y) \u2248 p(x|y); and (3) a discrim-\ninator D that distinguishes whether a pair of data (x, y) comes from the true distribution p(x, y).\nAll the components are parameterized as neural networks. Our desired equilibrium is that the joint\ndistributions de\ufb01ned by the classi\ufb01er and the generator both converge to the true data distribution. To\nthis end, we design a game with compatible utilities for the three players as follows.\nWe make the mild assumption that the samples from both p(x) and p(y) can be easily obtained.4\nIn the game, after a sample x is drawn from p(x), C produces a pseudo label y given x following\nthe conditional distribution pc(y|x). Hence, the pseudo input-label pair is a sample from the joint\ndistribution pc(x, y) = p(x)pc(y|x). Similarly, a pseudo input-label pair can be sampled from\nG by \ufb01rst drawing y \u223c p(y) and then drawing x|y \u223c pg(x|y); hence from the joint distribution\npg(x, y) = p(y)pg(x|y). For pg(x|y), we assume that x is transformed by the latent style variables z\ngiven the label y, namely, x = G(y, z), z \u223c pz(z), where pz(z) is a simple distribution (e.g., uniform\nor standard normal). Then, the pseudo input-label pairs (x, y) generated by both C and G are sent to\nthe single discriminator D for judgement. D can also access the input-label pairs from the true data\ndistribution as positive samples. We refer the utilities in the process as adversarial losses, which can\nbe formulated as a minimax game:\n\nmin\nC,G\n\nmax\n\nD\n\nU (C, G, D) =E(x,y)\u223cp(x,y)[log D(x, y)] + \u03b1E(x,y)\u223cpc(x,y)[log(1 \u2212 D(x, y))]\n\n+(1 \u2212 \u03b1)E(x,y)\u223cpg(x,y)[log(1 \u2212 D(G(y, z), y))],\n\n(1)\nwhere \u03b1 \u2208 (0, 1) is a constant that controls the relative importance of generation and classi\ufb01cation\nand we focus on the balance case by \ufb01xing it as 1/2 throughout the paper.\nThe game de\ufb01ned in Eqn. (1) achieves its equilibrium if and only if p(x, y) = (1 \u2212 \u03b1)pg(x, y) +\n\u03b1pc(x, y) (See details in Sec. 3.2). The equilibrium indicates that if one of C and G tends to the\ndata distribution, the other will also go towards the data distribution, which addresses the competing\nproblem. However, unfortunately, it cannot guarantee that p(x, y) = pg(x, y) = pc(x, y) is the unique\nglobal optimum, which is not desirable. To address this problem, we introduce the standard supervised\nloss (i.e., cross-entropy loss) to C, RL = E(x,y)\u223cp(x,y)[\u2212 log pc(y|x)], which is equivalent to the\nKL-divergence between pc(x, y) and p(x, y). Consequently, we de\ufb01ne the game as:\n\nmin\nC,G\n\nmax\n\nD\n\n\u02dcU (C, G, D) =E(x,y)\u223cp(x,y)[log D(x, y)] + \u03b1E(x,y)\u223cpc(x,y)[log(1 \u2212 D(x, y))]\n\n+(1 \u2212 \u03b1)E(x,y)\u223cpg(x,y)[log(1 \u2212 D(G(y, z), y))] + RL.\n\n(2)\n\nIt will be proven that the game with utilities \u02dcU has the unique global optimum for C and G.\n\n3.2 Theoretical Analysis and Pseudo Discriminative Loss\n\n4In semi-supervised learning, p(x) is the empirical distribution of inputs and p(y) is assumed same to the\n\ndistribution of labels on labeled data, which is uniform in our experiment.\n\n4\n\n\fAlgorithm 1 Minibatch stochastic gradient descent training of Triple-GAN in SSL.\n\nfor number of training iterations do\n\n\u2022 Sample a batch of pairs (xg, yg) \u223c pg(x, y) of size mg, a batch of pairs (xc, yc) \u223c pc(x, y)\nof size mc and a batch of labeled data (xd, yd) \u223c p(x, y) of size md.\n\u2022 Update D by ascending along its stochastic gradient:\n\u2207\u03b8d\n\nlog(1\u2212D(xc, yc))+\n\nlog(1\u2212D(xg, yg))\n\n\uf8ee\uf8f0 1\n\nlog D(xd, yd))+\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\uf8f9\uf8fb .\n\nmd\n\n(\n(xd,yd)\n\n\u03b1\nmc\n\n(xc,yc)\n\n1 \u2212 \u03b1\nmg\n\n(xg,yg)\n\n\u2022 Compute the unbiased estimators \u02dcRL and \u02dcRP of RL and RP respectively.\n\u2022 Update C by descending along its stochastic gradient:\n\n\uf8f9\uf8fb .\n\n\uf8ee\uf8f0 \u03b1\n\nmc\n\n(cid:88)\n\n(xc,yc)\n\n\u2207\u03b8c\n\n\u2207\u03b8g\n\nend for\n\n\u2022 Update G by descending along its stochastic gradient:\n\npc(yc|xc) log(1 \u2212 D(xc, yc)) + \u02dcRL + \u03b1P \u02dcRP\n\n\uf8ee\uf8f0 1 \u2212 \u03b1\n\nmg\n\n(cid:88)\n\n(xg,yg)\n\n\uf8f9\uf8fb .\n\nlog(1 \u2212 D(xg, yg))\n\nWe now provide a formal theoretical analysis of Triple-GAN under nonparametric assumptions and\nintroduce the pseudo discriminative loss, which is an unbiased regularization motivated by the global\nequilibrium. For clarity of the main text, we defer the proof details to Appendix A.\nFirst, we can show that the optimal D balances between the true data distribution and the mixture\ndistribution de\ufb01ned by C and G, as summarized in Lemma 3.1.\n\nLemma 3.1 For any \ufb01xed C and G, the optimal D of the game de\ufb01ned by the utility function\nU (C, G, D) is:\n\nD\u2217\nC,G(x, y) =\n\np(x, y)\n\np(x, y) + p\u03b1(x, y)\n\n,\n\n(3)\n\nwhere p\u03b1(x, y) := (1 \u2212 \u03b1)pg(x, y) + \u03b1pc(x, y) is a mixture distribution for \u03b1 \u2208 (0, 1).\nGiven D\u2217\nmaxD U (C, G, D), whose optimal point is summarized as in Lemma 3.2.\n\nC,G, we can omit D and reformulate the minimax game with value function U as: V (C, G) =\n\nLemma 3.2 The global minimum of V (C, G) is achieved if and only if p(x, y) = p\u03b1(x, y).\n\nWe can further show that C and G can at least capture the marginal distributions of data, especially\nfor pg(x), even there may exist multiple global equilibria, as summarized in Corollary 3.2.1.\n\nCorollary 3.2.1 Given p(x, y) = p\u03b1(x, y), the marginal distributions are the same for p, pc and pg,\ni.e. p(x) = pg(x) = pc(x) and p(y) = pg(y) = pc(y).\n\nGiven the above result that p(x, y) = p\u03b1(x, y), C and G do not compete as in the two-player based\nformulation and it is easy to verify that p(x, y) = pc(x, y) = pg(x, y) is a global equilibrium\npoint. However, it may not be unique and we should minimize an additional objective to ensure the\nuniqueness. In fact, this is true for the utility function \u02dcU (C, G, D) in problem (2), as stated below.\n\nTheorem 3.3 The equilibrium of \u02dcU (C, G, D) is achieved if and only if p(x, y) = pg(x, y) =\npc(x, y).\n\nThe conclusion essentially motivates our design of Triple-GAN, as we can ensure that both C and G\nwill converge to the true data distribution if the model has been trained to achieve the optimum.\nWe can further show another nice property of \u02dcU, which allows us to regularize our model for stable\nand better convergence in practice without bias, as summarized below.\n\n5\n\n\fCorollary 3.3.1 Adding any divergence (e.g.\nthe KL divergence) between any two of the joint\ndistributions or the conditional distributions or the marginal distributions, to \u02dcU as the additional\nregularization to be minimized, will not change the global equilibrium of \u02dcU.\n\nBecause label information is extremely insuf\ufb01cient in SSL, we propose pseudo discriminative loss\nRP = Epg [\u2212 log pc(y|x)], which optimizes C on the samples generated by G in the supervised\nmanner.\nIntuitively, a good G can provide meaningful labeled data beyond the training set as\nextra side information for C, which will boost the predictive performance (See Sec. 5.1 for the\nempirical evidence). Indeed, minimizing pseudo discriminative loss with respect to C is equivalent to\nminimizing DKL(pg(x, y)||pc(x, y)) (See Appendix A for proof) and hence the global equilibrium\nremains following Corollary 3.3.1. Also note that directly minimizing DKL(pg(x, y)||pc(x, y)) is\ninfeasible since its computation involves the unknown likelihood ratio pg(x, y)/pc(x, y). The pseudo\ndiscriminative loss is weighted by a hyperparameter \u03b1P. See Algorithm 1 for the whole training\nprocedure, where \u03b8c, \u03b8d and \u03b8g are trainable parameters in C, D and G respectively.\n\n4 Practical Techniques\n\nIn this section we introduce several practical techniques used in the implementation of Triple-GAN,\nwhich may lead to a biased solution theoretically but work well for challenging SSL tasks empirically.\nOne crucial problem of SSL is the small size of the labeled data. In Triple-GAN, D may memorize\nthe empirical distribution of the labeled data, and reject other types of samples from the true data\ndistribution. Consequently, G may collapse to these modes. To this end, we generate pseudo labels\nthrough C for some unlabeled data and use these pairs as positive samples of D. The cost is on\nintroducing some bias to the target distribution of D, which is a mixture of pc and p instead of the\npure p. However, this is acceptable as C converges quickly and pc and p are close (See results in\nSec.5).\nSince properly leveraging the unlabeled data is key to success in SSL, it is necessary to regularize\nC heuristically as in many existing methods [23, 26, 13, 15] to make more accurate predictions.\nWe consider two alternative losses on the unlabeled data. The con\ufb01dence loss [26] minimizes\nthe conditional entropy of pc(y|x) and the cross entropy between p(y) and pc(y), weighted by\na hyperparameter \u03b1B, as RU = Hpc(y|x) + \u03b1BEp\npredictions con\ufb01dently and be balanced on the unlabeled data. The consistency loss [13] penalizes\nthe network if it predicts the same unlabeled data inconsistently given different noise \u0001, e.g., dropout\nmasks, as RU = Ex\u223cp(x)||pc(y|x, \u0001) \u2212 pc(y|x, \u0001(cid:48))||2, where || \u00b7 ||2 is the square of the l2-norm. We\nuse the con\ufb01dence loss by default except on the CIFAR10 dataset (See details in Sec. 5).\nAnother consideration is to compute the gradients of Ex\u223cp(x),y\u223cpc(y|x)[log(1 \u2212 D(x, y))] with\nrespect to the parameters \u03b8c in C, which involves summation over the discrete random variable\ny, i.e.\nthe class label. On one hand, integrating out the class label is time consuming. On the\nother hand, directly sampling one label to approximate the expectation via the Monte Carlo method\nmakes the feedback of the discriminator not differentiable with respect to \u03b8c. As the REINFORCE\nalgorithm [29] can deal with such cases with discrete variables, we use a variant of it for the end-\nto-end training of our classi\ufb01er. The gradients in the original REINFORCE algorithm should be\nEx\u223cp(x)Ey\u223cpc(y|x)[\u2207\u03b8c log pc(y|x) log(1 \u2212 D(x, y))]. In our experiment, we \ufb01nd the best strategy\nis to use most probable y instead of sampling one to approximate the expectation over y. The bias is\nsmall as the prediction of C is rather con\ufb01dent typically.\n\n(cid:2) \u2212 log pc(y)(cid:3), which encourages C to make\n\n5 Experiments\n\nWe now present results on the widely adopted MNIST [14], SVHN [19], and CIFAR10 [12] datasets.\nMNIST consists of 50,000 training samples, 10,000 validation samples and 10,000 testing samples of\nhandwritten digits of size 28 \u00d7 28. SVHN consists of 73,257 training samples and 26,032 testing\nsamples and each is a colored image of size 32 \u00d7 32, containing a sequence of digits with various\nbackgrounds. CIFAR10 consists of colored images distributed across 10 general classes\u2014airplane,\nautomobile, bird, cat, deer, dog, frog, horse, ship and truck. There are 50,000 training samples and\n10,000 testing samples of size 32 \u00d7 32 in CIFAR10. We split 5,000 training data of SVHN and\n\n6\n\n\fTable 1: Error rates (%) on partially labeled MNIST, SHVN and CIFAR10 datasets, averaged by 10\nruns. The results with \u2020 are trained with more than 500,000 extra unlabeled data on SVHN.\nSVHN n = 1000 CIFAR10 n = 4000\n36.02 (\u00b10.10)\n\nAlgorithm\nM1+M2 [11]\nVAT [18]\nLadder [23]\nConv-Ladder [23]\nADGM [17]\nSDGM [17]\nMMCVA [15]\nCatGAN [26]\nImproved-GAN [25]\nALI [5]\nTriple-GAN (ours)\n\n2.33\n\nMNIST n = 100\n3.33 (\u00b10.14)\n1.06 (\u00b10.37)\n0.89 (\u00b10.50)\n0.96 (\u00b10.02)\n1.32 (\u00b10.07)\n1.24 (\u00b10.54)\n1.39 (\u00b10.28)\n0.93 (\u00b10.07)\n0.91 (\u00b10.58)\n\n24.63\n\n20.40 (\u00b10.47)\n\n22.86 \u2020\n\n16.61(\u00b10.24)\u2020\n4.95 (\u00b10.18) \u2020\n\n8.11 (\u00b11.3)\n5.77(\u00b10.17)\n\n7.3\n\n19.58 (\u00b10.58)\n18.63 (\u00b12.32)\n16.99 (\u00b10.36)\n\n18.3\n\nTable 2: Error rates (%) on MNIST with different number of labels, averaged by 10 runs.\n\nAlgorithm\nImproved-GAN [25]\nTriple-GAN (ours)\n\nn = 20\n\n16.77 (\u00b14.52)\n4.81 (\u00b14.95)\n\nn = 50\n\n2.21 (\u00b11.36)\n1.56 (\u00b10.72)\n\nn = 200\n\n0.90 (\u00b10.04)\n0.67 (\u00b10.16)\n\nCIFAR10 for validation if needed. On CIFAR10, we follow [13] to perform ZCA for the input of C\nbut still generate and estimate the raw images using G and D.\nWe implement our method based on Theano [27] and here we brie\ufb02y summarize our experimental\nsettings.5 Though we have an additional network, the generator and classi\ufb01er of Triple-GAN have\ncomparable architectures to those of the baselines [26, 25] (See details in Appendix F). The pseudo\ndiscriminative loss is not applied until the number of epochs reach a threshold that the generator could\ngenerate meaningful data. We only search the threshold in {200, 300}, \u03b1P in {0.1, 0.03} and the\nglobal learning rate in {0.0003, 0.001} based on the validation performance on each dataset. All of\nthe other hyperparameters including relative weights and parameters in Adam [9] are \ufb01xed according\nto [25, 15] across all of the experiments. Further, in our experiments, we \ufb01nd that the training\ntechniques for the original two-player GANs [3, 25] are suf\ufb01cient to stabilize the optimization of\nTriple-GAN.\n5.1 Classi\ufb01cation\nFor fair comparison, all the results of the baselines are from the corresponding papers and we average\nTriple-GAN over 10 runs with different random initialization and splits of the training data and report\nthe mean error rates with the standard deviations following [25].\nFirstly, we compare our method with a large body of approaches in the widely used settings on MNIST,\nSVHN and CIFAR10 datasets given 100, 1,000 and 4,000 labels6, respectively. Table 1 summarizes\nthe quantitative results. On all of the three datasets, Triple-GAN achieves the state-of-the-art results\nconsistently and it substantially outperforms the strongest competitors (e.g., Improved-GAN) on more\nchallenging SVHN and CIFAR10 datasets, which demonstrate the bene\ufb01t of compatible learning\nobjectives proposed in Triple-GAN. Note that for a fair comparison with previous GANs, we do not\nleverage the extra unlabeled data on SVHN, while some baselines [17, 15] do.\nSecondly, we evaluate our method with 20, 50 and 200 labeled samples on MNIST for a systematical\ncomparison with our main baseline Improved-GAN [25], as shown in Table 2. Triple-GAN consis-\ntently outperforms Improved-GAN with a substantial margin, which again demonstrates the bene\ufb01t\nof Triple-GAN. Besides, we can see that Triple-GAN achieves more signi\ufb01cant improvement as the\nnumber of labeled data decreases, suggesting the effectiveness of the pseudo discriminative loss.\nFinally, we investigate the reasons for the outstanding performance of Triple-GAN. We train a single\nC without G and D on SVHN as the baseline and get more than 10% error rate, which shows that G\nis important for SSL even though C can leverage unlabeled data directly. On CIFAR10, the baseline\n\n5Our source code is available at https://github.com/zhenxuan00/triple-gan\n6We use these amounts of labels as default settings throughout the paper if not speci\ufb01ed.\n\n7\n\n\f(a) Feature Matching\n\n(b) Triple-GAN\n\n(c) Automobile\n\n(d) Horse\n\nFigure 2: (a-b) Comparison between samples from Improved-GAN trained with feature matching\nand Triple-GAN on SVHN. (c-d) Samples of Triple-GAN in speci\ufb01c classes on CIFAR10.\n\n(a) SVHN data\n\n(b) SVHN samples\n\n(c) CIFAR10 data\n\n(d) CIFAR10 samples\n\nFigure 3: (a) and (c) are randomly selected labeled data. (b) and (d) are samples from Triple-GAN,\nwhere each row shares the same label and each column shares the same latent variables.\n\n(a) SVHN\n\n(b) CIFAR10\n\nFigure 4: Class-conditional latent space interpolation. We \ufb01rst sample two random vectors in the\nlatent space and interpolate linearly from one to another. Then, we map these vectors to the data\nlevel given a \ufb01xed label for each class. Totally, 20 images are shown for each class. We select two\nendpoints with clear semantics on CIFAR10 for better illustration.\n\n(a simple version of \u03a0 model [13]) achieves 17.7% error rate. The smaller improvement is reasonable\nas CIFAR10 is more complex and hence G is not as good as in SVHN. In addition, we evaluate\nTriple-GAN without the pseudo discriminative loss on SVHN and it achieves about 7.8% error rate,\nwhich shows the advantages of compatible objectives (better than the 8.11% error rate of Improved-\nGAN) and the importance of the pseudo discriminative loss (worse than the complete Triple-GAN by\n2%). Furthermore, Triple-GAN has a comparable convergence speed with Improved-GAN [25], as\nshown in Appendix E.\n\n5.2 Generation\nWe demonstrate that Triple-GAN can learn good G and C simultaneously by generating samples in\nvarious ways with the exact models used in Sec. 5.1. For fair comparison, the generative model and\nthe number of labels are the same to the previous method [25].\nIn Fig. 2 (a-b), we \ufb01rst compare the quality of images generated by Triple-GAN on SVHN and the\nImproved-GAN with feature matching [25],7 which works well for semi-supervised classi\ufb01cation.\nWe can see that Triple-GAN outperforms the baseline by generating fewer meaningless samples and\n\n7Though the Improved-GAN trained with minibatch discrimination [25] can generate good samples, it fails\n\nto predict labels accurately.\n\n8\n\n\fclearer digits. Further, the baseline generates the same strange sample four times, labeled with red\nrectangles in Fig. 2 . The comparison on MNIST and CIFAR10 is presented in Appendix B. We\nalso evaluate the samples on CIFAR10 quantitatively via the inception score following [25]. The\nvalue of Triple-GAN is 5.08 \u00b1 0.09 while that of the Improved-GAN trained without minibatch\ndiscrimination [25] is 3.87 \u00b1 0.03, which agrees with the visual comparison. We then illustrate\nimages generated from two speci\ufb01c classes on CIFAR10 in Fig. 2 (c-d) and see more in Appendix C.\nIn most cases, Triple-GAN is able to generate meaningful images with correct semantics.\nFurther, we show the ability of Triple-GAN to disentangle classes and styles in Fig. 3. It can be\nseen that Triple-GAN can generate realistic data in a speci\ufb01c class and the latent factors encode\nmeaningful physical factors like: scale, intensity, orientation, color and so on. Some GANs [22, 5, 21]\ncan generate data class-conditionally given full labels, while Triple-GAN can do similar thing given\nmuch less label information.\nFinally, we demonstrate the generalization capability of our Triple-GAN on class-conditional latent\nspace interpolation as in Fig. 4. Triple-GAN can transit smoothly from one sample to another with\ntotally different visual factors without losing label semantics, which proves that Triple-GANs can\nlearn meaningful latent spaces class-conditionally instead of over\ufb01tting to the training data, especially\nlabeled data. See these results on MNIST in Appendix D.\nOverall, these results con\ufb01rm that Triple-GAN avoid the competition between C and G and can lead\nto a situation where both the generation and classi\ufb01cation are good in semi-supervised learning.\n\n6 Conclusions\n\nWe present triple generative adversarial networks (Triple-GAN), a uni\ufb01ed game-theoretical framework\nwith three players\u2014a generator, a discriminator and a classi\ufb01er, to do semi-supervised learning with\ncompatible utilities. With such utilities, Triple-GAN addresses two main problems of existing\nmethods [26, 25]. Speci\ufb01cally, Triple-GAN ensures that both the classi\ufb01er and the generator can\nachieve their own optima respectively in the perspective of game theory and enable the generator\nto sample data in a speci\ufb01c class. Our empirical results on MNIST, SVHN and CIFAR10 datasets\ndemonstrate that as a uni\ufb01ed model, Triple-GAN can simultaneously achieve the state-of-the-art\nclassi\ufb01cation results among deep generative models and disentangle styles and classes and transfer\nsmoothly on the data level via interpolation in the latent space.\n\nAcknowledgments\n\nThe work is supported by the National NSF of China (Nos. 61620106010, 61621136008, 61332007),\nthe MIIT Grant of Int. Man. Comp. Stan (No. 2016ZXFB00001), the Youth Top-notch Talent Support\nProgram, Tsinghua Tiangong Institute for Intelligent Computing, the NVIDIA NVAIL Program and a\nProject from Siemens.\n\nReferences\n[1] Peter Burt and Edward Adelson. The Laplacian pyramid as a compact image code. IEEE\n\nTransactions on communications, 1983.\n\n[2] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-\nGAN: Interpretable representation learning by information maximizing generative adversarial\nnets. In NIPS, 2016.\n\n[3] Emily L Denton, Soumith Chintala, and Rob Fergus. Deep generative image models using a\n\nLaplacian pyramid of adversarial networks. In NIPS, 2015.\n\n[4] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016.\n\n9\n\n\f[6] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural\nnetworks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906,\n2015.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[9] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[10] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[11] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\n\nsupervised learning with deep generative models. In NIPS, 2014.\n\n[12] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nCiteseer, 2009.\n\n[13] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint\n\narXiv:1610.02242, 2016.\n\n[14] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[15] Chongxuan Li, Jun Zhu, and Bo Zhang. Max-margin deep generative models for (semi-)\n\nsupervised learning. arXiv preprint arXiv:1611.07119, 2016.\n\n[16] Yujia Li, Kevin Swersky, and Richard S Zemel. Generative moment matching networks. In\n\nICML, 2015.\n\n[17] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n[18] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional\n\nsmoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[19] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, 2011.\n\n[20] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint\n\narXiv:1606.01583, 2016.\n\n[21] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\n\nauxiliary classi\ufb01er gans. arXiv preprint arXiv:1610.09585, 2016.\n\n[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[23] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\n\nsupervised learning with ladder networks. In NIPS, 2015.\n\n[24] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[25] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training GANs. In NIPS, 2016.\n\n[26] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical genera-\n\ntive adversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n10\n\n\f[27] Theano Development Team. Theano: A Python framework for fast computation of mathematical\nexpressions. arXiv e-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.\n02688.\n\n[28] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[29] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[30] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentan-\n\ngling with recurrent transformations for 3d view synthesis. In NIPS, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2162, "authors": [{"given_name": "Chongxuan", "family_name": "LI", "institution": "Tsinghua University"}, {"given_name": "Taufik", "family_name": "Xu", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}