{"title": "Triangle Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5247, "page_last": 5256, "abstract": "A Triangle Generative Adversarial Network ($\\Delta$-GAN) is developed for semi-supervised cross-domain joint distribution matching, where the training data consists of samples from each domain, and supervision of domain correspondence is provided by only a few paired samples. $\\Delta$-GAN consists of four neural networks, two generators and two discriminators. The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs. The generators and discriminators are trained together using adversarial learning. Under mild assumptions, in theory the joint distributions characterized by the two generators concentrate to the data distribution. In experiments, three different kinds of domain pairs are considered, image-label, image-image and image-attribute pairs. Experiments on semi-supervised image classification, image-to-image translation and attribute-based image generation demonstrate the superiority of the proposed approach.", "full_text": "Triangle Generative Adversarial Networks\n\nZhe Gan\u2217, Liqun Chen\u2217, Weiyao Wang, Yunchen Pu, Yizhe Zhang,\n\nHao Liu, Chunyuan Li, Lawrence Carin\n\nDuke University\n\nzhe.gan@duke.edu\n\nAbstract\n\nA Triangle Generative Adversarial Network (\u2206-GAN) is developed for semi-\nsupervised cross-domain joint distribution matching, where the training data con-\nsists of samples from each domain, and supervision of domain correspondence\nis provided by only a few paired samples. \u2206-GAN consists of four neural net-\nworks, two generators and two discriminators. The generators are designed to\nlearn the two-way conditional distributions between the two domains, while the\ndiscriminators implicitly de\ufb01ne a ternary discriminative function, which is trained\nto distinguish real data pairs and two kinds of fake data pairs. The generators\nand discriminators are trained together using adversarial learning. Under mild\nassumptions, in theory the joint distributions characterized by the two generators\nconcentrate to the data distribution. In experiments, three different kinds of do-\nmain pairs are considered, image-label, image-image and image-attribute pairs.\nExperiments on semi-supervised image classi\ufb01cation, image-to-image translation\nand attribute-based image generation demonstrate the superiority of the proposed\napproach.\n\nIntroduction\n\n1\nGenerative adversarial networks (GANs) [1] have emerged as a powerful framework for learning\ngenerative models of arbitrarily complex data distributions. When trained on datasets of natural\nimages, signi\ufb01cant progress has been made on generating realistic and sharp-looking images [2, 3].\nThe original GAN formulation was designed to learn the data distribution in one domain. In practice,\none may also be interested in matching two joint distributions. This is an important task, since\nmapping data samples from one domain to another has a wide range of applications. For instance,\nmatching the joint distribution of image-text pairs allows simultaneous image captioning and text-\nconditional image generation [4], while image-to-image translation [5] is another challenging problem\nthat requires matching the joint distribution of image-image pairs.\nIn this work, we are interested in designing a GAN framework to match joint distributions. If paired\ndata are available, a simple approach to achieve this is to train a conditional GAN model [4, 6],\nfrom which a joint distribution is readily manifested and can be matched to the empirical joint\ndistribution provided by the paired data. However, fully supervised data are often dif\ufb01cult to acquire.\nSeveral methods have been proposed to achieve unsupervised joint distribution matching without\nany paired data, including DiscoGAN [7], CycleGAN [8] and DualGAN [9]. Adversarially Learned\nInference (ALI) [10] and Bidirectional GAN (BiGAN) [11] can be readily adapted to this case as\nwell. Though empirically achieving great success, in principle, there exist in\ufb01nitely many possible\nmapping functions that satisfy the requirement to map a sample from one domain to another. In\norder to alleviate this nonidenti\ufb01ability issue, paired data are needed to provide proper supervision to\ninform the model the kind of joint distributions that are desired.\nThis motivates the proposed Triangle Generative Adversarial Network (\u2206-GAN), a GAN frame-\nwork that allows semi-supervised joint distribution matching, where the supervision of domain\n\n\u2217 Equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Illustration of the Triangle Generative Adversarial Network (\u2206-GAN).\n\ncorrespondence is provided by a few paired samples. \u2206-GAN consists of two generators and two\ndiscriminators. The generators are designed to learn the bidirectional mappings between domains,\nwhile the discriminators are trained to distinguish real data pairs and two kinds of fake data pairs.\nBoth the generators and discriminators are trained together via adversarial learning.\n\u2206-GAN bears close resemblance to Triple GAN [12], a recently proposed method that can also be\nutilized for semi-supervised joint distribution mapping. However, there exist several key differences\nthat make our work unique. First, \u2206-GAN uses two discriminators in total, which implicitly de\ufb01nes\na ternary discriminative function, instead of a binary discriminator as used in Triple GAN. Second,\n\u2206-GAN can be considered as a combination of conditional GAN and ALI, while Triple GAN\nconsists of two conditional GANs. Third, the distributions characterized by the two generators in\nboth \u2206-GAN and Triple GAN concentrate to the data distribution in theory. However, when the\ndiscriminator is optimal, the objective of \u2206-GAN becomes the Jensen-Shannon divergence (JSD)\namong three distributions, which is symmetric; the objective of Triple GAN consists of a JSD term\nplus a Kullback-Leibler (KL) divergence term. The asymmetry of the KL term makes Triple GAN\nmore prone to generating fake-looking samples [13]. Lastly, the calculation of the additional KL\nterm in Triple GAN is equivalent to calculating a supervised loss, which requires the explicit density\nform of the conditional distributions, which may not be desirable. On the other hand, \u2206-GAN is\na fully adversarial approach that does not require that the conditional densities can be computed;\n\u2206-GAN only require that the conditional densities can be sampled from in a way that allows gradient\nbackpropagation.\n\u2206-GAN is a general framework, and can be used to match any joint distributions. In experiments,\nin order to demonstrate the versatility of the proposed model, we consider three domain pairs:\nimage-label, image-image and image-attribute pairs, and use them for semi-supervised classi\ufb01cation,\nimage-to-image translation and attribute-based image editing, respectively. In order to demonstrate\nthe scalability of the model to large and complex datasets, we also present attribute-conditional image\ngeneration on the COCO dataset [14].\n\n2 Model\n\n2.1 Generative Adversarial Networks (GANs)\n\nGenerative Adversarial Networks (GANs) [1] consist of a generator G and a discriminator D that\ncompete in a two-player minimax game, where the generator is learned to map samples from an\narbitray latent distribution to data, while the discriminator tries to distinguish between real and\ngenerated samples. The goal of the generator is to \u201cfool\u201d the discriminator by producing samples that\nare as close to real data as possible. Speci\ufb01cally, D and G are learned as\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = Ex\u223cp(x)[log D(x)] + Ez\u223cpz(z)[log(1 \u2212 D(G(z)))] ,\n\n(1)\n\nwhere p(x) is the true data distribution, and pz(z) is usually de\ufb01ned to be a simple distribution, such\nas the standard normal distribution. The generator G implicitly de\ufb01nes a probability distribution\npg(x) as the distribution of the samples G(z) obtained when z \u223c pz(z). For any \ufb01xed generator\n\n2\n\n\fG, the optimal discriminator is D(x) =\npg(x)+p(x). When the discriminator is optimal, solving this\nadversarial game is equivalent to minimizing the Jenson-Shannon Divergence (JSD) between p(x)\nand pg(x) [1]. The global equilibrium is achieved if and only if p(x) = pg(x).\n\np(x)\n\n2.2 Triangle Generative Adversarial Networks (\u2206-GANs)\n\nWe now extend GAN to \u2206-GAN for joint distribution matching. We \ufb01rst consider \u2206-GAN in the\nsupervised setting, and then discuss semi-supervised learning in Section 2.4. Consider two related\ndomains, with x and y being the data samples for each domain. We have fully-paired data samples\nthat are characterized by the joint distribution p(x, y), which also implies that samples from both the\nmarginal p(x) and p(y) can be easily obtained.\n\u2206-GAN consists of two generators: (i) a generator Gx(y) that de\ufb01nes the conditional distribution\npx(x|y), and (ii) a generator Gy(x) that characterizes the conditional distribution in the other\ndirection py(y|x). Gx(y) and Gy(x) may also implicitly contain a random latent variable z as input,\ni.e., Gx(y, z) and Gy(x, z). In the \u2206-GAN game, after a sample x is drawn from p(x), the generator\nGy produces a pseudo sample \u02dcy following the conditional distribution py(y|x). Hence, the fake data\npair (x, \u02dcy) is a sample from the joint distribution py(x, y) = py(y|x)p(x). Similarly, a fake data\npair (\u02dcx, y) can be sampled from the generator Gx by \ufb01rst drawing y from p(y) and then drawing\n\u02dcx from px(x|y); hence (\u02dcx, y) is sampled from the joint distribution px(x, y) = px(x|y)p(y). As\nsuch, the generative process between px(x, y) and py(x, y) is reversed.\nThe objective of \u2206-GAN is to match the three joint distributions: p(x, y), px(x, y) and py(x, y). If\nthis is achieved, we are ensured that we have learned a bidirectional mapping px(x|y) and py(y|x)\nthat guarantees the generated fake data pairs (\u02dcx, y) and (x, \u02dcy) are indistinguishable from the true\ndata pairs (x, y). In order to match the joint distributions, an adversarial game is played. Joint pairs\nare drawn from three distributions: p(x, y), px(x, y) or py(x, y), and two discriminator networks\nare learned to discriminate among the three, while the two conditional generator networks are trained\nto fool the discriminators.\nThe value function describing the game is given by\n\nmin\nGx,Gy\n\nmax\nD1,D2\n\nV (Gx, Gy, D1, D2) = E(x,y)\u223cp(x,y)[log D1(x, y)]\n+ Ey\u223cp(y),\u02dcx\u223cpx(x|y)\n+ Ex\u223cp(x),\u02dcy\u223cpy(y|x)\n\n(1 \u2212 D1(\u02dcx, y)) \u00b7 D2(\u02dcx, y)\n(1 \u2212 D1(x, \u02dcy)) \u00b7 (1 \u2212 D2(x, \u02dcy))\n\nlog\n\nlog\n\n(cid:17)(cid:105)\n\n(2)\n\n.\n\n(cid:104)\n(cid:104)\n\n(cid:16)\n(cid:16)\n\n(cid:17)(cid:105)\n\nThe discriminator D1 is used to distinguish whether a sample pair is from p(x, y) or not, if this\nsample pair is not from p(x, y), another discriminator D2 is used to distinguish whether this sample\npair is from px(x, y) or py(x, y). D1 and D2 work cooperatively, and the use of both implicitly\nde\ufb01nes a ternary discriminative function D that distinguish sample pairs in three ways. See Figure 1\nfor an illustration of the adversarial game and Appendix B for an algorithmic description of the\ntraining procedure.\n\n2.3 Theoretical analysis\n\n\u2206-GAN shares many of the theoretical properties of GANs [1]. We \ufb01rst consider the optimal\ndiscriminators D1 and D2 for any given generator Gx and Gy. These optimal discriminators then\nallow reformulation of objective (2), which reduces to the Jensen-Shannon divergence among the\njoint distribution p(x, y), px(x, y) and py(x, y).\nProposition 1. For any \ufb01xed generator Gx and Gy, the optimal discriminator D1 and D2 of the\ngame de\ufb01ned by V (Gx, Gy, D1, D2) is\n\nD\u2217\n1(x, y) =\n\np(x, y)\n\np(x, y) + px(x, y) + py(x, y)\n\n, D\u2217\n\n2(x, y) =\n\npx(x, y)\n\npx(x, y) + py(x, y)\n\n.\n\nProof. The proof is a straightforward extension of the proof in [1]. See Appendix A for details.\nProposition 2. The equilibrium of V (Gx, Gy, D1, D2) is achieved if and only if p(x, y) =\npx(x, y) = py(x, y) with D\u2217\n\n2 , and the optimum value is \u22123 log 3.\n\n3 and D\u2217\n\n2(x, y) = 1\n\n1(x, y) = 1\n\n3\n\n\fProof. Given the optimal D\u2217\nC(Gx, Gy) = max\nD1,D2\n\nV (Gx, Gy, D1, D2)\n\n(cid:16)\n\n= \u22123 log 3 + 3 \u00b7 JSD\n\np(x, y), px(x, y), py(x, y)\n\n1(x, y) and D\u2217\n\n2(x, y), the minimax game can be reformulated as:\n\n(cid:17) \u2265 \u22123 log 3 ,\n\n(3)\n\n(4)\n\nwhere JSD denotes the Jensen-Shannon divergence (JSD) among three distributions. See Appendix\nA for details.\n\nSince p(x, y) = px(x, y) = py(x, y) can be achieved in theory, it can be readily seen that the\nlearned conditional generators can reveal the true conditional distributions underlying the data, i.e.,\npx(x|y) = p(x|y) and py(y|x) = p(y|x).\n\n2.4 Semi-supervised learning\n\nIn order to further understand \u2206-GAN, we write (2) as\n\n(cid:125)\nV = Ep(x,y)[log D1(x, y)] + Epx(\u02dcx,y)[log(1 \u2212 D1(\u02dcx, y))] + Epy(x,\u02dcy)[log(1 \u2212 D1(x, \u02dcy))]\n\n(cid:124)\n(cid:124)\n(cid:125)\n+ Epx(\u02dcx,y)[log D2(\u02dcx, y)] + Epy(x,\u02dcy)[log(1 \u2212 D2(x, \u02dcy))]\n\nconditional GAN\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n.\n\nBiGAN/ALI\n\n(5)\n\n(6)\n\nThe objective of \u2206-GAN is a combination of the objectives of conditional GAN and BiGAN. The\nBiGAN part matches two joint distributions: px(x, y) and py(x, y), while the conditional GAN part\nprovides the supervision signal to notify the BiGAN part what joint distribution to match. Therefore,\n\u2206-GAN provides a natural way to perform semi-supervised learning, since the conditional GAN part\nand the BiGAN part can be used to account for paired and unpaired data, respectively.\nHowever, when doing semi-supervised learning, there is also one potential problem that we need\nto be cautious about. The theoretical analysis in Section 2.3 is based on the assumption that the\ndataset is fully supervised, i.e., we have the ground-truth joint distribution p(x, y) and marginal\ndistributions p(x) and p(y). In the semi-supervised setting, p(x) and p(y) are still available but\np(x, y) is not. We can only obtain the joint distribution pl(x, y) characterized by the few paired data\nsamples. Hence, in the semi-supervised setting, px(x, y) and py(x, y) will try to concentrate to the\nempirical distribution pl(x, y). We make the assumption that pl(x, y) \u2248 p(x, y), i.e., the paired\ndata can roughly characterize the whole dataset. For example, in the semi-supervised classi\ufb01cation\nproblem, one usually strives to make sure that labels are equally distributed among the labeled dataset.\n\n2.5 Relation to Triple GAN\n\n+Ep(x,y)[\u2212 log py(y|x)] ,\n\n\u2206-GAN is closely related to Triple GAN [12]. Below we review Triple GAN and then discuss the\nmain differences. The value function of Triple GAN is de\ufb01ned as follows:\nV =Ep(x,y)[log D(x, y)] + (1 \u2212 \u03b1)Epx(\u02dcx,y)[log(1 \u2212 D(\u02dcx, y))] + \u03b1Epy(x,\u02dcy)[log(1 \u2212 D(x, \u02dcy))]\n(7)\nwhere \u03b1 \u2208 (0, 1) is a contant that controls the relative importance of the two generators. Let Triple\nGAN-s denote a simpli\ufb01ed Triple GAN model with only the \ufb01rst three terms. As can be seen, Triple\nGAN-s can be considered as a combination of two conditional GANs, with the importance of each\ncondtional GAN weighted by \u03b1. It can be proven that Triple GAN-s achieves equilibrium if and\nonly if p(x, y) = (1 \u2212 \u03b1)px(x, y) + \u03b1py(x, y), which is not desirable. To address this problem, in\nTriple GAN a standard supervised loss RL = Ep(x,y)[\u2212 log py(y|x)] is added. As a result, when the\ndiscriminator is optimal, the cost function in Triple GAN becomes:\n\n2JSD\n\np(x, y)||((1 \u2212 \u03b1)px(x, y) + \u03b1py(x, y))\n\n+ KL(p(x, y)||py(x, y)) + const.\n\n(8)\n\n(cid:16)\n\n(cid:17)\n\nThis cost function has the good property that it has a unique minimum at p(x, y) = px(x, y) =\npy(x, y). However, the objective becomes asymmetrical. The second KL term pays low cost\nfor generating fake-looking samples [13]. By contrast \u2206-GAN directly optimizes the symmet-\nric Jensen-Shannon divergence among three distributions. More importantly, the calculation of\n\n4\n\n\fbe sampled from. For example, if we assume py(y|x) =(cid:82) \u03b4(y \u2212 Gy(x, z))p(z)dz, and \u03b4(\u00b7) is the\n\nEp(x,y)[\u2212 log py(y|x)] in Triple GAN also implies that the explicit density form of py(y|x) should\nbe provided, which may not be desirable. On the other hand, \u2206-GAN only requires that py(y|x) can\nDirac delta function, we can sample y through sampling z, however, the density function of py(y|x)\nis not explicitly available.\n\nvector y, i.e., px(x|y) =(cid:82) \u03b4(x\u2212 Gx(y, z))p(z)dz, where p(z) is chosen to be a simple distribution\n\n2.6 Applications\n\u2206-GAN is a general framework that can be used for any joint distribution matching. Besides\nthe semi-supervised image classi\ufb01cation task considered in [12], we also conduct experiments on\nimage-to-image translation and attribute-conditional image generation. When modeling image pairs,\nboth px(x|y) and py(y|x) are implemented without introducing additional latent variables, i.e.,\npx(x|y) = \u03b4(x \u2212 Gx(y)), py(y|x) = \u03b4(y \u2212 Gy(x)).\nA different strategy is adopted when modeling the image-label/attribute pairs. Speci\ufb01cally, let x\ndenote samples in the image domain, y denote samples in the label/attribute domain. y is a one-hot\nvector or a binary vector when representing labels and attributes, respectively. When modeling\npx(x|y), we assume that x is transformed by the latent style variables z given the label or attribute\n(e.g., uniform or standard normal). When learning py(y|x), py(y|x) is assumed to be a standard\nmulti-class or multi-label class\ufb01er without latent variables z. In order to allow the training signal\nbackpropagated from D1 and D2 to Gy, we adopt the REINFORCE algorithm as in [12], and use the\nlabel with the maximum probability to approximate the expectation over y, or use the output of the\nsigmoid function as the predicted attribute vector.\n3 Related work\nThe proposed framework focuses on designing GAN for joint-distribution matching. Conditional\nGAN can be used for this task if supervised data is available. Various conditional GANs have been\nproposed to condition the image generation on class labels [6], attributes [15], texts [4, 16] and\nimages [5, 17]. Unsupervised learning methods have also been developed for this task. BiGAN [11]\nand ALI [10] proposed a method to jointly learn a generation network and an inference network\nvia adversarial learning. Though originally designed for learning the two-way transition between\nthe stochastic latent variables and real data samples, BiGAN and ALI can be directly adapted to\nlearn the joint distribution of two real domains. Another method is called DiscoGAN [7], in which\ntwo generators are used to model the bidirectional mapping between domains, and another two\ndiscriminators are used to decide whether a generated sample is fake or not in each individual\ndomain. Further, additional reconstructon losses are introduced to make the two generators strongly\ncoupled and also alleviate the problem of mode collapsing. Similiar work includes CycleGAN [8],\nDualGAN [9] and DTN [18]. Additional weight-sharing constraints are introduced in CoGAN [19]\nand UNIT [20].\nOur work differs from the above work in that we aim at semi-supervised joint distribution matching.\nThe only work that we are aware of that also achieves this goal is Triple GAN. However, our model is\ndistinct from Triple GAN in important ways (see Section 2.5). Further, Triple GAN only focuses on\nimage classi\ufb01cation, while \u2206-GAN has been shown to be applicable to a wide range of applications.\nVarious methods and model architectures have been proposed to improve and stabilize the training\nof GAN, such as feature matching [21, 22, 23], Wasserstein GAN [24], energy-based GAN [25],\nand unrolled GAN [26] among many other related works. Our work is orthogonal to these methods,\nwhich could also be used to improve the training of \u2206-GAN. Instead of using adversarial loss, there\nalso exists work that uses supervised learning [27] for joint-distribution matching, and variational\nautoencoders for semi-supervised learning [28, 29]. Lastly, our work is also closely related to the\nrecent work of [30, 31, 32], which treats one of the domains as latent variables.\n4 Experiments\nWe present results on three tasks: (i) semi-supervised classi\ufb01cation on CIFAR10 [33]; (ii) image-\nto-image translation on MNIST [34] and the edges2shoes dataset [5]; and (iii) attribute-to-image\ngeneration on CelebA [35] and COCO [14]. We also conduct a toy data experiment to further\ndemonstrate the differences between \u2206-GAN and Triple GAN. We implement \u2206-GAN without\nintroducing additional regularization unless explicitly stated. All the network architectures are\nprovided in the Appendix.\n\n5\n\n\fFigure 2: Toy data experiment on \u2206-GAN and Triple GAN. (a) the joint distribution p(x, y) of real data. For\n(b) and (c), the left and right \ufb01gure is the learned joint distribution px(x, y) and py(x, y), respectively.\n\nTable 1: Error rates (%) on the par-\ntially labeled CIFAR10 dataset.\n\nAlgorithm\nCatGAN [36]\nImproved GAN [21]\nALI [10]\nTriple GAN [12]\n\u2206-GAN (ours)\n\nn = 4000\n19.58 \u00b1 0.58\n18.63 \u00b1 2.32\n17.99 \u00b1 1.62\n16.99 \u00b1 0.36\n16.80 \u00b1 0.42\n\nTable 2: Classi\ufb01cation accuracy (%) on the MNIST-to-\nMNIST-transpose dataset.\nn = 100\n\nn = 1000\n\nAll\n\nAlgorithm\nDiscoGAN\nTriple GAN 63.79 \u00b1 0.85\n83.20\u00b1 1.88\n\u2206-GAN\n\n\u2212\n\n\u2212\n\n84.93 \u00b1 1.63\n88.98\u00b1 1.50\n\n15.00\u00b1 0.20\n86.70 \u00b1 1.52\n93.34\u00b1 1.46\n\n0\n\n0\n\n4.1 Toy data experiment\nWe \ufb01rst compare our method with Triple GAN on a toy dataset. We synthesize data by drawing\n(x, y) \u223c 1\n4N (\u00b54, \u03a34), where \u00b51 = [0, 1.5](cid:62), \u00b52 =\n4N (\u00b51, \u03a31) + 1\n[\u22121.5, 0](cid:62), \u00b53 = [1.5, 0](cid:62), \u00b54 = [0,\u22121.5](cid:62), \u03a31 = \u03a34 = ( 3\n3 ). We\ngenerate 5000 (x, y) pairs for each mixture component. In order to implement \u2206-GAN and Triple\n\nGAN-s, we model px(x|y) and py(y|x) as px(x|y) =(cid:82) \u03b4(x\u2212 Gx(y, z))p(z)dz, py(y|x) =(cid:82) \u03b4(y \u2212\n\n0 0.025 ) and \u03a32 = \u03a33 = ( 0.025 0\n\n4N (\u00b52, \u03a32) + 1\n\n4N (\u00b53, \u03a33) + 1\n\nGy(x, z))p(z)dz where both Gx and Gy are modeled as a 4-hidden-layer multilayer perceptron\n(MLP) with 500 hidden units in each layer. p(z) is a bivariate standard Gaussian distribution. Triple\nGAN can be implemented by specifying both px(x|y) and py(y|x) to be distributions with explicit\ndensity form, e.g., Gaussian distributions. However, the performance can be bad since it fails to\ncapture the multi-modality of px(x|y) and py(y|x). Hence, only Triple GAN-s is implemented.\nResults are shown in Figure 2. The joint distributions px(x, y) and py(x, y) learned by \u2206-GAN\nsuccessfully match the true joint distribution p(x, y). Triple GAN-s cannot achieve this, and can only\nguarantee 1\n2 (px(x, y) + py(x, y)) matches p(x, y). Although this experiment is limited due to its\nsimplicity, the results clearly support the advantage of our proposed model over Triple GAN.\n4.2 Semi-supervised classi\ufb01cation\nWe evaluate semi-supervised classi\ufb01cation on the CIFAR10 dataset with 4000 labels. The labeled\ndata is distributed equally across classes and the results are averaged over 10 runs with different\nrandom splits of the training data. For fair comparison, we follow the publically available code of\nTriple GAN and use the same regularization terms and hyperparameter settings as theirs. Results\nare summarized in Table 1. Our \u2206-GAN achieves the best performance among all the competing\nmethods. We also show the ability of \u2206-GAN to disentangle classes and styles in Figure 3. \u2206-GAN\ncan generate realistic data in a speci\ufb01c class and the injected noise vector encodes meaningful style\npatterns like background and color.\n4.3\nImage-to-image translation\nWe \ufb01rst evaluate image-to-image translation on the edges2shoes dataset. Results are shown in\nFigure 4(bottom). Though DiscoGAN is an unsupervised learning method, it achieves impressive\nresults. However, with supervision provided by 10% paired data, \u2206-GAN generally generates more\naccurate edge details of the shoes. In order to provide quantitative evaluation of translating shoes to\nedges, we use mean squared error (MSE) as our metric. The MSE of using DiscoGAN is 140.1; with\n10%, 20%, 100% paired data, the MSE of using \u2206-GAN is 125.3, 113.0 and 66.4, respectively.\nTo further demonstrate the importance of providing supervision of domain correspondence, we\ncreated a new dataset based on MNIST [34], where the two image domains are the MNIST images\nand their corresponding tranposed ones. As can be seen in Figure 4(top), \u2206-GAN matches images\n\n6\n\n(a) real data (b) Triangle GAN (c) Triple GAN\fFigure 3: Generated CIFAR10 samples, where\neach row shares the same label and each column\nuses the same noise.\n\nFigure 4:\nImage-to-image translation experiments\non the MNIST-to-MNIST-transpose and edges2shoes\ndatasets.\n\nFigure 5: Results on the face-to-attribute-to-face experiment. The 1st row is the input images; the 2nd row is\nthe predicted attributes given the input images; the 3rd row is the generated images given the predicted attributes.\n\nTable 3: Results of P@10 and nDCG@10 for attribute predicting on CelebA and COCO.\n\nDataset\nMethod\nTriple GAN 40.97/50.74\n53.21/58.39\n\u2206-GAN\n\n1%\n\nCelebA\n10%\n\n100%\n\n10%\n\nCOCO\n50%\n\n100%\n\n62.13/73.56\n63.68/75.22\n\n70.12/79.37\n70.37/81.47\n\n32.64/35.91\n34.38/37.91\n\n34.00/37.76\n36.72/40.39\n\n35.35/39.60\n39.05/42.86\n\nbetwen domains well, while DiscoGAN fails in this task. For supporting quantitative evaluation,\nwe have trained a classi\ufb01er on the MNIST dataset, and the classi\ufb01cation accuracy of this classi\ufb01er\non the test set approaches 99.4%, and is, therefore, trustworthy as an evaluation metric. Given an\ninput MNIST image x, we \ufb01rst generate a transposed image y using the learned generator, and then\nmanually transpose it back to normal digits yT , and \ufb01nally send this new image yT to the classi\ufb01er.\nResults are summarized in Table 2, which are averages over 5 runs with different random splits of the\ntraining data. \u2206-GAN achieves signi\ufb01cantly better performance than Triple GAN and DiscoGAN.\n\n4.4 Attribute-conditional image generation\nWe apply our method to face images from the CelebA dataset. This dataset consists of 202,599\nimages annotated with 40 binary attributes. We scale and crop the images to 64 \u00d7 64 pixels. In\norder to qualitatively evaluate the learned attribute-conditional image generator and the multi-label\nclassi\ufb01er, given an input face image, we \ufb01rst use the classi\ufb01er to predict attributes, and then use\nthe image generator to produce images based on the predicted attributes. Figure 5 shows example\nresults. Both the learned attribute predictor and the image generator provides good results. We further\nshow another set of image editing experiment in Figure 6. For each sub\ufb01gure, we use a same set of\nattributes with different noise vectors to generate images. For example, for the top-right sub\ufb01gure,\n\n7\n\n-GANDiscoGANInput:GT Output:DiscoGAN:-GAN:Input:Output:Input:Output:BigNose, BlackHair, BushyEyebrows, Male, Young, Sideburns Attractive, Smiling, HighCheekbones, MouthSlightlyOpen, WearingLipstickAttractive, BlackHair, Male, HighCheekbones, Smiling, StraightHairBigNose, Chubby, Goatee, Male, OvalFace, Sideburns, WearingHatAttractive, BlondHair, NoBeard, PointyNose, StraightHair, ArchedEyebrowsHighCheekbones, MouthSlightlyOpen, NoBeard, OvalFace, SmilingAttractive, BrownHair, HeavyMakeup, NoBeard, WavyHair, YoungAttractive, Eyeglasses, NoBeard, StraightHair, WearingLipstick, YoungInput imagesPredicted attributesGenerated images\fFigure 6: Results on the image editing experiment.\n\nFigure 7: Results on the image-to-attribute-to-image experiment.\n\nall the images in the 1st row were generated based on the following attributes: black hair, female,\nattractive, and we then added the attribute of \u201csunglasses\u201d when generating the images in the 2nd row.\nIt is interesting to see that \u2206-GAN has great \ufb02exibility to adjust the generated images by changing\ncertain input attribtutes. For instance, by switching on the wearing hat attribute, one can edit the face\nimage to have a hat on the head.\nIn order to demonstrate the scalablility of our model to large and complex datasets, we also present\nresults on the COCO dataset. Following [37], we \ufb01rst select a set of 1000 attributes from the caption\ntext in the training set, which includes the most frequent nouns, verbs, or adjectives. The images in\nCOCO are scaled and cropped to have 64 \u00d7 64 pixels. Unlike the case of CelebA face images, the\nnetworks need to learn how to handle multiple objects and diverse backgrounds. Results are provided\nin Figure 7. We can generate reasonably good images based on the predicted attributes. The input\nand generated images also clearly share a same set of attributes. We also observe diversity in the\nsamples by simply drawing multple noise vectors and using the same predicted attributes.\nPrecision (P) and normalized Discounted Cumulative Gain (nDCG) are two popular evaluation\nmetrics for multi-label classi\ufb01cation problems. Table 3 provides the quantatitive results of P@10 and\nnDCG@10 on CelebA and COCO, where @k means at rank k (see the Appendix for de\ufb01nitions). For\nfair comparison, we use the same network architecures for both Triple GAN and \u2206-GAN. \u2206-GAN\nconsistently provides better results than Triple GAN. On the COCO dataset, our semi-supervised\nlearning approach with 50% labeled data achieves better performance than the results of Triple GAN\nusing the full dataset, demonstrating the effectiveness of our approach for semi-supervised joint\ndistribution matching. More results for the above experiments are provided in the Appendix.\n5 Conclusion\nWe have presented the Triangle Generative Adversarial Network (\u2206-GAN), a new GAN framework\nthat can be used for semi-supervised joint distribution matching. Our approach learns the bidirectional\nmappings between two domains with a few paired samples. We have demonstrated that \u2206-GAN may\nbe employed for a wide range of applications. One possible future direction is to combine \u2206-GAN\nwith sequence GAN [38] or textGAN [23] to model the joint distribution of image-caption pairs.\nAcknowledgements This research was supported in part by ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n1st row + pale skin = 2nd row1st row + mouth slightly open = 2nd row1st row + eyeglasses = 2nd row1st row + wearing hat = 2nd rowInput Predicted attributes Generated images Input Predicted attributes Generated images baseball, standing, next, player, man, group, person, field, sport, ball, outdoor, game, grass, crowd ! tennis, player, court, man, playing, field, racket, sport, swinging, ball, outdoor, holding, game, grass surfing, people, woman, water, standing, wave, man, top, riding, sport, ocean, outdoor, board! skiing, man, group, covered, day, hill, person, snow, riding, outdoor red, sign, street, next, pole, outdoor, stop, grass ! pizza, rack, blue, grill, plate, stove, table, pan, holding, pepperoni, cooked sink, shower, indoor, tub, restroom, bathroom, small, standing, room, tile, white, stall, tiled, black, bath ! computer, laptop, room, front, living, indoor, table, desk !\fReferences\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[2] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a\n\nlaplacian pyramid of adversarial networks. In NIPS, 2015.\n\n[3] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[4] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.\n\nGenerative adversarial text to image synthesis. In ICML, 2016.\n\n[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\n\nadversarial networks. In CVPR, 2017.\n\n[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.\n\n[7] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover cross-\n\ndomain relations with generative adversarial networks. In ICML, 2017.\n\n[8] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. In ICCV, 2017.\n\n[9] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image\n\ntranslation. In ICCV, 2017.\n\n[10] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and\n\nAaron Courville. Adversarially learned inference. In ICLR, 2017.\n\n[11] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.\n\n[12] Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In NIPS, 2017.\n\n[13] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial networks.\n\nIn ICLR, 2017.\n\n[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.\n\n[15] Guim Perarnau, Joost van de Weijer, Bogdan Raducanu, and Jose M \u00c1lvarez. Invertible conditional gans\n\nfor image editing. arXiv:1611.06355, 2016.\n\n[16] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris\nMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.\nIn ICCV, 2017.\n\n[17] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-\nresolution using a generative adversarial network. In CVPR, 2017.\n\n[18] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. In ICLR,\n\n2017.\n\n[19] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, 2016.\n\n[20] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In\n\nNIPS, 2017.\n\n[21] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\n\ntechniques for training gans. In NIPS, 2016.\n\n[22] Yizhe Zhang, Zhe Gan, and Lawrence Carin. Generating text via adversarial training. In NIPS workshop\n\non Adversarial Training, 2016.\n\n[23] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial\n\nfeature matching for text generation. In ICML, 2017.\n\n[24] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv:1701.07875, 2017.\n\n9\n\n\f[25] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. In ICLR,\n\n2017.\n\n[26] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks.\n\nIn ICLR, 2017.\n\n[27] Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. Dual supervised learning. In\n\nICML, 2017.\n\n[28] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin.\n\nVariational autoencoder for deep learning of images, labels and captions. In NIPS, 2016.\n\n[29] Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. Vae learning via\n\nstein variational gradient descent. In NIPS, 2017.\n\n[30] Chunyuan Li, Hao Liu, Changyou Chen, Yunchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin.\n\nAlice: Towards understanding adversarial learning for joint distribution matching. In NIPS, 2017.\n\n[31] Yunchen Pu, Weiyao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li, and Lawrence Carin.\n\nAdversarial symmetric variational autoencoder. In NIPS, 2017.\n\n[32] Yunchen Pu, Liqun Chen, Shuyang Dai, Weiyao Wang, Chunyuan Li, and Lawrence Carin. Symmetric\n\nvariational autoencoder and connections to adversarial learning. In NIPS, 2017.\n\n[33] Alex Krizhevsky. Learning multiple layers of features from tiny images. Citeseer, 2009.\n\n[34] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 1998.\n\n[35] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nICCV, 2015.\n\n[36] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adver-\n\nsarial networks. arXiv:1511.06390, 2015.\n\n[37] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and\n\nLi Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.\n\n[38] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: sequence generative adversarial nets with\n\npolicy gradient. In AAAI, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2712, "authors": [{"given_name": "Zhe", "family_name": "Gan", "institution": "Duke University"}, {"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Weiyao", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Yuchen", "family_name": "Pu", "institution": "Duke University"}, {"given_name": "Yizhe", "family_name": "Zhang", "institution": "Duke university"}, {"given_name": "Hao", "family_name": "Liu", "institution": "Nanjing University"}, {"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}