{"title": "Unsupervised Image-to-Image Translation Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 700, "page_last": 708, "abstract": "Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in https://github.com/mingyuliutw/unit.", "full_text": "Unsupervised Image-to-Image Translation Networks\n\nMing-Yu Liu, Thomas Breuel,\n\nJan Kautz\n\n{mingyul,tbreuel,jkautz}@nvidia.com\n\nNVIDIA\n\nAbstract\n\nUnsupervised image-to-image translation aims at learning a joint distribution of\nimages in different domains by using images from the marginal distributions in\nindividual domains. Since there exists an in\ufb01nite set of joint distributions that\ncan arrive the given marginal distributions, one could infer nothing about the joint\ndistribution from the marginal distributions without additional assumptions. To\naddress the problem, we make a shared-latent space assumption and propose an\nunsupervised image-to-image translation framework based on Coupled GANs.\nWe compare the proposed framework with competing approaches and present\nhigh quality image translation results on various challenging unsupervised image\ntranslation tasks, including street scene image translation, animal image translation,\nand face image translation. We also apply the proposed framework to domain\nadaptation and achieve state-of-the-art performance on benchmark datasets. Code\nand additional results are available in https://github.com/mingyuliutw/unit.\n\n1\n\nIntroduction\n\nMany computer visions problems can be posed as an image-to-image translation problem, mapping\nan image in one domain to a corresponding image in another domain. For example, super-resolution\ncan be considered as a problem of mapping a low-resolution image to a corresponding high-resolution\nimage; colorization can be considered as a problem of mapping a gray-scale image to a corresponding\ncolor image. The problem can be studied in supervised and unsupervised learning settings. In the\nsupervised setting, paired of corresponding images in different domains are available [8, 15]. In the\nunsupervised setting, we only have two independent sets of images where one consists of images\nin one domain and the other consists of images in another domain\u2014there exist no paired examples\nshowing how an image could be translated to a corresponding image in another domain. Due to\nlack of corresponding images, the UNsupervised Image-to-image Translation (UNIT) problem is\nconsidered harder, but it is more applicable since training data collection is easier.\nWhen analyzing the image translation problem from a probabilistic modeling perspective, the key\nchallenge is to learn a joint distribution of images in different domains. In the unsupervised setting,\nthe two sets consist of images from two marginal distributions in two different domains, and the task is\nto infer the joint distribution using these images. The coupling theory [16] states there exist an in\ufb01nite\nset of joint distributions that can arrive the given marginal distributions in general. Hence, inferring\nthe joint distribution from the marginal distributions is a highly ill-posed problem. To address the\nill-posed problem, we need additional assumptions on the structure of the joint distribution.\nTo this end we make a shared-latent space assumption, which assumes a pair of corresponding images\nin different domains can be mapped to a same latent representation in a shared-latent space. Based on\nthe assumption, we propose a UNIT framework that are based on generative adversarial networks\n(GANs) and variational autoencoders (VAEs). We model each image domain using a VAE-GAN. The\nadversarial training objective interacts with a weight-sharing constraint, which enforces a shared-\nlatent space, to generate corresponding images in two domains, while the variational autoencoders\nrelate translated images with input images in the respective domains. We applied the proposed\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fZ : shared latent space\n\nz\n\nE2\n\nE1\n\nG1 G2\n\nx1\n\nX1\n\nx2\n\nX2\n\nx1\n\nx2\n\n(a)\n\nE1\n\nG1\n\nz\n\nE2\n\nG2\n\n\u02dcx1!1\n1\n\u02dcx2!1\n2\n\n\u02dcx1!2\n1\n\u02dcx2!2\n2\n\nD1\n\nD2\n\nT/F\n\nT/F\n\n(b)\n\nFigure 1: (a) The shared latent space assumption. We assume a pair of corresponding images (x1, x2) in\ntwo different domains X1 and X2 can be mapped to a same latent code z in a shared-latent space Z. E1 and\nE2 are two encoding functions, mapping images to latent codes. G1 and G2 are two generation functions,\nmapping latent codes to images. (b) The proposed UNIT framework. We represent E1 E2 G1 and G2 using\nCNNs and implement the shared-latent space assumption using a weight sharing constraint where the connection\nweights of the last few layers (high-level layers) in E1 and E2 are tied (illustrated using dashed lines) and the\nconnection weights of the \ufb01rst few layers (high-level layers) in G1 and G2 are tied. Here, \u02dcx1!1\nand \u02dcx2!2\nare self-reconstructed images, and \u02dcx1!2\nare domain-translated images. D1 and D2 are adversarial\ndiscriminators for the respective domains, in charge of evaluating whether the translated images are realistic.\n\nand \u02dcx2!1\n\n1\n\n2\n\n1\n\n2\n\nNetworks\n\nRoles\n\nTable 1: Interpretation of the roles of the subnetworks in the proposed framework.\n{E1, G1}\nVAE for X1\n\n{E1, G1, D1}\nImage Translator X1 !X 2 GAN for X1 VAE-GAN [14]\n\n{G1, D1}\n\n{E1, G2}\n\n{G1, G2, D1, D2}\n\nCoGAN [17]\n\nframework to various unsupervised image-to-image translation problems and achieved high quality\nimage translation results. We also applied it to the domain adaptation problem and achieved state-of-\nthe-art accuracies on benchmark datasets. The shared-latent space assumption was used in Coupled\nGAN [17] for joint distribution learning. Here, we extend the Coupled GAN work for the UNIT\nproblem. We also note that several contemporary works propose the cycle-consistency constraint\nassumption [29, 10], which hypothesizes the existence of a cycle-consistency mapping so that an\nimage in the source domain can be mapped to an image in the target domain and this translated image\nin the target domain can be mapped back to the original image in the source domain. In the paper, we\nshow that the shared-latent space constraint implies the cycle-consistency constraint.\n\n2 Assumptions\n\nLet X1 and X2 be two image domains. In supervised image-to-image translation, we are given\nsamples (x1, x2) drawn from a joint distribution PX1,X2(x1, x2). In unsupervised image-to-image\ntranslation, we are given samples drawn from the marginal distributions PX1(x1) and PX2(x2). Since\nan in\ufb01nite set of possible joint distributions can yield the given marginal distributions, we could infer\nnothing about the joint distribution from the marginal samples without additional assumptions.\nWe make the shared-latent space assumption. As shown Figure 1, we assume for any given pair\nof images x1 and x2, there exists a shared latent code z in a shared-latent space, such that we\ncan recover both images from this code, and we can compute this code from each of the two\nimages. That is, we postulate there exist functions E\u21e41, E\u21e42, G\u21e41, and G\u21e42 such that, given a pair of\ncorresponding images (x1, x2) from the joint distribution, we have z = E\u21e41 (x1) = E\u21e42 (x2) and\nconversely x1 = G\u21e41(z) and x2 = G\u21e42(z). Within this model, the function x2 = F \u21e41!2(x1) that\nmaps from X1 to X2 can be represented by the composition F \u21e41!2(x1) = G\u21e42(E\u21e41 (x1)). Similarly,\nx1 = F \u21e42!1(x2) = G\u21e41(E\u21e42 (x2)). The UNIT problem then becomes a problem of learning F \u21e41!2\nand F \u21e42!1. We note that a necessary condition for F \u21e41!2 and F \u21e42!1 to exist is the cycle-consistency\nconstraint [29, 10]: x1 = F \u21e42!1(F \u21e41!2(x1)) and x2 = F \u21e41!2(F \u21e42!1(x2)). We can reconstruct\nthe input image from translating back the translated input image. In other words, the proposed\nshared-latent space assumption implies the cycle-consistency assumption (but not vice versa).\nTo implement the shared-latent space assumption, we further assume a shared intermediate repre-\nsentation h such that the process of generating a pair of corresponding images admits a form of\n\n2\n\n\fx1\n& x2\n\n.\n\nz ! h %\n\n(1)\nConsequently, we have G\u21e41 \u2318 G\u21e4L,1 G\u21e4H and G\u21e42 \u2318 G\u21e4L,2 G\u21e4H where G\u21e4H is a common high-level\ngeneration function that maps z to h and G\u21e4L,1 and G\u21e4L,2 are low-level generation functions that map\nh to x1 and x2, respectively. In the case of multi-domain image translation (e.g., sunny and rainy\nimage translation), z can be regarded as the compact, high-level representation of a scene (\"car in\nfront, trees in back\"), and h can be considered a particular realization of z through G\u21e4H (\"car/tree\noccupy the following pixels\"), and G\u21e4L,1 and G\u21e4L,2 would be the actual image formation functions\nin each modality (\"tree is lush green in the sunny domain, but dark green in the rainy domain\").\nAssuming h also allow us to represent E\u21e41 and E\u21e42 by E\u21e41 \u2318 E\u21e4H E\u21e4L,1 and E\u21e42 \u2318 E\u21e4H E\u21e4L,2.\nIn the next section, we discuss how we realize the above ideas in the proposed UNIT framework.\n\n3 Framework\n\n1\n\n2 = G2(z2 \u21e0 q2(z2|x2)).\n\nOur framework, as illustrated in Figure 1, is based on variational autoencoders (VAEs) [13, 22, 14]\nand generative adversarial networks (GANs) [6, 17]. It consists of 6 subnetworks: including two\ndomain image encoders E1 and E2, two domain image generators G1 and G2, and two domain\nadversarial discriminators D1 and D2. Several ways exist to interpret the roles of the subnetworks,\nwhich we summarize in Table 1. Our framework learns translation in both directions in one shot.\nVAE. The encoder\u2013generator pair {E1, G1} constitutes a VAE for the X1 domain, termed VAE1. For\nan input image x1 2X 1, the VAE1 \ufb01rst maps x1 to a code in a latent space Z via the encoder E1 and\nthen decodes a random-perturbed version of the code to reconstruct the input image via the generator\nG1. We assume the components in the latent space Z are conditionally independent and Gaussian with\nunit variance. In our formulation, the encoder outputs a mean vector E\u00b5,1(x1) and the distribution\nof the latent code z1 is given by q1(z1|x1) \u2318N (z1|E\u00b5,1(x1), I) where I is an identity matrix. The\nreconstructed image is \u02dcx1!1\n= G1(z1 \u21e0 q1(z1|x1)). Note that here we abused the notation since\nwe treated the distribution of q1(z1|x1) as a random vector of N (E\u00b5,1(x1), I) and sampled from it.\nSimilarly, {E2, G2} constitutes a VAE for X2: VAE2 where the encoder E2 outputs a mean vector\nE\u00b5,2(x2) and the distribution of the latent code z2 is given by q2(z2|x2) \u2318N (z2|E\u00b5,2(x2), I). The\nreconstructed image is \u02dcx2!2\nUtilizing the reparameterization trick [13], the non-differentiable sampling operation can be reparam-\neterized as a differentiable operation using auxiliary random variables. This reparameterization trick\nallows us to train VAEs using back-prop. Let \u2318 be a random vector with a multi-variate Gaussian\ndistribution: \u2318 \u21e0N (\u2318|0, I). The sampling operations of z1 \u21e0 q1(z1|x1) and z2 \u21e0 q2(z2|x2) can be\nimplemented via z1 = E\u00b5,1(x1) + \u2318 and z2 = E\u00b5,2(x2) + \u2318, respectively.\nWeight-sharing. Based on the shared-latent space assumption discussed in Section 2, we enforce a\nweight-sharing constraint to relate the two VAEs. Speci\ufb01cally, we share the weights of the last few\nlayers of E1 and E2 that are responsible for extracting high-level representations of the input images\nin the two domains. Similarly, we share the weights of the \ufb01rst few layers of G1 and G2 responsible\nfor decoding high-level representations for reconstructing the input images.\nNote that the weight-sharing constraint alone does not guarantee that corresponding images in two\ndomains will have the same latent code. In the unsupervised setting, no pair of corresponding images\nin the two domains exists to train the network to output a same latent code. The extracted latent\ncodes for a pair of corresponding images are different in general. Even if they are the same, the same\nlatent component may have different semantic meanings in different domains. Hence, the same latent\ncode could still be decoded to output two unrelated images. However, we will show that through\nadversarial training, a pair of corresponding images in the two domains can be mapped to a common\nlatent code by E1 and E2, respectively, and a latent code will be mapped to a pair of corresponding\nimages in the two domains by G1 and G2, respectively.\nThe shared-latent space assumption allows us to perform image-to-image translation. We can\ntranslate an image x1 in X1 to an image in X2 through applying G2(z1 \u21e0 q1(z1|x1)). We term such\nan information processing stream as the image translation stream. Two image translation streams exist\nin the proposed framework: X1 !X 2 and X2 !X 1. The two streams are trained jointly with the\ntwo image reconstruction streams from the VAEs. Once we could ensure that a pair of corresponding\n\n3\n\n\fimages are mapped to a same latent code and a same latent code is decoded to a pair of corresponding\nimages, (x1, G2(z1 \u21e0 q1(z1|x1))) would form a pair of corresponding images. In other words, the\ncomposition of E1 and G2 functions approximates F \u21e41!2 for unsupervised image-to-image translation\ndiscussed in Section 2, and the composition of E2 and G1 function approximates F \u21e42!1.\nGANs. Our framework has two generative adversarial networks: GAN1 = {D1, G1} and GAN2 =\n{D2, G2}. In GAN1, for real images sampled from the \ufb01rst domain, D1 should output true, while\nfor images generated by G1, it should output false. G1 can generate two types of images: 1) images\nfrom the reconstruction stream \u02dcx1!1\n= G1(z1 \u21e0 q1(z1|x1)) and 2) images from the translation\nstream \u02dcx2!1\n2 = G1(z2 \u21e0 q2(z2|x2)). Since the reconstruction stream can be supervisedly trained, it\n. We\nis suf\ufb01ce that we only apply adversarial training to images from the translation stream, \u02dcx2!1\napply a similar processing to GAN2 where D2 is trained to output true for real images sampled from\nthe second domain dataset and false for images generated from G2.\nCycle-consistency (CC). Since the shared-latent space assumption implies the cycle-consistency\nconstraint (See Section 2), we could also enforce the cycle-consistency constraint in the proposed\nframework to further regularize the ill-posed unsupervised image-to-image translation problem. The\nresulting information processing stream is called the cycle-reconstruction stream.\nLearning. We jointly solve the learning problems of the VAE1, VAE2, GAN1 and GAN2 for the\nimage reconstruction streams, the image translation streams, and the cycle-reconstruction streams:\n\n1\n\n2\n\nmin\n\nE1,E2,G1,G2\n\nD1,D2LVAE1(E1, G1) + LGAN1(E1, G1, D1) + LCC1(E1, G1, E2, G2)\nmax\nLVAE2(E2, G2) + LGAN2(E2, G2, D2) + LCC2(E2, G2, E1, G1).\n\n(2)\n\nVAE training aims for minimizing a variational upper bound In (2), the VAE objects are\nLVAE1(E1, G1) =1KL(q1(z1|x1)||p\u2318(z)) 2Ez1\u21e0q1(z1|x1)[log pG1(x1|z1)]\nLVAE2(E2, G2) =1KL(q2(z2|x2)||p\u2318(z)) 2Ez2\u21e0q2(z2|x2)[log pG2(x2|z2)].\n\n(3)\n(4)\nwhere the hyper-parameters 1 and 2 control the weights of the objective terms and the KL\ndivergence terms penalize deviation of the distribution of the latent code from the prior distribution.\nThe regularization allows an easy way to sample from the latent space [13]. We model pG1 and pG2\nusing Laplacian distributions, respectively. Hence, minimizing the negative log-likelihood term is\nequivalent to minimizing the absolute distance between the image and the reconstructed image. The\nprior distribution is a zero mean Gaussian p\u2318(z) = N (z|0, I).\nIn (2), the GAN objective functions are given by\n\nLGAN1(E1, G1, D1) = 0Ex1\u21e0PX1\nLGAN2(E2, G2, D2) = 0Ex2\u21e0PX2\n\n(5)\n(6)\nThe objective functions in (5) and (6) are conditional GAN objective functions. They are used to\nensure the translated images resembling images in the target domains, respectively. The hyper-\nparameter 0 controls the impact of the GAN objective functions.\nWe use a VAE-like objective function to model the cycle-consistency constraint, which is given by\n\n[log D1(x1)] + 0Ez2\u21e0q2(z2|x2)[log(1 D1(G1(z2)))]\n[log D2(x2)] + 0Ez1\u21e0q1(z1|x1)[log(1 D2(G2(z1)))].\n\nLCC1(E1, G1, E2, G2) =3KL(q1(z1|x1)||p\u2318(z)) + 3KL(q2(z2|x1!2\n\n1\n\n4Ez2\u21e0q2(z2|x1!2\n\n1\n\n)[log pG1(x1|z2)]\n\n))||p\u2318(z))\n\n(7)\n\nLCC2(E2, G2, E1, G1) =3KL(q2(z2|x2)||p\u2318(z)) + 3KL(q1(z1|x2!1\n\n2\n\n))||p\u2318(z))\n\n2\n\n)[log pG2(x2|z1)].\n\n4Ez1\u21e0q1(z1|x2!1\n\n(8)\nwhere the negative log-likelihood objective term ensures a twice translated image resembles the\ninput one and the KL terms penalize the latent codes deviating from the prior distribution in the\ncycle-reconstruction stream (Therefore, there are two KL terms). The hyper-parameters 3 and 4\ncontrol the weights of the two different objective terms.\nInheriting from GAN, training of the proposed framework results in solving a mini-max problem\nwhere the optimization aims to \ufb01nd a saddle point. It can be seen as a two player zero-sum game.\nThe \ufb01rst player is a team consisting of the encoders and generators. The second player is a team\nconsisting of the adversarial discriminators. In addition to defeating the second player, the \ufb01rst player\nhas to minimize the VAE losses and the cycle-consistency losses. We apply an alternating gradient\n\n4\n\n\f0.7\n\n0.56\n\n0.42\n\n0.28\n\n0.14\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0\n\n1\n\n0.7\n\n0.56\n\n0.42\n\n0.28\n\n0.14\n\n0\n1000\n\n1=10\n1 = 10\n1=0.1\n1 = 0.1\n100\n\n1=1\n1 = 1\n1=0.01\n1 = 0.01\n10\n\n1\n\n2\n2\n(c)\n\n6 Dis\n4 Dis\n\n5 Dis\n3 Dis\n\n2\n\n3\n\n4\n\n# of shared layers in gen.\n\n(b)\n\n(a)\n\nAccuracy\n\n0.569\u00b10.029\n\nMethod\nWeight \nSharing\nCycle \n\n0.568\u00b10.010\n\nConsistenc\nProposed 0.600\u00b10.015\n\n(d)\n\nFigure 2: (a) Illustration of the Map dataset. Left: satellite image. Right: map. We translate holdout satellite\nimages to maps and measure the accuracy achieved by various con\ufb01gurations of the proposed framework.\n(b) Translation accuracy versus different network architectures.\n(c) Translation accuracy versus different\nhyper-parameter values. (d) Impact of weight-sharing and cycle-consistency constraints on translation accuracy.\n\nupdate scheme similar to the one described in [6] to solve (2). Speci\ufb01cally, we \ufb01rst apply a gradient\nascent step to update D1 and D2 with E1, E2, G1, and G2 \ufb01xed. We then apply a gradient descent\nstep to update E1, E2, G1, and G2 with D1 and D2 \ufb01xed.\nTranslation: After learning, we obtain two image translation functions by assembling a subset of the\nsubnetworks. We have F1!2(x1) = G2(z1 \u21e0 q1(z1|x1)) for translating images from X1 to X2 and\nF2!1(x2) = G1(z2 \u21e0 q2(z2|x2)) for translating images from X2 to X1.\n\n4 Experiments\n\nWe \ufb01rst analyze various components of the proposed framework. We then present visual results on\nchallenging translation tasks. Finally, we apply our framework to the domain adaptation tasks.\nPerformance Analysis. We used ADAM [11] for training where the learning rate was set to 0.0001\nand momentums were set to 0.5 and 0.999. Each mini-batch consisted of one image from the \ufb01rst\ndomain and one image from the second domain. Our framework had several hyper-parameters. The\ndefault values were 0 = 10, 3 = 1 = 0.1 and 4 = 2 = 100. For the network architecture,\nour encoders consisted of 3 convolutional layers as the front-end and 4 basic residual blocks [7] as\nthe back-end. The generators consisted of 4 basic residual blocks as the front-end and 3 transposed\nconvolutional layers as the back-end. The discriminators consisted of stacks of convolutional layers.\nWe used LeakyReLU for nonlinearity. The details of the networks are given in Appendix A.\nWe used the map dataset [8] (visualized in Figure 2), which contained corresponding pairs of images\nin two domains (satellite image and map) useful for quantitative evaluation. Here, the goal was to\nlearn to translate between satellite images and maps. We operated in an unsupervised setting where\nwe used the 1096 satellite images from the training set as the \ufb01rst domain and 1098 maps from the\nvalidation set as the second domain. We trained for 100K iterations and used the \ufb01nal model to\ntranslate 1098 satellite images in the test set. We then compared the difference between a translated\nsatellite image (supposed to be maps) and the corresponding ground truth maps pixel-wisely. A pixel\ntranslation was counted correct if the color difference was within 16 of the ground truth color value.\nWe used the average pixel accuracy over the images in the test set as the performance metric. We\ncould use color difference for measuring translation accuracy since the target translation function\nwas unimodal. We did not evaluate the translation from maps to images since the translation was\nmulti-modal, which was dif\ufb01cult to construct a proper evaluation metric.\nIn one experiment, we varied the number of weight-sharing layers in the VAEs and paired each\ncon\ufb01guration with discriminator architectures of different depths during training. We changed the\nnumber of weight-sharing layers from 1 to 4. (Sharing 1 layer in VAEs means sharing 1 layer for\nE1 and E2 and, at the same time, sharing 1 layer for G1 and G2.) The results were reported in\nFigure 2(b). Each curve corresponded to a discriminator architecture of a different depth. The x-axis\ndenoted the number of weigh-sharing layers in the VAEs. We found that the shallowest discriminator\narchitecture led to the worst performance. We also found that the number of weight-sharing layer\nhad little impact. This was due to the use of the residual blocks. As tying the weight of one layer, it\neffectively constrained the other layers since the residual blocks only updated the residual information.\nIn the rest of the experiments, we used VAEs with 1 sharing layer and discriminators of 5 layers.\n\n5\n\n\fWe analyzed impact of the hyper-parameter values to the translation accuracy. For different weight\nvalues on the negative log likelihood terms (i.e., 2, 4), we computed the achieved translation\naccuracy over different weight values on the KL terms (i.e., 1, 3). The results were reported in\nFigure 2(c). We found that, in general, a larger weight value on the negative log likelihood terms\nyielded a better translation accuracy. We also found setting the weights of the KL terms to 0.1 resulted\nin consistently good performance. We hence set 1 = 3 = 0.1 and 2 = 4 = 100.\nWe performed an ablation study measuring impact of the weight-sharing and cycle-consistency\nconstraints to the translation performance and showed the results in Figure 2(d). We reported average\naccuracy over 5 trials (trained with different initialized weights.). We note that when we removed\nthe weight-sharing constraint (as a consequence, we also removed the reconstruction streams in the\nframework), the framework was reduced to the CycleGAN architecture [29, 10]. We found the model\nachieved an average pixel accuracy of 0.569. When we removed the cycle-consistency constraint\nand only used the weight-sharing constraint1, it achieved 0.568 average pixel accuracy. But when we\nused the full model, it achieved the best performance of 0.600 average pixel accuracy. This echoed\nour point that for the ill-posed joint distribution recovery problem, more constraints are bene\ufb01cial.\nQualitative results. Figure 3 to 6 showed results of the proposed framework on various UNIT tasks.\nStreet images. We applied the proposed framework to several unsupervised street scene image\ntranslation tasks including sunny to rainy, day to night, summery to snowy, and vice versa. For each\ntask, we used a set of images extracted from driving videos recorded at different days and cities. The\nnumbers of the images in the sunny/day, rainy, night, summery, and snowy sets are 86165, 28915,\n36280, 6838, and 6044. We trained the network to translate street scene image of size 640\u21e5480. In\nFigure 3, we showed several example translation results . We found that our method could generate\nrealistic translated images. We also found that one translation was usually harder than the other.\nSpeci\ufb01cally, the translation that required adding more details to the image was usually harder (e.g.\nnight to day). Additional results are available in https://github.com/mingyuliutw/unit.\nSynthetic to real. In Figure 3, we showed several example results achieved by applying the proposed\nframework to translate images between the synthetic images in the SYNTHIA dataset [23] and the\nreal images in the Cityscape dataset [2]. For the real to synthetic translation, we found our method\nmade the cityscape images cartoon like. For the synthetic to real translation, our method achieved\nbetter results in the building, sky, road, and car regions than in the human regions.\nDog breed conversion. We used the images of Husky, German Shepherd, Corgi, Samoyed, and Old\nEnglish Sheep dogs in the ImageNet dataset to learn to translate dog images between different breeds.\nWe only used the head regions, which were extracted by a template matching algorithm. Several\nexample results were shown in Figure 4. We found our method translated a dog to a different breed.\nCat species conversion. We also used the images of house cat, tiger, lion, cougar, leopard, jaguar,\nand cheetah in the ImageNet dataset to learn to translate cat images between different species. We\nonly used the head regions, which again were extracted by a template matching algorithm. Several\nexample results were shown in Figure 5. We found our method translated a cat to a different specie.\nFace attribute. We used the CelebA dataset [18] for attribute-based face images translation. Each face\nimage in the dataset had several attributes, including blond hair, smiling, goatee, and eyeglasses. The\nface images with an attribute constituted the 1st domain, while those without the attribute constituted\nthe 2nd domain. In Figure 6, we visualized the results where we translated several images that do not\nhave blond hair, eye glasses, goatee, and smiling to corresponding images with each of the individual\nattributes. We found that the translated face images were realistic.\nDomain Adaptation. We applied the proposed framework to the problem for adapting a classi\ufb01er\ntrained using labeled samples in one domain (source domain) to classify samples in a new domain\n(target domain) where labeled samples in the new domain are unavailable during training. Early\nworks have explored ideas from subspace learning [4] to deep feature learning [5, 17, 26].\nWe performed multi-task learning where we trained the framework to 1) translate images between\nthe source and target domains and 2) classify samples in the source domain using the features\nextracted by the discriminator in the source domain. Here, we tied the weights of the high-level\nlayers of D1 and D2. This allows us to adapt a classi\ufb01er trained in the source domain to the target\ndomain. Also, for a pair of generated images in different domains, we minimized the L1 distance\n\n1We used this architecture in an earlier version of the paper.\n\n6\n\n\fFigure 3: Street scene image translation results. For each pair, left is input and right is the translated image.\n\nOld Eng.\nSheep Dog\n\nHusky\n\nGerman\nShepherd\n\nInput\n\nCorgi\n\nInput\n\nHusky\n\nCorgi\n\nInput\n\nCougar\n\nCheetah\n\nLeopard\n\nLion\n\nTiger\n\nInput\n\nLeopard\n\nFigure 4: Dog breed translation results.\n\nInput\n\n+Blond Hair +Eyeglasses\n\nFigure 5: Cat species translation results.\n+Goatee\n\n+Smiling\n\nInput\n\n+Blond Hair +Eyeglasses\n\n+Goatee\n\n+Smiling\n\nFigure 6: Attribute-based face translation results.\n\n7\n\n\fTable 2: Unsupervised domain adaptation performance. The reported numbers are classi\ufb01cation accuracies.\n\nSA [4]\n0.5932\n\n-\n-\n\nDANN [5]\n\n0.7385\n\n-\n-\n\nDTN [26]\n0.8488\n\n-\n-\n\nMethod\n\nSVHN! MNIST\nMNIST! USPS\nUSPS! MNIST\n\nCoGAN\n\nUNIT (proposed)\n\n-\n\n0.9565\n0.9315\n\n0.9053\n0.9597\n0.9358\n\nbetween the features extracted by the highest layer of the discriminators, which further encouraged\nD1 and D2 to interpret a pair of corresponding images in the same way. We applied the approach to\nseveral tasks including adapting from the Street View House Number (SVHN) dataset [20] to the\nMNIST dataset and adapting between the MNIST and USPS datasets. Table 2 reported the achieved\nperformance with comparison to the competing approaches. We found that our method achieved a\n0.9053 accuracy for the SVHN!MNIST task, which was much better than 0.8488 achieved by the\nprevious state-of-the-art method [26]. We also achieved better performance for the MNIST$SVHN\ntask than the Coupled GAN approach, which was the state-of-the-art. The digit images had a small\nresolution. Hence, we used a small network. We also found that the cycle-consistency constraint was\nnot necessary for this task. More details about the experiments are available in Appendix B.\n\n5 Related Work\n\nSeveral deep generative models were recently proposed for image generation including GANs [6],\nVAEs [13, 22], and PixelCNN [27]. The proposed framework was based on GANs and VAEs but it\nwas designed for the unsupervised image-to-image translation task, which could be considered as a\nconditional image generation model. In the following, we \ufb01rst review several recent GAN and VAE\nworks and then discuss related image translation works.\nGAN learning is via staging a zero-sum game between the generator and discriminator. The quality\nof GAN-generated images had improved dramatically since the introduction. LapGAN [3] proposed\na Laplacian pyramid implementation of GANs. DCGAN [21] used a deeper convolutional network.\nSeveral GAN training tricks were proposed in [24]. WGAN [1] used the Wasserstein distance.\nVAEs optimize a variational bound. By improving the variational approximation, better image\ngeneration results were achieved [19, 12]. In [14], a VAE-GAN architecture was proposed to improve\nimage generation quality of VAEs. VAEs were applied to translate face image attribute in [28].\nConditional generative model is a popular approach for mapping an image from one domain to\nanother. Most of the existing works were based on supervised learning [15, 8, 9]. Our work differed\nto the previous works in that we do not need corresponding images. Recently, [26] proposed the\ndomain transformation network (DTN) and achieved promising results on translating small resolution\nface and digit images. In addition to faces and digits, we demonstrated that the proposed framework\ncan translate large resolution natural images. It also achieved a better performance in the unsupervised\ndomain adaptation task. In [25], a conditional generative adversarial network-based approach was\nproposed to translate a rendering images to a real image for gaze estimation. In order to ensure\nthe generated real image was similar to the original rendering image, the L1 distance between\nthe generated and original image was minimized. We note that two contemporary papers [29, 10]\nindependently introduced the cycle-consistency constraint for the unsupervised image translation.\nWe showed that that the cycle-consistency constraint is a natural consequence of the proposed\nshared-latent space assumption. From our experiment, we found that cycle-consistency and the\nweight-sharing (a realization of the shared-latent space assumption) constraints rendered comparable\nperformance. When the two constraints were jointed used, the best performance was achieved.\n\n6 Conclusion and Future Work\n\nWe presented a general framework for unsupervised image-to-image translation. We showed it\nlearned to translate an image from one domain to another without any corresponding images in two\ndomains in the training dataset. The current framework has two limitations. First, the translation\nmodel is unimodal due to the Gaussian latent space assumption. Second, training could be unstable\ndue to the saddle point searching problem. We plan to address these issues in the future work.\n\n8\n\n\fReferences\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.\n[2] M. Cordts, M. Omran, S. Ramos, T. Scharw\u00e4chter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and\nB. Schiele. The cityscapes dataset. Conference on Computer Vision and Pattern Recognition Workshop,\n2015.\n\n[3] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of\n\nadversarial networks. Advances in Neural Information Processing Systems, 2015.\n\n[4] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. International Conference on Computer Vision, 2013.\n\n[5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.\n\nDomain-adversarial training of neural networks. Journal of Machine Learning Research, 2016.\n\n[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. Advances in Neural Information Processing Systems, 2014.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Computer Vision and\n\nPattern Recognition, 2016.\n\n[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. Conference on Computer Vision and Pattern Recognition, 2017.\n\n[9] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.\n\nEuropean Conference in Computer Vision, 2016.\n\n[10] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative\n\nadversarial networks. International Conference on Machine Learning, 2017.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning\n\n[12] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive\n\n\ufb02ow. Advances in Neural Information Processing Systems, 2016.\n\n[13] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning\n\nRepresentations, 2015.\n\nRepresentations, 2014.\n\n[14] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. International Conference on Machine Learning, 2016.\n\n[15] C. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,\nZ. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network.\nConference on Computer Vision and Pattern Recognition, 2017.\n\n[16] T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.\n[17] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. Advances in Neural Information\n\nProcessing Systems, 2016.\n\non Computer Vision, 2015.\n\n[18] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. International Conference\n\n[19] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models. Interna-\n\ntional Conference on Machine Learning, 2016.\n\n[20] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. Advances in Neural Information Processing Systems workshop, 2011.\n\n[21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. International Conference on Learning Representations, 2016.\n\n[22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in\n\ndeep latent gaussian models. International Conference on Machine Learning, 2014.\n\n[23] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset: A large collection\nof synthetic images for semantic segmentation of urban scenes. Conference on Computer Vision and\nPattern Recognition, 2016.\n\n[24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. Advances in Neural Information Processing Systems, 2016.\n\n[25] A. Shrivastava, T. P\ufb01ster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and\nunsupervised images through adversarial training. Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[26] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation.\n\nInternational\n\nConference on Learning Representations, 2017.\n\n[27] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation\n\nwith pixelcnn decoders. Advances in Neural Information Processing Systems, 2016.\n\n[28] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual\n\nattributes. European Conference in Computer Vision, 2016.\n\n[29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. International Conference on Computer Vision, 2017.\n\n9\n\n\f", "award": [], "sourceid": 469, "authors": [{"given_name": "Ming-Yu", "family_name": "Liu", "institution": "NVIDIA"}, {"given_name": "Thomas", "family_name": "Breuel", "institution": null}, {"given_name": "Jan", "family_name": "Kautz", "institution": "NVIDIA"}]}