{"title": "Toward Multimodal Image-to-Image Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 476, "abstract": "Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.", "full_text": "Toward Multimodal Image-to-Image Translation\n\nJun-Yan Zhu\nUC Berkeley\n\nRichard Zhang\n\nUC Berkeley\n\nDeepak Pathak\nUC Berkeley\n\nTrevor Darrell\nUC Berkeley\n\nAlexei A. Efros\nUC Berkeley\n\nOliver Wang\nAdobe Research\n\nEli Shechtman\nAdobe Research\n\nAbstract\n\nMany image-to-image translation problems are ambiguous, as a single input image\nmay correspond to multiple possible outputs. In this work, we aim to model\na distribution of possible outputs in a conditional generative modeling setting.\nThe ambiguity of the mapping is distilled in a low-dimensional latent vector,\nwhich can be randomly sampled at test time. A generator learns to map the given\ninput, combined with this latent code, to the output. We explicitly encourage the\nconnection between output and the latent code to be invertible. This helps prevent\na many-to-one mapping from the latent code to the output during training, also\nknown as the problem of mode collapse, and produces more diverse results. We\nexplore several variants of this approach by employing different training objectives,\nnetwork architectures, and methods of injecting the latent code. Our proposed\nmethod encourages bijective consistency between the latent encoding and output\nmodes. We present a systematic comparison of our method and other variants on\nboth perceptual realism and diversity.\n\n1\n\nIntroduction\n\nDeep learning techniques have made rapid progress in conditional image generation. For example,\nnetworks have been used to inpaint missing image regions [20, 34, 47], add color to grayscale\nimages [19, 20, 27, 50], and generate photorealistic images from sketches [20, 40]. However, most\ntechniques in this space have focused on generating a single result. In this work, we model a\ndistribution of potential results, as many of these problems may be multimodal in nature. For\nexample, as seen in Figure 1, an image captured at night may look very different in the day, depending\non cloud patterns and lighting conditions. We pursue two main goals: producing results which are (1)\nperceptually realistic and (2) diverse, all while remaining faithful to the input.\nMapping from a high-dimensional input to a high-dimensional output distribution is challenging. A\ncommon approach to representing multimodality is learning a low-dimensional latent code, which\nshould represent aspects of the possible outputs not contained in the input image. At inference time,\na deterministic generator uses the input image, along with stochastically sampled latent codes, to\nproduce randomly sampled outputs. A common problem in existing methods is mode collapse [14],\nwhere only a small number of real samples get represented in the output. We systematically study a\nfamily of solutions to this problem.\nWe start with the pix2pix framework [20], which has previously been shown to produce high-\nquality results for various image-to-image translation tasks. The method trains a generator network,\nconditioned on the input image, with two losses: (1) a regression loss to produce similar output\nto the known paired ground truth image and (2) a learned discriminator loss to encourage realism.\nThe authors note that trivially appending a randomly drawn latent code did not produce diverse\nresults. Instead, we propose encouraging a bijection between the output and latent space. We not\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Multimodal image-to-image translation using our proposed method: given an input image from\none domain (night image of a scene), we aim to model a distribution of potential outputs in the target domain\n(corresponding day images), producing both realistic and diverse results.\n\nonly perform the direct task of mapping the latent code (along with the input) to the output but also\njointly learn an encoder from the output back to the latent space. This discourages two different latent\ncodes from generating the same output (non-injective mapping). During training, the learned encoder\nattempts to pass enough information to the generator to resolve any ambiguities regarding the output\nmode. For example, when generating a day image from a night image, the latent vector may encode\ninformation about the sky color, lighting effects on the ground, and cloud patterns. Composing the\nencoder and generator sequentially should result in the same image being recovered. The opposite\nshould produce the same latent code.\nIn this work, we instantiate this idea by exploring several objective functions, inspired by literature in\nunconditional generative modeling:\n\u2022 cVAE-GAN (Conditional Variational Autoencoder GAN): One approach is \ufb01rst encoding the\nground truth image into the latent space, giving the generator a noisy \u201cpeek\" into the desired\noutput. Using this, along with the input image, the generator should be able to reconstruct the\nspeci\ufb01c output image. To ensure that random sampling can be used during inference time, the latent\ndistribution is regularized using KL-divergence to be close to a standard normal distribution. This\napproach has been popularized in the unconditional setting by VAEs [23] and VAE-GANs [26].\n\n\u2022 cLR-GAN (Conditional Latent Regressor GAN): Another approach is to \ufb01rst provide a randomly\ndrawn latent vector to the generator. In this case, the produced output may not necessarily look like\nthe ground truth image, but it should look realistic. An encoder then attempts to recover the latent\nvector from the output image. This method could be seen as a conditional formulation of the \u201clatent\nregressor\" model [8, 10] and also related to InfoGAN [4].\n\n\u2022 BicycleGAN: Finally, we combine both these approaches to enforce the connection between latent\nencoding and output in both directions jointly and achieve improved performance. We show that\nour method can produce both diverse and visually appealing results across a wide range of image-\nto-image translation problems, signi\ufb01cantly more diverse than other baselines, including naively\nadding noise in the pix2pix framework. In addition to the loss function, we study the performance\nwith respect to several encoder networks, as well as different ways of injecting the latent code into\nthe generator network.\n\nWe perform a systematic evaluation of these variants by using humans to judge photorealism and\na perceptual distance metric [52] to assess output diversity. Code and data are available at https:\n//github.com/junyanz/BicycleGAN.\n\n2 Related Work\n\nGenerative modeling Parametric modeling of the natural image distribution is a challenging\nproblem. Classically, this problem has been tackled using restricted Boltzmann machines [41] and\nautoencoders [18, 43]. Variational autoencoders [23] provide an effective approach for modeling\nstochasticity within the network by reparametrization of a latent distribution at training time. A\ndifferent approach is autoregressive models [11, 32, 33], which are effective at modeling natural\n\n2\n\n(a)Inputnightimage(b)Diversedayimagessampledbyourmodel\u22ef\fFigure 2: Overview: (a) Test time usage of all the methods. To produce a sample output, a latent code z is\n\ufb01rst randomly sampled from a known distribution (e.g., a standard normal distribution). A generator G maps an\ninput image A (blue) and the latent sample z to produce a output sample \u02c6B (yellow). (b) pix2pix+noise [20]\nbaseline, with an additional ground truth image B (brown) that corresponds to A. (c) cVAE-GAN (and cAE-GAN)\nstarts from a ground truth target image B and encode it into the latent space. The generator then attempts to map\nthe input image A along with a sampled z back into the original image B. (d) cLR-GAN randomly samples a\nlatent code from a known distribution, uses it to map A into the output \u02c6B, and then tries to reconstruct the latent\ncode from the output. (e) Our hybrid BicycleGAN method combines constraints in both directions.\n\nimage statistics but are slow at inference time due to their sequential predictive nature. Generative\nadversarial networks [15] overcome this issue by mapping random values from an easy-to-sample\ndistribution (e.g., a low-dimensional Gaussian) to output images in a single feedforward pass of a\nnetwork. During training, the samples are judged using a discriminator network, which distinguishes\nbetween samples from the target distribution and the generator network. GANs have recently been\nvery successful [1, 4, 6, 8, 10, 35, 36, 49, 53, 54]. Our method builds on the conditional version of\nVAE [23] and InfoGAN [4] or latent regressor [8, 10] models by jointly optimizing their objectives.\nWe revisit this connection in Section 3.4.\n\nConditional image generation All of the methods de\ufb01ned above can be easily conditioned. While\nconditional VAEs [42] and autoregressive models [32, 33] have shown promise [16, 44, 46], image-\nto-image conditional GANs have lead to a substantial boost in the quality of the results. However, the\nquality has been attained at the expense of multimodality, as the generator learns to largely ignore the\nrandom noise vector when conditioned on a relevant context [20, 34, 40, 45, 47, 55]. In fact, it has\neven been shown that ignoring the noise leads to more stable training [20, 29, 34].\n\nExplicitly-encoded multimodality One way to express multiple modes is to explicitly encode\nthem, and provide them as an additional input in addition to the input image. For example, color\nand shape scribbles and other interfaces were used as conditioning in iGAN [54], pix2pix [20],\nScribbler [40] and interactive colorization [51]. An effective option explored by concurrent work [2,\n3, 13] is to use a mixture of models. Though able to produce multiple discrete answers, these\nmethods are unable to produce continuous changes. While there has been some degree of success\nfor generating multimodal outputs in unconditional and text-conditional setups [7, 15, 26, 31, 36],\nconditional image-to-image generation is still far from achieving the same results, unless explicitly\nencoded as discussed above. In this work, we learn conditional image generation models for modeling\nmultiple modes of output by enforcing tight connections between the latent and image spaces.\n\n3\n\nz(a) Testing Usage for all models(b) Training pix2pix+noise(c) Training cVAE-GAN(d) Training cLR-GAN(e) Training BicycleGAN!\"!#$(&|!))(&)*+)(&)#!\"!!\"!\"&!##)(&))(&),,,,--+.+00+.+.+0Target\tlatent\tdistributionGround\ttruth\toutputNetwork\toutputLossSample\tfrom\tdistributionInput\tImageDeep\tnetwork\f3 Multimodal Image-to-Image Translation\n\nOur goal is to learn a multi-modal mapping between two image domains, for example, edges and\nphotographs, or night and day images, etc. Consider the input domain A\u2282 RH\u00d7W\u00d73, which is to be\nmapped to an output domain B\u2282 RH\u00d7W\u00d73. During training, we are given a dataset of paired instances\n\nfrom these domains,(cid:8)(A\u2208A, B\u2208B)(cid:9), which is representative of a joint distribution p(A, B). It is\nnew instance A during test time, our model should be able to generate a diverse set of output (cid:98)B\u2019s,\n47, 55], they are primarily limited to generating a deterministic output (cid:98)B given the input image A.\nOn the other hand, we would like to learn the mapping that could sample the output (cid:98)B from true\n\nimportant to note that there could be multiple plausible paired instances B that would correspond to\nan input instance A, but the training dataset usually contains only one such pair. However, given a\ncorresponding to different modes in the distribution p(B|A).\nWhile conditional GANs have achieved success in image-to-image translation tasks [20, 34, 40, 45,\n\nconditional distribution given A, and produce results which are both diverse and realistic. To do so,\nwe learn a low-dimensional latent space z \u2208 RZ, which encapsulates the ambiguous aspects of the\noutput mode which are not present in the input image. For example, a sketch of a shoe could map\nto a variety of colors and textures, which could get compressed in this latent code. We then learn\na deterministic mapping G : (A, z) \u2192 B to the output. To enable stochastic sampling, we desire\nthe latent code vector z to be drawn from some prior distribution p(z); we use a standard Gaussian\ndistribution N (0, I) in this work.\nWe \ufb01rst discuss a simple extension of existing methods and discuss its strengths and weakness,\nmotivating the development of our proposed approach in the subsequent subsections.\n\n3.1 Baseline: pix2pix+noise (z \u2192 (cid:98)B)\n\nThe recently proposed pix2pix model [20] has shown high quality results in the image-to-image\ntranslation setting. It uses conditional adversarial networks [15, 30] to help produce perceptually\nrealistic results. GANs train a generator G and discriminator D by formulating their objective as an\nadversarial game. The discriminator attempts to differentiate between real images from the dataset\nand fake samples produced by the generator. Randomly drawn noise z is added to attempt to induce\nstochasticity. We illustrate the formulation in Figure 2(b) and describe it below.\nLGAN(G, D) = EA,B\u223cp(A,B)[log(D(A, B))] + EA\u223cp(A),z\u223cp(z)[log(1 \u2212 D(A, G(A, z)))]\n\n(1)\n\nTo encourage the output of the generator to match the input as well as stabilize the training, we use\nan (cid:96)1 loss between the output and the ground truth image.\n\nLimage\n\n1\n\n(G) = EA,B\u223cp(A,B),z\u223cp(z)||B \u2212 G(A, z)||1\n\nThe \ufb01nal loss function uses the GAN and (cid:96)1 terms, balanced by \u03bb.\n\nG\u2217 = arg min\n\nG\n\nmax\n\nD\n\nLGAN(G, D) + \u03bbLimage\n\n1\n\n(G)\n\n(2)\n\n(3)\n\nIn this scenario, there is little incentive for the generator to make use of the noise vector which\nencodes random information. Isola et al. [20] note that the noise was ignored by the generator in\npreliminary experiments and was removed from the \ufb01nal experiments. This was consistent with\nobservations made in the conditional settings by [29, 34], as well as the mode collapse phenomenon\nobserved in unconditional cases [14, 39]. In this paper, we explore different ways to explicitly enforce\nthe latent coding to capture relevant information.\n\n3.2 Conditional Variational Autoencoder GAN: cVAE-GAN (B \u2192 z \u2192 (cid:98)B)\nimage A to synthesize the desired output (cid:98)B. The overall model can be easily understood as the\n\nOne way to force the latent code z to be \u201cuseful\" is to directly map the ground truth B to it\nusing an encoding function E. The generator G then uses both the latent code and the input\n\nreconstruction of B, with latent encoding z concatenated with the paired A in the middle \u2013 similar to\nan autoencoder [18]. This interpretation is better shown in Figure 2(c).\n\n4\n\n\fThis approach has been successfully investigated in Variational Autoencoder [23] in the unconditional\nscenario without the adversarial objective. Extending it to conditional scenario, the distribution\nQ(z|B) of latent code z using the encoder E with a Gaussian assumption, Q(z|B) (cid:44) E(B). To\nre\ufb02ect this, Equation 1 is modi\ufb01ed to sampling z \u223c E(B) using the re-parameterization trick,\nallowing direct back-propagation [23].\n\nGAN = EA,B\u223cp(A,B)[log(D(A, B))] + EA,B\u223cp(A,B),z\u223cE(B)[log(1 \u2212 D(A, G(A, z)))]\nLVAE\nWe make the corresponding change in the (cid:96)1 loss term in Equation 2 as well to obtain LVAE\n(G) =\nEA,B\u223cp(A,B),z\u223cE(B)||B \u2212 G(A, z)||1. Further, the latent distribution encoded by E(B) is encour-\naged to be close to a random Gaussian to enable sampling at inference time, when B is not known.\n\n(4)\n\n1\n\nwhere DKL(p||q) = \u2212(cid:82) p(z) log p(z)\n\nLKL(E) = EB\u223cp(B)[DKL(E(B)|| N (0, I))],\n\n(5)\nq(z) dz. This forms our cVAE-GAN objective, a conditional version\n\nof the VAE-GAN [26] as\n\nG\u2217, E\u2217 = arg min\n\nG,E\n\nmax\n\nD\n\nLVAE\nGAN(G, D, E) + \u03bbLVAE\n\n1\n\n(G, E) + \u03bbKLLKL(E).\n\n(6)\n\nAs a baseline, we also consider the deterministic version of this approach, i.e., dropping KL-\ndivergence and encoding z = E(B). We call it cAE-GAN and show a comparison in the experiments.\nThere is no guarantee in cAE-GAN on the distribution of the latent space z, which makes the test-time\nsampling of z dif\ufb01cult.\n\nWe explore another method of enforcing the generator network to utilize the latent code embedding z,\nwhile staying close to the actual test time distribution p(z), but from the latent code\u2019s perspective.\nAs shown in Figure 2(d), we start from a randomly drawn latent code z and attempt to recover it\n\n3.3 Conditional Latent Regressor GAN: cLR-GAN (z \u2192 (cid:98)B \u2192(cid:98)z)\nwith(cid:98)z = E(G(A, z)). Note that the encoder E here is producing a point estimate for(cid:98)z, whereas the\nWe also include the discriminator loss LGAN(G, D) (Equation 1) on (cid:98)B to encourage the network to\n\n(G, E) = EA\u223cp(A),z\u223cp(z)||z \u2212 E(G(A, z))||1\n\nencoder in the previous section was predicting a Gaussian distribution.\n\nLlatent\n\n(7)\n\n1\n\ngenerate realistic results, and the full loss can be written as:\n\nG\u2217, E\u2217 = arg min\n\nmax\n\nG,E\n\nD\n\nLGAN(G, D) + \u03bblatentLlatent\n\n1\n\n(G, E)\n\n(8)\n\nThe (cid:96)1 loss for the ground truth image B is not used. Since the noise vector is randomly drawn, the\n\npredicted (cid:98)B does not necessarily need to be close to the ground truth but does need to be realistic.\nsample (cid:98)B is encoded to generate a latent vector.\n\nThe above objective bears similarity to the \u201clatent regressor\" model [4, 8, 10], where the generated\n\n3.4 Our Hybrid Model: BicycleGAN\n\nWe combine the cVAE-GAN and cLR-GAN objectives in a hybrid model. For cVAE-GAN, the encoding\nis learned from real data, but a random latent code may not yield realistic images at test time \u2013 the\nKL loss may not be well optimized. Perhaps more importantly, the adversarial classi\ufb01er D does not\nhave a chance to see results sampled from the prior during training. In cLR-GAN, the latent space is\neasily sampled from a simple distribution, but the generator is trained without the bene\ufb01t of seeing\nground truth input-output pairs. We propose to train with constraints in both directions, aiming to\n\ntake advantage of both cycles (B \u2192 z \u2192 (cid:98)B and z \u2192 (cid:98)B \u2192(cid:98)z), hence the name BicycleGAN.\n\nG\u2217, E\u2217 = arg min\n\nG,E\n\nmax\n\nD\n\nLVAE\nGAN(G, D, E) + \u03bbLVAE\n+LGAN(G, D) + \u03bblatentLlatent\n\n1\n\n1\n\n(G, E)\n(G, E) + \u03bbKLLKL(E),\n\n(9)\n\nwhere the hyper-parameters \u03bb, \u03bblatent, and \u03bbKL control the relative importance of each term.\n\n5\n\n\fFigure 3: Alternatives for injecting z into generator. Latent code z is injected by spatial replication and\nconcatenation into the generator network. We tried two alternatives, (left) injecting into the input layer and\n(right) every intermediate layer in the encoder.\n\nIn the unconditional GAN setting, Larsen et al. [26] observe that using samples from both the prior\nN (0, I) and encoded E(B) distributions further improves results. Hence, we also report one variant\nwhich is the full objective shown above (Equation 9), but without the reconstruction loss on the latent\nspace Llatent\n. We call it cVAE-GAN++, as it is based on cVAE-GAN with an additional loss LGAN(G, D),\nwhich allows the discriminator to see randomly drawn samples from the prior.\n\n1\n\n4\n\nImplementation Details\n\nThe code and additional results are publicly available at https://github.com/junyanz/\nBicycleGAN. Please refer to our website for more details about the datasets, architectures, and\ntraining procedures.\n\nNetwork architecture For generator G, we use the U-Net [37], which contains an encoder-decoder\narchitecture, with symmetric skip connections. The architecture has been shown to produce strong\nresults in the unimodal image prediction setting when there is a spatial correspondence between\ninput and output pairs. For discriminator D, we use two PatchGAN discriminators [20] at different\nscales, which aims to predict real vs. fake for 70 \u00d7 70 and 140 \u00d7 140 overlapping image patches.\nFor the encoder E, we experiment with two networks: (1) ECNN: CNN with a few convolutional and\ndownsampling layers and (2) EResNet: a classi\ufb01er with several residual blocks [17].\n\nTraining details We build our model on the Least Squares GANs (LSGANs) variant [28], which\nuses a least-squares objective instead of a cross entropy loss. LSGANs produce high-quality results\nwith stable training. We also \ufb01nd that not conditioning the discriminator D on input A leads to\nbetter results (also discussed in [34]), and hence choose to do the same for all methods. We set the\nparameters \u03bbimage = 10, \u03bblatent = 0.5 and \u03bbKL = 0.01 in all our experiments. We tie the weights\nfor the generators and encoders in the cVAE-GAN and cLR-GAN models. For the encoder, only the\npredicted mean is used in cLR-GAN. We observe that using two separate discriminators yields slightly\nbetter visual results compared to sharing weights. We only update G for the (cid:96)1 loss Llatent\n(G, E) on\nthe latent code (Equation 7), while keeping E \ufb01xed. We found optimizing G and E simultaneously\nfor the loss would encourage G and E to hide the information of the latent code without learning\nmeaningful modes. We train our networks from scratch using Adam [22] with a batch size of 1 and\nwith a learning rate of 0.0002. We choose latent dimension |z| = 8 across all the datasets.\nInjecting the latent code z to generator. We explore two ways of propagating the latent code z\nto the output, as shown in Figure 3: (1) add_to_input: we spatially replicate a Z-dimensional\nlatent code z to an H \u00d7 W \u00d7 Z tensor and concatenate it with the H \u00d7 W \u00d7 3 input image and\n(2) add_to_all: we add z to each intermediate layer of the network G, after spatial replication to\nthe appropriate sizes.\n\n1\n\n5 Experiments\n\nDatasets We test our method on several image-to-image translation problems from prior work,\nincluding edges \u2192 photos [48, 54], Google maps \u2192 satellite [20], labels \u2192 images [5], and outdoor\nnight \u2192 day images [25]. These problems are all one-to-many mappings. We train all the models on\n256 \u00d7 256 images.\nMethods We evaluate the following models described in Section 3: pix2pix+noise, cAE-GAN,\ncVAE-GAN, cVAE-GAN++, cLR-GAN, and our hybrid model BicycleGAN.\n\n6\n\nzz++\fFigure 4: Example Results We show example results of our hybrid model BicycleGAN. The left column\nshows the input. The second shows the ground truth output. The \ufb01nal four columns show randomly generated\nsamples. We show results of our method on night\u2192day, edges\u2192shoes, edges\u2192handbags, and maps\u2192satellites.\nModels and additional examples are available at https://junyanz.github.io/BicycleGAN.\n\nFigure 5: Qualitative method comparison We compare results on the labels \u2192 facades dataset across different\nmethods. The BicycleGAN method produces results which are both realistic and diverse.\n\n7\n\nInputGroundtruthGeneratedsamplespix2pix+noisecAE-GANInputGroundtruthcLR-GANcVAE-GANcVAE-GAN++BicycleGAN\fMethod\nRandom real images\npix2pix+noise [20]\ncAE-GAN\ncVAE-GAN\ncVAE-GAN++\ncLR-GAN\nBicycleGAN\n\nRealism\n\nAMT Fooling\n\nRate [%]\n50.0%\n\nDiversity\nLPIPS\nDistance\n.265\u00b1.007\n27.93\u00b12.40 % .013\u00b1.000\n13.64\u00b11.80 % .200\u00b1.002\n24.93\u00b12.27 % .095\u00b1.001\n29.19\u00b12.43 % .099\u00b1.002\n29.23\u00b12.48 % a.089\u00b1.002\n34.33\u00b12.69 % .111\u00b1.002\n\naWe found that cLR-GAN resulted in severe mode col-\nlapse, resulting in \u223c 15% of the images producing the same\nresult. Those images were omitted from this calculation.\nFigure 6: Realism vs Diversity. We measure diversity using average LPIPS distance [52], and realism using a\nreal vs. fake Amazon Mechanical Turk test on the Google maps \u2192 satellites task. The pix2pix+noise baseline\nproduces little diversity. Using only cAE-GAN method produces large artifacts during sampling. The hybrid\nBicycleGAN method, which combines cVAE-GAN and cLR-GAN, produces results which have higher realism\nwhile maintaining diversity.\n5.1 Qualitative Evaluation\nWe show qualitative comparison results on Figure 5. We observe that pix2pix+noise typically\nproduces a single realistic output, but does not produce any meaningful variation. cAE-GAN adds\nvariation to the output, but typically at a large cost to result quality. An example on facades is shown\non Figure 4.\nWe observe more variation in the cVAE-GAN, as the latent space is encouraged to encode information\nabout ground truth outputs. However, the space is not densely populated, so drawing random\nsamples may cause artifacts in the output. The cLR-GAN shows less variation in the output, and\nsometimes suffers from mode collapse. When combining these methods, however, in the hybrid\nmethod BicycleGAN, we observe results which are both diverse and realistic. Please see our website\nfor a full set of results.\n\n5.2 Quantitative Evaluation\nWe perform a quantitative analysis of the diversity, realism, and latent space distribution on our six\nvariants and baselines. We quantitatively test the Google maps \u2192 satellites dataset.\nDiversity We compute the average distance of random samples in deep feature space. Pretrained\nnetworks have been used as a \u201cperceptual loss\" in image generation applications [9, 12, 21], as well\nas a held-out \u201cvalidation\" score in generative modeling, for example, assessing the semantic quality\nand diversity of a generative model [39] or the semantic accuracy of a grayscale colorization [50].\nIn Figure 6, we show the diversity-score using the LPIPS metric proposed by [52]1. For each\nimages (sampled from 100 input A images). Random pairs of ground truth real images in the B \u2208 B\n\nmethod, we compute the average distance between 1900 pairs of randomly generated output (cid:98)B\ndomain produce an average variation of .265. As we are measuring samples (cid:98)B which correspond to a\n\nspeci\ufb01c input A, a system which stays faithful to the input should de\ufb01nitely not exceed this score.\nThe pix2pix system [20] produces a single point estimate. Adding noise to the system\npix2pix+noise produces a small diversity score, con\ufb01rming the \ufb01nding in [20] that adding noise\ndoes not produce large variation. Using the cAE-GAN model to encode a ground truth image B into a\nlatent code z does increase the variation. The cVAE-GAN, cVAE-GAN++, and BicycleGAN models all\nplace explicit constraints on the latent space, and the cLR-GAN model places an implicit constraint\nthrough sampling. These four methods all produce similar diversity scores. We note that high\ndiversity scores may also indicate that unnatural images are being generated, causing meaningless\nvariations. Next, we investigate the visual realism of our samples.\nPerceptual Realism To judge the visual realism of our results, we use human judgments, as proposed\nin [50] and later used in [20, 55]. The test sequentially presents a real and generated image to a human\n\n1Learned Perceptual Image Patch Similarity (LPIPS) metric computes distance in AlexNet [24] feature space\n\n(conv1-5, pretrained on Imagenet [38]), with linear weights to better match human perceptual judgments.\n\n8\n\n0.0000.0250.0500.0750.1000.1250.1500.1750.200Diversity (LPIPS Feature Distance)0510152025303540Realism (AMT Fooling Rate [%])pix2pix+noisecAE-GANcVAE-GANcVAE-GAN++cLR-GANBicycleGAN\fEncoder\nInjecting z\nlabel\u2192photo\nmap \u2192 satellite\n\nEResNet\n\nadd_to_all\n0.292 \u00b1 0.058\n0.268 \u00b1 0.070\n\nadd_to_all\n0.326 \u00b1 0.066\n0.287 \u00b1 0.067\nTable 1: The encoding performance with respect to the different encoder architectures and methods\nof injecting z. Here we report the reconstruction loss ||B \u2212 G(A, E(B))||1\n\nadd_to_input\n0.339 \u00b1 0.069\n0.272 \u00b1 0.069\n\nadd_to_input\n0.292 \u00b1 0.054\n0.266 \u00b1 0.068\n\nEResNet\n\nECNN\n\nECNN\n\n.\n\nFigure 7: Different label \u2192 facades results trained with varying length of the latent code |z| \u2208\n{2, 8, 256}.\nfor 1 second each, in a random order, asks them to identify the fake, and measures the \u201cfooling\"\nrate. Figure 6(left) shows the realism across methods. The pix2pix+noise model achieves high\nrealism score, but without large diversity, as discussed in the previous section. The cAE-GAN helps\nproduce diversity, but this comes at a large cost to the visual realism. Because the distribution of\nthe learned latent space is unclear, random samples may be from unpopulated regions of the space.\nAdding the KL-divergence loss in the latent space, used in the cVAE-GAN model, recovers the visual\nrealism. Furthermore, as expected, checking randomly drawn z vectors in the cVAE-GAN++ model\nslightly increases realism. The cLR-GAN, which draws z vectors from the prede\ufb01ned distribution\nrandomly, produces similar realism and diversity scores. However, the cLR-GAN model resulted in\nlarge mode collapse - approximately 15% of the outputs produced the same result, independent of\nthe input image. The full hybrid BicycleGAN gets the best of both worlds, as it does not suffer from\nmode collapse and also has the highest realism score by a signi\ufb01cant margin.\nEncoder architecture In pix2pix, Isola et al. [20] conduct extensive ablation studies on discrim-\ninators and generators. Here we focus on the performance of two encoder architectures, ECNN and\nEResNet, for our applications on the maps and facades datasets. We \ufb01nd that EResNet better encodes the\noutput image, regarding the image reconstruction loss ||B \u2212 G(A, E(B))||1 on validation datasets\nas shown in Table 1. We use EResNet in our \ufb01nal model.\nMethods of injecting latent code We evaluate two ways of injecting latent code z: add_to_input\nand add_to_all (Section 4), regarding the same reconstruction loss ||B \u2212 G(A, E(B))||1. Table 1\nshows that two methods give similar performance. This indicates that the U_Net [37] can already\npropagate the information well to the output without the additional skip connections from z. We use\nadd_to_all method to inject noise in our \ufb01nal model.\nLatent code length We study the BicycleGAN model results with respect to the varying number of\ndimensions of latent codes {2, 8, 256} in Figure 7. A very low-dimensional latent code may limit\nthe amount of diversity that can be expressed. On the contrary, a very high-dimensional latent code\ncan potentially encode more information about an output image, at the cost of making sampling\ndif\ufb01cult. The optimal length of z largely depends on individual datasets and applications, and how\nmuch ambiguity there is in the output.\n\n6 Conclusions\nIn conclusion, we have evaluated a few methods for combating the problem of mode collapse in the\nconditional image generation setting. We \ufb01nd that by combining multiple objectives for encouraging a\nbijective mapping between the latent and output spaces, we obtain results which are more realistic and\ndiverse. We see many interesting avenues of future work, including directly enforcing a distribution\nin the latent space that encodes semantically meaningful attributes to allow for image-to-image\ntransformations with user controllable parameters.\nAcknowledgments We thank Phillip Isola and Tinghui Zhou for helpful discussions. This work was supported\nin part by Adobe Inc., DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1633310, IIS-\n1427425, IIS-1212798, the Berkeley Arti\ufb01cial Intelligence Research (BAIR) Lab, and hardware donations from\nNVIDIA. JYZ is supported by Facebook Graduate Fellowship, RZ by Adobe Research Fellowship, and DP by\nNVIDIA Graduate Fellowship.\n\n9\n\n|z|=2|z|=256|z|=8Inputlabel\fReferences\n[1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In\n\nICLR, 2017.\n\n[2] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint\n\narXiv:1708.05349, 2017.\n\n[3] Q. Chen and V. Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. In ICCV, 2017.\n\n[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.\n\nrepresentation learning by information maximizing generative adversarial nets. In NIPS, 2016.\n\nInfogan:\n\ninterpretable\n\n[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and\n\nB. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.\n\n[6] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian\n\npyramid of adversarial networks. In NIPS, 2015.\n\n[7] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. In ICLR, 2017.\n\n[8] J. Donahue, P. Kr\u00e4henb\u00fchl, and T. Darrell. Adversarial feature learning. In ICLR, 2016.\n\n[9] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks.\n\nIn NIPS, 2016.\n\n[10] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially\n\nlearned inference. In ICLR, 2016.\n\n[11] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV, 1999.\n\n[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In\n\nCVPR, pages 2414\u20132423, 2016.\n\n[13] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania. Multi-agent diverse generative\n\nadversarial networks. arXiv preprint arXiv:1704.02906, 2017.\n\n[14] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160,\n\n2016.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, 2014.\n\n[16] S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy. Pixcolor: Pixel recursive\n\ncolorization. In BMVC, 2017.\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[18] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[19] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: Joint end-to-end learning of global and\nlocal image priors for automatic image colorization with simultaneous classi\ufb01cation. SIGGRAPH, 35(4),\n2016.\n\n[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. In CVPR, 2017.\n\n[21] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In\n\nECCV, 2016.\n\n[22] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[23] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[24] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. 2014.\n\n[25] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and\n\nediting of outdoor scenes. SIGGRAPH, 2014.\n\n10\n\n\f[26] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. In ICML, 2016.\n\n[27] G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In\n\nECCV, 2016.\n\n[28] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial\n\nnetworks. In ICCV, 2017.\n\n[29] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nICLR, 2016.\n\n[30] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.\n\n[31] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks:\n\nConditional iterative generation of images in latent space. In CVPR, 2017.\n\n[32] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. PMLR, 2016.\n\n[33] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional\n\nimage generation with pixelcnn decoders. In NIPS, 2016.\n\n[34] D. Pathak, P. Kr\u00e4henb\u00fchl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by\n\ninpainting. In CVPR, 2016.\n\n[35] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016.\n\n[36] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image\n\nsynthesis. In ICML, 2016.\n\n[37] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\n\nIn MICCAI, pages 234\u2013241. Springer, 2015.\n\n[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.\n\n[39] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. arXiv preprint arXiv:1606.03498, 2016.\n\n[40] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch\n\nand color. In CVPR, 2017.\n\n[41] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical\n\nreport, DTIC Document, 1986.\n\n[42] K. Sohn, X. Yan, and H. Lee. Learning structured output representation using deep conditional generative\n\nmodels. In NIPS, 2015.\n\n[43] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[44] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using\n\nvariational autoencoders. In ECCV, 2016.\n\n[45] W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis\n\nwith texture patches. In arXiv preprint arXiv:1706.02823, 2017.\n\n[46] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via\n\ncross convolutional networks. In NIPS, 2016.\n\n[47] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using\n\nmulti-scale neural patch synthesis. In CVPR, 2017.\n\n[48] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014.\n\n[49] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic\n\nimage synthesis with stacked generative adversarial networks. In ICCV, 2017.\n\n[50] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.\n\n11\n\n\f[51] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros. Real-time user-guided image\n\ncolorization with learned deep priors. SIGGRAPH, 2017.\n\n[52] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep\n\nfeatures as a perceptual metric. In arXiv preprint arXiv:1801.03924, 2018.\n\n[53] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.\n\n[54] J.-Y. Zhu, P. Kr\u00e4henb\u00fchl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural\n\nimage manifold. In ECCV, 2016.\n\n[55] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. In ICCV, 2017.\n\n12\n\n\f", "award": [], "sourceid": 338, "authors": [{"given_name": "Jun-Yan", "family_name": "Zhu", "institution": "UC Berkeley"}, {"given_name": "Richard", "family_name": "Zhang", "institution": "University of California, Berkeley"}, {"given_name": "Deepak", "family_name": "Pathak", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}, {"given_name": "Alexei", "family_name": "Efros", "institution": "UC Berkeley"}, {"given_name": "Oliver", "family_name": "Wang", "institution": "Adobe Research"}, {"given_name": "Eli", "family_name": "Shechtman", "institution": null}]}