{"title": "Unsupervised Image-to-Image Translation Using Domain-Specific Variational Information Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 10348, "page_last": 10358, "abstract": "Unsupervised image-to-image translation is a class of computer vision problems which aims at modeling conditional distribution of images in the target domain, given a set of unpaired images in the source and target domains. An image in the source domain might have multiple representations in the target domain. Therefore, ambiguity in modeling of the conditional distribution arises, specially when the images in the source and target domains come from different modalities. Current approaches mostly rely on simplifying assumptions to map both domains into a shared-latent space. Consequently, they are only able to model the domain-invariant information between the two modalities. These approaches cannot model domain-specific information which has no representation in the target domain. In this work, we propose an unsupervised image-to-image translation framework which maximizes a domain-specific variational information bound and learns the target domain-invariant representation of the two domain. The proposed framework makes it possible to map a single source image into multiple images in the target domain, utilizing several target domain-specific codes sampled randomly from the prior distribution, or extracted from reference images.", "full_text": "Unsupervised Image-to-Image Translation Using\nDomain-Speci\ufb01c Variational Information Bound\n\nHadi Kazemi\n\nSobhan Soleymani\n\nhakazemi@mix.wvu.edu\n\nssoleyma@mix.wvu.edu\n\nFariborz Taherkhani\n\nSeyed Mehdi Iranmanesh\n\nfariborztaherkhani@gmail.com\n\nseiranmanesh@mix.wvu.edu\n\nNasser M. Nasrabadi\n\nnasser.nasrabadi@mail.wvu.edu\n\nWest Virginia University\nMorgantown, WV 26505\n\nAbstract\n\nUnsupervised image-to-image translation is a class of computer vision problems\nwhich aims at modeling conditional distribution of images in the target domain,\ngiven a set of unpaired images in the source and target domains. An image in the\nsource domain might have multiple representations in the target domain. Therefore,\nambiguity in modeling of the conditional distribution arises, specially when the\nimages in the source and target domains come from different modalities. Current\napproaches mostly rely on simplifying assumptions to map both domains into a\nshared-latent space. Consequently, they are only able to model the domain-invariant\ninformation between the two modalities. These approaches usually fail to model\ndomain-speci\ufb01c information which has no representation in the target domain.\nIn this work, we propose an unsupervised image-to-image translation framework\nwhich maximizes a domain-speci\ufb01c variational information bound and learns the\ntarget domain-invariant representation of the two domain. The proposed framework\nmakes it possible to map a single source image into multiple images in the target\ndomain, utilizing several target domain-speci\ufb01c codes sampled randomly from the\nprior distribution, or extracted from reference images.\n\n1\n\nIntroduction\n\nImage-to-image translation is the major goal for many computer vision problems, such as sketch\nto photo-realistic image translation [25], style transfer [13], inpainting missing image regions [12],\ncolorization of grayscale images [11, 32], and super-resolution [18]. If corresponding image pairs\nare available in both source and target domains, these problems can be studied in a supervised\nsetting. For years, researchers [22] have made great efforts to solve this problem employing classical\nmethods, such as superpixel-based segmentation [39]. More recentely, frameworks such as conditional\nGenerative Adversarial Networks (cGAN) [12], Style and Structure Generative Adversarial Network\n(S2-GAN) [30], and VAE-GAN [17] are proposed to address the problem of supervised image-\nto-image translation. However, in many real-world applications, collecting paired training data is\nlaborious and expensive [37]. Therefore, in many applications, there are only a few paired images\navailable or no paired images at all. In this case, only independent sets of images in each domain,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) The photo-realistic image. (b) Translated image in the edge domain, using CycleGAN.\n(c) Generated edges after Histogram Equalization to illustrate how photo-speci\ufb01c information are\nencoded to satisfy cycle consistency.\n\nwith no correspondence in the other domain, should be deployed to learn the cross-domain image\ntranslation task. Despite the dif\ufb01culty of the unsupervised image-to-image translation, since there is\nno paired samples guiding how an image should be translated into a corresponding image in the other\ndomain, it is still more desirable compared to the supervised setting due to the lack of paired images\nand the convenience of collecting two independent image sets. As a result, in this paper, we focus on\nthe design of a framework for unsupervised image-to-image translation.\nThe key challenge in cross-domain image translation is learning the conditional distribution of images\nin the target domain. In the unsupervised setting, this conditional distribution should be learned using\ntwo independent image sets. Previous works in the literature mostly consider a shared-latent space, in\nwhich they assume that images from two domains can be mapped into a low-dimensional shared-latent\nspace [37, 20]. However, this assumption does not hold when the two domains represent different\nmodalities, since some information in one modality might have no representation in the other modality.\nFor example, in the case of sketch to photo-realistic image translation, color and texture information\nhave no interpretable meaning in the sketch domain. In other words, each sketch can be mapped\ninto several photo-realistic images. Accordingly, learning a single domain-invariant latent space\nwith aforementioned assumption [37, 20, 24] prevents the model from capturing domain-speci\ufb01c\ninformation. Therefore, a sketch can only be mapped into one of its corresponding photo-realistic\nimages. In addition, since the current unsupervised techniques are implemented mainly based on the\n\"cycle consistency\" [20, 37], the translated image in the target domain may encode domain-speci\ufb01c\ninformation of the source domain (Figure 1). The encoded information can then be utilized to recover\nthe source image again. This encoding can effectively degrade the performance and stability of the\ntraining process.\nTo address this problem, we remove the shared-latent space assumption, and learn a domain-speci\ufb01c\nspace jointly with a domain-invariant space. Our proposed framework is based on Generative\nAdversarial Networks and Variational Autoencoders (VAEs), and models the conditional distribution\nof the target domain using VAE-GAN. Broadly speaking, two encoders map a source image into a pair\nof domain-invariant and source domain-speci\ufb01c codes. The domain-invariant code in combination\nwith a target domain-speci\ufb01c code, sampled from a desired distribution, is fed to a generator which\ntranslates them into the corresponding target domain image. To reconstruct the source image at the\nend of the cycle, the extracted source domain-speci\ufb01c code is passed through a domain-speci\ufb01c path\nto the backward path from translated target domain image.\nIn order to learn two distinct codes for the shared and domain-speci\ufb01c information, we train the\nnetwork to extract two distinct domain-speci\ufb01c and domain-invariant codes. The former is learned\nby maximizing its mutual information with the source domain while simultaneously we minimize\nthe mutual information between this code and the translated image in the target domain. The mutual\ninformation maximization may also result in the domain-speci\ufb01c code to represent an interpretable\nrepresentation of the domain-speci\ufb01c information [6]. These loss terms are crucial in the unsupervised\nframework, since domain-invariant information may also go through the domain-speci\ufb01c path to\nsatisfy the cycle consistency in the backward path.\nIn this paper we extend CycleGAN [37] to learn a domain-speci\ufb01c code for each modality, through\ndomain-speci\ufb01c variational information bound maximization, in addition to a domain-invariant\ncode. Then, based on the proposed domain-speci\ufb01c learning scheme, we introduce a framework for\none-to-many cross-domain image-to-image translation in an unsupervised setting.\n\n2\n\n\f(a) X \u2192 Y \u2192 X cycle\n(b) Y \u2192 X \u2192 Y cycle\nFigure 2: Proposed framework for unsupervised image-to-image translation.\n\n2 Related Works\n\nIn the computer vision literature, image generation problem is tackled using autoregressive models\n[21, 29], restricted Boltzmann machines [26], and autoencoders [10]. Recently, generative techniques\nare proposed for image translation tasks. Models such as GANs [7, 34] and VAEs [23, 15] achieve\nimpressive results in image generation. They are also utilized in conditional setting [12, 38] to address\nthe image-to-image translation problem. However, in the prior research, relatively less attention is\ngiven to the unsupervised setting [20, 37, 4].\nMany state-of-the-art unsupervised image-to-image translation frameworks are developed based on\nthe cycle-consistency constraint [37]. Liu et al. [20] showed that learning a shared-latent space\nbetween the images in source and target domains implies the cycle-consistency. The cycle-consistency\nconstraint assumes that the source image can be reconstructed from the generated image, in the\ntarget domain, without any extra domain-speci\ufb01c information [20, 37]. From our experience, this\nassumption severely constrains the network and degrades the performance and stability of the\ntraining process, in the case of learning the translation between different modalities. In addition, this\nassumption limits the diversity of generated images by the framework, i.e., the network associates\na single target image with each source image. To tackle this problem, some prior research attempt\nto map a single image into multiple images in the target domain in a supervised setting [5, 3]. This\nproblem is also addressed in [2] in an unsupervised setting. However, they have not considered any\nmechanisms to force their auxiliary latent variables to represent only the domain-speci\ufb01c information.\nIn this work, in contrast, we aim to learn distinct domain-speci\ufb01c and domain-invariant latent spaces\nin an unsupervised setting. The learned domain-speci\ufb01c code is supposed to represent the properties\nof the source image which have no representation in the target domain. To this end, we train our\nnetwork by maximization of a domain-speci\ufb01c variational information to learn a domain-speci\ufb01c\nspace.\n\n3 Framework and Formulation\n\nOur framework, as illustrated in Figure 2, is developed based on GAN [30] and VAE-GAN [17], and\nincludes two generative adversarial networks; {Gx, Dx} and {Gy, Dy}. The encoder-generators,\n{Exd, Gx} and {Eyd, Gy}, also constitute two VAEs. Inspired by CycleGAN model [37], we trained\nour network in two cycles; X \u2192 Y \u2192 X and Y \u2192 X \u2192 Y, where X and Y represent the source and\ntarget domains, respectively.1 Each cycle consists of forward and backward paths. In each forward\npath, we translate an image from the input domain into its corresponding image in the output domain.\nIn the backward path, we remap the generated image into the input domain and reconstruct the input\nimage. In our formulation, rather than learning a single shared-latent space between the two domains,\nwe propose to decompose the latent code, z, into two parts: c, which is the domain-invariant code\nbetween the two domains, and vi, i = {x, y}, which is the domain-speci\ufb01c code.\nDuring the forward path in X \u2192 Y \u2192 X cycle, we simultaneously train two encoders, Exc and\nExd, to map data samples from the input domain, X , into a latent representation, z. The input\ndomain-invariant encoder, Exc, maps the input image, x \u2208 X , into the input domain-invariant\ncode, c1. The input domain-speci\ufb01c encoder, Exd, maps x into the input domain-speci\ufb01c code, vx1.\n\n1For simplicity, in the remainder of the paper, for each cycle, we use terms input domain and output domain.\n\n3\n\n\fThen, the domain-invariant code, c1, and a randomly sampled output domain-speci\ufb01c code, vy1,\nare fed to the output generator (decoder), Gy, to generate the corresponding representation of the\ninput image, yg = Gy(c1, vy1), in the output domain Y. Since in X \u2192 Y \u2192 X cycle the output\ndomain-speci\ufb01c information is not available during the training phase, a prior, p(vy), is imposed over\nthe domain-speci\ufb01c distribution which is selected as a unit normal distribution N (0,I). Here, index\n1 in the codes\u2019 subscripts refers to the \ufb01rst cycle X \u2192 Y \u2192 X . We use the same notation for all the\nlatent codes in the reminder of the paper.\nThe output discriminator, Dy, is employed to enforce the translated images, yg, resemble images in\nthe output domain Y. The translated images should not be distinguishable from the real samples in Y.\nTherefore, we apply the adversarial loss [30] which is given by:\n\nL1\nGAN = Ey\u223cp(y) log[Dy(y)] + E(c1,vy1)\u223cp(c1,vy1) log[1 \u2212 Dy(Gy(c1, vy1))].\n\n(1)\n\nNote that the domain-speci\ufb01c encoder Exd outputs mean and variance vectors (\u00b5vx1, \u03c32\nvx1) =\nExd(x), which represents the distribution of the domain-speci\ufb01c code vx1 given by qx(vx1|x) =\nN (vx1|\u00b5vx1, diag(\u03c32\nvx1)). Similar to the previous works on VAE [15], we assume that the domain-\nspeci\ufb01c components of vx are conditionally independent and Gaussian with unit variance. We utilize\nreparametrization trick [15] to train the VAE-GAN using back-propagation. We de\ufb01ne the variational\nloss for the domain-speci\ufb01c VAE as follows:\n\nL1\nV AE = \u2212DKL[qx(vx1|x), p(vx)] + Evx1\u223cq(vx1|x)[log p(x|vx1)].\n\n(2)\nwhere the Kullback\u2013Leibler (DKL) divergence term is a measure of how the distribution of domain-\nspeci\ufb01c code, vx, diverges from the prior distribution. The conditional distribution p(x|vx1) is\nmodeled as Laplacian distribution, and therefore, minimizing the negative log-likelihood term is\nequivalent to the absolute distance between the input and its reconstruction.\nIn the backward path, the output domain-invariant encoder, Eyc, and the output domain-speci\ufb01c\n\nencoder, Eyd, map the generated image into the reconstructed domain-invariant code, (cid:98)c1, and the\nreconstructed domain-speci\ufb01c code, (cid:99)vy1, respectively. The domain-speci\ufb01c encoder, Eyd, outputs\nreconstructed input,(cid:98)x, is generated by the output generator, Gx, with (cid:98)c1 and vx1 as its inputs. Here,\nin the forward path. We enforce a reconstruction criteria to force (cid:98)c1, (cid:99)vy1 and(cid:98)x to be the reconstruction\n\nmean and variance vectors (\u00b5vy1, \u03c32\nof the domain-speci\ufb01c code, vy1, given by qy(vy1|y) = N (vy1|\u00b5vy1, diag(\u03c32\nvx1 is sampled from its distribution, N (\u00b5vx1, diag(\u03c32\n\nvy1) = Eyd(Gy(c1, vy1)) which represents the distribution\nvy1)). Finally, the\n\nof c1, vy1, and x, respectively. To this end, the reconstruction loss is de\ufb01ned as follows:\n\nr = Ex\u223cp(x),vy1\u223cN (0,I)[\u03bb1||(cid:98)x \u2212 x||2 + \u03bb2||(cid:99)vy1 \u2212 vy1||2 + \u03bb3||(cid:98)c1 \u2212 c1||2],\n\n(3)\nwhere \u03bb1, \u03bb2, and \u03bb3 are the hyper-parameters to control the weight of each term in the loss function.\n\nvx1)), where (\u00b5vx1, \u03c32\n\nvx1) is the output of Exd\n\nL1\n\n4 Domain-speci\ufb01c Variational Information bound\n\nIn the proposed model, we decompose the latent space, z, into the domain-invariant and domain-\nspeci\ufb01c codes. As it is mentioned in the previous section, the domain-invariant code should only\ncapture the information shared between the two modalities, while the domain-speci\ufb01c code represents\nthe information which has no interpretation in the output domain. Otherwise, all the information\ncan go through the domain-speci\ufb01c path and satisfy the cycle-consistency property of the network\n\n(Ex\u223cp(x)||(cid:98)x \u2212 x||2 \u2192 0 and Ey\u223cp(y)||(cid:98)y \u2212 y||2 \u2192 0). In this trivial solution, the generator, Gy,\n\ncan translate an input domain image into the output domain image that does not correspond to the\ninput image, while satisfying the discriminator Dy in terms of resembling the images in Y. Figure 7\n(second row) presents images generated by this trivial solution.\nHere, we propose an unsupervised method to learn the domain-speci\ufb01c information of the source\ndata distribution which has minimum information about the target domain. To learn the source\ndomain-speci\ufb01c code, vx, we propose to minimize the mutual information between vx and the target\ndomain distribution, while simultaneously, we maximize the mutual information between vx and\nthe source domain distribution. Similarly, the target domain-speci\ufb01c code vy is learned for target\ndomain Y. In other words, to learn the source and target domain speci\ufb01c codes vx and vy, we should\nminimize the following loss function:\n\nLint =(cid:0)I(y, vx; \u03b8) \u2212 \u03b2I(x, vx; \u03b8)(cid:1) +(cid:0)I(x, vy; \u03b8) \u2212 \u03b2I(y, vy; \u03b8)(cid:1),\n\n(4)\n\n4\n\n\fL2\n\nint = I(y,(cid:98)vx2 ; \u03b8) \u2212 \u03b2I(y, vy2; \u03b8),\n\nint are implemented in cycles X \u2192 Y \u2192 X and Y \u2192 X \u2192 Y, respectively.\n\nwhere \u03b8 represents the model parameters. To translate Lint to an implementable loss function, we\nde\ufb01ne the following two loss functions:\n\nint = I(x,(cid:98)vy1; \u03b8) \u2212 \u03b2I(x, vx1; \u03b8),\nL1\nint and L2\nwhere L1\nInstead of minimizing L1\nint, or similarly L2\nint, we minimize their variational upper bounds, which\nwe refer to as domain-speci\ufb01c variational information bounds. Zhao et al. [35] illustrated that using\nKL-divergance in VAEs results in information preference problem, in which the mutual information\nbetween the latent code and the input becomes vanishingly small, while training the network using\nonly reconstruction loss, with no KL divergence term, maximizes the mutual information. However,\nsome other types of divergences, such as MMD and Stein Variational Gradient, do not suffer from\nthis problem. Consequently, in this paper, for L1\nint, to maximize I(x, vx1; \u03b8) we can replace the\n\ufb01rst term in (2) with Maximum-Mean Discrepancy (MMD) [35], which always prefers to maximize\nmutual information between x and vx1. The MMD is a framework which utilizes all of the moments\nto quantify the distance between two distributions. It could be implemented using the kernel trick as\nfollows:\n\n(5)\n\nM M D[p(z) (cid:107) q(z)] = Ep(z),p(z(cid:48))[k(z, z(cid:48))] + Eq(z),q(z(cid:48))[k(z, z(cid:48))] \u2212 2Ep(z),q(z(cid:48))[k(z, z(cid:48))],\nwhere k(z, z(cid:48)) is any universal positive de\ufb01nite kernel, such as Gaussian k(z, z(cid:48)) = e\nConsequently, we rewrite the VAE objective in Equation (2) as follows:\n\n(6)\n\u2212 (cid:107)z\u2212z(cid:48)(cid:107)\n\n2\u03c32\n\n.\n\nL1\nV AE = M M D[p(vx1) (cid:107) q(vx1)] + Evx1\u223cq(vx1|x)[log p(x|vx1)].\n\nFollowing the method described in [1], to minimize the \ufb01rst term of L1\nupper-bound for the \ufb01rst term as:\n\nint in (5), we de\ufb01ne an\n\n(cid:90)\n\nI(x,(cid:98)vy1 ; \u03b8) \u2264\n\nd(cid:98)vy1dxp(x)p((cid:98)vy1|x) log\n\np((cid:98)vy1|x)\nr((cid:98)vy1)\n\nSince p((cid:98)vy1 ) is tractable but dif\ufb01cult to compute, we de\ufb01ne variational approximations to it as r((cid:98)vy1).\n\nSimilar to [1], r(z) is de\ufb01ned as a \ufb01xed dim-dimensional spherical Gaussian, r(z) = N (z|0, I),\nwhere dim is the dimension of vy1. This upper-bound in combination with the MMD forms a domain-\nspeci\ufb01c variational information bound. Note that MMD does not optimize an upper-bound to the\nnegative log likelihood directly, but it guarantees the mutual information to be maximized and we can\nexpect a high log likelihood performance [35]. To translate this upper-bound, L1, to an implementable\nloss function in the model, we use the following empirical data distribution approximation:\n\n= L1.\n\n(7)\n\n(8)\n\n\u03b4xn (x).\n\n(9)\n\nTherefore, the upper bound can be approximated as:\n\nN(cid:88)\n\nn=1\n\np(x) \u2248 1\nN\n\n(cid:90)\n\nN(cid:88)\n\nd(cid:98)vy1 p((cid:98)vy1|xn) log\n\np((cid:98)vy1|xn)\nr((cid:98)vy1)\n\n.\n\n(10)\n\nSince(cid:98)vy1 = f (x, vy1) and vy1 \u223c N (0,I), the implementable upper-bound, L, can be approximated\n\nn=1\n\nas follows:\n\nEvy1\u223cN (0,I)DKL[p((cid:98)vy1|xn)||r((cid:98)vy1 )].\n\n(11)\nAs illustrated in Figure 2b, we train the Y \u2192 X \u2192 Y cycle starting from an image y \u2208 Y. All the\ncomponents in this cycle share weights with the corresponding components in X \u2192 Y \u2192 X cycle.\nSimilar losses, L2, L2\nGAN , can be de\ufb01ned for this cycle. The overall loss for the\nnetwork is de\ufb01ned as:\n\nr, L2\n\nn=1\n\nL1 \u2248 1\nN\n\nN(cid:88)\n\nL1 \u2248 1\nN\n\nV AE, and L2\n2(cid:88)\n\nLoss =\n\n\u03b1i\n1Li + \u03b1i\n\n2Li\n\nr + \u03b1i\n\n3Li\n\nGAN + \u03b1i\n\n4Li\n\nV AE.\n\n(12)\n\ni=1\n\n5\n\n\f(a) Edges\u2194Handbags\n\n(b) Edges\u2194Shoes\n\nFigure 3: Qualitative comparison of our proposed method with BicycleGAN, CycleGAN and UNIT.\nThe proposed framework is able to generate diverse realistic outputs. However, it does not require\nany supervisions during its training phase.\n\n5\n\nImplementation\n\nWe adopt the architecture for our common latent encoder, generator, and discriminator networks\nfrom Zhu and Park et al. [37]. The domain-invariant encoders includes two stride-2 convolutions,\nand three residual blocks [8]. The generators consist of three residual blocks and two transposed\nconvolutions with stride-2. The domain-speci\ufb01c encoders share the \ufb01rst two convolution layers with\ntheir corresponding domain-invariant encoders, followed by \ufb01ve stride-2 convolutions. Since the\nspatial size of the domain-speci\ufb01c codes do not match with their corresponding domain-invariant\ncodes, we tile them to the same size as the domain-invariant codes, and then, concatenate them to\ncreate the generators\u2019 inputs. For the discriminator networks we use 30 \u00d7 30 PatchGAN networks\n[19, 12], which classi\ufb01es whether 30 \u00d7 30 overlapping image patches are real or fake. We use Adam\noptimizer [14] for online optimization with the learning rate of 0.0002. For reconstruction loss in (3),\nwe set \u03bb1 = 10 and \u03bb2 = \u03bb3 = 1. The values of \u03b12 and \u03b13 in (12) are set to 1, and the \u03b14\n= \u03b2 = 1.\n\u03b11\nFinally, regarding the kernel parameter \u03c3 in (6), as discussed in [35], MMD is fairly robust to this\nparameter selection, and using 2\ndim is a practical value in most scenarios, where dim is the dimension\nof vx1.\n\n6 Experiments\n\nOur experiments aim to show that an interpretable representation can be learned by the domain-\nspeci\ufb01c variational information bound maximization. Visual results on translation task show how\ndomain-speci\ufb01c code can alter the style of generated images in a new domain. We compare our\nmethod against baselines both qualitatively and quantitatively.\n\n6.1 Qualitative Evaluation\nWe use two datasets for qualitative comparison, edges \u2194 handbags [36] and edges \u2194 shoes [31].\nFigures 3a and 3b represent the comparison between the proposed framework and baseline image-to-\nimage translation algorithms: CycleGAN [37], UNIT [20], and BicycleGAN [38]. Our framework,\nsimilar to the BicycleGAN, can be utilized to generate multiple realistic images for a single input,\nwhile does not require any supervision. In contrast, CycleGAN and UNIT learn one-to-one mappings\nas they learn only one domain-invariant latent code between the two modalities. From our experience,\ntraining CycleGAN and UNIT on edges \u2194 photos datasets is very unstable and sensitive to the\nparameters. Figure 1 illustrates how CycleGAN encodes information about textures and colors in the\ngenerated image in the edge domain. This information encoding enables the discriminator to easily\ndistinguish the fake generated samples from the real ones which results in unstability in the training\nof the generators.\nThree other datasets, namely architectural labels \u2194 photos from the CMP Facade database [28],\nand CUHK Face Sketch Dataset (CUFS) [27] are employed for more qualitative evaluation. The\nimage-to-image translation results for the proposed framework are presented in Figure 4d, and 4c\nfor these datasets, respectively. Our method successfully captures domain-speci\ufb01c properties of\nthe target domain. Therefore, we are able to generate diverse images from a single input sample.\nMore results for edges \u2194 shoes and edges \u2194 handbags datasets are presented in Figures 4a and 4b,\nrespectively. These \ufb01gures present one-to-many image translation when different domain-speci\ufb01c\n\n6\n\n\f(a) Edges\u2194Shoes.\n\n(b) Edges\u2194Handbags\n\n(c) Sketch\u2194Photo-realistic\n\n(d) Label\u2194Facade photo\n\n(e) Photos\u2194Edges\n\nFigure 4: The results of our framework on different datasets.\n\nFigure 5: Failure cases, where some domain-speci\ufb01c codes do not result in well-de\ufb01ned styles.\n\ncodes are deployed. The results for the backward path for edges \u2194 handbags and edges \u2194 shoes are\nalso presented in Figure 4e. Since there is no extra information in the edge domain, the generated\nedges are quite similar to each other despite the value of edge domain-speci\ufb01c code.\nUsing the learned domain-speci\ufb01c code, we can transfer domain-speci\ufb01c properties from a reference\nimage in the output domain to the generated image. To this end, instead of sampling from the\ndistribution of output domain-speci\ufb01c code, we can use a domain-speci\ufb01c code extracted from a\nreference image in the output domain. To this end, the reference image is fed to the output domain-\nspeci\ufb01c encoder to extract its domain-speci\ufb01c code. The extracted code can be used for image\ntranslation guided by the reference image. Figures 6 show the results using domain-speci\ufb01c codes\nextracted from multiple reference images to translate edges into realistic photos. Finally, Figure 5\nillustrates some failure cases, where some domain-speci\ufb01c codes do not result in well-de\ufb01ned styles.\n\n6.2 Quantitative Evaluation\n\nTable 1 presents the quantitative comparison between the proposed framework and three state-of-the-\nart models. Similar to BicycleGAN [38], we perform a quantitative analysis of the diversity using\nLearned Perceptual Image Patch Similarity (LPIPS) metric [33]. The LPIPS distance is calculated\nas the average distance between 2000 pairs of randomly generated output images, in deep feature\nspace of a pre-trained AlexNet [16]. Diversity scores for different techniques using the LPIPS metric\nare summarized in Table 1. Note that the diversity score is not de\ufb01ned for one-to-one frameworks,\ne.g., CycleGAN and UNIT. Previous \ufb01ndings showed that these models are not able to generate large\noutput variation, even by noise injection [12, 38]. The diversity scores of our proposed framework\nare close to the BicycleGAN, while we do not have any supervision during the training phase.\nGenerating unnatural images usually results in a high diversity score. Therefore, to investigate\nwhether the variation of generated images is meaningful, we need to evaluate the visual realism\nof the generated samples as well. As proposed in [32, 37], the \u201cfooling\" rate of human subjects,\n\n7\n\n\fFigure 6: Using domain-speci\ufb01c information\nfrom a reference image to transform an input\nimage into the output domain.\n\nFigure 7: Generated images with (\ufb01rst row)\nand without (second row) mutual informa-\ntion minization between the target domain-\nspeci\ufb01c code and the source domain.\n\nTable 1: Diversity measure for generated images using average LPIPS distance and realism score\nusing human fooling rate, and FID score on Edges\u2194Shoes and edges \u2194 handbags tasks.\n\nEdges\u2194Shoes\nFooling\nRate\n-\n\nLPIPS\nDistance\n\n0.290\n\n-\n-\n\n0.113\n0.121\n\n22.0\n24.3\n38.0\n36.0\n\nEdges\u2194Handbags\n\nLPIPS\nDistance\n\n0.369\n\nFooling\nRate\n-\n\n-\n-\n\n0.134\n0.129\n\n19.2\n25.9\n34.9\n33.2\n\nFID\nScore\n\n-\n\n84.36\n81.22\n37.79\n40.84\n\nFID\nScore\n\n-\n\n90.32\n86.54\n43.18\n48.36\n\nMethod\n\nReal Images\nUNIT\nCycleGAN\nBicycleGAN\nOurs\n\nis considered as visual realism score of each framework. We sequentially presented a real and\ngenerated image to a human for 1 second each, in a random order, asked them to identify the fake, and\nmeasured the fooling rate. We also used the Frechet Inception Distance (FID) to evaluate the quality\nof generated images [9]. It directly measures the distance between the synthetic data distribution\nand the real data distribution. To calculate FID, images are encoded with visual features from a\npre-trained inception model. Note that a lower FID value interprets as a lower distance between\nsynthetic and real data distributions. Table 1 shows how the FID results con\ufb01rm the results from\nfooling rate. We calculate the FID over 10k randomly generated samples.\n\n6.3 Discussion and Ablation Study\n\nOur framework learns a disentangled representation of content and style, which provides users more\ncontrol on the image translation outputs. This framework is not only suitable for image-to-image\ntranslation, but also one can use it to transfer style between the images of a single domain. Comparing\nwith other unsupervised one-to-one image-to-image translation frameworks, i.e., CycleGAN and\nUNIT, our method handles translation between signi\ufb01cantly different domains. In contrast, CycleGAN\nencodes the domain-speci\ufb01c codes to satisfy the cycle-consistency (see Figure 1). UNIT also\ncompletely fails as it cannot \ufb01nd a shared representation in these cases.\nNeglecting the minimization of the mutual information between target domain-speci\ufb01c information\nand the source domain may result in capturing attributes with high variation in the target despite\ntheir common nature in both domains. For example, as illustrated in Figure 7, the domain-speci\ufb01c\ncode can result in altering the attributes, such as gender or face structure, while these attributes are\ndomain-invariant properties of the two modalities. In addition, removing the domain-speci\ufb01c code\ncycle-consistency criteria (e.g. vy1 = \u02c6vy1) results in a partial mode collapse in the model, with many\noutputs being almost identical, which reduces the LPIPS (see Table 2). Without the domain-invariant\ncode cycle-consistency criteria (e.g. c1 = \u02c6c1), the image quality is unsatisfactory. A possible reason\nfor quality degradation is that c1 can include the domain-speci\ufb01c information as there is no constraint\non it to represent shared information exclusively. That results in the same issue as explained in\n\n8\n\n\fTable 2: Average LPIPS distances with and without domain-speci\ufb01c code cycle-consistency on\nEdges\u2194Shoes and edges \u2194 handbags tasks.\n\nshoes\n\nw/\n\nLPIPS\n\n0.121\n\nw/o\n0.095\n\nhandbags\nw/o\nw/\n0.113\n\n0.129\n\nFigure 1. Very small values for \u03b2 result in the second term in L1\nint in (5) to be neglected. Therefore,\nthe domain-speci\ufb01c code, vx1, will be irrelevant in the loss minimization and the learned domain\nspeci\ufb01c code could be meaningless. In contrast, with very large values of \u03b2, yg carries the domain\nspeci\ufb01c information of the x as well.\n\n7 Conclusion\n\nIn this paper, we introduced a framework for one-to-many cross-domain image-to-image translation\nin an unsupervised setting. In contrast to the previous works, our approach learns a distinct domain-\nspeci\ufb01c code for each of the two modalities, maximizing a domain-speci\ufb01c variational information\nbound. In addition, it learns a domain-invariant code. During the training phase, a unit normal\ndistribution is imposed over the domain-speci\ufb01c latent distribution, which let us control the domain-\nspeci\ufb01c properties of the generated image in the output domain. To generate diverse target domain\nimages, we extract domain-speci\ufb01c codes from reference images, or sample them from a prior\ndistribution. These domain-speci\ufb01c codes, combined with the learned domain-invariant code, result\nin target domain images with different target domain-speci\ufb01c properties.\n\nReferences\n[1] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv\n\npreprint arXiv:1612.00410, 2016.\n\n[2] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented CycleGAN: Learning\n\nMany-to-Many mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.\n\n[3] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint\n\narXiv:1708.05349, 2017.\n\n[4] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain\n\nadaptation with generative adversarial networks. 2017.\n\n[5] Q. Chen and V. Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. 2017.\n\n[6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.\n\nInfogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in Neural\nInformation Processing Systems, pages 2172\u20132180, 2016.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale\nupdate rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems,\npages 6626\u20136637, 2017.\n\n[10] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science,\n\n313(5786):504\u2013507, 2006.\n\n[11] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: joint end-to-end learning of global and\nlocal image priors for automatic image colorization with simultaneous classi\ufb01cation. ACM Transactions\non Graphics (TOG), 35(4):110, 2016.\n\n9\n\n\f[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[13] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In\n\nEuropean Conference on Computer Vision, pages 694\u2013711. Springer, 2016.\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n\n[16] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks.\n\narXiv:1404.5997, 2014.\n\narXiv preprint\n\n[17] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[18] C. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,\nZ. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv\npreprint, 2016.\n\n[19] C. Li and M. Wand. Precomputed real-time texture synthesis with Markovian generative adversarial\n\nnetworks. In European Conference on Computer Vision, pages 702\u2013716. Springer, 2016.\n\n[20] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in\n\nNeural Information Processing Systems, pages 700\u2013708, 2017.\n\n[21] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint\n\narXiv:1601.06759, 2016.\n\n[22] C. Peng, X. Gao, N. Wang, and J. Li. Superpixel-based face sketch\u2013photo synthesis. IEEE Transactions\n\non Circuits and Systems for Video Technology, 27(2):288\u2013299, 2017.\n\n[23] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in\n\ndeep latent gaussian models. In International Conference on Machine Learning, volume 2, 2014.\n\n[24] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Moressi, F. Cole, and K. Murphy. Xgan: Unsupervised\n\nimage-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139, 2017.\n\n[25] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch\n\nand color. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.\n\n[26] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical\n\nreport, Colorado University at Boulder Department of Computer Science, 1986.\n\n[27] X. Tang and X. Wang. Face sketch synthesis and recognition. In Computer Vision, 2003. Proceedings.\n\nNinth IEEE International Conference on, pages 687\u2013694, 2003.\n\n[28] R. Tyle\u02c7cek and R. \u0160\u00e1ra. Spatial pattern templates for recognition of objects with regular structure. In\n\nGerman Conference on Pattern Recognition, pages 364\u2013374. Springer, 2013.\n\n[29] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation\nwith pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790\u20134798, 2016.\n\n[30] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In\n\nEuropean Conference on Computer Vision, pages 318\u2013335. Springer, 2016.\n\n[31] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 192\u2013199, 2014.\n\n[32] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. pages 649\u2013666. Springer, 2016.\n\n[33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep\n\nfeatures as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018.\n\n[34] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint\n\narXiv:1609.03126, 2016.\n\n[35] S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint\n\narXiv:1706.02262, 2017.\n\n10\n\n\f[36] J.-Y. Zhu, P. Kr\u00e4henb\u00fchl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural\n\nimage manifold. In European Conference on Computer Vision, pages 597\u2013613. Springer, 2016.\n\n[37] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n[38] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal\nimage-to-image translation. In Advances in Neural Information Processing Systems, pages 465\u2013476, 2017.\n\n[39] F. Zohrizadeh, M. Kheirandishfard, and F. Kamangar. Image segmentation using sparse subset selection.\nIn 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1470\u20131479, March\n2018.\n\n11\n\n\f", "award": [], "sourceid": 6619, "authors": [{"given_name": "Hadi", "family_name": "Kazemi", "institution": "WVU"}, {"given_name": "Sobhan", "family_name": "Soleymani", "institution": "West Virginia University"}, {"given_name": "Fariborz", "family_name": "Taherkhani", "institution": "West Virginia University"}, {"given_name": "Seyed", "family_name": "Iranmanesh", "institution": "West Virginia University"}, {"given_name": "Nasser", "family_name": "Nasrabadi", "institution": "WVU"}]}