{"title": "Unsupervised Attention-guided Image-to-Image Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 3693, "page_last": 3703, "abstract": "Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms which are jointly adversarially trained with the generators and discriminators. We empirically demonstrate that our approach is able to attend to relevant regions in the image without requiring any additional supervision, and that by doing so it achieves more realistic mappings compared to recent approaches.", "full_text": "Unsupervised Attention-guided\n\nImage-to-Image Translation\n\nYoussef A. Mejjati\nUniversity of Bath\nyam28@bath.ac.uk\n\nChristian Richardt\nUniversity of Bath\n\nchristian@richardt.name\n\nJames Tompkin\nBrown University\n\njames_tompkin@brown.edu\n\nDarren Cosker\nUniversity of Bath\n\nD.P.Cosker@bath.ac.uk\n\nKwang In Kim\nUniversity of Bath\nk.kim@bath.ac.uk\n\nAbstract\n\nCurrent unsupervised image-to-image translation techniques struggle to focus their\nattention on individual objects without altering the background or the way multiple\nobjects interact within a scene. Motivated by the important role of attention\nin human perception, we tackle this limitation by introducing unsupervised\nattention mechanisms that are jointly adversarially trained with the generators and\ndiscriminators. We demonstrate qualitatively and quantitatively that our approach\nattends to relevant regions in the image without requiring supervision, which\ncreates more realistic mappings when compared to those of recent approaches.\n\nInput\n\nOurs\n\nCycleGAN [1]\n\nRA [2]\n\nDiscoGAN [3]\n\nUNIT [4]\n\nDualGAN [5]\n\nFigure 1: By explicitly modeling attention, our algorithm is able to better alter the object of interest\nin unsupervised image-to-image translation tasks, without changing the background at the same time.\n\n1\n\nIntroduction\n\nImage-to-image translation is the task of mapping an image from a source domain to a target domain.\nApplications include image colorization [6], image super-resolution [7, 8], style transfer [9], domain\nadaptation [10] and data augmentation [11]. Many approaches require data from each domain to be\npaired or under alignment, e.g., when translating satellite images to topographic maps, which restricts\napplications and may not even be possible for some domains. Unsupervised approaches, such as\nDiscoGAN [3] and CycleGAN [1], overcome this problem with cyclic losses which encourage the\ntranslated domain to be faithfully reconstructed when mapped back to the original domain.\n\nExisting algorithms feed an input image to an encoder\u2013decoder-like neural network architecture\ncalled the generator, which tries to translate the image. Then, this output is fed to a discriminator\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhich attempts to classify if the output image has indeed been translated. In these generative adversar-\nial networks (GANs), the quality of the generated images improves as the generator and discriminator\ncompete to reach the Nash equilibrium expressed by the minimax loss of the training procedure [12].\nHowever, these approaches are limited by the system\u2019s inability to attend only to speci\ufb01c\nscene objects.\nIn the unsupervised case, where images are not paired or aligned, the network\nmust additionally learn which parts of the scene are intended to be translated. For instance, in\nFigure 1, a convincing translation between the horse and zebra domains requires the network to\nattend to each animal and change only those parts of the image. This is challenging for existing\napproaches, even if they use a localized loss like PatchGAN [13], as the network itself has no\nexplicit attention mechanism. Instead, they typically aim to minimize the divergence between the\nunderlying data-generating distribution for the entire image in the source and target domains. To\novercome this limitation, we propose to minimize the divergence between only the relevant parts\nof the data-generating distributions for the source and target domains. For this, we \ufb01nd inspiration\nfrom attentional mechanisms in human perception [14], and their successful application in machine\nlearning [2, 15]. We add an attention network to each generator in the CycleGAN setup. These are\njointly trained to produce attention maps for regions that the discriminator \u2018considers\u2019 are the most\ndiscriminative between the source and target domains. Then, these maps are applied to the input of the\ngenerator to constrain it to relevant image regions. The whole network is trained end-to-end with no\nadditional supervision. We qualitatively and quantitatively show that explicitly incorporating attention\ninto image translation networks signi\ufb01cantly improves the quality of translated images (see Figure 1).\n\n2 Related work\n\nImage-to-image translation. Contemporary image-to-image translation approaches leverage the\npowerful ability of deep neural networks to build meaningful representations. Speci\ufb01cally, GANs\nhave proven to be the gold standard in achieving appealing image-to-image translation results.\nFor instance, Isola et al.\u2019s pix2pix algorithm [9] uses a GAN conditioned on the source image\nand imposes an L1 loss between the generated image and its ground-truth map. This requires the\nexistence of ground-truth paired images from each of the source and target domains. Zhu et al.\u2019s\nunpaired image-to-image translation network [1] builds upon pix2pix and removes the paired input\ndata burden by imposing that each image should be reconstructed correctly when translated twice,\ni.e., when mapped from source to target to source. These maps must conserve the overall structure\nand content of the image. DiscoGAN [3] and DualGAN [5] use the same principle, but with different\nlosses, making them more or less robust to changes in shape.\n\nSome unsupervised translation approaches assume the existence of a shared latent space between\nsource and target domains. Liu and Tuzel\u2019s Coupled GAN (CoGAN) [16] learns an estimate of\nthe joint data-generating distribution using samples from the marginals, by enforcing source and\ntarget discriminators and generators to share parameters in low-level layers. Liu et al.\u2019s unsupervised\nimage-to-image translation networks (UNIT) [4] build upon Coupled GAN by assuming the existence\nof a shared low-dimensional latent space between the source and target domains. Once the image is\nmapped to its latent representation, then a generator decodes it into its target domain version. Huang\net al.\u2019s multi-modal UNIT (MUNIT) [17] framework extends this idea to multi-modal image-to-image\ntranslation by assuming two latent representations: one for \u2018style\u2019 and one for \u2018content\u2019. Then, the\ncross-domain image translation is performed by combining different content and style representations.\nGiven input images depicting objects at multiple scales, the aforementioned approaches are\nsometimes able to translate the foreground. However, they generally also affect the background in\nunwanted ways, leading to unrealistic translations. We demonstrate that our algorithm is able to\novercome this limitation by incorporating attention into the image translation framework.\n\nAttending to speci\ufb01c regions within image translation has recently been explored by Ma et al.\n[18], who attempt to decouple local textures from holistic shapes by attending to local objects\nof interest (e.g., eyes, nose, and mouth in a face); this is manifested through attention maps as\nindividual square image regions. This limits the approach, as (1) it assumes that all objects are\nthe same size, corresponding to the sizes of the square attention maps, and (2) it involves tuning\nhyper-parameters for the number and size of the square regions. As a consequence, this approach\ncannot straightforwardly deal with image translation without altering the background.\n\nAttention learning. Attention learning has bene\ufb01ted from advances in deep learning. Contem-\nporary approaches use convolution-deconvolution networks trained on ground-truth masks [19],\nand combine these architectures with recurrent attention models. Speci\ufb01cally, Kuen et al.\u2019s saliency\n\n2\n\n\fFigure 2: Data-\ufb02ow diagram from the source domain S to the target domain T during training. The\nroles of S and T are symmetric in our network, so that data also \ufb02ows in the opposite direction T \u2192 S.\n\ndetection [20] uses Recurrent Neural Networks (RNN) to adaptively select a sequence of local\nregions in the input image for saliency estimation. Then, these local estimates are combined into\na global estimate. Such approaches cannot be applied in our setting, since they require supervision.\nUnsupervised attention learning includes Mnih et al.\u2019s recurrent model of visual attention [15],\nwhich uses only a few learned square regions of the image trained from classi\ufb01cation labels. This\napproach is not differentiable and requires training with reinforcement learning, which is not straight-\nforward to apply in our problem. More recently, attention has been enforced on activation functions\nto select only task-relevant features [2, 21]. However, we show in experiments that our approach\nof enforcing attention on the input image provides better results for image-to-image translation.\n\nLearning attention also encourages the generation of more realistic images compared to classic\nvanilla GANs. For example, Zhang et al.\u2019s self-attention GANs [22] constrain the generator to\ngradually consider non-local relationships in the feature space by using unsupervised attention,\nwhich produces globally realistic images. Yang et al.\u2019s recursive approach [23] generates images\nby decoupling the generation of the foreground and background in a sequential manner; however,\nits extension to image-to-image translation is not straightforward as in that case we only care about\nmodifying the foreground. Attention has also been used for video generation [24], where a binary\nmask is learned to distinguish between dynamic and static regions in each frame of a generated video.\nThe generated masks are trained to detect unrealistic motions and patterns in the generated frames,\nwhereas our attention network is trained to \ufb01nd the most discriminative regions which characterize\na given image domain. Finally, Chen et al.\u2019s contemporaneous work shares our goal of learning an\nattention map for image translation [25]; we will discuss the differences between our methods after\nexplaining our approach (see Section 4).\n\n3 Our approach\n\nThe goal of image translation is to estimate a map FS\u2192T from a source image domain S to a target\nimage domain T based on independently sampled data instances XS and XT , such that the distribution\nof the mapped instances FS\u2192T (XS) matches the probability distribution PT of the target. Our start-\ning point is Zhu et al.\u2019s CycleGAN approach [1], which also learns a domain inverse FT\u2192S to enforce\ncycle consistency: FT\u2192S(FS\u2192T (XS))\u2248XS. The training of the transfer network FS\u2192T requires a\ndiscriminator DT to try to detect the translated outputs from the observed instances XT . For cycle con-\nsistency, the inverse map FT\u2192S and the corresponding discriminator DS are simultaneously trained.\nSolving this problem requires solving two equally important tasks: (1) locating the areas to\ntranslate in each image, and (2) applying the right translation to the located areas. We achieve this\nby adding two attention networks AS and AT , which select areas to translate by maximizing the\nprobability that the discriminator makes a mistake. We denote AS : S\u2192 Sa and AT : T \u2192 Ta, where\nSa and Ta are the attention maps induced from S and T , respectively. Each attention map contains\nper-pixel [0,1] estimates. After feeding the input image to the generator, we apply the learned mask\nto the generated image using an element-wise product \u2018(cid:12)\u2019, and then add the background using the\n\n3\n\nss's''\uf0bb+ssa1-sasfs'sbDTASFS\uf0aeTATFT\uf0aeSASFS\uf0aeT\u2299\u2299\finverse of the mask applied to the input image. As such, AS and AT are trained in tandem with\nthe generators; Figure 2 visualizes this process.\n\nHenceforth, we will describe only the map FS\u2192T ; the inverse map FT\u2192S is de\ufb01ned similarly.\n\n3.1 Attention-guided generator\nFirst, we feed the input image s\u2208 S into the generator FS\u2192T , which maps s to the target domain T .\nThen, the same input is fed to the attention network AS, resulting in the attention map sa = AS(s).\nTo create the \u2018foreground\u2019 object sf \u2208 T , we apply sa to FS\u2192T (s) via an element-wise product\non each RGB channel: sf = sa (cid:12) FS\u2192T (s) (Figure 2 shows an example). Finally, we create the\n\u2018background\u2019 image sb = (1\u2212sa)(cid:12)s, and add it to the masked output of the generator FS\u2192T . Thus,\nthe mapped image s(cid:48) is obtained by:\n\n(cid:124)\n\ns(cid:48) = sa(cid:12)FS\u2192T (s)\n\n+ (1\u2212sa)(cid:12)s\n.\n\nForeground\n\nBackground\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(1)\n\nAttention map intuition. The attention network AS plays a key role in Equation 1. If the attention\nmap sa was replaced by all ones, to mark the entire image as relevant, then we obtain CycleGAN\nas a special case of our approach. If sa was all zeros, then the generated image would be identical\nto the input image due to the background term in Equation 1, and the discriminator would never\nbe fooled by the generator. If sa attends to an image region without a relevant foreground instance\nto translate, then the result s(cid:48) will preserve its source domain class (i.e. a horse will remain a horse).\nIn other words, the image parts which most describe the domain will remain unchanged, which\nmakes it straightforward for the discriminator DT to detect the image as a fake. Therefore, the\nonly way to \ufb01nd an equilibrium between generator FS\u2192T , attention map AS, and discriminator DT\nis for AS to focus on the objects or areas that the corresponding discriminator thinks are the most\ndescriptive within its domain (i.e., the horses). The discriminator mechanism which makes GAN\ngenerators produce realistic images also makes our attention networks \ufb01nd the domain-descriptive\nobjects in the images.\n\nThe attention map is continuous between [0,1], i.e., it is a matte rather than a segmentation mask.\nThis is valuable for three reasons: (1) it makes estimating the attention maps differentiable, and\nso able to train at all, (2) it allows the network to be uncertain about attention during the training\nprocess, which allows convergence, and (3) it allows the network to learn how to compose edges,\nwhich otherwise might make the foreground object look \u2018stuck on\u2019 or produce fringing artifacts.\n\nLoss function. This process is governed by the adversarial energy:\n\nLs\nadv(FS\u2192T ,AS,DT ) =Et\u223cPT (t)\n\n(2)\nIn addition, and similarly to CycleGAN, we add a cycle-consistency loss to the overall framework\nby enforcing a one-to-one mapping between s and the output of its inverse mapping s(cid:48)(cid:48):\n\n(cid:2)log(DT (t))(cid:3)+Es\u223cPS (s)\n\n(cid:2)log(1\u2212DT (s(cid:48)))(cid:3).\n\nLs\ncyc(s,s(cid:48)(cid:48)) =(cid:107)s\u2212s(cid:48)(cid:48)(cid:107)1,\n\n(3)\n\nwhere s(cid:48)(cid:48) is obtained from s(cid:48) via FT\u2192S and AT , similarly to Equation 1.\n\nThis added loss makes our framework more robust in two ways: (1) it enforces the attended\nregions in the generated image to conserve content (e.g., pose), and (2) it encourages the attention\nmaps to be sharp (converging towards a binary map), as the cycle-consistency loss of unattended\nareas will always be zero. Further, when computing s(cid:48)(cid:48), we use the attention map extracted from\nAT (s(cid:48)). This adds another consistency requirement, as the generated attention maps produced by\nAS and AT for s and s(cid:48), respectively, should match to minimize Equation 3.\n\nWe obtain the \ufb01nal energy to optimize by combining the adversarial and cycle-consistency losses\n\nfor both source and target domains:\n\n(cid:0)Ls\n\n(cid:1),\n\nadv +Lt\n\nadv +\u03bbcyc\n\ncyc +Lt\n\ncyc\n\nL(FS\u2192T ,FT\u2192S,AS,AT ,DS,DT ) =Ls\n(cid:18)\n\nwhere we use the loss hyper-parameter \u03bbcyc = 10 throughout our experiments. The optimal\nparameters of L are obtained by solving the minimax optimization problem:\n\n\u2217\nS\u2192T ,F\n\nF\n\n\u2217\n\u2217\n\u2217\nT\u2192S,A\nS,A\nT ,D\n\n\u2217\nS,D\n\n\u2217\nT =\n\nL(FS\u2192T ,FT\u2192S,AS,AT ,DS,DT )\n\n.\n\n(5)\n\n(4)\n\n(cid:19)\n\nargmin\n\nFS\u2192T ,FT \u2192S ,AS ,AT\n\nargmax\nDS ,DT\n\n4\n\n\f(cid:26)t\n\n(cid:26)FS\u2192T (s)\n\ns(cid:48)\nnew =\n\n3.2 Attention-guided discriminator\n\nEquation 1 constrains the generators to act only on attended regions: as the attention networks\ntrain to become more accurate at \ufb01nding the foreground, the generator improves in translating just\nthe object of interest between domains, e.g., from horse to zebra. However, there is a tension: the\nwhole-image discriminators look (implicitly) at the distribution of backgrounds with respect to the\ntranslated foregrounds. For instance, one observes that the translated horse now looks correctly\nlike a zebra, but also that the overall scene is fake, because the background still shows where horses\nlive\u2014in meadows\u2014and not where zebras live\u2014in savannas. In this sense, we really are trying to\nmake a \u2018fake\u2019 image which does not match either underlying probability distribution PS or PT .\n\nThis tension manifests itself in two behaviors: (1) the generator FS\u2192T tries to \u2018paint\u2019 background\ndirectly into the attended regions, and (2) the attention map slowly includes more and more\nbackground, converging towards a fully attended map (all values in the map converge to 1). Our\nsupplemental material provides example cases (last column in Figure 2; ablation studies Ours\u2013D\nand Ours\u2013D\u2013A in Figure 5).\nTo overcome this, we train the discriminator such that it only considers attended regions.\nSimply using sa (cid:12) s is problematic, as real samples fed to the discriminator now depend on the\ninitially-untrained attention map sa. This leads to mode collapse if all networks in the GAN are\ntrained jointly. To overcome this issue, we \ufb01rst train the discriminators on full images for 30 epochs,\nand then switch to masked images once the attention networks AS and AT have developed.\n\nFurther, with a continuous attention map, the discriminator may receive \u2018fractional\u2019 pixel values,\nwhich may be close to zero early in training. While the generator bene\ufb01ts from being able to\nblend pixels at object boundaries, multiplying real images by these fractional values causes the\ndiscriminator to learn that mid gray is \u2018real\u2019 (i.e., we push the answer towards the midpoint 0 of the\nnormalized [\u22121,1] pixel space). Thus, we threshold the learned attention map for the discriminator:\n\nif AT (t) > \u03c4\notherwise\n\n0\n\nand\n\ntnew =\n\n(6)\nwhere tnew and s(cid:48)\nnew are masked versions of target sample t and translated source sample s(cid:48), which\nonly contain pixels exceeding a user-de\ufb01ned attention threshold \u03c4, which we set to 0.1 (Figure 3\nin the supplemental material justi\ufb01es such choice). Moreover, we \ufb01nd that removing instance\nnormalization from the discriminator at that stage is helpful as we do not want its \ufb01nal prediction\nto be in\ufb02uenced by zero values coming from the background.\n\n0\n\nif AS(s) > \u03c4\notherwise\n\nThus, we update the adversarial energy Ladv of Equation 2 to:\n\nLs\nadv(FS\u2192T ,AS,DT ) =Et\u223cPT (t)\n\n(7)\nAlgorithm 1 summarizes the training procedure for learning FS\u2192T ; training FT\u2192S is similar. Our\nsupplemental material provides details of the individual network con\ufb01gurations.\n\n(cid:2)log(DT (tnew))(cid:3)+Es\u223cPS (s)\n\n(cid:2)log(1\u2212DT (s(cid:48)\n\nnew))(cid:3),\n\nWhen optimizing the objective in Equation 7 beyond 30 epochs, real image inputs to the\ndiscriminator are now also dependent on the learned attention maps. This can lead to mode collapse\nif the training is not performed carefully. For instance, if the mask returned by the attention network\n\nfor i = 0 to |XS|\u22121 do\n\nSample a data point s from XS and a data point t from XT .\nif c < 30 then\n\nAlgorithm 1 Training procedure for the source-to-target map FS\u2192T .\nInput: XS, XT , K (number of epochs), \u03bbcyc (cycle-consistency weight), \u03b1 (ADAM learning rate).\n1: for c = 0 to K\u22121 do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\nOutput: Trained networks F \u2217\n\nCompute s(cid:48) using Equation 1.\nUpdate parameters of FS\u2192T , DT , and AS using Equation 4 with learning rate \u03b1.\nCompute s(cid:48)\nUpdate parameters of FS\u2192T and DT using Equations 4 and 7 with learning rate \u03b1.\n\nnew and tnew using Equation 6.\n\nend if\nend for\n\nS\u2192T , A\u2217\n\nS and D\u2217\nT .\n\nelse\n\n5\n\n\fFigure 3: Input source images (top row) and their corresponding estimated attention maps (below).\nThese re\ufb02ect the discriminative areas between the source and target domains. The right side of the \ufb01g-\nure shows source and target attention maps, trained on horses and zebras, respectively, when applied to\nimages without horse or zebra. The lack of attention suggests appropriate attention network behavior.\n\nInput\n\nOur\n\nAttention\n\nOurs\n\nCycleGAN [1] RA [2]\n\nDiscoGAN [3] UNIT [4] DualGAN [5]\n\nFigure 4: Image translation results for mapping apples to oranges and our learned attention.\n\nis always zero, then the generator will always create \u2018real\u2019 images from the point of view of the\ndiscriminator, as the masked sample tnew in Equation 7 would be all black. We avoid this situation\nby stopping the training of both AS and AT after 30 epochs (Figure 2 in the supplementary material\njusti\ufb01es such hyper-parameter choice).\n\n4 Experiments\n\nBaselines. We compare to DiscoGAN [3] and CycleGAN [1], which are similar, but which use\ndifferent losses: DiscoGAN uses a standard GAN loss [12], and CycleGAN uses a least-squared\nGAN loss [26]. We also compare with DualGAN [5], which is similar to CycleGAN but uses a\nWasserstein GAN loss [27]. Aditionally, we compare with Liu et al.\u2019s UNIT algorithm [4], which\nleverages the latent space assumption between each pair of source/target images. Finally, we compare\nwith Wang et al.\u2019s attention module [2] by incorporating it after the \ufb01rst layer of our generators;\nwe refer to this implementation as \u201cRA\u201d.\nDatasets. We use the \u2018Apple to Orange\u2019 (A\u2194 O) and \u2018Horse to Zebra\u2019 (H \u2194 Z) datasets provided\nby Zhu et al. [1], and the \u2018Lion to Tiger\u2019 (L\u2194 T ) dataset obtained from the corresponding classes\nin the Animals With Attributes (AWA) dataset [28]. These datasets contain objects at different scales\nacross different backgrounds, which make the image-to-image translation setting more challenging.\nNote that for the mapping Lion to Tiger we do not \ufb01nd it necessary to apply the attention-guided\ndiscriminator part.\n\nQualitative results. Observing our learned attention maps, we can see that our approach is able\nto learn relevant image regions and ignore the background (Figure 3). When an input image does not\ncontain any elements of the source domain, our approach does not attend to it, and so successfully\nleaves the image unedited. Holistic image translation approaches, on the other hand, are mislead\nby irrelevant background content and so incorrectly hallucinate texture patterns of the target objects\n(last two rows of Figure 5).\n\nAmong competing approaches, DiscoGAN struggles to separate the background and foreground\ncontent (see Figures 1, 4 and 5). We believe this is partly because their cycle-consistency energy\nis given the same weight as the GAN\u2019s adversarial energy. DualGAN produces slightly better results,\n\n6\n\n\fInput\n\nOurs\n\nCycleGAN [1]\n\nRA [2]\n\nDiscoGAN [3]\n\nUNIT [4]\n\nDualGAN [5]\n\nFigure 5: Translation results. From top to bottom: Z \u2192 H, Z \u2192 H, H \u2192 Z, H \u2192 Z, A\u2192 O, O\u2192 A,\nL\u2192 T , and T \u2192 L. Below line: image translation in the absence of the source domain class (Z \u2192 H).\n\n7\n\n\fTable 1: Kernel Inception Distance\u00d7100 \u00b1 std.\u00d7100 for different image translation algorithms.\nLower is better. Abbreviations: (A)pple, (O)range, (H)orse, (Z)ebra, (T )iger, (L)ion.\n\nH \u2192 Z\n13.68 \u00b1 0.28\n10.16 \u00b1 0.12\n10.38 \u00b1 0.31\n11.22 \u00b1 0.24\n10.25 \u00b1 0.25\n6.93 \u00b1 0.27\n\nL\u2192 T\n16.10 \u00b1 0.55\n9.98 \u00b1 0.13\n10.18 \u00b1 0.15\n11.00 \u00b1 0.09\n10.15 \u00b1 0.08\n8.56 \u00b1 0.16\n\nT \u2192 L\n19.97 \u00b1 0.09\n12.68 \u00b1 0.07\n10.44 \u00b1 0.04\n10.23 \u00b1 0.03\n10.97 \u00b1 0.04\n9.17 \u00b1 0.07\n\nAlgorithm\nDiscoGAN [3]\nRA [2]\nDualGAN [5]\nUNIT [4]\nCycleGAN [1]\nOurs\n\nA\u2192 O\n18.34 \u00b1 0.75\n12.75 \u00b1 0.49\n13.04 \u00b1 0.72\n11.68 \u00b1 0.43\n8.48 \u00b1 0.53\n6.44 \u00b1 0.69\n\nO\u2192 A\n21.56 \u00b1 0.80\n13.84 \u00b1 0.78\n12.42 \u00b1 0.88\n11.76 \u00b1 0.51\n9.82 \u00b1 0.51\n5.32 \u00b1 0.48\n\nZ \u2192 H\n16.60 \u00b1 0.50\n10.97 \u00b1 0.26\n12.86 \u00b1 0.50\n13.63 \u00b1 0.34\n11.44 \u00b1 0.38\n8.87 \u00b1 0.26\n\nalthough the background is still heavily altered. For example, the \ufb01rst row of Figure 1 contains\nundesirable zebra patterns in the background. CycleGAN produces more visually appealing results\nwith its least-squares GAN and appropriate weighting between the adversarial and cycle-consistency\nlosses, even though some elements of the background are still altered. For instance, CycleGAN alters\nthe writing on the chalkboard in the last row of Figure 4, and generates a blue-grey lion in the \ufb01rst\nrow of Figure 5 when asked to translate the zebra pinned down by the lion. The UNIT algorithm\nuses the shared latent space assumption between source and target domains to be robust to changes\nin geometric shape. For example, in the 7th row of Figure 5, we can see that the face of the lion\ncub is mapped to a tiger; however, the overall image is not realistic. Finally, incorporating residual\nattention (RA) modules into the image translation framework does not improve the generated image\nquality, which validates our choice of incorporating attention into images instead of on activation\nfunctions. This is particularly noticeable when the input source image does not contain any relevant\nobject, as in Figure 5 (bottom). In this case, existing algorithms are mislead by irrelevant background\ncontent and incorrectly hallucinate texture patterns of the target objects. By learning attention maps,\nour algorithm successfully ignores background contents and reproduces the input images.\n\nOne limitation of our approach is visible in the last third row of Figure 5, which contains an\nalbino tiger. In this challenging case of an object with outlier appearance within its domain, our\nattention network fails to identify the tiger as foreground, and so our network changes the background\nimage content, too. However, overall, our approach of learning attention maps within unsupervised\nimage-to-image translation obtains more realistic results, particularly for datasets containing objects\nat multiple scales and with different backgrounds.\n\nQuantitative results. We use the recently proposed Kernel Inception Distance (KID) [29] to quan-\ntitatively evaluate our image translation framework. KID computes the squared maximum mean\ndiscrepancy (MMD) between feature representations of real and generated images. Such feature repre-\nsentations are extracted from the Inception network architecture [30]. In contrast to the Fr\u00e9chet Incep-\ntion Distance [31], KID has an unbiased estimator, which makes it more reliable, especially when there\nare fewer test images than the dimensionality of the inception features. While KID is not bounded, the\nlower its value, the more shared visual similarities there are between real and generated images. As\nwe wish the foreground of mapped images to be in the target domain T and the background to remain\nin the source domain S, a good mapping should have a low KID value when computed using both the\ntarget and the source domains. Therefore, we report the mean KID value computed between generated\nsamples using both source and target domains in Table 1. Further, to ensure consistency, the mean KID\nvalues reported are averaged over 10 different splits of size 50, randomly sampled from each domain.\nOur approach achieves the lowest KID score in all the mappings, with CycleGAN as the next\nbest performing approach. UNIT achieves the second-lowest KID score, which suggests that the\nlatent space assumption is useful in our setting. Using Wasserstein GAN allows DualGAN to follow\nclosely behind. The CycleGAN variant using residual attention modules (RA) produces worse results\nthan regular CycleGAN but comparable to UNIT, which suggests that applying attention on the\nfeature space does not considerably improve performance. Finally, by giving the same weight to\nthe adversarial and cyclic energy, DiscoGAN achieves the worst performance in terms of mean KID\nvalues, which is consistent with our qualitative results.\n\nAblation Study. First, we evaluate the cycle-consistency loss governed by Equation 3. This is\nmotivated by using attention to constrain the mapping between only relevant instances, which can\nbe considered as a weak form of cycle consistency. The cycle-consistency loss plays an important\nrole in making attention maps sharp; without them, we notice an onset of mode collapse in GAN\ntraining. As a result, we obtain a model (\u2018Ours\u2013cycle\u2019) with very high KID (Table 2).\n\n8\n\n\fTable 2: Kernel Inception Distance\u00d7100 \u00b1 std.\u00d7100 for ablations of our algorithm. Lower is better.\nAbbreviations: (H)orse, (Z)ebra.\n\nAlgorithm\nOurs\u2013cycle\nOurs\u2013cycleAtt\nOurs\u2013As\nOurs\u2013At\nOurs\u2013D\nOurs\u2013D\u2013A\nOurs\n\nZ \u2192 H\n64.55 \u00b1 0.34\n9.46 \u00b1 0.38\n10.90 \u00b1 0.25\n9.30 \u00b1 0.45\n9.26 \u00b1 0.22\n9.86 \u00b1 0.32\n8.87 \u00b1 0.26\n\nH \u2192 Z\n41.48 \u00b1 0.34\n7.79 \u00b1 0.23\n7.62 \u00b1 0.25\n7.80 \u00b1 0.21\n7.77 \u00b1 0.35\n8.28 \u00b1 0.34\n6.93 \u00b1 0.27\n\nNext, we test the effect of computing attention on the inverse mapping. Instead of computing\na new attention map AT (s(cid:48)), we use the formerly computed AS(s). This model (\u2018Ours\u2013cycleAtt\u2019)\nperforms worse, because computing attention on both the mapping and its inverse indirectly enforces\nsimilarity between both attention maps AT (s(cid:48)) and AS(s).\n\nFurther, we evaluate behavior with only a single attention network: \u2018Ours\u2013As\u2019 and \u2018Ours\u2013At\u2019\ncorresponding to AS and AT , respectively. These approaches are the best performing after our\n\ufb01nal implementation: AS acts on s, but also on t(cid:48) via the inverse mapping, which in\ufb02uences the\ngenerators to still only translate relevant regions. Moreover, we measure the importance of our\nattention-guided discriminator by replacing it with a whole-image discriminator while stopping the\ntraining of the attention networks (\u2018Ours\u2013D\u2019). For this model, mean KID values are higher than our\n\ufb01nal formulation because the generator tries to paint elements of the background onto the foreground\nto compensate for the variance between foreground and background in the source and target domains.\nFinally, we consider the contemporaneous Attention GAN of Chen et al. [25], which also learns\nan attention map for image translation through a cyclic loss. We compare their approach using an\nablated version of our software implementation, as we await a code release from the authors for\na direct results comparison. Our approach differs in two ways: \ufb01rst, we feed the holistic image to\nthe discriminator for the \ufb01rst 30 epochs, and afterwards show it only the masked image; second,\nwe stop the training of the attention networks after 30 epochs to prevent it from focusing on the\nbackground as well. These two differences reduce errors caused by spurious image additions from\nF , and remove the need for the optional supervision introduced by Chen et al. to help remove\nbackground artifacts and better \u2018focus\u2019 the attention map on the foreground. Table 2 demonstrates\nthis quantitatively (\u2018Ours\u2013D\u2013A\u2019), with higher KID scores compared to our \ufb01nal implementation.\nPlease see the supplemental document for visual examples.\n\n5 Conclusion\n\nWhile recent unsupervised image-to-image translation techniques are able to map relevant image\nregions, they also inadvertently map irrelevant regions, too. By doing so, the generated images\nfail to look realistic, as the background and foreground are generally not blended properly. By\nincorporating an attention mechanism into unsupervised image-to-image translation, we demonstrate\nsigni\ufb01cant improvements in the quality of generated images. Our simple algorithm leverages the\ndiscriminator to learn accurate attention maps with no additional supervision. This suggests that\nour learned attention maps re\ufb02ect where the discriminator looks before deciding whether an image\nis real or fake, making it an appropriate tool for investigating the behavior of adversarial networks.\n\nFuture work. Although our approach can produce appealing translation results in the presence\nof multi-scale objects and varying backgrounds, the overall approach is still not robust to shape\nchanges between domains, e.g., making Pegasus by translating a horse into a bird. Our transfer\nmust happen within attended regions in the image, but shape change typically requires altering\nparts outside these regions.\nIn the supplementary material, we provide an example of such\nlimitation via the mapping zebra to lion. Our code is released in the following Github repository:\nhttps://github.com/AlamiMejjati/Unsupervised-Attention-guided-Image-to-Image-Translation.\n\nAcknowledgements: Youssef A. Mejjati thanks the Marie Sklodowska-Curie grant agreement\nNo 665992, and the UK\u2019s EPSRC Centre for Doctoral Training in Digital Entertainment (CDE),\nEP/L016540/1. Kwang In Kim, Christian Richardt, and Darren Cosker thank RCUK EP/M023281/1.\n\n9\n\n\fReferences\n[1] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. In ICCV, 2017.\n\n[2] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual\n\nattention network for image classi\ufb01cation. In CVPR, 2017.\n\n[3] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with\n\ngenerative adversarial networks. JMLR, 2017.\n\n[4] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS,\n\n2017.\n\n[5] Z. Yi, H. Zhang, P. Tan, and M. Gong. DualGAN: Unsupervised dual learning for\n\nimage-to-image translation. In ICCV, 2017.\n\n[6] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsupervised diverse colorization via generative\n\nadversarial networks. In ECML-PKDD, 2017.\n\n[7] C. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,\nJ. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative\nadversarial network. In CVPR, 2017.\n\n[8] B. Wu, H. Duan, Z. Liu, and G. Sun. SRPGAN: Perceptual generative adversarial network\n\nfor single image super resolution. arXiv preprint arXiv:1712.05927, 2017.\n\n[9] P. Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. In CVPR, 2017.\n\n[10] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation\n\nfor domain adaptation. In CVPR, 2018.\n\n[11] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi. BAGAN: Data augmentation\n\nwith balancing GAN. arXiv preprint arXiv:1803.09655, 2018.\n\n[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\n\nand Y. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[13] C. Li and M. Wand. Precomputed real-time texture synthesis with Markovian generative\n\nadversarial networks. In ECCV, 2016.\n\n[14] R. Rensink. The dynamic representation of scenes. Visual Cognition, 7(1\u20133):17\u201342, 2000.\n[15] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In\n\nNIPS, 2014.\n\n[16] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.\n[17] X. Huang, M. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image\n\ntranslation. In ECCV, 2018.\n\n[18] S. Ma, J. Fu, C. Wen Chen, and T. Mei. DA-GAN: Instance-level image translation by deep\n\nattention generative adversarial networks. In CVPR, 2018.\n\n[19] N. Liu, J. Han, and M.-H. Yang. PiCANet: Learning pixel-wise contextual attention for saliency\n\ndetection. In CVPR, 2018.\n\n[20] J. Kuen, Z. Wang, and G. Wang. Recurrent attentional networks for saliency detection. In\n\nCVPR, 2016.\n\n[21] S. Jetley, N. Lord, N. Lee, and P. Torr. Learn to pay attention. In ICLR, 2018.\n[22] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial\n\nnetworks. arXiv preprint arXiv:1805.08318, 2018.\n\n[23] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: Layered recursive generative\n\nadversarial networks for image generation. In ICLR, 2017.\n\n[24] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS,\n\n2016.\n\n[25] X. Chen, C. Xu, X. Yang, and D. Tao. Attention-GAN for object trans\ufb01guration in wild images.\n\nIn ECCV, 2018.\n\n[26] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial\n\nnetworks. In ICCV, 2017.\n\n10\n\n\f[27] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nICML, 2017.\n\n[28] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by\n\nbetween-class attribute transfer. In CVPR, 2009.\n\n[29] M. Bi\u00b4nkowski, D. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In ICLR,\n\n2018.\n\n[30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception\n\narchitecture for computer vision. In CVPR, 2016.\n\n[31] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Klambauer. GANs trained by a\n\ntwo time-scale update rule converge to a Nash equilibrium. In NIPS, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1862, "authors": [{"given_name": "Youssef", "family_name": "Alami Mejjati", "institution": "University of Bath"}, {"given_name": "Christian", "family_name": "Richardt", "institution": "University of Bath"}, {"given_name": "James", "family_name": "Tompkin", "institution": "Brown University"}, {"given_name": "Darren", "family_name": "Cosker", "institution": "University of Bath"}, {"given_name": "Kwang In", "family_name": "Kim", "institution": "University of Bath"}]}