{"title": "Controllable Text-to-Image Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 2065, "page_last": 2075, "abstract": "In this paper, we propose a novel controllable text-to-image generative adversarial network (ControlGAN), which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions. To achieve this, we introduce a word-level spatial and channel-wise attention-driven generator that can disentangle different visual attributes, and allow the model to focus on generating and manipulating subregions corresponding to the most relevant words. Also, a word-level discriminator is proposed to provide fine-grained supervisory feedback by correlating words with image regions, facilitating training an effective generator which is able to manipulate specific visual attributes without affecting the generation of other content. Furthermore, perceptual loss is adopted to reduce the randomness involved in the image generation, and to encourage the generator to manipulate specific attributes required in the modified text. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing state of the art, and is able to effectively manipulate synthetic images using natural language descriptions.", "full_text": "Controllable Text-to-Image Generation\n\nBowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip H. S. Torr\n\nUniversity of Oxford\n\n{bowen.li, thomas.lukasiewicz}@cs.ox.ac.uk\n\n{xiaojuan.qi, philip.torr}@eng.ox.ac.uk\n\nAbstract\n\nIn this paper, we propose a novel controllable text-to-image generative adversar-\nial network (ControlGAN), which can effectively synthesise high-quality images\nand also control parts of the image generation according to natural language de-\nscriptions. To achieve this, we introduce a word-level spatial and channel-wise\nattention-driven generator that can disentangle different visual attributes, and allow\nthe model to focus on generating and manipulating subregions corresponding to\nthe most relevant words. Also, a word-level discriminator is proposed to pro-\nvide \ufb01ne-grained supervisory feedback by correlating words with image regions,\nfacilitating training an effective generator which is able to manipulate speci\ufb01c\nvisual attributes without affecting the generation of other content. Furthermore,\nperceptual loss is adopted to reduce the randomness involved in the image gen-\neration, and to encourage the generator to manipulate speci\ufb01c attributes required\nin the modi\ufb01ed text. Extensive experiments on benchmark datasets demonstrate\nthat our method outperforms existing state of the art, and is able to effectively\nmanipulate synthetic images using natural language descriptions. Code is available\nat https://github.com/mrlibw/ControlGAN.\n\n1\n\nIntroduction\n\nGenerating realistic images that semantically match given text descriptions is a challenging problem\nand has tremendous potential applications, such as image editing, video games, and computer-aided\ndesign. Recently, thanks to the success of generative adversarial networks (GANs) [4, 6, 15] in\ngenerating realistic images, text-to-image generation has made remarkable progress [16, 25, 27] by\nimplementing conditional GANs (cGANs) [5, 16, 17], which are able to generate realistic images\nconditioned on given text descriptions.\nHowever, current generative networks are typically uncontrollable, which means that if users change\nsome words of a sentence, the synthetic image would be signi\ufb01cantly different from the one generated\nfrom the original text as shown in Fig. 1. When the given text description (e.g., colour) is changed,\ncorresponding visual attributes of the bird are modi\ufb01ed, but other unrelated attributes (e.g., the pose\nand position) are changed as well. This is typically undesirable in real-world applications, when a\nuser wants to further modify the synthetic image to satisfy her preferences.\nThe goal of this paper is to generate images from text, and also allow the user to manipulate synthetic\nimages using natural language descriptions, in one framework. In particular, we focus on modifying\nvisual attributes (e.g., category, texture, and colour) of objects in the generated images by changing\ngiven text descriptions. To achieve this, we propose a novel controllable text-to-image generative\nadversarial network (ControlGAN), which can synthesise high-quality images, and also allow the\nuser to manipulate objects\u2019 attributes, without affecting the generation of other content.\nOur ControlGAN contains three novel components. The \ufb01rst component is the word-level spatial\nand channel-wise attention-driven generator, where an attention mechanism is exploited to allow the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis bird has a yellow back and rump, gray\nouter rectrices, and a light gray breast.\n(original text)\n\nThis bird has a red back and rump, yellow\nouter rectrices, and a light white breast.\n(modi\ufb01ed text)\n\nText\n\n[27]\n\n[25]\n\nOurs\n\nOriginal\n\nFigure 1: Examples of modifying synthetic images using a natural language description. The current\nstate of the art methods generate realistic images, but fail to generate plausible images when we\nslightly change the text. In contrast, our method allows parts of the image to be manipulated in\ncorrespondence to the modi\ufb01ed text description while preserving other unrelated content.\n\ngenerator to synthesise subregions corresponding to the most relevant words. Our generator follows\na multi-stage architecture [25, 28] that synthesises images from coarse to \ufb01ne, and progressively\nimproves the quality. The second component is a word-level discriminator, where the correlation\nbetween words and image subregions is explored to disentangle different visual attributes, which\ncan provide the generator with \ufb01ne-grained training signals related to visual attributes. The third\ncomponent is the adoption of the perceptual loss [7] in text-to-image generation, which can reduce\nthe randomness involved in the generation, and enforce the generator to preserve visual appearance\nrelated to the unmodi\ufb01ed text.\nTo this end, an extensive analysis is performed, which demonstrates that our method can effectively\ndisentangle different attributes and accurately manipulate parts of the synthetic image without losing\ndiversity. Also, experimental results on the CUB [23] and COCO [10] datasets show that our method\noutperforms existing state of the art both qualitatively and quantitatively.\n\n2 Related Work\n\nText-to-image Generation. Recently, there has been a lot of work and interest in text-to-image\ngeneration. Mansimov et al. [11] proposed the AlignDRAW model that used an attention mechanism\nover words of a caption to draw image patches in multiple stages. Nguyen et al. [13] introduced an\napproximate Langevin approach to synthesise images from text. Reed et al. [16] \ufb01rst applied the\ncGAN to generate plausible images conditioned on text descriptions. Zhang et al. [27] decomposed\ntext-to-image generation into several stages generating image from coarse to \ufb01ne. However, all above\napproaches mainly focus on generating a new high-quality image from a given text, and cannot allow\nthe user to manipulate the generation of speci\ufb01c visual attributes using natural language descriptions.\n\nImage-to-image translation. Our work is also closely related to conditional image manipulation\nmethods. Cheng et al. [3] produced high-quality image parsing results from verbal commands. Zhu\net al. [31] proposed to change the colour and shape of an object by manipulating latent vectors.\nBrock et al. [2] introduced a hybrid model using VAEs [9] and GANs, which achieved an accurate\nreconstruction without loss of image quality. Recently, Nam et al. [12] built a model for multi-modal\nlearning on both text descriptions and input images, and proposed a text-adaptive discriminator\nwhich utilised word-level text-image matching scores as supervision. However, they adopted a global\npooling layer to extract image features, which may lose important \ufb01ne-grained spatial information.\nMoreover, the above approaches focus only on image-to-image translation instead of text-to-image\ngeneration, which is probably more challenging.\n\nAttention. The attention mechanism has shown its ef\ufb01ciency in various research \ufb01elds including\nimage captioning [24, 30], machine translation [1], object detection [14, 29], and visual question\nanswering [26]. It can effectively capture task-relevant information and reduce the interference from\nless important one. Recently, Xu et al. [25] built the AttnGAN model that designed a word-level\nspatial attention to guide the generator to focus on subregions corresponding to the most relevant\nword. However, spatial attention only correlates words with partial regions without taking channel\n\n2\n\n\fFigure 2: The architecture of our proposed ControlGAN. In (b), Lcorre is the correlation loss discussed\nin Sec. 3.3. In (c), Lper is the perceptual loss discussed in Sec. 3.4.\n\ninformation into account. Also, different channels of features in CNNs may have different purposes,\nand it is crucial to avoid treating all channels without distinction, such that the most relevant channels\nin the visual features can be fully exploited.\n\n3 Controllable Generative Adversarial Networks\n\nGiven a sentence S, we aim to synthesise a realistic image I(cid:48) that semantically aligns with S (see\nFig. 2), and also make this generation process controllable \u2013 if S is modi\ufb01ed to be Sm, the synthetic\nresult \u02dcI(cid:48) should semantically match Sm while preserving irrelevant content existing in I(cid:48) (shown in\nFig. 4). To achieve this, we propose three novel components: 1) a channel-wise attention module, 2)\na word-level discriminator, and 3) the adoption of the perceptual loss in the text-to-image generation.\nWe elaborate our model as follows.\n\n3.1 Architecture\n\nWe adopt the multi-stage AttnGAN [25] as our backbone architecture (see Fig. 2). Given a sentence\nS, the text encoder \u2013 a pre-trained bidirectional RNN [25] \u2013 encodes the sentence S into a sentence\nfeature s \u2208 RD with dimension D describing the whole sentence, and word features w \u2208 RD\u00d7L\nwith length L (i.e., number of words) and dimension D. Following [27], we also apply conditioning\naugmentation (CA) to s. The augmented sentence feature \u02dcs is further concatenated with a random\nvector z to serve as the input to the \ufb01rst stage. The overall framework generates an image from coarse-\nto \ufb01ne-scale in multiple stages, and, in each stage, the network produces a hidden visual feature vi,\nwhich is the input to the corresponding generator Gi to produce a synthetic image. Spatial attention\n[25] and our proposed channel-wise attention modules take w and vi as inputs, and output attentive\nword-context features. These attentive features are further concatenated with the hidden feature vi\nand then serve as input for the next stage.\nThe generator exploits the attention mechanism via incorporating a spatial attention module [25] and\nthe proposed channel-wise attention module. The spatial attention module [25] can only correlate\nwords with individual spatial locations without taking channel information into account. Thus, we\nintroduce a channel-wise attention module (see Sec. 3.2) to exploit the connection between words\nand channels. We experimentally \ufb01nd that the channel-wise attention module highly correlates\nsemantically meaningful parts with corresponding words, while the spatial attention focuses on colour\ndescriptions (see Fig. 6). Therefore, our proposed channel-wise attention module, together with the\nspatial attention, can help the generator disentangle different visual attributes, and allow it to focus\nonly on the most relevant subregions and channels.\n\n3\n\ncorreLper(a) word-level spatial and channel-wise attention-driven generator word featuressentence featureCAG3z\u223cN(0,1)spatial attentionchannel-wise attentionA bird with a red breast, red eyebrow and a red crown.G2G1D1D2D3Image EncoderText Encoderword-leveldiscriminatormatchedmismatched(b) discriminatorreal/fakerealfakenwSText Encoderfeaturesfunctionlayerfakespatial attentionchannel-wise attentionlossLVGG-16relu2_2(c) perceptual loss netwroksw\u02dcsv1v2v3perceptual lossreal\fFigure 3: The architecture of proposed channel-wise attention module and word-level discriminator.\n\n3.2 Channel-Wise Attention\n\nAt the kth stage, the channel-wise attention module (see Fig. 3 (a)) takes two inputs: the word features\nw and hidden visual features vk \u2208 RC\u00d7(Hk\u2217Wk), where Hk and Wk denote the height and width of\nthe feature map at stage k. The word features w are \ufb01rst mapped into the same semantic space as the\nvisual features vk via a perception layer Fk, producing \u02dcwk = Fkw, where Fk \u2208 R(Hk\u2217Wk)\u00d7D.\nThen, we calculate the channel-wise attention matrix mk \u2208 RC\u00d7L by multiplying the converted word\nfeatures \u02dcwk and visual features vk, denoted as mk = vk \u02dcwk. Thus, mk aggregates correlation values\nbetween channels and words across all spatial locations. Next, mk is normalised by the softmax\nfunction to generate the normalised channel-wise attention matrix \u03b1k as\n\n(cid:80)L\u22121\n\nexp(mk\ni,j)\nl=0 exp(mk\n\ni,l)\n\n\u03b1k\n\ni,j =\n\n.\n\n(1)\n\ni,j represents the correlation between the ith channel in the visual features vk\n\nThe attention weight \u03b1k\nand the jth word in the sentence S, and higher value means larger correlation.\nEquipped with the channel-wise attention matrix \u03b1k, we obtain the \ufb01nal channel-wise attention\nfeatures f \u03b1\nk is a dynamic rep-\nresentation weighted by the correlation between words and corresponding channels in the visual\nfeatures. Thus, channels with high correlation values are enhanced resulting in a high response to\ncorresponding words, which can facilitate disentangling word attributes into different channels, and\nalso reduce the in\ufb02uence from irrelevant channels by assigning a lower correlation.\n\nk \u2208 RC\u00d7(Hk\u2217Wk), denoted as f \u03b1\n\nk = \u03b1k( \u02dcwk)T . Each channel in f \u03b1\n\n3.3 Word-Level Discriminator\n\nTo encourage the generator to modify only parts of the image according to the text, the discriminator\nshould provide the generator with \ufb01ne-grained training feedback, which can guide the generation\nof subregions corresponding to the most relevant words. Actually, the text-adaptive discriminator\n[12] also explores the word-level information in the discriminator, but it adopts a global average\npooling layer to output a 1D vector as image feature, and then calculates the correlation between\nimage feature and each word. By doing this, the image feature may lose important spatial information,\nwhich provides crucial cues for disentangling different visual attributes. To address the issue, we\npropose a novel word-level discriminator inspired by [12] to explore the correlation between image\nsubregions and each word; see Fig. 3 (b).\nOur word-level discriminator takes two inputs: 1) word features w, w(cid:48) encoded from the text encoder,\nwhich follows the same architecture as the one (see Fig. 2 (a)) used in the generator, where w and\nw(cid:48) denote word features encoded from the original text S and a randomly sampled mismatched\ntext, respectively, and 2) visual features nreal, nfake, both encoded by a GoogleNet-based [22] image\nencoder from the real image I and generated images I(cid:48), respectively.\n\n4\n\ncorreLwvkf\u03b1kD\u00d7L\u02dcwk:(Hk*Wk)\u00d7LC\u00d7(Hk*Wk)mk:C\u00d7LL\u00d7(Hk*Wk)(a) channel-wise attention (b) word-level discriminatornwMatMulSoftmaxMatMulRepeatMul.CosineSimilaritySigmoid\u2211D\u00d7Lb:D\u00d7LC\u00d7(H*W)L\u00d7Dm:L\u00d7(H*W)(H*W)\u00d7L\u02dcn:D\u00d7(H*W)\u03b3\u2032:D\u00d7L\u02dcb:D\u00d7LTrans.Trans.Trans.MatMulSoftmaxMatMulTrans.TransposeMatMulHadamard Product SummationMul.RepeatAlong Row DirectionMatrix MultiplicationFkF\u2032\u03b1k:C\u00d7LC\u00d7(Hk*Wk)\u03b2:L\u00d7(H*W)Word-Level Self Attention[12]\u03b3:1\u00d7LFk,F\u2032Perception Layer\u2211\fFor simplicity, in the following, we use n \u2208 RC\u00d7(H\u2217W ) to represent visual features nreal and nfake,\nand use w \u2208 RD\u00d7L for both original and mismatched word features. The word-level discriminator\ncontains a perception layer F (cid:48) that is used to align the channel dimension of visual feature n and\nword feature w, denoted as \u02dcn = F (cid:48)n, where F (cid:48) \u2208 RD\u00d7C is a weight matrix to learn. Then, the\nword-context correlation matrix m \u2208 RL\u00d7(H\u2217W ) can be derived via m = wT \u02dcn, and is further\nnormalised by the softmax function to get a correlation matrix \u03b2:\n\n\u03b2i,j =\n\n,\n\n(2)\n\n(cid:80)(H\u2217W )\u22121\n\nl=0\n\nexp(mi,j)\n\nexp(mi,l)\n\nwhere \u03b2i,j represents the correlation value between the ith word and the jth subregion of the image.\nThen, the image subregion-aware word features b \u2208 RD\u00d7L can be obtained by b = \u02dcn\u03b2T , which\naggregates all spatial information weighted by the word-context correlation matrix \u03b2.\nAdditionally, to further reduce the negative impact of less important words, we adopt the word-level\nself-attention [12] to derive a 1D vector \u03b3 with length L re\ufb02ecting the relative importance of each\nword. Then, we repeat \u03b3 by D times to produce \u03b3(cid:48), which has the same size as b. Next, b is further\nreweighted by \u03b3(cid:48) to get \u02dcb, denoted as \u02dcb = b (cid:12) \u03b3(cid:48), where (cid:12) represents element-wise multiplication.\nFinally, we derive the correlation between the ith word and the whole image as Eq. (3):\n\nri = \u03c3(\n\n(\u02dcbi)T wi\n||\u02dcbi|| ||wi|| ),\nall word-context correlations, denoted as Lcorre(I, S) =(cid:80)L\u22121\n\nwhere \u03c3 is the sigmoid function, ri evaluates the correlation between the ith word and the image, and\n\u02dcbi and wi represent the ith column of b and w, respectively.\nTherefore, the \ufb01nal correlation value Lcorre between image I and sentence S is calculated by summing\ni=0 ri. By doing so, the generator can\nreceive \ufb01ne-grained feedback from the word-level discriminator for each visual attribute, which can\nfurther help supervise the generation and manipulation of each subregion independently.\n\n(3)\n\n3.4 Perceptual Loss\n\nWithout adding any constraint on text-irrelevant regions (e.g., backgrounds), the generated results can\nbe highly random, and may also fail to be semantically consistent with other content. To mitigate this\nrandomness, we adopt the perceptual loss [7] based on a 16-layer VGG network [21] pre-trained on\nthe ImageNet dataset [18]. The network is used to extract semantic features from both the generated\nimage I(cid:48) and the real image I, and the perceptual loss is de\ufb01ned as\n\nLper(I(cid:48), I) =\n\n1\n\nCiHiWi\n\n(cid:107)\u03c6i(I(cid:48)) \u2212 \u03c6i(I)(cid:107)2\n2 ,\n\n(4)\n\nwhere \u03c6i(I) is the activation of the ith layer of the VGG network, and Hi and Wi are the height and\nwidth of the feature map, respectively.\nTo our knowledge, we are the \ufb01rst to apply the perceptual loss [7] in controllable text-to-image\ngeneration, which can reduce the randomness involved in the image generation by matching feature\nspace.\n\n3.5 Objective Functions\nThe generator and discriminator are trained alternatively by minimising both the generator loss LG\nand discriminator loss LD.\nGenerator objective. The generator loss LG as Eq. (5) contains an adversarial loss LGk, a text-\nimage correlation loss Lcorre, a perceptual loss Lper, and a text-image matching loss LDAMSM [25].\n\nLG =\n\n(LGk + \u03bb2Lper(Ik\n\n(cid:48), Ik) + \u03bb3 log(1 \u2212 Lcorre(I(cid:48)\n\nk, S))) + \u03bb4LDAMSM,\n\n(5)\n\nk=1\n\nwhere K is the number of stages, Ik is the real image sampled from the true image distribution Pdata\n(cid:48) is the generated image at the kth stage sampled from the model distribution P Gk,\nat stage k, Ik\n\n5\n\nK(cid:88)\n\n\f\u03bb2, \u03bb3, \u03bb4 are hyper-parameters controlling different losses, Lper is the perceptual loss described in\nSec. 3.4, which puts constraint on the generation process to reduce the randomness, the LDAMSM\n[25] is used to measure text-image matching score based on the cosine similarity, and Lcorre re\ufb02ects\nthe correlation between the generated image and the given text description considering spatial\ninformation.\nThe adversarial loss LGk is composed of the unconditional and conditional adversarial losses shown\nin Eq. (6): the unconditional adversarial loss is applied to make the synthetic image be real, and the\nconditional adversarial loss is utilised to make the generated image match the given text S.\n\n(cid:124)\nLGk = \u2212 1\n2\n\nEIk\n\n(cid:48)\u223cP Gk\n\n(cid:2)log(Dk(Ik\n(cid:123)(cid:122)\n\n(cid:48)))(cid:3)\n(cid:125)\n\n\u2212 1\n2\n\n(cid:124)\n\n(cid:2)log(Dk(Ik\n(cid:123)(cid:122)\n\n(cid:48), S))(cid:3)\n(cid:125)\n\nEIk\n\n(cid:48)\u223cP Gk\n\n.\n\n(6)\n\nunconditional adversarial loss\n\nconditional adversarial loss\n\nDiscriminator objective. The \ufb01nal loss function for training the discriminator D is de\ufb01ned as:\n\nLD =\n\n(LDk + \u03bb1(log(1 \u2212 Lcorre(Ik, S)) + log Lcorre(Ik, S(cid:48)))),\n\n(7)\n\nK(cid:88)\n\nk=1\n\nwhere Lcorre is the correlation loss determining whether word-related visual attributes exist in the\nimage (see Sec. 3.3), S(cid:48) is a mismatched text description that is randomly sampled from the dataset\nand is irrelevant to Ik, and \u03bb1 is a hyper-parameter controlling the importance of additional losses.\nThe adversarial loss LDk contains two components: the unconditional adversarial loss determines\nwhether the image is real, and the conditional adversarial loss determines whether the given image\nmatches the text description S:\n\nEIk\n\n(cid:123)(cid:122)\nEIk\u223cPdata [log(Dk(Ik))] \u2212 1\n2\n(cid:123)(cid:122)\nEIk\u223cPdata [log(Dk(Ik, S))] \u2212 1\n2\n\nunconditional adversarial loss\n\n(cid:48)\u223cP Gk\n\nEIk\n\n(cid:48)\u223cP Gk\n\nconditional adversarial loss\n\n(cid:2)log(1 \u2212 Dk(Ik\n(cid:48)))(cid:3)\n(cid:125)\n(cid:48), S))(cid:3)\n(cid:2)log(1 \u2212 Dk(Ik\n(cid:125)\n\n(8)\n\n.\n\n(cid:124)\nLDk =\u2212 1\n2\n(cid:124)\n\n\u2212 1\n2\n\n4 Experiments\n\nTo evaluate the effectiveness of our approach, we conduct extensive experiments on the CUB bird\n[23] and the MS COCO [10] datasets. We compare with two state of the art GAN methods on\ntext-to-image generation, StackGAN++ [28] and AttnGAN [25]. Results for the state of the art are\nreproduced based on the code released by the authors.\n\n4.1 Datasets\n\nOur method is evaluated on the CUB bird [23] and the MS COCO [10] datasets. The CUB dataset\ncontains 8,855 training images and 2,933 test images, and each image has 10 corresponding text\ndescriptions. As for the COCO dataset, it contains 82,783 training images and 40,504 validation\nimages, and each image has 5 corresponding text descriptions. We preprocess these two datasets\nbased on the methods introduced in [27].\n\n4.2\n\nImplementation\n\nThere are three stages (K = 3) in our ControlGAN generator following [25]. The three scales are\n64\u00d7 64, 128\u00d7 128, and 256\u00d7 256, and spatial and channel-wise attentions are applied at the stages 2\nand 3. The text encoder is a pre-trained bidirectional LSTM [20] to encode the given text description\ninto a sentence feature with dimension 256 and word features with length 18 and dimension 256.\nIn the perceptual loss, we compute the content loss at layer relu2_2 of VGG-16 [21] pre-trained on\nthe ImageNet [18]. The whole network is trained using the Adam optimiser [8] with the learning\nrate 0.0002. The hyper-parameters \u03bb1, \u03bb2, \u03bb3, and \u03bb4 are set to 0.5, 1, 1, and 5 for both datasets,\nrespectively.\n\n6\n\n\fTable 1: Quantitative comparison: Inception Score, R-precision, and L2 reconstruction error of state\nof the art and ControlGAN on the CUB and COCO datasets.\n\nCUB\n\nCOCO\n\nMethod\nStackGAN++\nAttnGAN\nOurs\n\nIS\n4.04 \u00b1 .05\n4.36 \u00b1 .03\n4.58 \u00b1 .09\n\nTop-1 Acc(%) L2 error\n45.28 \u00b1 3.72\n67.82 \u00b1 4.43\n69.33 \u00b1 3.23\n\n0.29\n0.26\n0.18\n\nIS\n8.30 \u00b1 .10\n25.89 \u00b1 .47\n24.06 \u00b1 .60\n\nTop-1 Acc(%) L2 error\n72.83 \u00b1 3.17\n85.47 \u00b1 3.69\n82.43 \u00b1 2.43\n\n0.32\n0.40\n0.17\n\n4.3 Comparison with State of the Art\n\nQuantitative results. We adopt the Inception Score [19] to evaluate the quality and diversity of the\ngenerated images. However, as the Inception Score cannot re\ufb02ect the relevance between an image\nand a text description, we utilise R-precision [25] to measure the correlation between a generated\nimage and its corresponding text. We compare the top-1 text-to-image retrieval accuracy (Top-1 Acc)\non the CUB and COCO datasets following [12].\nQuantitative results are shown in Table 1, our method achieves better IS and R-precision values on\nthe CUB dataset compared with the state of the art, and has a competitive performance on the COCO\ndataset. This indicates that our method can generate higher-quality images with better diversity,\nwhich semantically align with the text descriptions.\nTo further evaluate whether the model can generate controllable results, we compute the L2 recon-\nstruction error [12] between the image generated from the original text and the one from the modi\ufb01ed\ntext shown in Table 1. Compared with other methods, ControlGAN achieves a signi\ufb01cantly lower\nreconstruction error, which demonstrates that our method can better preserve content in the image\ngenerated from the original text.\nQualitative results. We show qualitative comparisons in Fig. 4. As we can see, according to\nmodifying given text descriptions, our approach can successfully manipulate speci\ufb01c visual attributes\naccurately. Also, our method can even handle out-of-distribution queries, e.g., red zebra on a river\nshown in the last two columns of Fig. 4. All the above indicates that our approach can manipulate\ndifferent visual attributes independently, which demonstrates the effectiveness of our approach in\ndisentangling visual attributes for text-to-image generation.\nFig. 5 shows the visual comparison between ControlGAN, AttnGAN [25], and StackGAN++ [28].\nIt can be observed that when the text is modi\ufb01ed, the two compared approaches are more likely to\ngenerate new content, or change some visual attributes that are not relevant to the modi\ufb01ed text. For\ninstance, as shown in the \ufb01rst two columns, when we modify the colour attributes, StackGAN++\nchanges the pose of the bird, and AttnGAN generates new background. In contrast, our approach is\nable to accurately manipulate parts of the image generation corresponding to the modi\ufb01ed text, while\npreserving the visual attributes related to unchanged text.\nIn the COCO dataset, our model again achieves much better results compared with others shown in\nFig. 5. For example, as shown in the last four columns, the compared approaches cannot preserve\nthe shape of objects and even fail to generate reasonable images. Generally speaking, the results\non COCO are not as good as on the CUB dataset. We attribute this to the few text-image pairs and\nmore abstract captions in the dataset. Although there are a lot of categories in COCO, each category\nonly has a few number of examples, and captions focus mainly on the category of objects rather than\ndetailed descriptions, which makes text-to-image generation more challenging.\n\n4.4 Component Analysis\n\nEffectiveness of channel-wise attention. Our model implements channel-wise attention in the\ngenerator, together the spatial attention, to generate realistic images. To better understand the\neffectiveness of attention mechanisms, we visualise the intermediate results and corresponding\nattention maps at different stages.\nWe experimentally \ufb01nd that the channel-wise attention correlates closely with semantic parts of\nobjects, while the spatial attention focuses mainly on colour descriptions. Fig. 6 shows several\n\n7\n\n\fThis bird is\nyellow with\nblack and has\na very short\n\nbeak.\n\nThis bird is\norange with\ngrey and has a\n\nvery short\n\nbeak.\n\nThe small bird\n\nhas a dark\nbrown head\n\nand light\n\nbrown body.\n\nThe small bird\nhas a dark tan\nhead and light\ngrey body.\n\nA large group\nof cows on a\n\nfarm.\n\nA large group\nof white cows\n\non a farm.\n\nA crowd of\n\npeople \ufb02y kites\n\non a hill.\n\nA crowd of\n\npeople \ufb02y kites\n\non a park.\n\nA group of\nzebras on a\ngrassy \ufb01eld\nwith trees in\nbackground.\n\nA group of red\nzebras on a\nriver with\ntrees in\n\nbackground.\n\nFigure 4: Qualitative results on the CUB and COCO datasets. Odd-numbered columns show the\noriginal text and even-numbered ones the modi\ufb01ed text. The last two are an out-of-distribution case.\n\nA giraffe is\n\nstanding on the\n\ndirt in an\nenclosure.\n\nA zebra stands\non a pathway\nnear grass.\n\nA sheep stands\non a pathway\nnear grass.\n\nA giraffe is\n\nstanding on the\n\ndirt.\n\nThis bird has a\nwhite neck and\nbreast with a\nturquoise\ncrown and\n\nfeathers a small\n\nshort beak.\n\nThis bird has a\ngrey neck and\nbreast with a\nblue crown and\nfeathers a small\n\nshort beak.\n\nThis bird has\nwings that are\nyellow and has a\nbrown body.\n\nThis bird has\nwings that are\nblack and has a\n\nred body.\n\nInput\n\nStackGAN++\n[28]\n\nAttnGAN [25]\n\nOurs\n\nFigure 5: Qualitative comparison of three methods on the CUB and COCO datasets. Odd-numbered\ncolumns show the original text and even-numbered ones the modi\ufb01ed text.\n\nFigure 6: Top: visualisation of feature channels at stage 3. The number at the top-right corner is the\nchannel number, and the word that has the highest correlation \u03b1i,j in Eq. 1 with the channel is shown\nunder the image. Bottom: spatial attention produced in stage 3.\n\nchannels of feature maps that correlate with different semantics, and our channel-wise attention\nmodule assigns large correlation values to channels that are semantically related to the word describing\nparts of a bird. This phenomenon is further veri\ufb01ed by the ablation study shown in Fig. 7 (left side).\nWithout channel-wise attention, our model fails to generate controllable results when we modify the\ntext related to parts of a bird. In contrast, our model with channel-wise attention can generate better\ncontrollable results.\nEffectiveness of word-level discriminator. To verify the effectiveness of the word-level discrimina-\ntor, we \ufb01rst conduct an ablation study: our model is trained without word-level discriminator, shown\nin Fig. 7 (right side), and then we construct a baseline model by replacing our discriminator with\n\n8\n\nheadthisGenerated Image10222908birdwingscrownbirdInput Text: A yellow bird, with green \ufb02ank, a white belly, a yellow crown, and a black bill.Generated Imagebillwith25300723Input Text: This white and black bird with a red head, brown wings and a white belly.\fThis yellow\nbird has grey\n\nand white\nwings and a\nred head.\n\nThis yellow\nbird has grey\n\nand white\nwings and a\nred belly.\n\nThe bird is\nsmall and\nround with\nwhite belly\nand blue\nwings.\n\nThe bird is\nsmall and\nround with\nwhite head\nand blue\nwings.\n\nInput\n\nOurs without\nchannel-wise\n\nattention\n\nOurs\n\nThis bird\u2019s\nwing bar is\nbrown and\n\nyellow, and its\n\nbelly is\nyellow.\n\nThis bird\u2019s\nwing bar is\nbrown and\n\nyellow, and its\nbelly is white.\n\nA small bird\nwith a brown\ncolouring and\nwhite belly.\n\nA small bird\nwith a brown\ncolouring and\nwhite head.\n\nInput\n\nOurs without\nword-level\ndiscriminator\n\nOurs\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nFigure 7: Left: ablation study of channel-wise attention; right: ablation study of the word-level\ndiscriminator.\n\nThe bird is\nsmall with a\npointed bill,\nhas black\neyes, and a\nyellow crown.\n\nThe bird is\nsmall with a\npointed bill,\nhas black\neyes, and an\norange crown.\n\nA bird with a\nwhite belly\nand metallic\nblue wings\nwith a small\n\nbeak.\n\nA bird with a\nwhite head\nand metallic\nblue wings\nwith a small\n\nbeak.\n\nInput\n\nOurs without\nperceptual\n\nloss\n\nOurs\n\nA songbird is\nyellow with\nblue and\n\ngreen feathers\n\nand black\n\neyes.\n\nA tiny bird,\nwith green\n\ufb02ank, white\nbelly, yellow\ncrown, and a\npointy bill.\n\nA tiny bird,\nwith green\n\ufb02ank, grey\nbelly, blue\ncrown, and a\npointy bill.\n\nA songbird is\nwhite with\nblue feathers\n\nand black\n\neyes.\n\nInput\n\nOurs with\ntext-adaptive\ndiscriminator\n\nOurs\n\nFigure 8: Left: ablation study of the perceptual loss [7]; right: comparison between our word-level\ndiscriminator and text-adaptive discriminator [12].\n\na text-adaptive discriminator [12], which also explores the correlation between image features and\nwords. Visual comparisons are shown in Fig. 8 (right side). We can easily observe that the compared\nbaseline fails to manipulate the synthetic images. For example, as shown in the \ufb01rst two columns,\nthe bird generated from the modi\ufb01ed text has a totally different shape, and the background has been\nchanged as well. This is due to the fact that the text-adaptive discriminator [12] uses a global pooling\nlayer to extract image features, which may lose important spatial information.\nEffectiveness of perceptual loss. Furthermore, we conduct an ablation study: our model is trained\nwithout the perceptual loss, shown in Fig. 8 (left side). Without perceptual loss, images generated\nfrom modi\ufb01ed text are hard to preserve content that are related to unmodi\ufb01ed text, which indicates\nthat the perceptual loss can potentially introduce a stricter semantic constraint on the image generation\nand help reduce the involved randomness.\n\n5 Conclusion\n\nWe have proposed a controllable generative adversarial network (ControlGAN), which can generate\nand manipulate the generation of images based on natural language descriptions. Our ControlGAN\ncan successfully disentangle different visual attributes and allow parts of the synthetic image to be\nmanipulated accurately, while preserving the generation of other content. Three novel components\nare introduced in our model: 1) the word-level spatial and channel-wise attention-driven generator\ncan effectively disentangle different visual attributes, 2) the word-level discriminator provides the\ngenerator with \ufb01ne-grained training signals related to each visual attribute, and 3) the adoption\nof perceptual loss reduces the randomness involved in the generation, and enforces the generator\nto reconstruct content related to unmodi\ufb01ed text. Extensive experimental results demonstrate the\neffectiveness and superiority of our method on two benchmark datasets.\n\nAcknowledgements. This work was supported by the Alan Turing Institute under the UK EPSRC grant\nEP/N510129/1, the AXA Research Fund, the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant\nSeebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to acknowledge the\nRoyal Academy of Engineering and FiveAI.\n\n9\n\n\fReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial\n\nnetworks. arXiv preprint arXiv:1609.07093, 2016.\n\n[3] M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. J. Mitra, and P. Torr. Imagespirit:\n\nVerbal guided image parsing. ACM Transactions on Graphics (TOG), 34(1):3, 2014.\n\n[4] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian\npyramid of adversarial networks. In Advances in Neural Information Processing Systems, pages 1486\u20131494,\n2015.\n\n[5] H. Dong, S. Yu, C. Wu, and Y. Guo. Semantic image synthesis via adversarial learning. In Proceedings of\n\nthe IEEE International Conference on Computer Vision, pages 5706\u20135714, 2017.\n\n[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672\u20132680,\n2014.\n\n[7] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In\n\nProceedings of European Conference on Computer Vision, pages 694\u2013711. Springer, 2016.\n\n[8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[9] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative\n\nmodels. In Advances in Neural Information Processing Systems, pages 3581\u20133589, 2014.\n\n[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\nCOCO: Common objects in context. In Proceedings of European Conference on Computer Vision, pages\n740\u2013755. Springer, 2014.\n\n[11] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention.\n\narXiv preprint arXiv:1511.02793, 2015.\n\n[12] S. Nam, Y. Kim, and S. J. Kim. Text-adaptive generative adversarial networks: manipulating images with\n\nnatural language. In Advances in Neural Information Processing Systems, pages 42\u201351, 2018.\n\n[13] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks:\nConditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 4467\u20134477, 2017.\n\n[14] A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson. Top-down control of visual attention in\nobject detection. In Proceedings of International Conference on Image Processing (Cat. No. 03CH37429),\nvolume 1, pages 253\u2013256. IEEE, 2003.\n\n[15] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[16] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image\n\nsynthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[17] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In\n\nAdvances in Neural Information Processing Systems, pages 217\u2013225, 2016.\n\n[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[19] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\n[20] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal\n\nProcessing, 45(11):2673\u20132681, 1997.\n\n[21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n10\n\n\f[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\nGoing deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 1\u20139, 2015.\n\n[23] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-Ucsd Birds-200-2011 dataset.\n\n2011.\n\n[24] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend\nand tell: Neural image caption generation with visual attention. In International Conference on Machine\nLearning, pages 2048\u20132057, 2015.\n\n[25] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. AttnGAN: Fine-grained text to image\ngeneration with attentional generative adversarial networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1316\u20131324, 2018.\n\n[26] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21\u201329, 2016.\n\n[27] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN: Text to photo-\nIn Proceedings of the IEEE\n\nrealistic image synthesis with stacked generative adversarial networks.\nInternational Conference on Computer Vision, pages 5907\u20135915, 2017.\n\n[28] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN++: Realistic\nimage synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 41(8):1947\u20131962, 2018.\n\n[29] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang. Progressive attention guided recurrent network for salient\nobject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 714\u2013722, 2018.\n\n[30] Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang. MDNet: A semantically and visually interpretable\nmedical image diagnosis network. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 6428\u20136436, 2017.\n\n[31] J.-Y. Zhu, P. Kr\u00e4henb\u00fchl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural\nimage manifold. In Proceedings of European Conference on Computer Vision, pages 597\u2013613. Springer,\n2016.\n\n11\n\n\f", "award": [], "sourceid": 1221, "authors": [{"given_name": "Bowen", "family_name": "Li", "institution": "University of Oxford"}, {"given_name": "Xiaojuan", "family_name": "Qi", "institution": "University of Oxford"}, {"given_name": "Thomas", "family_name": "Lukasiewicz", "institution": "University of Oxford"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}]}