{"title": "Learning What and Where to Draw", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 225, "abstract": "Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 \u00d7 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations.", "full_text": "Learning What and Where to Draw\n\nScott Reed1,\u2217\n\nreedscot@google.com\n\nZeynep Akata2\n\nakata@mpi-inf.mpg.de\n\nSantosh Mohan1\n\nsantoshm@umich.edu\n\nSamuel Tenka1\n\nsamtenka@umich.edu\n\nBernt Schiele2\n\nschiele@mpi-inf.mpg.de\n\nHonglak Lee1\n\nhonglak@umich.edu\n\n1University of Michigan, Ann Arbor, USA\n\n2Max Planck Institute for Informatics, Saarbr\u00fccken, Germany\n\nAbstract\n\nGenerative Adversarial Networks (GANs) have recently demonstrated the capa-\nbility to synthesize compelling real-world images, such as room interiors, album\ncovers, manga, faces, birds, and \ufb02owers. While existing models can synthesize\nimages based on global constraints such as a class label or caption, they do not\nprovide control over pose or object location. We propose a new model, the Gen-\nerative Adversarial What-Where Network (GAWWN), that synthesizes images\ngiven instructions describing what content to draw in which location. We show\nhigh-quality 128 \u00d7 128 image synthesis on the Caltech-UCSD Birds dataset, con-\nditioned on both informal text descriptions and also object location. Our system\nexposes control over both the bounding box around the bird and its constituent\nparts. By modeling the conditional distributions over part locations, our system\nalso enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail),\nyielding an ef\ufb01cient interface for picking part locations.\n\nIntroduction\n\n1\nGenerating realistic images from informal descriptions would have a wide range of applications.\nModern computer graphics can already generate remarkably realistic scenes, but it still requires the\nsubstantial effort of human designers and developers to bridge the gap between high-level concepts\nand the end product of pixel-level details. Fully automating this creative process is currently out of\nreach, but deep networks have shown a rapidly-improving ability for controllable image synthesis.\nIn order for the image-generating system to be useful, it should support high-level control over the\ncontents of the scene to be generated. For example, a user might provide the category of image to be\ngenerated, e.g. \u201cbird\u201d. In the more general case, the user could provide a textual description like \u201ca\nyellow bird with a black head\u201d.\nCompelling image synthesis with this level of control has already been demonstrated using convo-\nlutional Generative Adversarial Networks (GANs) [Goodfellow et al., 2014, Radford et al., 2016].\nVariational Autoencoders also show some promise for conditional image synthesis, in particular\nrecurrent versions such as DRAW [Gregor et al., 2015, Mansimov et al., 2016]. However, current\napproaches have so far only used simple conditioning variables such as a class label or a non-localized\ncaption [Reed et al., 2016b], and did not allow for controlling where objects appear in the scene.\nTo generate more realistic and complex scenes, image synthesis models can bene\ufb01t from incorporating\na notion of localizable objects. The same types of objects can appear in many locations in different\nscales, poses and con\ufb01gurations. This fact can be exploited by separating the questions of \u201cwhat\u201d\n\u2217Majority of this work was done while \ufb01rst author was at U. Michigan, but completed while at DeepMind.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Text-to-image examples. Locations can\nbe speci\ufb01ed by keypoint or bounding box.\n\nand \u201cwhere\u201d to modify the image at each step of computation. In addition to parameter ef\ufb01ciency,\nthis yields the bene\ufb01t of more interpretable image samples, in the sense that we can track what the\nnetwork was meant to depict at each location.\nFor many image datasets, we have not only\nglobal annotations such as a class label but\nalso localized annotations, such as bird part\nkeypoints in Caltech-USCD birds (CUB) [Wah\net al., 2011] and human joint locations in the\nMPII Human Pose dataset (MHP) [Andriluka\net al., 2014]. For CUB, there are associated\ntext captions, and for MHP we collected a new\ndataset of 3 captions per image.\nOur proposed model learns to perform location-\nand content-controllable image synthesis on the\nabove datasets. We demonstrate two ways to\nencode spatial constraints (though there could\nbe many more). First, we show how to condi-\ntion on the coarse location of a bird by incor-\nporating spatial masking and cropping modules\ninto a text-conditional GAN, implemented using spatial transformers. Second, we can condition\non part locations of birds and humans in the form of a set of normalized (x,y) coordinates, e.g.\nbeak@(0.23,0.15). In the second case, the generator and discriminator use a multiplicative gating\nmechanism to attend to the relevant part locations.\nThe main contributions are as follows: (1) a novel architecture for text- and location-controllable\nimage synthesis, yielding more realistic and higher-resolution CUB samples, (2) a text-conditional\nobject part completion model enabling a streamlined user interface for specifying part locations, and\n(3) exploratory results and a new dataset for pose-conditional text to human image synthesis.\n2 Related Work\nIn addition to recognizing patterns within images, deep convolutional networks have shown remark-\nable capability to generate images. Dosovitskiy et al. [2015] trained a deconvolutional network to\ngenerate 3D chair renderings conditioned on a set of graphics codes indicating shape, position and\nlighting. Yang et al. [2015] followed with a recurrent convolutional encoder-decoder that learned\nto apply incremental 3D rotations to generate sequences of rotated chair and face images. Oh et al.\n[2015] used a similar approach in order to predict action-conditional future frames of Atari games.\nReed et al. [2015] trained a network to generate images that solved visual analogy problems.\nThe above models were all deterministic (i.e. conventional feed-forward and recurrent neural\nnetworks), trained to learn one-to-one mappings from the latent space to pixel space. Other recent\nworks take the approach of learning probabilistic models with variational autoencoders [Kingma and\nWelling, 2014, Rezende et al., 2014]. Kulkarni et al. [2015] developed a convolutional variational\nautoencoder in which the latent space was \u201cdisentangled\u201d into separate blocks of units corresponding\nto graphics codes. Gregor et al. [2015] created a recurrent variational autoencoder with attention\nmechanisms for reading and writing portions of the image canvas at each time step (DRAW).\nIn addition to VAE-based image generation models, simple and effective Generative Adversarial\nNetworks [Goodfellow et al., 2014] have been increasingly popular. In general, GAN image samples\nare notable for their relative sharpness compared to samples from the contemporary VAE models.\nLater, class-conditional GAN [Denton et al., 2015] incorporated a Laplacian pyramid of residual\nimages into the generator network to achieve a signi\ufb01cant qualitative improvement. Radford et al.\n[2016] proposed ways to stabilize deep convolutional GAN training and synthesize compelling\nimages of faces and room interiors.\nSpatial Transformer Networks (STN) [Jaderberg et al., 2015] have proven to be an effective visual\nattention mechanism, and have already been incorporated into the latest deep generative models.\nEslami et al. [2016] incorporate STNs into a form of recurrent VAE called Attend, Infer, Repeat (AIR),\nthat uses an image-dependent number of inference steps, learning to generate simple multi-object\n2D and 3D scenes. Rezende et al. [2016] build STNs into a DRAW-like recurrent network with\nimpressive sample complexity visual generalization properties.\n\n2\n\nBeakBellyThis bird is bright blue.Right legThis bird is completely black.Heada man in an orange jacket, black pants and a black cap wearing sunglasses skiing\fLarochelle and Murray [2011] proposed the Neural Autoregressive Density Estimator (NADE) to\ntractably model distributions over image pixels as a product of conditionals. Recently proposed\nspatial grid-structured recurrent networks [Theis and Bethge, 2015, van den Oord et al., 2016] have\nshown encouraging image synthesis results. We use GANs in our approach, but the same principle of\nseparating \u201cwhat\u201d and \u201cwhere\u201d conditioning variables can be applied to these types of models.\n3 Preliminaries\n3.1 Generative Adversarial Networks\nGenerative adversarial networks (GANs) consist of a generator G and a discriminator D that compete\nin a two-player minimax game. The discriminator\u2019s objective is to correctly classify its inputs as\neither real or synthetic. The generator\u2019s objective is to synthesize images that the discriminator will\nclasssify as real. D and G play the following game with value function V (D, G):\n\nmin\nG\n\nmax\nD\n\nV (D, G) = Ex\u223cpdata(x)[log D(x)] + Ex\u223cpz(z)[log(1 \u2212 D(G(z)))]\n\nwhere z is a noise vector drawn from e.g. a Gaussian or uniform distribution. Goodfellow et al.\n[2014] showed that this minimax game has a global optimium precisely when pg = pdata, and that\nwhen G and D have enough capacity, pg converges to pdata.\nTo train a conditional GAN, one can simply provide both the generator and discriminator with the\nadditional input c as in [Denton et al., 2015, Radford et al., 2016] yielding G(z, c) and D(x, c). For\nan input tuple (x, c) to be intepreted as \u201creal\u201d, the image x must not only look realistic but also match\nits context c. In practice G is trained to maximize log D(G(z, c)).\n\n3.2 Structured joint embedding of visual descriptions and images\nTo encode visual content from text descriptions, we use a convolutional and recurrent text encoder to\nlearn a correspondence function between images and text features, following the approach of Reed\net al. [2016a] (and closely related to Kiros et al. [2014]). Sentence embeddings are learned by\noptimizing the following structured loss:\n\nN(cid:88)\n\nn=1\n\n1\nN\n\n\u2206(yn, fv(vn)) + \u2206(yn, ft(tn))\n\n(1)\n\nwhere {(vn, tn, yn), n = 1, ..., N} is the training data set, \u2206 is the 0-1 loss, vn are the images, tn\nare the corresponding text descriptions, and yn are the class labels. fv and ft are de\ufb01ned as\nEv\u223cV(y)[\u03c6(v)T \u03d5(t))]\n\nEt\u223cT (y)[\u03c6(v)T \u03d5(t))], ft(t) = arg max\ny\u2208Y\n\nfv(v) = arg max\n\ny\u2208Y\n\n(2)\n\nwhere \u03c6 is the image encoder (e.g. a deep convolutional network), \u03d5 is the text encoder, T (y) is\nthe set of text descriptions of class y and likewise V(y) for images. Intuitively, the text encoder\nlearns to produce a higher compatibility score with images of the correspondong class compared to\nany other class, and vice-versa. To train the text encoder we minimize a surrogate loss related to\nEquation 1 (see Akata et al. [2015] for details). We modify the approach of Reed et al. [2016a] in a\nfew ways: using a char-CNN-GRU [Cho et al., 2014] instead of char-CNN-RNN, and estimating the\nexpectations in Equation 2 using the average of 4 sampled captions per image instead of 1.\n4 Generative Adversarial What-Where Networks (GAWWN)\nIn the following sections we describe the bounding-box- and keypoint-conditional GAWWN models.\n4.1 Bounding-box-conditional text-to-image model\nFigure 2 shows a sketch of the model, which can be understood by starting from input noise z \u2208 RZ\nand text embedding t \u2208 RT (extracted from the caption by pre-trained 2 encoder \u03d5(t)) and following\nthe arrows. Below we walk through each step.\nFirst, the text embedding (shown in green) is replicated spatially to form a M \u00d7 M \u00d7 T feature\nmap, and then warped spatially to \ufb01t into the normalized bounding box coordinates. The feature map\n\n2Both \u03c6 and \u03d5 could be trained jointly with the GAN, but pre-training allows us to use the best available\n\nimage features from higher resolution images (224 \u00d7 224) and speeds up GAN training.\n\n3\n\n\fentries outside the box are all zeros.3 The diagram shows a single object, but in the case of multiple\nlocalized captions, these feature maps are averaged. Then, convolution and pooling operations are\napplied to reduce the spatial dimension back to 1 \u00d7 1. Intuitively, this feature vector encodes the\ncoarse spatial structure in the image, and we concatenate this with the noise vector z.\n\nFigure 2: GAWWN with bounding box location control.\n\nIn the next stage, the generator branches into local and global processing stages. The global pathway\nis just a series of stride-2 deconvolutions to increase spatial dimension from 1 \u00d7 1 to M \u00d7 M. In\nthe local pathway, upon reaching spatial dimension M \u00d7 M, a masking operation is applied so that\nregions outside the object bounding box are set to 0. Finally, the local and global pathways are\nmerged by depth concatenation. A \ufb01nal series of deconvolution layers are used to reach the \ufb01nal\nspatial dimension. In the \ufb01nal layer we apply a Tanh nonlinearity to constrain the outputs to [\u22121, 1].\nIn the discriminator, the text is similarly replicated spatially to form a M \u00d7 M \u00d7 T tensor. Meanwhile\nthe image is processed in local and global pathways. In the local pathway, the image is fed through\nstride-2 convolutions down to the M \u00d7 M spatial dimension, at which point it is depth-concatenated\nwith the text embedding tensor. The resulting tensor is spatially cropped to within the bounding box\ncoordinates, and further processed convolutionally until the spatial dimension is 1 \u00d7 1. The global\npathway consists simply of convolutions down to a vector, with additive contribution of the orignal\ntext embedding t. Finally, the local and global pathway output vectors are combined additively and\nfed into the \ufb01nal layer producing the scalar discriminator score.\n4.2 Keypoint-conditional text-to-image model\nFigure 3 shows the keypoint-conditional version of the GAWWN, described in detail below.\n\nFigure 3: Text and keypoint-conditional GAWWN.. Keypoint grids are shown as 4 \u00d7 4 for clarity of\npresentation, but in our experiments we used 16 \u00d7 16.\nThe location keypoints are encoded into a M \u00d7 M \u00d7 K spatial feature map in which the channels\ncorrespond to the part; i.e. head in channel 1, left foot in channel 2, and so on. The keypoint\ntensor is fed into several stages of the network. First, it is fed through stride-2 convolutions to produce\na vector that is concatenated with noise z and text embedding t. The resulting vector provides\ncoarse information about content and part locations. Second, the keypoint tensor is \ufb02attened into a\nbinary matrix with a 1 indicating presence of any part at a particular spatial location, then replicated\ndepth-wise into a tensor of size M \u00d7 M \u00d7 H.\nIn the local and global pathways, the noise-text-keypoint vector is fed through deconvolutions to\nproduce another M \u00d7 M \u00d7 H tensor. The local pathway activations are gated by pointwise multipli-\ncation with the keypoint tensor of the same size. Finally, the original M \u00d7 M \u00d7 K keypoint tensor is\n\n3For details of how to apply this warping see equation 3 in [Jaderberg et al., 2015]\n\n4\n\nA red bird with a black faceGenerator NetworkDiscriminator Network= Deconv= Conv{ 0, 1 }111616Spatial replicate, crop to bboxGlobalLocaldepth concat161612812816161616128128128crop to bboxreplicatespatialA red bird with a black faceLocal11depth concatcrop to bbox1616Global16161616GlobalLocalGlobalA red bird with a black facepart locsLocal{ 0, 1 }Generator NetworkDiscriminator Network= Deconv= Convdepth concatmax, replicate depth depth concat128128part locsdepth concatmax, replicate depth depth concat1616161616161616161611A red bird with a black facedepth concatpointwise multiply128128128replicatespatial1616\fdepth-concatenated with the local and global tensors, and processed with further deconvolutions to\nproduce the \ufb01nal image. Again a Tanh nonlinearity is applied.\nIn the discriminator, the text embedding t is fed into two stages. First, it is combined additively with\nthe global pathway that processes the image convolutionally producing a vector output. Second, it\nis spatially replicated to M \u00d7 M and then depth-concatenated with another M \u00d7 M feature map\nin the local pathway. This local tensor is then multiplicatively gated with the binary keypoint mask\nexactly as in the generator, and the resulting tensor is depth-concatenated with the M \u00d7 M \u00d7 T\nkeypoints. The local pathway is fed into further stride-2 convolutions to produce a vector, which\nis then additively combined with the global pathway output vector, and then into the \ufb01nal layer\nproducing the scalar discriminator score.\n4.3 Conditional keypoint generation model\nFrom a user-experience perspective, it is not optimal to require users to enter every single keypoint of\nthe parts of the object they wish to be drawn (e.g. for birds our model would require 15). Therefore,\nit would be very useful to have access to all of the conditional distributions of unobserved keypoints\ngiven a subset of observed keypoints and the text description. A similar problem occurs in data\nimputation, e.g. \ufb01lling in missing records or inpainting image occlusions. However, in our case we\nwant to draw convincing samples rather than just \ufb01ll in the most likely values.\nConditioned on e.g. only the position of a bird\u2019s beak, there could be several very different plausible\nposes that satisfy the constraint. Therefore, a simple approach such as training a sparse autoencoder\nover keypoints would not suf\ufb01ce. A DBM [Salakhutdinov and Hinton, 2009] or variational autoen-\ncoder [Rezende et al., 2014] could in theory work, but for simplicity we demonstrate the results\nachieved by applying the same generic GAN framework to this problem.\nThe basic idea is to use the assignment of each object part as observed (i.e. conditioning variable) or\nunobserved as a gating mechanism. Denote the keypoints for a single image as ki := {xi, yi, vi}, i =\n1, ..., K, where x and y indicate the row and column position, respectively, and v is a bit set to 1 if the\npart is visible and 0 otherwise. If the part is not visible, x and y are also set to 0. Let k \u2208 [0, 1]K\u00d73\nencode the keypoints into a matrix. Let the conditioning variables (e.g. a beak position speci\ufb01ed\nby the user) be encoded into a vector of switch units s \u2208 {0, 1}K, with the i-th entry set to 1 if the\ni-th part is a conditioning variable and 0 otherwise. We can formulate the generator network over\nkeypoints Gk, conditioned on text t and a subset of keypoints k, s, as follows:\n\nGk(z, t, k, s) := s (cid:12) k + (1 \u2212 s) (cid:12) f (z, t, k)\n\n(3)\nwhere (cid:12) denotes pointwise multiplication and f : RZ+T +3K \u2192 R3K is an MLP. In practice we\nconcatenated z, t and \ufb02attened k and chose f to be a 3-layer fully-connected network.\nThe discriminator Dk learns to distinguish real keypoints and text (kreal, treal) from synthetic.\nIn order for Gk to capture all of the conditional distributions over keypoints, during training we\nrandomly sample switch units s in each mini-batch. Since we would like to usually specify 1 or 2\nkeypoints, in our experiments we set the \u201con\u201d probability to 0.1. That is, each of the 15 bird parts\nonly had a 10% chance of acting as a conditioning variable for a given training image.\n\n5 Experiments\nIn this section we describe our experiments on generating images from text descriptions on the\nCaltech-UCSD Birds (CUB) and MPII Human Pose (MHP) datasets.\nCUB [Wah et al., 2011] has 11,788 images of birds belonging to one of 200 different species. We also\nuse the text dataset from Reed et al. [2016a] including 10 single-sentence descriptions per bird image.\nEach image also includes the bird location via its bounding box, and keypoint (x,y) coordinates for\neach of 15 bird parts. Since not all parts are visible in each image, the keypoint data also provides an\nadditional bit per part indicating whether the part can be seen.\nMHP Andriluka et al. [2014] has 25K images with 410 different common activities. For each image,\nwe collected 3 single-sentence text descriptions using Mechanical Turk. We asked the workers to\ndescribe the most distinctive aspects of the person and the activity they are engaged in, e.g. \u201ca man\nin a yellow shirt preparing to swing a golf club\u201d. Each image has potentially multiple sets of (x,y)\nkeypoints for each of the 16 joints. During training we \ufb01ltered out images with multiple people, and\nfor the remaining 19K images we cropped the image to the person\u2019s bounding box.\n\n5\n\n\fWe encoded the captions using a pre-trained char-CNN-GRU as described in [Reed et al., 2016a].\nDuring training, the 1024-dimensional text embedding for a given image was taken to be the average\nof four randomly-sampled caption encodings corresponding to that image. Sampling multiple captions\nper image provides further information required to draw the object. At test time one can average\ntogether any number of description embeddings, including a single caption.\nFor both CUB and MHP, we trained our GAWWN using the ADAM solver with batch size 16 and\nlearning rate 0.0002 (See Alg. 1 in [Reed et al., 2016b] for the conditional GAN training algorithm).\nThe models were trained on all categories and we show samples on a set of held out captions. For the\nspatial transformer module, we used a Torch implementation provided by Oquab [2016]. Our GAN\nimplementation is loosely based on dcgan.torch4.\nIn experiments we analyze how accurately the GAWWN samples re\ufb02ect the text and location\nconstraints. First we control the location of the bird by interpolation via bounding boxes and\nkeypoints. We consider both the case of (1) ground-truth keypoints from the data set, and (2) synthetic\nkeypoints generated by our model, conditioned on the text. Case (2) is advantageous because it\nrequires less effort from a hypothetical user (i.e. entering 15 keypoint locations). We then compare\nour CUB results to representative samples from the previous work. Finally, we show samples on text-\nand pose-conditional generation of images of human actions.\n5.1 Controlling bird location via bounding boxes\nWe \ufb01rst demonstrate sampling from the text-conditional model while varying the bird location. Since\nlocation is speci\ufb01ed via bounding box coordinates, we can also control the size and aspect ratio of\nthe bird. This is shown in Figure 4 by interpolating the bounding box coordinates while at the same\ntime \ufb01xing the text and noise conditioning variables.\n\nFigure 4: Controlling the bird\u2019s position using bounding box coordinates. and previously-unseen text.\nWith the noise vector z \ufb01xed in every set of three frames, the background is usually similar but not\nperfectly invariant. Interestingly, as the bounding box coordinates are changed, the direction the bird\nfaces does not change. This suggests that the model learns to use the the noise distribution to capture\nsome aspects of the background and also non-controllable aspects of \u201cwhere\u201d such as direction.\n5.2 Controlling individual part locations via keypoints\nIn this section we study the case of text-conditional image generation with keypoints \ufb01xed to the\nground-truth. This can give a sense of the performance upper bound for the text to image pipeline,\nbecause synthetic keypoints can be no more realistic than the ground-truth. We take a real image and\nits keypoint annotations from the CUB dataset, and a held-out text description, and draw samples\nconditioned on this information.\n\nFigure 5: Bird generation conditioned on \ufb01xed groundtruth keypoints (overlaid in blue) and previously\nunseen text. Each sample uses a different random noise vector.\n\n4https://github.com/soumith/dcgan.torch\n\n6\n\nThis bird has a black head, a long orange beak and yellow bodyThis large black bird has a pointy beak and black eyesThis small blue bird has a short pointy beak and brown patches on its wingsCaptionShrinkingTranslationStretchingGTThis large black bird has a long neck and tail feathers.This bird is mostly white with a thick black eyebrow, small and black beak and a long tail.This is a small yellowish green bird with a pointy black beak, black eyes and gray wings.This pale pink bird has a black eyebrow and a black pointy beak, gray wings and yellow underparts.This bird has a bright red crown and black wings and beak.This large white bird has an orange-tipped beak.GTGTGTGTGTGT\fFigure 5 shows several image samples that accurately re\ufb02ect the text and keypoint constraints. More\nexamples including success and failure are included in the supplement. We observe that the bird pose\nrespects the keypoints and is invariant across the samples. The background and other small details,\nsuch as thickness of the tree branch or the background color palette do change with the noise.\n\nFigure 6: Controlling the bird\u2019s position using keypoint coordinates. Here we only interpolated the\nbeak and tail positions, and sampled the rest conditioned on these two.\n\nThe GAWWN model can also use keypoints to shrink, translate and stretch objects, as shown\nin Figure 6. We chose to specify beak and tail positions, because in most cases these de\ufb01ne an\napproximate bounding box around the bird.\nUnlike in the case of bounding boxes, we can now control which way the bird is pointing; note that\nhere all birds face left, whereas when we use bounding boxes (Figure 4) the orientation is random.\nElements of the scene, even outside of the controllable location, adjust in order to be coherent with\nthe bird\u2019s position in each frame although in each set of three frames we use the same noise vector z.\n5.3 Generating both bird keypoints and images from text alone\nAlthough ground truth keypoint locations lead to visually plausible results as shown in the previous\nsections, the keypoints are costly to obtain. In Figure 7, we provide examples of accurate samples\nusing generated keypoints. Compared to ground-truth keypoints, on average we did not observe\ndegradation in quality. More examples for each regime are provided in the supplement.\n\nFigure 7: Keypoint- and text-conditional bird generation in which the keypoints are generated\nconditioned on unseen text. The small blue boxes indicate the generated keypoint locations.\n5.4 Comparison to previous work\nIn this section we compare our results with previous text-to-image results on CUB. In Figure 8 we\nshow several representative examples that we cropped from the supplementary material of [Reed et al.,\n2016b]. We compare against the actual ground-truth and several variants of GAWWN. We observe\nthat the 64 \u00d7 64 samples from [Reed et al., 2016b] mostly re\ufb02ect the text description, but in some\ncases lack clearly de\ufb01ned parts such as a beak. When the keypoints are zeroed during training, our\nGAWWN architecture actually fails to generate any plausible images. This suggests that providing\nadditional conditioning variables in the form of location constraints is helpful for learning to generate\nhigh-resolution images. Overall, the sharpest and most accurate results can be seen in the 128 \u00d7 128\nsamples from our GAWWN with real or synthetic keypoints (bottom two rows).\n\n5.5 Beyond birds: generating images of humans\nHere we apply our model to generating images of humans conditioned on a description of their\nappearance and activity, and also on their approximate pose. This is a much more challenging task\nthan generating images of birds due to the larger variety of scenes and pose con\ufb01gurations.\n\n7\n\nShrinkingTranslationStretchingThis bird has a black head, a long orange beak and yellow bodyThis large black bird has a pointy beak and black eyesThis small blue bird has a short pointy beak and brown patches on its wingsCaptionGTThis bird has a yellow head, black eyes, a gray pointy beak and orange lines on its breast.This water bird has a long white neck, black body, yellow beak and black head.This bird is large, completely black, with a long pointy beak and black eyes.This small bird has a blue and gray head, pointy beak, black and white patterns on its wings and a white belly.This bird is completely red with a red and cone-shaped beak, black face and a red nape.This white bird has gray wings, red webbed feet and a long, curved and yellow beak.This small bird has a blue and gray head, pointy beak and a white belly.GTGTGTGTGTGT\fFigure 8: Comparison of GAWWN to GAN-INT-CLS from Reed et al. [2016b] and also the ground-\ntruth images. For the ground-truth row, the \ufb01rst entry corresonds directly to the caption, and the\nsecond two entries are sampled from the same species.\n\nFigure 9: Generating humans. Both the keypoints and the image are generated from unseen text.\n\nThe human image samples shown in Figure 9 tend to be much blurrier compared to the bird images,\nbut in many cases bear a clear resemblance to the text query and the pose constraints. Simple captions\ninvolving skiing, golf and yoga tend to work, but complex descriptions and unusual poses (e.g.\nupside-down person on a trampoline) remain especially challenging. We also generate videos by\n(1) extracting pose keypoints from a pre-trained pose estimator from several YouTube clips, and\n(2) combining these keypoint trajectories with a text query, \ufb01xing the noise vector z over time and\nconcatenating the samples (see supplement).\n\n6 Discussion\nIn this work we showed how to generate images conditioned on both informal text descriptions and\nobject locations. Locations can be accurately controlled by either bounding box or a set of part\nkeypoints. On CUB, the addition of a location constraint allowed us to accurately generate compelling\n128 \u00d7 128 images, whereas previous models could only generate 64 \u00d7 64. Furthermore, this location\nconditioning does not constrain us during test time, because we can also learn a text-conditional\ngenerative model of part locations, and simply generate them at test time.\nAn important lesson here is that decomposing the problem into easier subproblems can help generate\nrealistic high-resolution images. In addition to making the overall text to image pipeline easier to\ntrain with a GAN, it also yields additional ways to control image synthesis. In future work, it may\nbe promising to learn the object or part locations in an unsupervised or weakly supervised way. In\naddition, we show the \ufb01rst text-to-human image synthesis results, but performance on this task is\nclearly far from saturated and further architectural advances will be required to solve it.\n\nAcknowledgements This work was supported in part by NSF CAREER IIS-1453651, ONR\nN00014-13-1-0762, and a Sloan Research Fellowship.\n\n8\n\nA small sized bird that has tones of brown and dark red with a short stout billGAN-INT-CLS(Reed et. al, 2016b)Ground-truth image and text captionGAWWNKey points givenGAWWNKey points generatedThis bird has a yellow breast and a dark grey faceThe bird is solid black with white eyes and a black beak.GAWWNtrained without key pointsa woman in a yellow tank top is doing yoga.the man wearing the red shirt and white pants play golf on the green grassa man in green shirt and white pants is swinging his golf club.a man in a red sweater and grey pants swings a golf club with one hand.a woman in grey shirt is doing yoga.a man in an orange jacket, black pants and a black cap wearing sunglasses skiing.a man is skiing and competing for the olympics on the slopes.a woman wearing goggles swimming through very murky waterGTSamplesGTSamplesCaptionCaption\fReferences\nZ. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of Output Embeddings for Fine-Grained Image\n\nClassi\ufb01cation. In CVPR, 2015.\n\nM. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of\n\nthe art analysis. In CVPR, June 2014.\n\nK. Cho, B. van Merri\u00ebnboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation:\n\nEncoder\u2013decoder approaches. Syntax, Semantics and Structure in Statistical Translation, 2014.\n\nE. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of\n\nadversarial networks. In NIPS, 2015.\n\nA. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In CVPR, 2015.\n\nS. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast scene\n\nunderstanding with generative models. In NIPS, 2016.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, 2014.\n\nK. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. In ICML, 2015.\n\nM. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n\nD. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\nR. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural\n\nlanguage models. In ACL, 2014.\n\nT. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In\n\nNIPS, 2015.\n\nH. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.\n\nE. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. In\n\nICLR, 2016.\n\nJ. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in\n\natari games. In NIPS, 2015.\n\nQ. Oquab. Modules for spatial transformer networks. github.com/qassemoquab/stnbhwd, 2016.\n\nA. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative\n\nadversarial networks. In ICLR, 2016.\n\nS. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.\n\nS. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations for \ufb01ne-grained visual descriptions. In\n\nCVPR, 2016a.\n\nS. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis.\n\nIn ICML, 2016b.\n\nD. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep\n\ngenerative models. In ICML, 2014.\n\nD. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep generative\n\nmodels. In ICML, 2016.\n\nR. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In AISTATS, 2009.\n\nL. Theis and M. Bethge. Generative image modeling using spatial lstms. In NIPS, 2015.\n\nA. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.\n\nC. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.\n\nJ. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for\n\n3d view synthesis. In NIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 151, "authors": [{"given_name": "Scott", "family_name": "Reed", "institution": "University of Michigan"}, {"given_name": "Zeynep", "family_name": "Akata", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Santosh", "family_name": "Mohan", "institution": "University of MIchigan"}, {"given_name": "Samuel", "family_name": "Tenka", "institution": "University of MIchigan"}, {"given_name": "Bernt", "family_name": "Schiele", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "University of Michigan"}]}