{"title": "Fader Networks:Manipulating Images by Sliding Attributes", "book": "Advances in Neural Information Processing Systems", "page_first": 5967, "page_last": 5976, "abstract": "This paper introduces a new encoder-decoder architecture that is trained to reconstruct images by disentangling the salient information of the image and the values of attributes directly in the latent space. As a result, after training, our model can generate different realistic versions of an input image by varying the attribute values. By using continuous attribute values, we can choose how much a specific attribute is perceivable in the generated image. This property could allow for applications where users can modify an image using sliding knobs, like faders on a mixing console, to change the facial expression of a portrait, or to update the color of some objects. Compared to the state-of-the-art which mostly relies on training adversarial networks in pixel space by altering attribute values at train time, our approach results in much simpler training schemes and nicely scales to multiple attributes. We present evidence that our model can significantly change the perceived value of the attributes while preserving the naturalness of images.", "full_text": "Fader Networks:\n\nManipulating Images by Sliding Attributes\n\nGuillaume Lample1,2, Neil Zeghidour1,3, Nicolas Usunier1,\nAntoine Bordes1, Ludovic Denoyer2, Marc\u2019Aurelio Ranzato1\n\n{gl,neilz,usunier,abordes,ranzato}@fb.com\n\nludovic.denoyer@lip6.fr\n\nAbstract\n\nThis paper introduces a new encoder-decoder architecture that is trained to re-\nconstruct images by disentangling the salient information of the image and the\nvalues of attributes directly in the latent space. As a result, after training, our\nmodel can generate different realistic versions of an input image by varying the\nattribute values. By using continuous attribute values, we can choose how much a\nspeci\ufb01c attribute is perceivable in the generated image. This property could allow\nfor applications where users can modify an image using sliding knobs, like faders\non a mixing console, to change the facial expression of a portrait, or to update\nthe color of some objects. Compared to the state-of-the-art which mostly relies\non training adversarial networks in pixel space by altering attribute values at train\ntime, our approach results in much simpler training schemes and nicely scales to\nmultiple attributes. We present evidence that our model can signi\ufb01cantly change\nthe perceived value of the attributes while preserving the naturalness of images.\n\n1\n\nIntroduction\n\nWe are interested in the problem of manipulating natural images by controlling some attributes\nof interest. For example, given a photograph of the face of a person described by their gender,\nage, and expression, we want to generate a realistic version of this same person looking older\nor happier, or an image of a hypothetical twin of the opposite gender. This task and the related\nproblem of unsupervised domain transfer recently received a lot of interest [18, 25, 10, 27, 22, 24],\nas a case study for conditional generative models but also for applications like automatic image\nedition. The key challenge is that the transformations are ill-de\ufb01ned and training is unsupervised: the\ntraining set contains images annotated with the attributes of interest, but there is no example of the\ntransformation: In many cases such as the \u201cgender swapping\u201d example above, there are no pairs of\nimages representing the same person as a male or as a female. In other cases, collecting examples\nrequires a costly annotation process, like taking pictures of the same person with and without glasses.\nOur approach relies on an encoder-decoder architecture where, given an input image x with its\nattributes y, the encoder maps x to a latent representation z, and the decoder is trained to reconstruct\nx given (z, y). At inference time, a test image is encoded in the latent space, and the user chooses\nthe attribute values y that are fed to the decoder. Even with binary attribute values at train time,\neach attribute can be considered as a continuous variable during inference to control how much it is\nperceived in the \ufb01nal image. We call our architecture Fader Networks, in analogy to the sliders of an\naudio mixing console, since the user can choose how much of each attribute they want to incorporate.\n\n1Facebook AI Research\n2Sorbonne Universit\u00e9s, UPMC Univ Paris 06, UMR 7606, LIP6\n3LSCP, ENS, EHESS, CNRS, PSL Research University, INRIA\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Interpolation between different attributes (Zoom in for better resolution). Each line shows\nreconstructions of the same face with different attribute values, where each attribute is controlled as a\ncontinuous variable. It is then possible to make an old person look older or younger, a man look more\nmanly or to imagine his female version. Left images are the originals.\n\nThe fundamental feature of our approach is to constrain the latent space to be invariant to the attributes\nof interest. Concretely, it means that the distribution over images of the latent representations should\nbe identical for all possible attribute values. This invariance is obtained by using a procedure similar\nto domain-adversarial training (see e.g., [21, 6, 15]). In this process, a classi\ufb01er learns to predict the\nattributes y given the latent representation z during training while the encoder-decoder is trained\nbased on two objectives at the same time. The \ufb01rst objective is the reconstruction error of the decoder,\ni.e., the latent representation z must contain enough information to allow for the reconstruction of the\ninput. The second objective consists in fooling the attribute classi\ufb01er, i.e., the latent representation\nmust prevent it from predicting the correct attribute values. In this model, achieving invariance is a\nmeans to \ufb01lter out, or hide, the properties of the image that are related to the attributes of interest.\nA single latent representation thus corresponds to different images that share a common structure\nbut with different attribute values. The reconstruction objective then forces the decoder to use the\nattribute values to choose, from the latent representation, the intended image.\nOur motivation is to learn a disentangled latent space in which we have explicit control on some\nattributes of interest, without supervision of the intended result of modifying attribute values. With\na similar motivation, several approaches have been tested on the same tasks [18, 25], on related\nimage-to-image translation problems [10, 27], or for more speci\ufb01c applications like the creation of\nparametrized avatars [24]. In addition to a reconstruction loss, the vast majority of these works rely\non adversarial training in pixel space, which compares during training images generated with an\nintentional change of attributes from genuine images for the target attribute values. Our approach is\ndifferent both because we use adversarial training for the latent space instead of the output, but also\nbecause adversarial training aims at learning invariance to attributes. The assumption underlying our\nwork is that a high \ufb01delity to the input image is less con\ufb02icting with the invariance criterion, than\nwith a criterion that forces the hallucinated image to match images from the training set.\nAs a consequence of this principle, our approach results in much simpler training pipelines than those\nbased on adversarial training in pixel space, and is readily amenable to controlling multiple attributes,\nby adding new output variables to the discriminator of the latent space. As shown in Figure 1 on test\nimages from the CelebA dataset [14], our model can make subtle changes to portraits that end up\nsuf\ufb01cient to alter the perceived value of attributes while preserving the natural aspect of the image\nand the identity of the person. Our experiments show that our model outperforms previous methods\nbased on adversarial training on the decoders\u2019 output like [18] in terms of both reconstruction loss\nand generation quality as measured by human subjects. We believe this disentanglement approach is\na serious competitor to the widespread adversarial losses on the decoder output for such tasks.\nIn the remainder of the paper, we discuss in more details the related work in Section 2. We then\npresent the training procedure in Section 3 before describing the network architecture and the\nimplementation in Section 4. Experimental results are shown in Section 5.\n\n2\n\n\f2 Related work\nThere is substantial literature on attribute-based and/or conditional image generation that can be split\nin terms of required supervision, with three different levels. At one extreme are fully supervised\napproaches developed to model known transformations, where examples take the form of (input,\ntransformation, result of the transformation). In that case, the model needs to learn the desired\ntransformation. This setting was previously explored to learn af\ufb01ne transformations [9], 3D rotations\n[26], lighting variations [12] and 2D video game animations [20]. The methods developed in these\nworks however rely on the supervised setting, and thus cannot be applied in our setup.\nAt the other extreme of the supervision spectrum lie fully unsupervised methods that aim at learning\ndeep neural networks that disentangle the factors of variations in the data, without speci\ufb01cation of\nthe attributes. Example methods are InfoGAN [4], or the predictability minimization framework\nproposed in [21]. The neural photo editor [3] disentangles factors of variations in natural images for\nimage edition. [8] introduced the beta-VAE, a modi\ufb01cation of the variational autoencoder (VAE)\nframework that can learn latent factorized representations in a completely unsupervised manner. This\nsetting is considerably harder than the one we consider, and in general, it may be dif\ufb01cult with these\nmethods to automatically discover high-level concepts such as gender or age.\nOur work lies in between the two previous settings. It is related to information as in [16]. Methods\ndeveloped for unsupervised domain transfer [10, 27, 22, 24] can also be applied in our case: given\ntwo different domains of images such as \u201cdrawings\u201d and \u201cphotograph\u201d, one wants to map an image\nfrom one domain to the other without supervision; in our case, a domain would correspond to an\nattribute value. The mappings are trained using adversarial training in pixel space as mentioned in\nthe introduction, using separate encoders and/or decoders per domain, and thus do not scale well to\nmultiple attributes. In this line of work but more speci\ufb01cally considering the problem of modifying\nattributes, the Invertible conditional GAN [18] \ufb01rst trains a GAN conditioned on the attribute values,\nand in a second step learns to map input images to the latent space of the GAN, hence the name of\ninvertible GANs. It is used as a baseline in our experiments. Antipov et al. [1] use a pre-trained face\nrecognition system instead of a conditional GAN to learn the latent space, and only focuses on the\nage attribute. The attribute-to-image approach [25] is a variational auto-encoder that disentangles\nforeground and background to generate images using attribute values only. Conditional generation is\nperformed by inferring the latent state given the correct attributes and then changing the attributes.\nAdditionally, our work is related to work on learning invariant latent spaces using adversarial training\nin domain adaptation [6], fair classi\ufb01cation [5] and robust inference [15]. The training criterion\nwe use for enforcing invariance is similar to the one used in those works, the difference is that the\nend-goal of these works is only to \ufb01lter out nuisance variables or sensitive information. In our case,\nwe learn generative models, and invariance is used as a means to force the decoder to use attribute\ninformation in its reconstruction.\nFinally, for the application of automatically modifying faces using attributes, the feature interpolation\napproach of [23] presents a means to generate alterations of images based on attributes using a\npre-trained network on ImageNet. While their approach is interesting from an application perspective,\ntheir inference is costly and since it relies on pre-trained models, cannot naturally incorporate factors\nor attributes that have not been foreseen during the pre-training.\n\n3 Fader Networks\nLet X be an image domain and Y the set of possible attributes associated with images in X , where\nin the case of people\u2019s faces typical attributes are glasses/no glasses, man/woman, young/old. For\nsimplicity, we consider here the case where attributes are binary, but our approach could be extended\nto categorical attributes. In that setting, Y = {0, 1}n, where n is the number of attributes. We have a\ntraining set D = {(x1, y1), ..., (xm, ym)}, of m pairs (image, attribute) (xi \u2208 X , yi \u2208 Y). The end\ngoal is to learn from D a model that will generate, for any attribute vector y(cid:48), a version of an input\nimage x whose attribute values correspond to y(cid:48).\n\nEncoder-decoder architecture Our model, described in Figure 2, is based on an encoder-decoder\narchitecture with domain-adversarial training on the latent space. The encoder E\u03b8enc : X \u2192 RN is a\nconvolutional neural network with parameters \u03b8enc that maps an input image to its N-dimensional\nlatent representation E\u03b8enc (x). The decoder D\u03b8dec : (RN ,Y) \u2192 X is a deconvolutional network with\nparameters \u03b8dec that produces a new version of the input image given its latent representation E\u03b8enc (x)\n\n3\n\n\fand any attribute vector y(cid:48). When the context is clear, we simply use D and E to denote D\u03b8dec and\nE\u03b8enc. The precise architectures of the neural networks are described in Section 4. The auto-encoding\nloss associated to this architecture is a classical mean squared error (MSE) that measures the quality\nof the reconstruction of a training input x given its true attribute vector y:\n\n(cid:88)\n\n(cid:13)(cid:13)D\u03b8dec\n\n(cid:0)E\u03b8enc(x), y(cid:1) \u2212 x(cid:13)(cid:13)2\n\n2\n\nLAE(\u03b8enc, \u03b8dec) =\n\n1\nm\n\n(x,y)\u2208D\n\nThe exact choice of the reconstruction loss is not fundamental in our approach, and adversarial losses\nsuch as PatchGAN [13] could be used in addition to the MSE at this stage to obtain better textures or\nsharper images, as in [10]. Using a mean absolute or mean squared error is still necessary to ensure\nthat the reconstruction matches the original image.\nIdeally, modifying y in D(E(x), y) would generate images with different perceived attributes, but\nsimilar to x in every other aspect. However, without additional constraints, the decoder learns to\nignore the attributes, and modifying y at test time has no effect.\n\nLearning attribute-invariant latent representations\nTo avoid this behavior, our approach is to\nlearn latent representations that are invariant with respect to the attributes. By invariance, we mean\nthat given two versions of a same object x and x(cid:48) that are the same up to their attribute values, for\ninstance two images of the same person with and without glasses, the two latent representations\nE(x) and E(x(cid:48)) should be the same. When such an invariance is satis\ufb01ed, the decoder must use the\nattribute to reconstruct the original image. Since the training set does not contain different versions\nof the same image, this constraint cannot be trivially added in the loss.\nWe hence propose to incorporate this constraint by doing adversarial training on the latent space.\nThis idea is inspired by the work on predictability minimization [21] and adversarial training for\ndomain adaptation [6, 15] where the objective is also to learn an invariant latent representation using\nan adversarial formulation of the learning objective. To that end, an additional neural network called\nthe discriminator is trained to identify the true attributes y of a training pair (x, y) given E(x). The\ninvariance is obtained by learning the encoder E such that the discriminator is unable to identify the\nright attributes. As in GANs [7], this corresponds to a two-player game where the discriminator aims\nat maximizing its ability to identify attributes, and E aims at preventing it to be a good discriminator.\nThe exact structure of our discriminator is described in Section 4.\n\nDiscriminator objective The discriminator outputs probabilities of an attribute vector\nP\u03b8dis(y|E(x)), where \u03b8dis are the discriminator\u2019s parameters. Using the subscript k to refer to\nlog P\u03b8dis,k(yk|E(x)). Since the objective of the\nthe k-th attribute, we have log P\u03b8dis(y|E(x)) =\ndiscriminator is to predict the attributes of the input image given its latent representation, its loss\n(cid:88)\ndepends on the current state of the encoder and is written as:\n\nn(cid:80)\n\nk=1\n\n(cid:0)y(cid:12)(cid:12)E\u03b8enc (x)(cid:1)\n\nlog P\u03b8dis\n\n(1)\n\nLdis(\u03b8dis|\u03b8enc) = \u2212 1\nm\n\n(x,y)\u2208D\n\nAdversarial objective The objective of the encoder is now to compute a latent representation that\noptimizes two objectives. First, the decoder should be able to reconstruct x given E(x) and y, and\nat the same time the discriminator should not be able to predict y given E(x). We consider that a\nmistake is made when the discriminator predicts 1 \u2212 yk for attribute k. Given the discriminator\u2019s\nparameters, the complete loss of the encoder-decoder architecture is then:\n\n(cid:88)\n\n(cid:13)(cid:13)D\u03b8dec\n\n(cid:0)E\u03b8enc(x), y(cid:1) \u2212 x(cid:13)(cid:13)2\n\nL(\u03b8enc, \u03b8dec|\u03b8dis) =\n\n1\nm\n\n(x,y)\u2208D\n\n2 \u2212 \u03bbE log P\u03b8dis (1 \u2212 y|E\u03b8enc (x)) ,\n\n(2)\n\nwhere \u03bbE > 0 controls the trade-off between the quality of the reconstruction and the invariance\nof the latent representations. Large values of \u03bbE will restrain the amount of information about x\ncontained in E(x), and result in blurry images, while low values limit the decoder\u2019s dependency on\nthe latent code y and will result in poor effects when altering attributes.\n\n4\n\n\fFigure 2: Main architecture. An (image, attribute) pair (x, y) is given as input. The encoder maps x\nto the latent representation z; the discriminator is trained to predict y given z whereas the encoder\nis trained to make it impossible for the discriminator to predict y given z only. The decoder should\nreconstruct x given (z, y). At test time, the discriminator is discarded and the model can generate\ndifferent versions of x when fed with different attribute values.\n\nLearning algorithm Overall, given the current state of the encoder, the optimal discriminator\nLdis(\u03b8dis|\u03b8enc). If we ignore problems related to multiple\nparameters satisfy \u03b8\u2217\n(and local) minima, the overall objective function is\n\ndis(\u03b8enc) \u2208 argmin\u03b8dis\n\n\u03b8\u2217\nenc, \u03b8\u2217\n\ndec = argmin\n\u03b8enc,\u03b8dec\n\nL(\u03b8enc, \u03b8dec|\u03b8\u2217\n\ndis(\u03b8enc)) .\n\n(cid:0)\u03b8dis\n\nIn practice, it is unreasonable to solve for \u03b8\u2217\ndis(\u03b8enc) at each update of \u03b8enc. Following the practice of\nadversarial training for deep networks, we use stochastic gradient updates for all parameters, consid-\nering the current value of \u03b8dis as an approximation for \u03b8\u2217\ndis(\u03b8enc). Given a training example (x, y), let\nus denote Ldis\nthe corresponding discriminator loss. The update at time t given the current parameters \u03b8(t)\nand \u03b8(t)\n\n(cid:12)(cid:12)\u03b8enc, x, y(cid:1) the auto-encoder loss restricted to (x, y) and L(cid:0)\u03b8enc, \u03b8dec\n\n(cid:12)(cid:12)\u03b8dis, x, y(cid:1)\n\ndec and the training example (x(t), y(t)) is:\n\ndis, \u03b8(t)\nenc,\n\n(cid:12)(cid:12)\u03b8(t)\nenc, x(t), y(t)(cid:1)\ndec] \u2212 \u03b7\u2207\u03b8enc,\u03b8decL(cid:0)\u03b8(t)\n\ndis \u2212 \u03b7\u2207\u03b8disLdis\nenc, \u03b8(t)\n\n(cid:0)\u03b8(t)\n\nenc, \u03b8(t)\n\ndec\n\ndis\n\n(cid:12)(cid:12)\u03b8(t+1)\n\ndis\n\n, x(t), y(t)(cid:1) .\n\n\u03b8(t+1)\ndis = \u03b8(t)\n, \u03b8(t+1)\n] = [\u03b8(t)\n\ndec\n\n[\u03b8(t+1)\n\nenc\n\nThe details of training and models are given in the next section.\n\n4\n\nImplementation\n\nWe adapt the architecture of our network from [10]. Let Ck be a Convolution-BatchNorm-ReLU\nlayer with k \ufb01lters. Convolutions use kernel of size 4 \u00d7 4, with a stride of 2, and a padding of 1, so\nthat each layer of the encoder divides the size of its input by 2. We use leaky-ReLUs with a slope of\n0.2 in the encoder, and simple ReLUs in the decoder.\nThe encoder consists of the following 7 layers:\n\nC16 \u2212 C32 \u2212 C64 \u2212 C128 \u2212 C256 \u2212 C512 \u2212 C512\n\nInput images have a size of 256 \u00d7 256. As a result, the latent representation of an image consists of\n512 feature maps of size 2 \u00d7 2. In our experiments, using 6 layers gave us similar results, while 8\nlayers signi\ufb01cantly decreased the performance, even when using more feature maps in the latent state.\nTo provide the decoder with image attributes, we append the latent code to each layer given as input to\nthe decoder, where the latent code of an image is the concatenation of the one-hot vectors representing\n\n5\n\n\fModel\n\nReal Image\nIcGAN AE\nIcGAN Swap\nFadNet AE\nFadNet Swap\n\nMouth\n92.6\n22.7\n11.4\n88.4\n79.0\n\nNaturalness\n\nSmile Glasses Mouth\n89.0\n87.0\n88.1\n21.7\n10.1\n22.9\n75.2\n91.8\n66.2\n31.4\n\n88.6\n14.8\n9.6\n78.8\n45.3\n\nAccuracy\nSmile Glasses\n88.3\n91.7\n9.9\n90.1\n97.1\n\n97.6\n86.2\n47.5\n94.5\n76.6\n\nTable 1: Perceptual evaluation of naturalness and swap accuracy for each model. The naturalness\nscore is the percentage of images that were labeled as \u201creal\u201d by human evaluators to the question \u201cIs\nthis image a real photograph or a fake generated by a graphics engine?\u201d. The accuracy score is the\nclassi\ufb01cation accuracy by human evaluators on the values of each attribute.\n\nthe values of its attributes (binary attributes are represented as [1, 0] and [0, 1]). We append the latent\ncode as additional constant input channels for all the convolutions of the decoder. Denoting by n the\nnumber of attributes, (hence a code of size 2n), the decoder is symmetric to the encoder, but uses\ntransposed convolutions for the up-sampling:\n\nC512+2n \u2212 C512+2n \u2212 C256+2n \u2212 C128+2n \u2212 C64+2n \u2212 C32+2n \u2212 C16+2n .\n\nThe discriminator is a C512 layer followed by a fully-connected neural network of two layers of size\n512 and n repsectively.\n\nDropout We found it bene\ufb01cial to add dropout in our discriminator. We hypothesized that dropout\nhelped the discriminator to rely on a wider set of features in order to infer the current attributes,\nimproving and stabilizing its accuracy, and consequently giving better feedback to the encoder. We\nset the dropout rate to 0.3 in all our experiments. Following [10], we also tried to add dropout in\nthe \ufb01rst layers of the decoder, but in our experiments, this turned out to signi\ufb01cantly decrease the\nperformance.\n\nDiscriminator cost scheduling Similarly to [2], we use a variable weight for the discriminator loss\ncoef\ufb01cient \u03bbE. We initially set \u03bbE to 0 and the model is trained like a normal auto-encoder. Then,\n\u03bbE is linearly increased to 0.0001 over the \ufb01rst 500, 000 iterations to slowly encourage the model\nto produce invariant representations. This scheduling turned out to be critical in our experiments.\nWithout it, we observed that the encoder was too affected by the loss coming from the discriminator,\neven for low values of \u03bbE.\n\nModel selection Model selection was \ufb01rst performed automatically using two criteria. First, we\nused the reconstruction error on original images as measured by the MSE. Second, we also want the\nmodel to properly swap the attributes of an image. For this second criterion, we train a classi\ufb01er\nto predict image attributes. At the end of each epoch, we swap the attributes of each image in the\nvalidation set and measure how well the classi\ufb01er performs on the decoded images. These two\nmetrics were used to \ufb01lter out potentially good models. The \ufb01nal model was selected based on human\nevaluation on images from the train set reconstructed with swapped attributes.\n\n5 Experiments\n5.1 Experiments on the celebA dataset\n\nExperimental setup We \ufb01rst present experiments on the celebA dataset [14], which contains\n200, 000 images of celebrity of shape 178 \u00d7 218 annotated with 40 attributes. We used the standard\ntraining, validation and test split. All pictures presented in the paper or used for evaluation have been\ntaken from the test set. For pre-processing, we cropped images to 178 \u00d7 178, and resized them to\n256 \u00d7 256, which is the resolution used in all \ufb01gures of the paper. Image values were normalized\nto [\u22121, 1]. All models were trained with Adam [11], using a learning rate of 0.002, \u03b21 = 0.5, and a\nbatch size of 32. We performed data augmentation by \ufb02ipping horizontally images with a probability\n0.5 at each iteration. As model baseline, we used IcGAN [18] with the model provided by the authors\nand trained on the same dataset. 4\n\n4https://github.com/Guim3/IcGAN\n\n6\n\n\fFigure 3: Swapping the attributes of different faces. Zoom in for better resolution.\n\nQualitative evaluation Figure 3 shows examples of images generated when swapping different\nattributes: the generated images have a high visual quality and clearly handle the attribute value\nchanges, for example by adding realistic glasses to the different faces. These generated images\ncon\ufb01rm that the latent representation learned by Fader Networks is both invariant to the attribute\nvalues, but also captures the information needed to generate any version of a face, for any attribute\nvalue. Indeed, when looking at the shape of the generated glasses, different glasses shapes and colors\nhave been integrated into the original face depending on the face: our model is not only adding\n\u201cgeneric\u201d glasses to all faces, but generates plausible glasses depending on the input.\n\nQuantitative evaluation protocol We performed a quantitative evaluation of Fader Networks on\nMechanical Turk, using IcGAN as a baseline. We chose the three attributes Mouth (Open/Close),\nSmile (With/Without) and Glasses (With/Without) as they were attributes in common between IcGAN\nand our model. We evaluated two different aspects of the generated images: the naturalness, that\nmeasures the quality of generated images, and the accuracy, that measures how well swapping an\nattribute value is re\ufb02ected in the generation. Both measures are necessary to assess that we generate\nnatural images, and that the swap is effective. We compare: REAL IMAGE , that provides original\nimages without transformation, FADNET AE and ICGAN AE , that reconstruct original images\nwithout attribute alteration, and FADNET SWAP and ICGAN SWAP , that generate images with one\nswapped attribute, e.g., With Glasses \u2192 Without Glasses. Before being submitted to Mechanical Turk,\nall images were cropped and resized following the same processing than IcGAN. As a result, output\nimages were displayed in 64 \u00d7 64 resolution, also preventing Workers from basing their judgment on\nthe sharpness of presented images exclusively.\nTechnically, we should also assess that the identity of a person is preserved when swapping attributes.\nThis seemed to be a problem for GAN-based methods, but the reconstruction quality of our model is\nvery good (RMSE on test of 0.0009, to be compared to 0.028 for IcGAN), and we did not observe\nthis issue. Therefore, we did not evaluate this aspect.\nFor naturalness, the \ufb01rst 500 images from the test set such that there are 250 images for each attribute\nvalue were shown to Mechanical Turk Workers, 100 for each of the 5 different models presented\nabove. For each image, we asked whether the image seems natural or generated. The description\ngiven to the Workers to understand their task showed 4 examples of real images, and 4 examples of\nfake images (1 FADNET AE , 1 FADNET SWAP , 1 ICGAN AE , 1 ICGAN SWAP ).\nThe accuracy of each model on each attribute was evaluated in a different classi\ufb01cation task, resulting\nin a total of 15 experiments. For example, the FadNet/Glasses experiment consisted in asking\nWorkers whether people with glasses being added by FADNET SWAP effectively possess glasses,\nand vice-versa. This allows us to evaluate how perceptible the swaps are to the human eye. In each\nexperiment, 100 images were shown (50 images per class, in the order they appear in the test set).\nIn both quantitative evaluations, each experiment was performed by 10 Workers, resulting in 5, 000\nsamples per experiment for naturalness, and 1, 000 samples per classi\ufb01cation experiment on swapped\nattributes. The results on both tasks are shown in Table 1.\n\n7\n\n\fFigure 4: (Zoom in for better resolution.) Examples of multi-attribute swap (Gender / Opened eyes /\nEye glasses) performed by the same model. Left images are the originals.\n\nQuantitative results\nIn the naturalness experiments, only around 90% of real images were classi-\n\ufb01ed as \u201creal\u201d by the Workers, indicating the high level of requirement to generate natural images. Our\nmodel obtained high naturalness accuracies when reconstructing images without swapping attributes:\n88.4%, 75.2% and 78.8%, compared to IcGAN reconstructions whose accuracy does not exceed 23%,\nwhether it be for reconstructed or swapped images. For the swap, FADNET SWAP still consistently\noutperforms ICGAN SWAP by a large margin. However, the naturalness accuracy varies a lot based\non the swapped attribute: from 79.0% for the opening of the mouth, down to 31.4% for the smile.\nClassi\ufb01cation experiments show that reconstructions with FADNET AE and ICGAN AE have very\nhigh classi\ufb01cation scores, and are even on par with real images on both Mouth and Smile. FADNET\nSWAP obtains an accuracy of 66.2% for the mouth, 76.6% for the glasses and 97.1% for the smile,\nindicating that our model can swap these attributes with a very high ef\ufb01ciency. On the other hand,\nwith accuracies of 10.1%, 47.5% and 9.9% on these same attributes, ICGAN SWAP does not seem\nable to generate convincing swaps.\n\nMulti-attributes swapping We present qualitative results for the ability of our model to swap\nmultiple attributes at once in Figure 4, by jointly modifying the gender, open eyes and glasses\nattributes. Even in this more dif\ufb01cult setting, our model can generate convincing images with multiple\nswaps.\n\n5.2 Experiments on Flowers dataset\n\nWe performed additional experiments on the Oxford-102 dataset, which contains about 9, 000 images\nof \ufb02owers classi\ufb01ed into 102 categories [17]. Since the dataset does not contain other labels than the\n\ufb02ower categories, we built a list of color attributes from the \ufb02ower captions provided by [19]. Each\n\ufb02ower is provided with 10 different captions. For a given color, we gave a \ufb02ower the associated color\nattribute, if that color appears in at least 5 out of the 10 different captions. Although being naive, this\napproach was enough to create accurate labels. We resized images to 64 \u00d7 64. Figure 5 represents\nreconstructed \ufb02owers with different values of the \u201cpink\u201d attribute. We can observe that the color of\nthe \ufb02ower changes in the desired direction, while keeping the background cleanly unchanged.\n\nFigure 5: Examples of reconstructed \ufb02owers with different values of the pink attribute. First row\nimages are the originals. Increasing the value of that attribute will turn \ufb02ower colors into pink, while\ndecreasing it in images with originally pink \ufb02owers will make them turn yellow or orange.\n\n8\n\n\f6 Conclusion\n\nWe presented a new approach to generate variations of images by changing attribute values. The\napproach is based on enforcing the invariance of the latent space w.r.t. the attributes. A key advantage\nof our method compared to many recent models [27, 10] is that it generates realistic images of high\nresolution without needing to apply a GAN to the decoder output. As a result, it could easily be\nextended to other domains like speech, or text, where the backpropagation through the decoder can\nbe really challenging because of the non-differentiable text generation process for instance. However,\nmethods commonly used in vision to assess the visual quality of the generated images, like PatchGAN,\ncould totally be applied on top of our model.\n\nAcknowledgments\n\nThe authors would like to thank Yedid Hoshen for initial discussions about the core ideas of the paper,\nChristian Pursch and Alexander Miller for their help in setting up the experiments and Mechanical\nTurk evaluations. The authors are also grateful to David Lopez-Paz and Mouhamadou Moustapha\nCisse for useful feedback and support on this project.\n\nReferences\n[1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay. Face aging with conditional\n\ngenerative adversarial networks. arXiv preprint arXiv:1702.01983, 2017.\n\n[2] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349,\n2015.\n\n[3] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Neural photo editing with\n\nintrospective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.\n\n[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[5] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. arXiv\n\npreprint arXiv:1511.05897, 2015.\n\n[6] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. Journal of Machine Learning Research, 17(59):1\u201335, 2016.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[8] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nMohamed Shakir, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. Proceedings of ICLR 2017, 2017.\n\n[9] Geoffrey Hinton, Alex Krizhevsky, and Sida Wang. Transforming auto-encoders. Arti\ufb01cial\n\nNeural Networks and Machine Learning\u2013ICANN 2011, pages 44\u201351, 2011.\n\n[10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.\n\n[11] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[12] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolu-\ntional inverse graphics network. In Advances in Neural Information Processing Systems, pages\n2539\u20132547, 2015.\n\n9\n\n\f[13] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian gen-\nerative adversarial networks. In European Conference on Computer Vision, pages 702\u2013716.\nSpringer, 2016.\n\n[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[15] Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks.\n\narXiv preprint arXiv:1611.01046, 2016.\n\n[16] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and\nYann LeCun. Disentangling factors of variation in deep representation using adversarial training.\nIn Advances in Neural Information Processing Systems, pages 5041\u20135049, 2016.\n\n[17] Maria-Elena Nilsback and Andrew Zisserman. Automated \ufb02ower classi\ufb01cation over a large\nIn Computer Vision, Graphics & Image Processing, 2008. ICVGIP\u201908.\n\nnumber of classes.\nSixth Indian Conference on, pages 722\u2013729. IEEE, 2008.\n\n[18] Guim Perarnau, Joost van de Weijer, Bogdan Raducanu, and Jose M \u00c1lvarez.\n\nconditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.\n\nInvertible\n\n[19] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations of\n\ufb01ne-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 49\u201358, 2016.\n\n[20] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In\n\nAdvances in Neural Information Processing Systems, pages 1252\u20131260, 2015.\n\n[21] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural\n\nComputation, 4(6):863\u2013879, 1992.\n\n[22] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation.\n\narXiv preprint arXiv:1611.02200, 2016.\n\n[23] Paul Upchurch, Jacob Gardner, Kavita Bala, Robert Pless, Noah Snavely, and Kilian Weinberger.\nDeep feature interpolation for image content changes. arXiv preprint arXiv:1611.05507, 2016.\n\n[24] Lior Wolf, Yaniv Taigman, and Adam Polyak. Unsupervised creation of parameterized avatars.\n\narXiv preprint arXiv:1704.05693, 2017.\n\n[25] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional\nimage generation from visual attributes. In European Conference on Computer Vision, pages\n776\u2013791. Springer, 2016.\n\n[26] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentan-\ngling with recurrent transformations for 3d view synthesis. In Advances in Neural Information\nProcessing Systems, pages 1099\u20131107, 2015.\n\n[27] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3042, "authors": [{"given_name": "Guillaume", "family_name": "Lample", "institution": "Facebook AI Research"}, {"given_name": "Neil", "family_name": "Zeghidour", "institution": "Facebook A.I. Research / Ecole Normale Sup\u00e9rieure"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Facebook AI Research"}, {"given_name": "Antoine", "family_name": "Bordes", "institution": "Facebook AI Research"}, {"given_name": "Ludovic", "family_name": "DENOYER", "institution": "Universite Pierre et Marie Curie - Paris"}, {"given_name": "Marc'Aurelio", "family_name": "Ranzato", "institution": "Facebook"}]}