{"title": "Non-Adversarial Mapping with VAEs", "book": "Advances in Neural Information Processing Systems", "page_first": 7528, "page_last": 7537, "abstract": "The study of cross-domain mapping without supervision has recently attracted much attention. Much of the recent progress was enabled by the use of adversarial training as well as cycle constraints. The practical difficulty of adversarial training motivates research into non-adversarial methods. In a recent paper, it was shown that cross-domain mapping is possible without the use of cycles or GANs. Although promising, this approach suffers from several drawbacks including costly inference and an optimization variable for every training example preventing the method from using large training sets. We present an alternative approach which is able to achieve non-adversarial mapping using a novel form of Variational Auto-Encoder. Our method is much faster at inference time, is able to leverage large datasets and has a simple interpretation.", "full_text": "Non-Adversarial Mapping with VAEs\n\nYedid Hoshen\n\nFacebook AI Research\n\nAbstract\n\nThe study of cross-domain mapping without supervision has recently attracted\nmuch attention. Much of the recent progress was enabled by the use of adversarial\ntraining as well as cycle constraints. The practical dif\ufb01culty of adversarial training\nmotivates research into non-adversarial methods. In a recent paper, it was shown\nthat cross-domain mapping is possible without the use of cycles or GANs. Although\npromising, this approach suffers from several drawbacks including costly inference\nand an optimization variable for every training example preventing the method\nfrom using large training sets. We present an alternative approach which is able to\nachieve non-adversarial mapping using a novel form of Variational Auto-Encoder.\nOur method is much faster at inference time, is able to leverage large datasets and\nhas a simple interpretation.\n\n1\n\nIntroduction\n\nMaking analogies is a key component in human intelligence. Evidence for that is the signi\ufb01cant\ncomponent of analogy questions in standardized intelligence tests. In order to achieve true arti\ufb01cial\nintelligence it is likely that computers would need to be able to learn to make analogies. Although\nmaking analogies with full supervision is very related to standard supervised learning which has been\nat the core of machine learning research, the unsupervised setting which is far more challenging, has\nreceived less attention. The task has been attempted over the last few decades, but positive results\nhave only recently been achieved for making analogies across very distant domains.\nLet us consider two domains X and Y. Each domain contains an unordered set of samples. In the\nunsupervised setting, we do not have correspondences between the samples of the two domains.\nThere are two different types of unsupervised analogies that can be learned between different domains:\nmatching and mapping. The matching task sets to \ufb01nd for each sample x in domain X , a sample in\ndomain Y which is the most analogous to it. The degree of analogy is either de\ufb01ned by humans or a\nmore objective computerized measure. The mapping task attempts to map each sample x in the X\ndomain to a new sample such that it appears to come from the Y domain while still being analogous\nto x. A special case is exact matching [1] when the exact analogy for x appears in the Y domain\ntraining set. In this case the optimal match is also the optimal mapped analogy.\nIn this work we tackle unsupervised mapping across domains. This task has only recently been\nsuccessfully attempted between suf\ufb01ciently different domains. Nearly all successful methods but one,\nhave used the combination of two constraints: i) Adversarial domain confusion: the method trains a\nmapping TXY (x) for mapping between domain X to domain Y. A discriminator DY () is trained to\ndistinguish between mapped X domain sample TXY (x) and original Y domain samples. Mapping\nfunction T () and discriminator D() are trained in an adversarial fashion iteratively. The objective\nis that at the end of training, mapped X domain samples will be indistinguishable from Y domain\nsamples. ii) The adversarial constraint, on its own, is not able to force mapped images to correspond\nto a semantically related Y domain sample. An additional circularity constraint was found necessary\nwhere mappings are learned from both X \u2192 Y and Y \u2192 X . The constraint is that a sample mapped\nto the other domain and back to the original domain will be unchanged by this mapping. This was\nfound effective at preserving semantic information across mappings.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fDespite the effectiveness of the two constraints, they present several issues. Adversarial optimization\nis a saddle point problem and its optimization is hard. This makes architecture selection challenging.\nScaling to high resolutions is dif\ufb01cult for adversarial training. This has only very recently been\nachieved, using cascades of generators, using a slow and complex training regime. Combining such\nhigh resolution generative architectures with cross-domain mapping presents a formidable challenge.\nThe circularity constraint presents its own challenges: the assumption that the mapping functions\nTXY and TY X are inverse functions, is equivalent to requiring the transformation between the two\ndomains to be one-to-one. This is not true even for the simplest transformations such as colorization\nor super-resolution.\nHoshen and Wolf have recently introduced, NAM [2, 3] - a method for unsupervised mapping which\ndoes not rely on adversarial training during mapping or on circularity. The main idea behind NAM is\nto parametrize domain X by a pre-trained parametric generator G(z), taking parameters z which they\ncall the latent code. The generator is trained using some off-the-shelf method such as unconditional\nGAN (e.g. DCGAN [4], StackGAN [5], PGGAN [6]) or VAE [7]. NAM also receives a set of training\nsamples from the Y domain. NAM attempts to learn two sets of variables, i) weights of mapping\nfunction TXY () ii) latent code zy for every training sample y. The latent code is optimized such that\ny = T (G(zy)). Both TXY () and the set of latent codes {zy|y \u2208 Y} are trained jointly end-to-end.\nNAM suffers from several signi\ufb01cant issues: i) NAM learns a latent variable zy per training sample\ny. This means that every image only contributes to a single code during an epoch, which causes\nan imbalance between the training speeds of the mapping and the codes. ii) NAM does not have a\nforward encoder, mapping a sample y to its X domain analogy. Instead it is learned by optimization\n(cid:107)T (G(zy)), y(cid:107). This can require many iterations for a single sample making it quite slow.\niii)\nAlthough one of the greatest advantages of NAM is the ability to obtain multiple solutions \u2013 every\nsolution requires solving the optimization problem in (ii), and there is no guarantee on the variability\nbetween the solutions obtained, It simply relies on the non-convexity of neural networks.\nIn this paper we present a new method VAE-NAM, which overcomes the aforementioned issues\nwith NAM. VAE-NAM replaces the per-example learned latent codes, by a forward encoder E(),\nregressing example y directly to code zy. Learning a simple encoder encounters the same issues as\nwith the circularity constraint. We choose instead to have a stochastic encoder that learns a latent\ndistribution similar to VAE. Along with the the abilities (shared with NAM) to model many-to-one\nor one-to-many mappings and to enjoy non-adversarial training of the mapping stage, our \ufb01nal\nformulation has the following advantages: a feedforward mapping for analogies and no reliance on\nmultiple iterations during evaluation, a principled parametric model for the set of possible multiple\nsolutions and a clear analogy to VAEs, allowing direct transfer of VAE research to VAE-NAM.\n\n2 Related Work\n\nUnsupervised domain alignment: Mapping across similar domains without supervision has been\nsuccessfully achieved by classical methods such as Congealing [8]. Unsupervised translation across\nvery different domains has only very recently began to generate strong results, due to the advent of\ngenerative adversarial networks (GANs), and all state-of-the-art unsupervised translation methods we\nare aware of employ GAN technology. The most popular additional constraint is cycle-consistency:\nenforcing that samples mapped to the other domain and back are unchanged. This approach is\ntaken by DiscoGAN [9], CycleGAN [10] and DualGAN [11]. Recently, StarGAN [12] extended the\napproach to more than two domains. Circularity has recently also been applied by [13] for solving\nthe task of unsupervised word translation. Another type of constraint was proposed by CoGAN [14]\nand later extended by UNIT [15] - employing a shared latent space. UNIT learns a Variational\nAuto-encoder (VAE) [16] for each of the domains and adversarially ensures that both share the same\nlatent space. Although CoGAN showed that the shared latent constraint is weaker than the circularity\nconstraint, UNIT showed that the combination yields the best result. Similarly to the code released\nwith UNIT [15], our work uses for some of the tasks a VGG \u201cperceptual loss\u201d that employs an\nImagenet pre-trained network. This is a generic feature extraction method that is not domain speci\ufb01c\n(and is therefore still unsupervised). Our work is built upon NAM, but uses a learned probabilistic\nencoder rather than learning the codes directly, which has many advantages described in detail in\nlater sections.\n\n2\n\n\fUnconditional Generative Modeling: Many methods were proposed for generative modeling of image\ndistributions. Currently the most popular approaches rely on GANs and VAEs [7]. GAN-based\nmethods are plagued by instability during training. Many methods were proposed to address this\nissue for unconditional generation, e.g., [17, 18, 19]. The modi\ufb01cations are typically not employed\nin cross-domain mapping works. Our method trains a generative model (typically a GAN), in the\nX domain completely independently from the Y domain, and can directly bene\ufb01t from the latest\nadvancements in the unconditional image generation literature. GLO [20] is an alternative to GAN,\nwhich iteratively \ufb01ts per-image latent vectors (starting from random \u201cnoise\u201d) and learns a mapping\nG() between the noise vectors and the training images. GLO is trained using a reconstruction loss,\nminimizing the difference between the training images and those generated from the noise vectors.\nDifferently from our approach is tackles unconditional generation rather than domain mapping.\n\n3 VAE-NAM\n\nIn this section we present our proposed method: VAE-NAM. Formally the task of unsupervised\ndomain mapping can be written as follows: Let X and Y be two image domains. The task is to\n\ufb01nd transformation T () such that for every x \u2208 X , the mapped T (x) is the analogy of x in the Y\ndomain. In this paper we specialize to image domains, however we speculate that our method can be\ntransfered to non-image domains.\n\n3.1 Non-Adversarial Mapping (NAM)\n\nNAM [3] is a recent method for unsupervised mapping across image domains. It takes as input\nan unconditional model of the X domain, as well as a set of Y domain training images {y}. If a\nset of X domain training images was given as input, NAM assumes it is used to train a parametric\nunconditional generative model G(z) by some other method. NAM does not specify requirements\non this method as long as the resulting model satis\ufb01es: i) compactness: for every allowed set of\nparameters z, the generated image G(z) lies within the X domain.\nii) completeness: for every\npossible image x \u2208 X , there exists a set of parameters zx such that x = G(zx). Unconditional\ngenerative modeling is a very active research \ufb01eld, and althought no technique known to us is able\nto satisfy the requirements exactly (for interesting datasets), some good approximations exist e.g.\nstate-of-the-art GANs or VAEs.\nGiven a pre-trained X domain generative model G(z) and a set of Y domain training images {y},\nNAM estimates two sets of variables: i) Parameters of transformation T () from domain X to domain\nY ii) A latent code zy for every Y domain training image so that the generated X domain image from\nthis latent code G(zy) maps to Y domain image y. The entire optimization problem is:\n\nargminT,{zy}\n\n(cid:107)T (G(zy)), y(cid:107)\n\n(1)\n\n(cid:88)\n\ny\u2208Y\n\nNAM optimizes the parameters of T () as well as all latent codes jointly {zy}. Note that a latent code\nzy is estimated for every training image y.\nNAM has signi\ufb01cant advantages over other methods: it does not use adversarial training for learning\na mapping, the mapping can be one to many and multiple solutions can be obtained for a single input\nimage. NAM is able to use pre-trained generative models i.e. a generative model of the target domain\nneeds to be estimated only once, and can be mapped to many other domains without retraining.\nUnfortunately, the NAM framework also has several issues both at training and evaluation time.\nHaving a latent code for every training image means that the number of variables to be estimated\nscales with the number of images. Having per-image latent code variables means the images do\nnot directly help each other estimate their latent code, but only very indirectly through the mapping\nfunction. Additionally, a gradient step is calculated for every latent code only once per epoch whereas\nthe mapping function has a gradient step once per batch. This causes an imbalance in the learning\nrates as well as sometimes poor estimation for each latent variable. There are also issues at inference:\nEvaluation by optimization implies that multiple (hundreds) of feedforward and backprop iterations\nare required. Furthermore multiple solutions are obtained by solving Eq. 1 multiple times, each with\na different random initialization, without a principled model for the set of acceptable solutions.\n\n3\n\n\f3.2 AE-NAM\n\nWe \ufb01rst present a simple method, AE-NAM, that overcomes the issues identi\ufb01ed in the previous\nsection.\nSimilarly to NAM, AE-NAM takes as input a pre-trained generator of the X domain, G(z), and a\nset of Y domain training images {y}. Differently from NAM, VAE-NAM does not directly estimate\na latent code for each training image independently. Instead it learns a feedforward encoder that\nestimates a code zy given y. We will denote the encoder function E().\nThe simplest solution is to have a deterministic encoder zy = E(y), encoding the input Y domain\nimage y into a latent code of domain X . We denote this method AE-NAM. AE-NAM estimates both\nE() and T () directly:\n\n(cid:88)\n\ny\u2208Y\n\nargminT,E\n\n(cid:107)T (G(E(y)), y(cid:107)\n\n(2)\n\nThe intuition for this method, is that we simply autoencode the Y domain however with the decoder\ndecomposed into two parts: i) an X domain prior G() and an X \u2192 Y mapping T (). The autoencoder\nforces X and Y to share the same latent space. UNIT [15] attempts to ensure shared latent space\nusing adversarial constraints on the X and Y domains. Our method escapes this requirement by using\nthe pre-trained generator prior, which \ufb01xes the latent space.\nThis method has the bene\ufb01t of simplicity, and fast evaluation due to its feedforward nature. AE-NAM\nhowever suffers from a signi\ufb01cant drawback - it assumes a one-to-one mapping. In fact it has the\nsame issues as the circularity constraint used by other methods. Every Y domain image y, maps into\nexactly a single X domain image \u02dcx = G(E(y)). Such mapping will fail in the cases for which the\nreal transformation is one-to-many e.g.: image colorization. A single gray scale image can map into\na whole space of RGB images. A key requirement from a mapping function is the ability to learn\nmultiple possible solutions for a single input.\n\n3.3 VAE-NAM\n\nTo address the variability issue suffered by AE-NAM we present our \ufb01nal approach: VAE-NAM.\nVAE-NAM encodes every Y domain image y into a distribution of latent codes P (zy|y) rather than\njust a single deterministic transformation. To learn the probabilistic encoding we use the VAE\nformulation \ufb01rst developed by Kingma and Welling [7].\nIn a probabilstic autoencoder, the observed (y) and latent (z) variables are modeled as p(y, z). To\ninfer from this model we would need to compute p(z|y). Using Bayes\u2019 law to marginalize over z, is\nhard in the general case. Instead, VAEs use the variational approximation to estimate the posterior\ndistribution p(z|y) by a parametric function q\u03bb(z|y). In our case, instead of directly evaluating\np(z|y), we approximate it using a neural network. We choose a probability distribution which is\nsimple and parametric, speci\ufb01cally the Gaussian distribution which is simply parametrized by \u00b5y and\ny for every sample. We thus train a neural network encoder E() regressing y to \u00b5y and log(\u03c32\n\u03c32\n\ny):\n\n(3)\nVAEs optimize the KL Divergence between the true and approximate posteriors KL(q\u03bb(z|y)|p(z|y)).\nIt can be shown that for an autoencoder this optimization criterion is equivalent to optimizing:\n\n(\u00b5y, log \u03c32\n\ny) = E(y)\n\nLV AE = \u2212Eq\u03bb(z|y)[log p(y|z)] + KL(q\u03bb(z|y)|p(z))\n\n(4)\n\nWhere we follow traditional VAEs and choose a normal prior for p(z).\nVAEs model p(y|z) as a neural network decoder D(), mapping latent code z to reconstruct the\noriginal image y = D(z). Differently from VAEs however, we impose a strong domain prior on the\ndecoder. The decoder D() is replaced by a combination of the X domain generative model G() and\nthe X \u2192 Y mapping function T (). The decoder is therefore given by D(z) = T (G(z)). VAE-NAM\nis optimized end-to-end using the re-parametrization trick [7], where we simply use samples from the\nmodel, and treat them as deterministic variables. The full optimization loss is therefore given as:\n\nLV AEN AM = (cid:107)T (G(\u00b5y + \u0001 \u2217 \u03c3y)), y(cid:107) + KL(q(z|y)|p(z))\n\n(5)\n\n4\n\n\fA new sample of normally distributed \u0001 is drawn for every example. Under the normal prior the KL\ndivergence is given by:\n\nKL(q(z|y)|p(z)) =\n\n1\n2\n\n(1 + log(\u03c32\n\ny) \u2212 \u00b52\n\ny \u2212 \u03c32\ny)\n\n(6)\n\nInstead of using the Euclidean pixel loss used in the original VAE formulation, we use the perceptual\nloss (the theory is invariant to this modi\ufb01cation). Optimization is straight forward, T () and E() can\nuse the same learning rates and we are able to use arbitrarily large amounts of training data (whereas\nonly limited training sets can be used for NAM).\n\n3.4\n\nInference and Multiple Solutions\n\nBy the end of training, all 3 networks E(), G() and T () have been estimated. To evaluate the solution\nto a new Y domain image y, we \ufb01rst encode:\n\nFor every component in the code, we independently sample a normally distributed value \u0001 (where\n\u0001i \u223c N (0, 1)):\n\n(\u00b5y, log \u03c32\n\ny) = E(y)\n\n(7)\n\nWe then compute the X domain analogy by evaluating the generator:\n\ny = \u00b5i(y) + \u03c3i(y) \u2217 \u0001i\nzi\n\n\u02dcx = G(zy)\n\n(8)\n\n(9)\n\nTo obtain multiple solutions we simply sample multiple codes, each generation being a solution.\nDifferently from NAM we now possess a parametric model of the solution space. We can therefore\nsample the solutions with maximal variety such as the ones along different components with maximal\nvariance, or along a line segment in solution space.\nInference requires just a single evaluation of E() and as many generator evaluations as the number\nof solutions required. This is far more ef\ufb01cient than NAM which requires hundreds of forward and\nbackward steps for each solution.\n\n3.5\n\nImplementation Details\n\nIn this section we give a detailed description of the procedure used to generate the experiments\npresented in this paper.\nUnconditional Generative Models: VAE-NAM takes as input a generative model of the X domain. We\nexperimented with a variety of generative models and obtained the best performance with a standard\nDCGAN For the 32 \u00d7 32 and 64 \u00d7 64 resolution domains including Shoes-RGB and Handbags-RGB,\nSVHN, MNIST and Cars. We generally found that it was helpful to choose the latent code dimension\nto agree with the dataset complexity therefore we used dimension 32 for MNIST and 100 for the rest.\nEncoder: For the VAE encoder we used a standard modi\ufb01cation of the DCGAN discriminator, where\nthe sigmoid was removed from the output layer, and the output dimension was increased to twice the\nlatent code dimension (to model both \u00b5 and log(\u03c32)).\nMapping: We use a CRN [21] mapping function as used by NAM. We modi\ufb01ed the architecture to\n64 \u00d7 64 images by removing the appropriate number of layers so that the central residual blocks will\noperate on the same 32 \u00d7 32 resolution as in the original architecture. The number of channels used\nin the MNIST and SVHN experiments is 8, in all other experiments the number is 32.\nOptimization: The pre-training of the generative models was done using the code supplied by the\nauthors with standard hyperparameter choices. VAE-NAM was optimized end-to-end over E() and\nT (). Differently from NAM, a single learning rate (3 \u00d7 10\u22123) was used across all parameters as all\nparameters are updated with every training batch. We also use all Y domain training images available,\nrather than just 2000. This is due to VAE-NAM\u2019s parametric encoder allowing it to bene\ufb01t from the\ngradient updates from every training sample.\n\n5\n\n\fTable 1: MNIST\u2192SVHN Translation quality measured by translated digit classi\ufb01cation accuracy (%)\n\nCycleGAN DistanceGAN NAM AE-NAM VAE-NAM\n\n17.7\n\n26.8\n\n31.9\n\n16.7\n\n51.7\n\nTable 2: SVHN\u2192MNIST Translation quality measured by translated digit classi\ufb01cation accuracy (%)\n\nCycleGAN DistanceGAN NAM AE-NAM VAE-NAM\n\n26.1\n\n26.8\n\n33.3\n\n37.4\n\n37.4\n\n4 Experiments\n\nIn this section we evaluate the proposed VAE-NAM against NAM, which is at present the only\nunsupervised cross-domain mapping method that does not use adversarial training. We evaluate the\nvisual quality of generations, accuracy of analogies and runtime during evaluation.\n\n4.1 Quantitative Results\n\nWe conduct several quantitative experiments to benchmark the relative performance of VAE-NAM.\nMNIST\u2192SVHN: MNIST and SVHN are standard benchmarks for evaluating classi\ufb01cation algorithms.\nSeveral domain mapping algorithms (including DiscoGAN and NAM) evaluated their performance\non mapping between the two datasets. The protocal consists of mapping images from MNIST to\nSVHN (using DiscoGAN, DistanceGAN, NAM, NAM-AE and NAM-VAE) and are then classi\ufb01ed by\na pre-trained SVHN classi\ufb01cation network. This is used to evaluate the percentage of images mapped\nto the same label in the other domain. The results can be seen in Tab. 1. VAE-NAM signi\ufb01cantly\noutperformed NAM, DiscoGAN and DistanceGAN [22] on the more challenging MNIST to SVHN\ntask (where much information needs to be hallucinated).\nSVHN\u2192MNIST: This experiment is performed in exactly the same manner as MNIST\u2192SVHN,\nwith the role of the two datasets reversed. As we can see from Tab. 2, AE-NAM and VAE-NAM\noutperform all other methods. Note that AE-NAM achieves the same results as VAE-NAM (the best\nresult is achieved with the KL-divergence loss term turned off) due to the many-to-one nature of this\ntask.\nCar2Car: In the car2car task, each domain consists of a set of distinct car models, displayed at an\nangle \u03b8 = {\u221290,\u221275, ..., 90} degrees. The task is to map across the domains such that the model is\ntranslated to the other domain while the angle is preserved. We evaluate performance by the root\nmedian of the squares of the residuals between the mapped and ground truth orientations. This\nis measured by a pretrained angle regressor with Network in Network architecture, trained on the\nX domain training images. VAE-NAM outperformed NAM on this task (and both signi\ufb01cantly\noutperformed DiscoGAN which presented this benchmark). Per-latent variable optimization is more\nerror prone than training an encoder, hence VAE-NAM\u2019s better performance. Due to the one-to-one\nnature of this task, AE-NAM performed similarly to VAE-NAM.\n\n4.2 Qualitative Results\n\nWe present a qualitative evaluation of our method, both in comparison to other approaches and an\ninvestigation into the effects of the different loss components. We evaluated the different methods of\nthe edges2shoes dataset. It consists of 48000 images of shoes \ufb01rst collected by [23]. Domain X is\nthe set of shoe RGB images, whereas domain Y is the set of their edge maps. No correspondences are\ngiven to any of the algorithms during training, the dataset therefore forms an unsupervised analogy\ndataset.\nComparison of methods: Several analogies by VAE-NAM and NAM can be observed in Fig. 1. We\ncan see the VAE-NAM visual quality is comparable to NAM on this dataset, where VAE-NAM\nadheres better to the ground truth. We evaluated the merits of using VAE-NAM vs. the deterministic\nAE-NAM. It is evident that using the deterministic AE-NAM yields more precise adherence to target\nimage than VAE-NAM. The AE-NAM result however is signi\ufb01cantly less realistic than those obtained\n\n6\n\n\fTable 3: Car2Car root median residual error (lower is better).\n\nDiscoGAN NAM VAE-NAM\n\n13.81\n\n1.47\n\n1.38\n\nNAM\n\nAE-NAM\n\nVAE-NAM\n\nVAE-NAM\n\nVAE-NAM\n\nGT\n\nFigure 1: Comparison between NAM (top row) AE-NAM (second row) and VAE-NAM (bottom\nrows)\n\nfrom VAE-NAM (e.g. as only a single solution is assigned to each image, it is an average of all\npossible solutions, and in many cases does not look particularly realistic). AE-NAM also suffers from\nsample variety issues, it can be seen that analogies of different images have mostly the same color.\nThis is remedied by switching to the VAE-NAM formulation which enables the trade-off between\nreconstruction accuracy and modeling the variability of the solution space.\nExploring the latent space parametrized by \u03c32(y): VAE-NAM \ufb01nds a Gaussian probability distribu-\ntion of X domain analogies. In Fig. 1 we visually explore the set of analogies encoded by the learned\nprobability distribution. We can see that the learned analogy distribution exhibits much variety which\nis inherent in the nature of the many-to-one mapping problem.\n\n4.3 Runtime Analysis\n\nOne of the greatest advantages of VAE-NAM over NAM is the availability of a forward encoder. By\nthe end of training, T () is obtained for NAM, and both T () and E() are obtained for NAM-VAE.\nAt evaluation time, every y domain image requires solving the optimization problem in Eq. 1, this\ntypically requires around 100 iterations each requiring a forward and backward step. For NAM,\neach iteration requires the evaluation of G() and T () as well as the VGG loss which dominates the\nevaluation time. For VAE-NAM, only a single evaluation of E() and G() are required. If multiple\nsolutions are required, runtime for both NAM and VAE-NAM increase linearly with the number of\nsolutions.\nEvaluating 100 analogies on a P100 GPU and for 64\u00d7 64 images took 0.013s for VAE-NAM whereas\nNAM required 23s. Our proposed method is therefore about 2000\u00d7 faster than NAM at evaluation\n\n7\n\n\ftime. We conclude that for applications for which evaluation runtime is important, VAE-NAM should\nalways be preferred over NAM.\n\n5 Discussion\n\nOur method VAE-NAM presented a new interpretation for cross-domain mapping. We formulated the\nmapping as a variational autoencoder on the Y domain, with special properties. The encoder E(y)\nmaps the Y domain image into the pre-trained latent space of X . The decoder consists of two parts,\nthe \ufb01rst being a pre-trained X domain generative model G(z) serving as a transformation prior.\nVAE-NAM has some relation to CoGAN and UNIT, which assume that analogous images in the\nX and Y domains share the same latent code. Various constraint are used to ensure this property\nincluding adversarial and circularity constraints (although the latter was slightly relaxed in MUNIT).\nOur model is however signi\ufb01cantly different. In CoGAN and later works the only relation between\nthe two domains is through the latent code space Z. In NAM, only the X domain is directly related\nto the Z domain whereas Y is only directly related to the X domain (and to Z only through X ). This\nforces the two domains to share the same latent space without adversarial constraints. VAE-NAM,\nadds a uni-directional connection between Y \u2192 Z, however the direct connection between X \u2192 Y\nis maintained. This ensures the two domains share the same latent space while also bene\ufb01ting from\nthe advantages of having a parametric encoding model.\nThe success of VAE-NAM suggests that the main component required for unsupervised cross-domain\nmapping is a strong domain prior. Given such strong prior, the method learns to effectively map\nbetween the two domains. The formulation as VAE allows future work to use the vast research\ndeveloped by the auto-encoder research community for example the Wasserstein autoencoder [24] or\nrecurrence for dealing with temporal data [25].\nGeneralization to multiple domains occurs naturally, and has a possible cognitive interpretation. Let\nus assume the agent has a good generative model of a privileged domain, which the agent is very\nfamiliar with (in our notation, the domain is denoted X ). For every new domain Y, the agent will train\na VAE, with the decoder factorized by the known generative model GX (), and a learned mapping\nfunction TXY . The familiar domain therefore serves as a familiar grounding space, onto which all\nother knowledge is aligned. Further research is needed to verify if this model in fact accords with\nhuman cognitive patterns.\nThere are multiple ways in which the current work may be extended. Each input sample is mapped\nso that the latent space is a Gaussian unit ball with identity covariance. This might be too restrictive,\nand does not fully explore the state of possible multiple analogies. We think that better probabilistic\nmodeling of the latent space can be a fruitful direction.\nNAM and VAE-NAM have so far only been applied to image domains. The next step in unsupervised\nmapping research is to explore the applicability of VAE-NAM to different modalities such as word\ntranslation and speech. Non-adversarial mapping is particularly exciting for tasks for which we\nalready have good parametric models. An example of settings for which excellent parametric\ngenerators were developed include: face and 3D modeling, as well as general physics simulations\nand game engines. Successful alignment between the world and a simulation can advance the \ufb01elds\nof model based learning and virtual reality.\n\n6 Conclusions\n\nWe presented VAE-NAM, a method for learning to map across image domains without supervision.\nDifferently from NAM, it uses a parametric stochastic function to map input samples to latent codes,\nrather than learning the latent code by direct optimization. The model also parameterizes the latent\nspace, and multiple analogies that can arise from a single input sample. VAE-NAM is shown to be\nbetter at using large datasets than NAM, is simpler to train, generally converges to better solutions\nand is much faster at evaluation time.\n\n8\n\n\fReferences\n[1] Yedid Hoshen and Lior Wolf. Identifying analogies across domains. International Conference\n\non Learning Representations, 2018.\n\n[2] Yedid Hoshen and Lior Wolf. Nam - unsupervised cross-domain image mapping without cycles\n\nor gans. In ICLR Workshop, 2018.\n\n[3] Yedid Hoshen and Lior Wolf. Nam: Non-adversarial unsupervised domain mapping. In ECCV,\n\n2018.\n\n[4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. In ICLR, 2016.\n\n[5] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and\nDimitris Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial\nnetworks. arXiv: 1710.10916, 2017.\n\n[6] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[8] Erik G Miller, Nicholas E Matsakis, and Paul A Viola. Learning from one example through\n\nshared densities on transforms. In CVPR, 2000.\n\n[9] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to\n\ndiscover cross-domain relations with generative adversarial networks. In ICML, 2017.\n\n[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image\n\ntranslation using cycle-consistent adversarial networks. In ICCV, 2017.\n\n[11] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for\n\nimage-to-image translation. In ICCV, 2017.\n\n[12] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.\nStargan: Uni\ufb01ed generative adversarial networks for multi-domain image-to-image translation.\nIn CVPR, 2018.\n\n[13] Yedid Hoshen and Lior Wolf. An iterative closest point method for unsupervised word transla-\n\ntion. arXiv preprint arXiv:1801.06126, 2018.\n\n[14] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS. 2016.\n\n[15] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation\n\nnetworks. In NIPS, 2017.\n\n[16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[17] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. In ICLR, 2017.\n\n[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\n\nImproved training of wasserstein gans. In NIPS, 2017.\n\n[19] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. In ICLR, 2018.\n\n[20] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent\n\nspace of generative networks. In ICML, 2018.\n\n[21] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded re\ufb01nement\n\nnetworks. ICCV, 2017.\n\n[22] Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. In NIPS. 2017.\n\n9\n\n\f[23] Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In CVPR,\n\n2014.\n\n[24] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. In ICLR, 2018.\n\n[25] Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. In ICLR Workshop,\n\n2014.\n\n10\n\n\f", "award": [], "sourceid": 3737, "authors": [{"given_name": "Yedid", "family_name": "Hoshen", "institution": "Facebook AI Research"}]}