{"title": "One-Shot Unsupervised Cross Domain Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 2104, "page_last": 2114, "abstract": "Given a single image $x$ from domain $A$ and a set of images from domain $B$, our task is to generate the analogous of $x$ in $B$. We argue that this task could be a key AI capability that underlines the ability of cognitive agents to act in the world and present empirical evidence that the existing unsupervised domain translation methods fail on this task. Our method follows a two step process. First, a variational autoencoder for domain $B$ is trained. Then, given the new sample $x$, we create a variational autoencoder for domain $A$ by adapting the layers that are close to the image in order to directly fit $x$, and only indirectly adapt the other layers. Our experiments indicate that the new method does as well, when trained on one sample $x$, as the existing domain transfer methods, when these enjoy a multitude of training samples from domain $A$. Our code is made publicly available at https://github.com/sagiebenaim/OneShotTranslation", "full_text": "One-Shot Unsupervised Cross Domain Translation\n\nSagie Benaim1 and Lior Wolf1,2\n\n1The School of Computer Science , Tel Aviv University, Israel\n\n2Facebook AI Research\n\nAbstract\n\nGiven a single image x from domain A and a set of images from domain B,\nour task is to generate the analogous of x in B. We argue that this task could\nbe a key AI capability that underlines the ability of cognitive agents to act in\nthe world and present empirical evidence that the existing unsupervised domain\ntranslation methods fail on this task. Our method follows a two step process. First,\na variational autoencoder for domain B is trained. Then, given the new sample\nx, we create a variational autoencoder for domain A by adapting the layers that\nare close to the image in order to directly \ufb01t x, and only indirectly adapt the other\nlayers. Our experiments indicate that the new method does as well, when trained\non one sample x, as the existing domain transfer methods, when these enjoy a\nmultitude of training samples from domain A. Our code is made publicly available\nat https://github.com/sagiebenaim/OneShotTranslation.\n\n1\n\nIntroduction\n\nA simpli\ufb01cation of an intuitive paradigm for accumulating knowledge by an intelligent agent is as\nfollows. The gained knowledge is captured by a model that retains previously seen samples and is\nalso able to generate new samples by blending the observed ones. The agent learns continuously by\nbeing exposed to a series of objects. Whenever a new sample is observed, the agent generates, using\nthe internal model, a virtual sample that is analogous to the observed one, and compares the observed\nand blended objects in order to update the internal model.\nThis variant of the perceptual blending framework [1], requires multiple algorithmic solutions. One\nmajor challenge is a speci\ufb01c case of \u201cthe learning paradox\u201d, i.e., how can one learn what it does not\nalready know, or, in the paradigm above, how can the analogous mental image be constructed if the\nobserved sample is unseen and potentially very different than anything that was already observed.\nComputationally, this generation step requires solving the task that we term one-shot unsupervised\ncross domain translation: given a single sample x from an unknown domain A and many samples\nor, almost equivalently, a model of domain B, generate a sample y \u2208 B that is analogous to x.\nWhile there has been a great deal of research dedicated to unsupervised domain translation, where\nmany samples from domain A are provided, the literature does not deal, as far as we know, with the\none-shot case.\nTo be clear, since parts of the literature may refer to these type of tasks as zero-shot learning, we are\nnot given any training images in A except for the image to be mapped x. Consider, for example, the\ntask posed in [2] of mapping zebras to horses. The existing methods can perform this task well, given\nmany training images of zebras and horses. However, it seems entirely possible to map a single zebra\nimage to the analogous horse image even without seeing any other zebra image.\nThe method we present, called OST (One Shot Translation), uses the two domains asymmetrically\nand employs two steps. First, a variational autoencoder is constructed for domain B. This allows\nus to encode samples from domain B effectively as well as generate new samples based on random\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flatent space vectors. In order to encourage generality, we further augment B with samples produced\nby a slight rotation and with a random horizontal translation.\nIn the second phase, the variational autoencoder is cloned to create two copies that share the top\nlayers of the encoders and the bottom layers of the decoders, one for the samples in B and one\nfor the sample x in A. The autoencoders are trained with reconstruction losses as well as with a\nsingle-sample one-way circularity loss. The samples from domain B continue to train its own copy\nas in the \ufb01rst step, updating both the shared and the unshared layers. The gradient from sample x\nupdates only the unshared layers and not the shared layers. This way, the autoencoder of B is adjusted\nby x through the loss incurred on unshared layers for domain B by the circularity loss, and through\nsubsequent adaptation of the shared layers to \ufb01t the samples of B. This allows the shared layers to\ngradually adapt to the new sample x, but prevents over\ufb01tting on this single sample. Augmentation is\napplied, as before, to B and also to x for added stability.\nWe perform a wide variety of experiments and demonstrate that OST outperforms the existing\nalgorithms in the low-shot scenario. On most datasets the method also presents a comparable\naccuracy with a single training example to the accuracy obtained by the other methods for the\nentire set of domain A images. This success sheds new light on the potential mechanisms that\nunderlie unsupervised domain translation, since in the one-shot case, constraints on the inter-sample\ncorrelations in domain A do not apply.\n\n2 Previous Work\n\nUnsupervised domain translation methods receive two sets of samples, one from each domain, and\nlearn a function that maps between a sample in one domain and the analogous sample in the other\ndomain [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Such methods are unsupervised in the sense that the two\nsets are completely unpaired.\nThe mapping between the domains can be recovered based on multiple cues. First, shared objects\nbetween domains can serve as supervised samples. This is the case in the early unsupervised\ncross-lingual dictionary translation methods [13, 14, 15, 16], which identi\ufb01ed international words\n(\u2018computer\u2019, \u2018computadora\u2019,\u2018komp\u00fcter\u2019) or other words with a shared etymology by considering\ninter-language edit distances. These words were used as a seed set to bootstrap the mapping process.\nA second cue is that of object relations. It often holds that the pairwise similarities between objects in\ndomain A are preserved after the transformation to domain B. This was exploited in [5] using the L2\ndistances between classes. In the work on unsupervised word to word translation [9, 10, 11, 17], the\nrelations between words in each language are encoded by word vectors [18], and translation is well\napproximated by a linear transformation of one language\u2019s vectors to those of the second.\nA third cue is that of inner object relations. If the objects of domain A are complex and contain\nmultiple parts, then one can expect that after mapping, the counterpart in domain B would have a\nsimilar arrangement of parts. This was demonstrated by examining the distance between halves of\nimages in [5] and it also underlies unsupervised NLP translation methods that can translate a sentence\nin one language to a sentence in another, after observing unmatched corpora [12].\nAnother way to capture these inner-object relations is by constructing separate autoencoders for\nthe two domains, which share many of the weights [6, 7]. It is assumed that the low-level image\nproperties, such as texture and color, are domain-speci\ufb01c, and that the mid- and top-level properties\nare common to both domains.\nThe third cue is also manifested implicitly (in both autoencoder architectures and in other methods)\nby the structure of the neural network used to perform the cross-domain mapping [19]. The network\u2019s\ncapacity constrains the space of possible solutions and the relatively shallow networks used, and\ntheir architecture dictate the form of a solution. Taken together with the GAN [20] constraints that\nensure that the generated images are from the target domain, and restricted further by the circularity\nconstraint [2, 3, 4], much of the ambiguity in mapping is eliminated.\nIn the context of one-shot translation, it is not possible to \ufb01nd or to generate analogs in B to the\ngiven x \u2208 A, since the domain-invariant distance between the two domains is not de\ufb01ned. One can\ntry to use general purpose distances such as the perceptual distance, but this would make the work\n\n2\n\n\f(Phase I)\n\n(Phase II)\n\nFigure 1:\nIllustration of the two phases of training. (Phase I): Augmented samples from domain\nB, P (\u039b), are used to train a variational autoencoder for domain B. RBB denotes the space of\nreconstructed samples from P (\u039b). (Phase II): the variational autoencoder of phase I is cloned, while\nsharing the weights of part of the encoder (ES) and part of the decoder (GS). These shared parts,\nmarked with a snow\ufb02ake, are frozen with respect to the sample x. For both phase I and phase II,\nwe train a discriminator DB to ensure that the generated image belong to the distribution of domain\nB. P (x) and P (\u039b) are translated to a common feature space, CE, using EU\nB respectively.\nC (resp CG) is the space of features, constructed after passing CE (resp C) through the common\nencoder ES (resp common decoder GS). RAB denotes the subspace of samples in B constructed\nfrom P (x), which is generated by augmenting x. RAA denotes the space of reconstructed samples\nfrom P (x). RABA denotes the subspace of samples in A constructed by translating P (x) to domain\nB and then back to A.\n\nA and EU\n\nsemi-supervised such as [21, 22] (these methods are also not one-shot). Since there are no inter-object\nrelations in domain A, the only cue that can be used is of the third type.\nWe have made attempts to compare various image parts within x, thereby generalizing the image-\nhalves solution of [5]. However, this did not work. Instead, our work relies on the assumption that\nthe mid-level representation of domain A is similar to that of B, which, as mentioned above, is the\nunderlying assumption in autoencoder based cross-domain translation work [6, 7].\n\n3 One-Shot Translation\n\nIn the problem of unsupervised cross-domain translation, the learning algorithm is provided with\nunlabeled datasets from two domains, A and B. The goal is to learn a function T , which maps samples\nin domain A to analog samples in domain B. In the autoencoder based mapping technique [7], two\nencoders/decoders are learned. We denote the encoder for domain A (B) by EA (EB) and the decoder\nby GA (GB). In order to translate a sample x in domain A to domain B, one employs the encoder of\nA and the decoder of B, i.e., TAB = GB \u25e6 EA.\nA strong constraint on the form of the translation is given by sharing layers between the two\nautoencoders. The lower layers of the encoder and the top layers of the decoder are domain-speci\ufb01c\nand unshared. The encoder\u2019s top layers and decoder\u2019s bottom layers are shared. This sharing enforces\nthe same structure on the encoding of both domains and is crucial for the success of the translation.\nB \u25e6 GS,\nSpeci\ufb01cally, we write EA = ES \u25e6 EU\nwhere the superscripts S and U denote shared and unshared parts, respectively, and the subscripts\ndenote the domain. This structure is depicted in Fig. 1.\nIn addition to the networks that participate in the two autoencoders, an adversarial discriminator\nDB is trained in both phases, in order to model domain B. Domain A does not contain enough\nreal examples in the case of low-shot learning and, in addition, a domain A discriminator is less\nneeded since the task is to map from A to B. When mapping x (after augmentation, to B using the\ntransformation T ) the discriminator DB is used to provide an adversarial signal.\n\nA \u25e6 GS, and GB = GU\n\nA , EB = ES \u25e6 EU\n\nB , GA = GU\n\n3\n\n\fs\u2208P (\u039b)\n\ns\u2208P (\u039b)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\ns\u2208P (\u039b)\n\ns\u2208P (\u039b)\n\n3.1 Phase One of Training\n\nIn the \ufb01rst phase, we employ a training set \u039b of images from domain B and train a variational\nautoencoder for this domain. The method employs an augmentation operator that consists of small\nrandom rotations of the image and a horizontal translation. We denote by P (\u039b) the training set\nconstructed by randomly augmenting every sample s \u2208 \u039b.\nThe following losses are used:\nLRECB =\n\n(cid:107)GB(EB(s)) \u2212 s(cid:107)1\n\n(1)\n\nLV AEB =\n\nLGANB =\n\nLDB =\n\nKL(EB \u25e6 P (\u039b)||N (0, I))\n\n\u2212(cid:96)(DB(GB(EB(s))), 0)\n\n+(cid:96)(DB(GB(EB(s))), 0) + (cid:96)(DB(s), 1)\n\n(2)\n\n(3)\n\n(4)\n\nwhere the \ufb01rst three losses are the reconstruction loss, the variational loss and the adversarial loss on\nthe generator, respectively, and the fourth loss is the loss of the GAN\u2019s discriminator, in which we\nuse the bar to indicate that GB is not updated during the backpropagation of this loss. (cid:96) can be the\nbinary cross entropy or the least square loss, (cid:96)(x, y) = (x \u2212 y)2 [23]. When training EB and GB in\nthe \ufb01rst phase, the following loss is minimized:\n\nLI = LRECB + \u03b11LV AEB + \u03b12LGAN\n\n(5)\nwhere \u03b1i are tradeoff parameters. At the same time we train DB to minimize LDB . Similarly to\nCycleGAN, DB can be a PatchGAN [24] discriminator, which checks if 70 \u00d7 70 overlapping patches\nof the image are real or fake.\n\n3.2 Phase Two of Training\n\nIn the second phase, we make use of the sample x from domain A, as well as the set \u039b. In case we\nare given more than one sample from domain A, we simply add the loss terms to each one of the\nsamples.\nDenote by P (x) the set of random augmentations of x and the cross-domain encoding/decoding as:\n\nTBB =GU\nTBA =GU\n\nB(GS(ES(EU\nA(GS(ES(EU\n\n(8)\n(9)\nwhere the bar is used, as before, to indicate a detached clone not updated during backpropagation.\nA ) are initialized with the weights of GA (resp. EA) trained in phase I.\nGU\nThe following additional losses are used:\n\nA(GS(ES(EU\nB(GS(ES(EU\n\nTAA =GU\nTAB =GU\n\nB (x))))\nB (x))))\n\nA (x))))\nA (x))))\n\nB and GU\n\nA (resp. EU\n\nB and EU\n\n(6)\n(7)\n\nLRECA =\n\nLcycle =\n\nLGANAB =\n\nLDAB =\n\ns\u2208P (x)\n\ns\u2208P (x)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\ns\u2208P (x)\n\ns\u2208P (x)\n\n(cid:107)TAA(s) \u2212 s(cid:107)1\n\n(cid:107)TBA(TAB(s)) \u2212 s(cid:107)1\n\n\u2212(cid:96)(DB(TAB(s)), 0)\n\n+(cid:96)(DB(TAB(s)), 0) + (cid:96)(DB(s), 1)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\nnamely, the reconstruction loss on x, a one-way cycle loss applied to x, and the generator and\ndiscriminator losses for domain B given the source sample x. In phase II we minimize the following\nloss:\n\nLII = LI + \u03b13LRECA + \u03b14Lcycle + \u03b15LGAN_AB\n\n4\n\n\fB and GU\n\nwhere \u03b1i are tradeoff parameters. Losses not in LI are minimized over the unshared layers of the\nencoders and decoders. We stress that losses in LI as still minimized over both the shared and\nunshared layers in phase II. At the same time we train DB to minimize LDB and LDAB .\nNote that GS and ES enforce the same structure on x as it does on samples from domain B. Enforcing\nthis is crucial in making x and TAB(x) structurally aligned, as these layers typically encode structure\ncommon to both domains A and B [7, 6]. OST assumes that it is suf\ufb01cient to train a VAE for\ndomain B only, in order for GS and ES to contain the features needed to represent x and its aligned\ncounterpart TAB(x). Give this assumption, it does not rely on samples from A to train GS and ES.\nGS and ES are detached during backpropagation not just from the VAE\u2019s reconstruction loss in\ndomain A but also from the cycle and the GAN_AB losses in LII. As our experiments show, it is\nimportant to adapt these shared parts to x. This happens indirectly: during training the unshared\nlayers of EU\nB are updated via the one-shot cycle loss (Eq. 11). Due to this change, all three\nloss terms in LI are expected to increase and GS and ES are adapted to rectify this.\nSelective backpropagation plays a crucial role in OST. Its aim is to adapt the unshared layers of\ndomain A to the shared representation obtained based on the samples of domain B. Intuitively, LI\nlosses, which are formulated with samples of B only, can be backpropagated normally, since due\nto the number of samples in B, ES and GS generalize well to other samples in this domain. Based\non the shared latent space assumption, ES and GS would also \ufb01t samples in A. However, updating\nthe layers of GS and ES based on loss LII (with selective backpropagation turned off, as is done in\nthe ablation experiments of Tab. 1), would quickly lead to over\ufb01tting on x, since for every shared\nrepresentation, the unshared layers in domain A can still reconstruct this one sample. This increase\nin \ufb01tting capacity leads to an arbitrary mapping of x, and one can see that in this case, the mapping\nof x is highly unstable during training and almost arbitrary (Fig. 2). If the shared representation is\ncompletely \ufb01xed at phase II, as in row 8 of Tab. 1, the lack of adaptation hurts performance. This is\nanalogous to what was discovered in [25] in the context of adaptation in transfer learning.\nNote that we did not add the cycle loss in the reverse direction. Consider the MNIST (domain A) to\nSVHN (domain B) translation (Fig. 3). If we had the cycle-loss in the reverse direction, all SVHN\nimages (of all digits) would be translated to the single MNIST image (of a single digit) present in\ntraining. The cycle loss would then require that we reconstruct the original SVHN image from the\nsingle MNIST image (see rows 9 and 10 of Tab. 1).\n\n3.3 Network Architecture and Implementation\nWe consider x \u2208 A and samples in B to be images in R3\u00d7256\u00d7256. We compare our results to state\nof the art method, CycleGAN [2] and UNIT [7] and use the architecture of CycleGAN, shown to\nbe highly successful for a variety of datasets, for the encoders, decoders and discriminator. For a\nfair comparison, the same architecture is used when comparing OST to the CycleGAN and UNIT\nbaselines. The network architecture released with UNIT did not outperform the combination of the\nUNIT losses and the CycleGAN architecture for the datasets that are used in our experiments.\nBoth the shared and unshared encoders (resp. decoders) consist of between 1 and 2 2-stride convolu-\ntions (resp. deconvolutions). The shared encoder consists of between 1 and 3 residual blocks after\nthe convolutional layers. The shared decoder also consists of between 1 and 3 residual blocks before\n\nSelective backprop\n\nNon-selective backprop\n\nSelective backprop\n\nNon-selective backprop\n\n\u2192\n\u2192\n\u2192\n\u2192\n\nFigure 2: Mapping of an SVHN image to MNSIT. The results are shown at different iterations.\nWithout selective backpropagation, the result is unstable and arbitrary.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 3: (a) Translating MNIST images to SVHN images. x-axis is the number of samples in A\n(log-scale), y-axis is the accuracy of a pretrained classi\ufb01er on the resulting translated images. The\naccuracy is averaged over 1000 independent runs for different samples. Blue: Our OST method.\nYellow: UNIT [7]. Red: CycleGAN [2] . (b) The same graph in the reverse direction.\n\nits deconvolutional layers. The number of layers is selected to obtain the optimal CycleGAN results\nand is used for all architectures. Batch normalization and ReLU activations are used between layers.\nCycleGAN employs a number of additional techniques to stabilize training, which OST borrows. The\n\ufb01rst is the use of a PatchGAN discriminator [24], and the second is the use of least-square loss for\nthe discriminator [23] instead of negative log-likelihood loss. For the MNIST [26] to SVHN [27]\ntranslation and the reverse translation, the PatchGAN discriminator is not used, and, for these\nexperiments, where the input is in R3\u00d732\u00d732, the standard DCGAN [28] architecture is used.\n\n4 Experiments\n\nWe compare OST, trained on a single sample x \u2208 A, to both UNIT and CycleGAN trained either\non x alone or with the entire training set of images from A. We conduct a number of quantitative\nevaluations, including style and content loss comparison as well as a classi\ufb01cation accuracy test for\ntarget images. For the MNIST to SVHN translation and the reverse, we conduct an ablation study,\nshowing the importance of every component of our approach. For this task, we further evaluate our\napproach, when more samples are presented, showing that OST is able to perform well on larger\ntraining sets. In all cases x is sampled from the training set of the other methods. The experiments\nare repeated multiple times and the mean results are reported.\n\nMNIST to SVHN Translation Using OST, we translated a randomly selected MNIST [26] image\nto an Street View House Number (SVHN) [27] image. We used a pretrained-classi\ufb01er for SVHN, to\npredict a label for the translated image and compared it to the input MNIST image label.\nFig. 3(a) shows the accuracy of the translation for increasing number of samples in A. The accuracy\nis the percentage of translations for which the label of the input image matches that given by a\npretrained classi\ufb01er applied on the translated image. The same random selection of images was used\nfor baseline comparison, and that accuracy is measured on the train images translated from A to B,\nand not on a separate test set. The reverse translation experiment was also conducted and shown in\nFig. 3(b). While increasing the number of samples, increases the accuracy, OST outperforms the\nbaselines even when trained on the entire training set. We note that the accuracy of the unsupervised\nmapping is lower than for the supervised one or when using a pretrained perceptual loss [21].\nIn a second experiment, an ablation study is conducted. We consider our method where any of the\nfollowing are left out: \ufb01rst, augmentation on both the input image x \u2208 A and on images from B.\nSecond, one way cycle loss, Lcycle. Third, selective back propagation is lifted, and gradients from\nlosses of LII are passed through shared encoders and decoders, Es and Gs. The results are reported\nin Tab. 1. We \ufb01nd that selective back propagation has the largest effect on translation accuracy.\nOne-way cycle loss and augmentation contribute less to the one-shot performance.\n\n6\n\n\fTable 1: Ablation study for the MNIST to SVHN translation (and vice versa). We consider the\ncontribution of various parts of our method on the accuracy. Translation is done for one sample.\n\nSelective\nbackprop\n\nAccuracy\n\n(MNIST to SVHN)\n\nAccuracy\n\n(SVHN to MNIST)\n\nAugment-\nation\nFalse\nTrue\nFalse\nTrue\nFalse\nTrue\nFalse\nTrue\n\nOne-way\n\ncycle\nFalse\nFalse\nTrue\nTrue\nFalse\nFalse\nTrue\nTrue\n\nFalse\nFalse\nFalse\nFalse\nTrue\nTrue\nTrue\n\nNo Phase II update\n\nof ES and GS\n\nTrue\nTrue\nTrue\n\nTwo-way cycle\nTwo-way cycle\n\nTrue\n\nTrue\nFalse\nTrue\n\n0.07\n0.11\n0.13\n0.14\n0.19\n0.20\n0.22\n0.16\n\n0.20\n0.11\n0.23\n\n0.10\n0.11\n0.13\n0.14\n0.20\n0.20\n0.23\n0.15\n\n0.13\n0.12\n0.23\n\nTable 2: (i) Measuring the perceptual distance [29], between inputs and their corresponding output\nimages of different style transfer tasks. Low perceptual loss indicates that much of the high-level\ncontent is preserved in the translation. (ii) Measuring the style difference between translated images\nand images from the target domain. We compute the average Gram matrix of translated images and\nimages from the target domain and \ufb01nd the average distance between them, as described in [29].\n\nComponent Dataset\n\nOST UNIT [7] CycleGAN [2] UNIT [7] CycleGAN [2]\n\n1\n\n1\n\n1\n\n(i) Content\n\n(ii) Style\n\nSamples in A\nSummer2Winter\nWinter2Summer\nMonet2Photo\nPhoto2Monet\nSummer2Winter\nWinter2Summer\nMonet2Photo\nPhoto2Monet\n\n0.64\n0.73\n3.75\n1.47\n1.64\n1.58\n1.20\n1.95\n\n3.20\n3.10\n6.82\n2.92\n6.51\n6.80\n6.83\n7.53\n\n3.53\n3.48\n5.80\n2.98\n1.62\n1.31\n0.90\n1.91\n\nAll\n1.41\n1.38\n1.46\n2.01\n1.69\n1.69\n1.21\n2.12\n\nAll\n0.41\n0.40\n1.41\n1.46\n1.69\n1.66\n1.18\n1.88\n\nIn another experiment, we completely freezed the shared encoder and decoder in phase II. In this case,\nthe mapping fails to produce images in the target distribution. In the SVHN to MNIST translation,\nfor instance, the background color of the translated images is gray and not black.\n\nStyle Transfer Tasks We consider the tasks of two-way translation from Images to Monet-style\npainting [2], Summer to Winter translation [2] and the reverse translations. To asses the quality of\nthese translations, we measure the perceptual distance [29] between input and translated images. This\nsupervised distance is minimized in style transfer tasks to preserve the translation\u2019s content, and so a\nlow value indicates that much of the content is preserved. Further, we compute the style difference\nbetween translated images and target domain images, as introduced in [29]. Tab. 2 shows that OST\ncaptures the target style in a similar manner to UNIT and CycleGAN when trained many samples,\nas well as CycleGAN trained with a single sample. While the latter captures the style of the target\ndomain, it is unable to preserve the content, as indicated by the high perceptual distance. Sample\nresults obtained with OST are shown in Fig. 4 and in the supplementary.\n\nDrawing Tasks We consider the translation of Google Maps to Aerial View photos [24], Facades\nto Images [30], Cityscapes to Labels [31] and the reverse translations. Sample results are show in\nFig. 4 and in the supplementary. OST trained on a single sample, as well as CycleGAN and UNIT\ntrained on the entire training set obtain aligned mappings, while CycleGAN and UNIT trained on\na single sample, either failed to produce samples from the target distribution or failed to create an\n\n7\n\n\fTable 3: (i) Perceptual distance [29] between the inputs and corresponding output images, for various\ndrawing tasks. (ii) Style difference between translated images and images from the target domain.\n(iii) Correctness of translation as evaluated by a user study.\n\nFacades\nto Images To Maps\n\nMethod\n\nImages to\nFacades\n\n(i) OST 1\n\nUNIT [7] All\nCycleGAN [2] All\n\n(ii) OST 1\n\nUNIT [7] All\nCycleGAN [2] All\n\n(iii) OST 1\n\nUNIT [7] ALL\nCycleGAN [2] ALL\n\n4.76\n3.85\n3.79\n3.57\n3.92\n3.81\n91%\n86%\n93%\n\n5.05\n4.80\n4.49\n7.88\n7.42\n7.03\n90%\n83%\n84%\n\nImages Maps to\nImages\n2.36\n2.30\n2.11\n1.50\n1.59\n1.30\n67%\n75%\n81%\n\n2.49\n2.42\n2.49\n2.24\n2.56\n2.33\n83%\n81%\n97%\n\nLabels to\nCityscapes\n\nCityscapes\nto Labels\n\n3.34\n2.61\n2.73\n0.67\n0.69\n0.77\n66%\n63%\n72%\n\n2.39\n2.18\n2.28\n1.13\n1.21\n1.22\n56%\n37%\n45%\n\naligned mapping. Tab. 3 shows that OST achieves a similar perceptual distance and style difference\nto CycleGAN and UNIT trained on the entire training set. This indicates that OST achieves a similar\ncontent similarity to the input image, and style difference to the target domain, as these methods. To\nfurther validate this, we asked 20 persons to rate whether the source image matches the target image\n(presenting the methods and samples in a random order) and list in Tab. 3 the ratio of \u201cyes\u201d answers.\n\n5 Discussion\n\nBeing a one-shot technique, the method we present is suitable for agents that survey the environment\nand encounter images from unseen domains. In phase II, the autoencoder of domain B changes in\norder to adapt to domain A. This is desirable in the context of \u201clife long\u201d unsupervised learning,\nwhere new domains are to be encountered sequentially. However, phase II is geared toward the\nsuccess of translating x, and in the context of multi-one-shot domain adaptations, a more conservative\napproach would be required.\nIn this work, we translate one sample from a previously unseen domain A to domain B. An interesting\nquestion is the ability of mapping from a domain in which many samples have been seen to a new\ndomain, from which a single training sample is given. An analog two phase approach can be\nattempted, in which an autoencoder is trained on the source domain, replicated, and tuned selectively\non the target domain. The added dif\ufb01culty in this other direction is that adversarial training cannot be\nemployed directly on the target domain, since only one sample of it is seen. It is possible that one can\nstill model this domain based on the variability that exists in the familiar source domain.\n\nAcknowledgements\n\nThis project has received funding from the European Research Council (ERC) under the Euro-\npean Union\u2019s Horizon 2020 research and innovation programme (grant ERC CoG 725974). The\ncontribution of Sagie Benaim is part of Ph.D. thesis research conducted at Tel Aviv University.\n\n8\n\n\f(Input)\n\n(OST 1-shot) (Cycle 1-shot) (Unit 1-shot)\n\n(Cycle all)\n\n(Unit all)\n\no\nT\ns\ne\nd\na\nc\na\nF\n\ns\ne\nd\na\nc\na\nF\no\nT\n\no\nT\ns\np\na\nM\n\ns\np\na\nM\no\nT\n\ns\ne\np\na\nc\ns\ny\nt\ni\n\nC\no\nT\n\no\nT\ns\ne\np\na\nc\ns\ny\nt\ni\n\nC\n\no\nT\nt\ne\nn\no\nM\n\nt\ne\nn\no\nM\no\nT\n\nr\ne\nm\nm\nu\nS\no\nT\n\no\nT\nr\ne\nm\nm\nu\nS\n\nFigure 4: Translation for various tasks using OST (1 Sample), CycleGAN and UNIT (1 and Many\nSamples)\n\n9\n\n\fReferences\n[1] Fauconnier, G., Turner, M.: The Way We Think: Conceptual Blending and the Mind\u2019s Hidden\n\nComplexities. Basic Books (2003)\n\n[2] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-\nIn: IEEE International Conference on Computer Vision.\n\nconsistent adversarial networks.\n(2017)\n\n[3] Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with\ngenerative adversarial networks. International Conference on Machine Learning (ICML) (2017)\n[4] Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for image-to-image\ntranslation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017) 2868\u20132876\n[5] Benaim, S., Wolf, L.: One-sided unsupervised domain mapping. In: Advances in Neural\n\nInformation Processing Systems 30. (2017)\n\n[6] Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks.\n\nInformation Processing Systems 29. (2016) 469\u2013477\n\nIn: Advances in Neural\n\n[7] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks.\n\nAdvances in neural information processing systems 30. (2017)\n\nIn:\n\n[8] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Uni\ufb01ed generative adversarial\nnetworks for multi-domain image-to-image translation. In: The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR). (June 2018)\n\n[9] Conneau, A., Lample, G., Ranzato, M., Denoyer, L., J\u00e9gou, H.: Word translation without\n\nparallel data. International Conference on Learning Representations (ICLR) (2017)\n\n[10] Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon\ninduction. In: Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers). Volume 1. (2017) 1959\u20131970\n\n[11] Zhang, M., Liu, Y., Luan, H., Sun, M.: Earth mover\u2019s distance minimization for unsupervised\nbilingual lexicon induction. In: Proceedings of the 2017 Conference on Empirical Methods in\nNatural Language Processing. (2017) 1934\u20131945\n\n[12] Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using\nmonolingual corpora only. In: International Conference on Learning Representations. (2018)\n[13] Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable\ntexts. In: Proceedings of the 17th international conference on Computational linguistics-Volume\n1, Association for Computational Linguistics (1998) 414\u2013420\n\n[14] Rapp, R.: Automatic identi\ufb01cation of word translations from unrelated english and german\ncorpora. In: Proceedings of the 37th annual meeting of the Association for Computational\nLinguistics on Computational Linguistics. (1999)\n\n[15] Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and\nbridge languages. In: proceedings of the 6th conference on Natural language learning-Volume\n20, Association for Computational Linguistics (2002) 1\u20137\n\n[16] Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceed-\nings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, Association for\nComputational Linguistics (2002) 9\u201316\n\n[17] Hoshen, Y., Wolf, L.: Non-adversarial unsupervised word translation. In: Conference on\n\nEmpirical Methods in Natural Language Processing (EMNLP). (2018)\n\n[18] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Ef\ufb01cient estimation of word representations in\n\nvector space. arXiv preprint arXiv:1301.3781 (2013)\n\n[19] Galanti, T., Wolf, L., Benaim, S.: The role of minimal complexity functions in unsupervised\nlearning of semantic mappings. International Conference on Learning Representations (2018)\n[20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,\nBengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems\n27. (2014) 2672\u20132680\n\n[21] Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: Interna-\n\ntional Conference on Learning Representations (ICLR). (2017)\n\n10\n\n\f[22] Hoshen, Y., Wolf, L.: NAM - unsupervised cross-domain image mapping without cycles or\n\nGANs. In: International Conference on Learning Representations (ICLR) workshop. (2018)\n\n[23] Mao, X., Li, Q., Xie, H., Lau, R., Wang, Z.: Multi-class generative adversarial networks with\n\nthe l2 loss function. (11 2016)\n\n[24] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.:\n\nImage-to-image translation with conditional\nadversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR). (2017)\n\n[25] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural\nIn: Proceedings of the 27th International Conference on Neural Information\nnetworks?\nProcessing Systems - Volume 2. NIPS\u201914, Cambridge, MA, USA, MIT Press (2014) 3320\u20133328\n\n[26] LeCun, Y., Cortes, C.: MNIST handwritten digit database. (2010)\n[27] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natu-\nral images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and\nUnsupervised Feature Learning. (2011)\n\n[28] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)\n\n[29] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-\nresolution. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The\nNetherlands, October 11-14, 2016, Proceedings, Part II. (2016) 694\u2013711\n\n[30] Tyle\u02c7cek, R., \u0160\u00e1ra, R.: Spatial pattern templates for recognition of objects with regular structure.\n\nIn: German Conference on Pattern Recognition. (2013)\n\n[31] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth,\nS., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: The IEEE\nConference on Computer Vision and Pattern Recognition (CVPR). (2016)\n\n11\n\n\f", "award": [], "sourceid": 1077, "authors": [{"given_name": "Sagie", "family_name": "Benaim", "institution": "Tel Aviv University"}, {"given_name": "Lior", "family_name": "Wolf", "institution": "Facebook AI Research"}]}