{"title": "Style Transfer from Non-Parallel Text by Cross-Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 6830, "page_last": 6841, "abstract": "This paper focuses on style transfer on the basis of non-parallel text. This is an instance of a broad family of problems including machine translation, decipherment, and sentiment modification. The key challenge is to separate the content from other aspects such as style. We assume a shared latent content distribution across different text corpora, and propose a method that leverages refined alignment of latent representations to perform style transfer. The transferred sentences from one style should match example sentences from the other style as a population. We demonstrate the effectiveness of this cross-alignment method on three tasks: sentiment modification, decipherment of word substitution ciphers, and recovery of word order.", "full_text": "Style Transfer from Non-Parallel Text by\n\nCross-Alignment\n\nTianxiao Shen1 Tao Lei2 Regina Barzilay1 Tommi Jaakkola1\n\n1{tianxiao, regina, tommi}@csail.mit.edu 2tao@asapp.com\n\n1MIT CSAIL\n\n2ASAPP Inc.\n\nAbstract\n\nThis paper focuses on style transfer on the basis of non-parallel text. This is an\ninstance of a broad family of problems including machine translation, decipherment,\nand sentiment modi\ufb01cation. The key challenge is to separate the content from\nother aspects such as style. We assume a shared latent content distribution across\ndifferent text corpora, and propose a method that leverages re\ufb01ned alignment of\nlatent representations to perform style transfer. The transferred sentences from\none style should match example sentences from the other style as a population.\nWe demonstrate the effectiveness of this cross-alignment method on three tasks:\nsentiment modi\ufb01cation, decipherment of word substitution ciphers, and recovery\nof word order.1\n\n1\n\nIntroduction\n\nUsing massive amounts of parallel data has been essential for recent advances in text generation tasks,\nsuch as machine translation and summarization. However, in many text generation problems, we can\nonly assume access to non-parallel or mono-lingual data. Problems such as decipherment or style\ntransfer are all instances of this family of tasks. In all of these problems, we must preserve the content\nof the source sentence but render the sentence consistent with desired presentation constraints (e.g.,\nstyle, plaintext/ciphertext).\nThe goal of controlling one aspect of a sentence such as style independently of its content requires\nthat we can disentangle the two. However, these aspects interact in subtle ways in natural language\nsentences, and we can succeed in this task only approximately even in the case of parallel data.\nOur task is more challenging here. We merely assume access to two corpora of sentences with the\nsame distribution of content albeit rendered in different styles. Our goal is to demonstrate that this\ndistributional equivalence of content, if exploited carefully, suf\ufb01ces for us to learn to map a sentence\nin one style to a style-independent content vector and then decode it to a sentence with the same\ncontent but a different style.\nIn this paper, we introduce a re\ufb01ned alignment of sentence representations across text corpora.\nWe learn an encoder that takes a sentence and its original style indicator as input, and maps it to\na style-independent content representation. This is then passed to a style-dependent decoder for\nrendering. We do not use typical VAEs for this mapping since it is imperative to keep the latent content\nrepresentation rich and unperturbed. Indeed, richer latent content representations are much harder to\nalign across the corpora and therefore they offer more informative content constraints. Moreover, we\nreap additional information from cross-generated (style-transferred) sentences, thereby getting two\ndistributional alignment constraints. For example, positive sentences that are style-transferred into\nnegative sentences should match, as a population, the given set of negative sentences. We illustrate\nthis cross-alignment in Figure 1.\n\n1Our code and data are available at https://github.com/shentianxiao/language-style-transfer.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: An overview of the proposed cross-alignment method. X1 and X2 are two sentence\ndomains with different styles y1 and y2, and Z is the shared latent content space. Encoder E maps a\nsentence to its content representation, and generator G generates the sentence back when combining\nwith the original style. When combining with a different style, transferred \u02dcX1 is aligned with X2 and\n\u02dcX2 is aligned with X1 at the distributional level.\n\nTo demonstrate the \ufb02exibility of the proposed model, we evaluate it on three tasks: sentiment\nmodi\ufb01cation, decipherment of word substitution ciphers, and recovery of word order. In all of these\napplications, the model is trained on non-parallel data. On the sentiment modi\ufb01cation task, the model\nsuccessfully transfers the sentiment while keeps the content for 41.5% of review sentences according\nto human evaluation, compared to 41.0% achieved by the control-gen model of Hu et al. (2017).\nIt achieves strong performance on the decipherment and word order recovery tasks, reaching Bleu\nscore of 57.4 and 26.1 respectively, obtaining 50.2 and 20.9 gap than a comparable method without\ncross-alignment.\n\n2 Related work\n\nStyle transfer in vision Non-parallel style transfer has been extensively studied in computer\nvision (Gatys et al., 2016; Zhu et al., 2017; Liu and Tuzel, 2016; Liu et al., 2017; Taigman et al.,\n2016; Kim et al., 2017; Yi et al., 2017). Gatys et al. (2016) explicitly extract content and style\nfeatures, and then synthesize a new image by combining \u201ccontent\u201d features of one image with \u201cstyle\u201d\nfeatures from another. More recent approaches learn generative networks directly via generative\nadversarial training (Goodfellow et al., 2014) from two given data domains X1 and X2. The key\ncomputational challenge in this non-parallel setting is aligning the two domains. For example,\nCoupledGANs (Liu and Tuzel, 2016) employ weight-sharing between networks to learn cross-domain\nrepresentation, whereas CycleGAN (Zhu et al., 2017) introduces cycle consistency which relies\non transitivity to regularize the transfer functions. While our approach has a similar high-level\narchitecture, the discreteness of natural language does not allow us to reuse these models and\nnecessitates the development of new methods.\n\nNon-parallel transfer in natural language\nIn natural language processing, most tasks that involve\ngeneration (e.g., translation and summarization) are trained using parallel sentences. Our work most\nclosely relates to approaches that do not utilize parallel data, but instead guide sentence generation\nfrom an indirect training signal (Mueller et al., 2017; Hu et al., 2017). For instance, Mueller et al.\n(2017) manipulate the hidden representation to generate sentences that satisfy a desired property (e.g.,\nsentiment) as measured by a corresponding classi\ufb01er. However, their model does not necessarily\nenforce content preservation. More similar to our work, Hu et al. (2017) aims at generating sentences\nwith controllable attributes by learning disentangled latent representations (Chen et al., 2016). Their\nmodel builds on variational auto-encoders (VAEs) and uses independency constraints to enforce\nthat attributes can be reliably inferred back from generated sentences. While our model builds\non distributional cross-alignment for the purpose of style transfer and content preservation, these\nconstraints can be added in the same way.\n\nAdversarial training over discrete samples Recently, a wide range of techniques addresses chal-\nlenges associated with adversarial training over discrete samples generated by recurrent networks\n(Yu et al., 2016; Lamb et al., 2016; Hjelm et al., 2017; Che et al., 2017). In our work, we employ\nthe Professor-Forcing algorithm (Lamb et al., 2016) which was originally proposed to close the\ngap between teacher-forcing during training and self-feeding during testing for recurrent networks.\nThis design \ufb01ts well with our scenario of style transfer that calls for cross-alignment. By using\n\n2\n\n\fcontinuous relaxation to approximate the discrete sampling process (Jang et al., 2016; Maddison\net al., 2016), the training procedure can be effectively optimized through back-propagation (Kusner\nand Hern\u00e1ndez-Lobato, 2016; Goyal et al., 2017).\n\n3 Formulation\n\nIn this section, we formalize the task of non-parallel style transfer and discuss the feasibility of the\nlearning problem. We assume the data are generated by the following process:\n\n1. a latent style variable y is generated from some distribution p(y);\n2. a latent content variable z is generated from some distribution p(z);\n3. a datapoint x is generated from conditional distribution p(x|y, z).\n\n(cid:90)\n\nz\n\n2 ,\u00b7\u00b7\u00b7 , x(m)\n\nWe observe two datasets with the same content distribution but different styles y1 and y2, where\ny1 and y2 are unknown. Speci\ufb01cally, the two observed datasets X1 = {x(1)\n1 } and X2 =\n2 } consist of samples drawn from p(x1|y1) and p(x2|y2) respectively. We want to\n{x(1)\nestimate the style transfer functions between them, namely p(x1|x2; y1, y2) and p(x2|x1; y1, y2).\nA question we must address is when this estimation problem is feasible. Essentially, we only observe\nthe marginal distributions of x1 and x2, yet we are going to recover their joint distribution:\n\n1 ,\u00b7\u00b7\u00b7 , x(n)\n\np(x1, x2|y1, y2) =\n\np(z)p(x1|y1, z)p(x2|y2, z)dz\n\n(1)\nAs we only observe p(x1|y1) and p(x2|y2), y1 and y2 are unknown to us. If two different y and\ny(cid:48) lead to the same distribution p(x|y) = p(x|y(cid:48)), then given a dataset X sampled from it, its\nunderlying style can be either y or y(cid:48). Consider the following two cases: (1) both datasets X1 and\nX2 are sampled from the same style y; (2) X1 and X2 are sampled from style y and y(cid:48) respectively.\nThese two scenarios have different joint distributions, but the observed marginal distributions are the\nsame. To prevent such confusion, we constrain the underlying distributions as stated in the following\nproposition:\nProposition 1. In the generative framework above, x1 and x2\u2019s joint distribution can be recovered\nfrom their marginals only if for any different y, y(cid:48) \u2208 Y, distributions p(x|y) and p(x|y(cid:48)) are\ndifferent.\n\nThis proposition basically says that X generated from different styles should be \u201cdistinct\u201d enough,\notherwise the transfer task between styles is not well de\ufb01ned. While this seems trivial, it may not\nhold even for simpli\ufb01ed data distributions. The following examples illustrate how the transfer (and\nrecovery) becomes feasible or infeasible under different model assumptions. As we shall see, for a\ncertain family of styles Y, the more complex distribution for z, the more probable it is to recover the\ntransfer function and the easier it is to search for the transfer.\n\n3.1 Example 1: Gaussian\nConsider the common choice that z \u223c N (0, I) has a centered isotropic Gaussian distribution.\nSuppose a style y = (A, b) is an af\ufb01ne transformation, i.e. x = Az + b + \u0001, where \u0001 is a noise\nvariable. For b = 0 and any orthogonal matrix A, Az + b \u223c N (0, I) and hence x has the same\ndistribution for any such styles y = (A, 0). In this case, the effect of rotation cannot be recovered.\nInterestingly, if z has a more complex distribution, such as a Gaussian mixture, then af\ufb01ne trans-\nformations can be uniquely determined.\n\nLemma 1. Let z be a mixture of Gaussians p(z) =(cid:80)K\n\nk=1 \u03c0kN (z; \u00b5k, \u03a3k). Assume K \u2265 2, and\nthere are two different \u03a3i (cid:54)= \u03a3j. Let Y = {(A, b)||A| (cid:54)= 0} be all invertible af\ufb01ne transformations,\nand p(x|y, z) = N (x; Az + b, \u00012I), in which \u0001 is a noise. Then for all y (cid:54)= y(cid:48) \u2208 Y, p(x|y) and\np(x|y(cid:48)) are different distributions.\nTheorem 1. If the distribution of z is a mixture of Gaussians which has more than two different\ncomponents, and x1, x2 are two af\ufb01ne transformations of z, then the transfer between them can be\nrecovered given their respective marginals.\n\n3\n\n\f3.2 Example 2: Word substitution\n\nConsider here another example when z is a bi-gram language model and a style y is a vocabulary in\nuse that maps each \u201ccontent word\u201d onto its surface form (lexical form). If we observe two realizations\nx1 and x2 of the same language z, the transfer and recovery problem becomes inferring a word\nalignment between x1 and x2.\nNote that this is a simpli\ufb01ed version of language decipherment or translation. Nevertheless, the\nrecovery problem is still suf\ufb01ciently hard. To see this, let M1, M2 \u2208 Rn\u00d7n be the estimated\nbi-gram probability matrix of data X1 and X2 respectively. Seeking the word alignment is equivalent\nto \ufb01nding a permutation matrix P such that P (cid:62)M1P \u2248 M2, which can be expressed as an\noptimization problem,\n\n(cid:107)P (cid:62)M1P \u2212 M2(cid:107)2\n\nmin\nP\n\nThe same formulation applies to graph isomorphism (GI) problems given M1 and M2 as the\nadjacency matrices of two graphs, suggesting that determining the existence and uniqueness of P is\nat least GI hard. Fortunately, if M as a graph is complex enough, the search problem could be more\ntractable. For instance, if each vertex\u2019s weights of incident edges as a set is unique, then \ufb01nding the\nisomorphism can be done by simply matching the sets of edges. This assumption largely applies to\nour scenario where z is a complex language model. We empirically demonstrate this in the results\nsection.\nThe above examples suggest that z as the latent content variable should carry most complexity of\ndata x, while y as the latent style variable should have relatively simple effects. We construct the\nmodel accordingly in the next section.\n\n4 Method\n\nLearning the style transfer function under our generative assumption is essentially learning the\nconditional distribution p(x1|x2; y1, y2) and p(x2|x1; y1, y2). Unlike in vision where images are\ncontinuous and hence the transfer functions can be learned and optimized directly, the discreteness\nof language requires us to operate through the latent space. Since x1 and x2 are conditionally\nindependent given the latent content variable z,\n\np(x1|x2; y1, y2) =\n\n(cid:90)\n(cid:90)\n=\n= Ez\u223cp(z|x2,y2)[p(x1|y1, z)]\n\np(x1, z|x2; y1, y2)dz\np(z|x2, y2) \u00b7 p(x1|y1, z)dz\n\nz\n\nz\n\n(2)\n\n(3)\n\nThis suggests us learning an auto-encoder model. Speci\ufb01cally, a style transfer from x2 to x1 involves\ntwo steps\u2014an encoding step that infers x2\u2019s content z \u223c p(z|x2, y2), and a decoding step which\ngenerates the transferred counterpart from p(x1|y1, z). In this work, we approximate and train\np(z|x, y) and p(x|y, z) using neural networks (where y \u2208 {y1, y2}).\nLet E : X \u00d7 Y \u2192 Z be an encoder that infers the content z for a given sentence x and a style y, and\nG : Y \u00d7 Z \u2192 X be a generator that generates a sentence x from a given style y and content z. E\nand G form an auto-encoder when applying to the same style, and thus we have reconstruction loss,\n\nLrec(\u03b8E, \u03b8G) = Ex1\u223cX1[\u2212 log pG(x1|y1, E(x1, y1))] +\n\nEx2\u223cX2[\u2212 log pG(x2|y2, E(x2, y2))]\n\nwhere \u03b8 are the parameters to estimate.\nIn order to make a meaningful transfer by \ufb02ipping the style, X1 and X2\u2019s content space must\ncoincide, as our generative framework presumed. To constrain that x1 and x2 are generated from the\nsame latent content distribution p(z), one option is to apply a variational auto-encoder (Kingma and\nWelling, 2013). A VAE imposes a prior density p(z), such as z \u223c N (0, I), and uses a KL-divergence\nregularizer to align both posteriors pE(z|x1, y1) and pE(z|x2, y2) to it,\n\nLKL(\u03b8E) = Ex1\u223cX1[DKL(pE(z|x1, y1)(cid:107)p(z))] + Ex2\u223cX2 [DKL(pE(z|x2, y2)(cid:107)p(z))]\n\n(4)\n\n4\n\n\fThe overall objective is to minimize Lrec + LKL, whose opposite is the variational lower bound of\ndata likelihood.\nHowever, as we have argued in the previous section, restricting z to a simple and even distribution\nand pushing most complexity to the decoder may not be a good strategy for non-parallel style transfer.\nIn contrast, a standard auto-encoder simply minimizes the reconstruction error, encouraging z to\ncarry as much information about x as possible. On the other hand, it lowers the entropy in p(x|y, z),\nwhich helps to produce meaningful style transfer in practice as we \ufb02ip between y1 and y2. Without\nexplicitly modeling p(z), it is still possible to force distributional alignment of p(z|y1) and p(z|y2).\nTo this end, we introduce two constrained variants of auto-encoder.\n\n4.1 Aligned auto-encoder\n\nDispense with VAEs that make an explicit assumption about p(z) and align both posteriors to it, we\nalign pE(z|y1) and pE(z|y2) with each other, which leads to the following constrained optimization\nproblem:\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\ns.t. E(x1, y1)\n\nd\n= E(x2, y2)\n\nLrec(\u03b8E, \u03b8G)\n\nx1 \u223c X1, x2 \u223c X2\n\n(5)\n\nIn practice, a Lagrangian relaxation of the primal problem is instead optimized. We introduce\nan adversarial discriminator D to align the aggregated posterior distribution of z from different\nstyles (Makhzani et al., 2015). D aims to distinguish between these two distributions:\n\nLadv(\u03b8E, \u03b8D) = Ex1\u223cX1[\u2212 log D(E(x1, y1))] + Ex2\u223cX2[\u2212 log(1 \u2212 D(E(x2, y2)))]\n\n(6)\n\nThe overall training objective is a min-max game played among the encoder E, generator G and\ndiscriminator D. They constitute an aligned auto-encoder:\n\nmin\nE,G\n\nmax\n\nD\n\nLrec \u2212 \u03bbLadv\n\n(7)\n\nWe implement the encoder E and generator G using single-layer RNNs with GRU cell. E takes\nan input sentence x with initial hidden state y, and outputs the last hidden state z as its content\nrepresentation. G generates a sentence x conditioned on latent state (y, z). To align the distributions\nof z1 = E(x1, y1) and z2 = E(x2, y2), the discriminator D is a feed-forward network with a single\nhidden layer and a sigmoid output layer.\n\n4.2 Cross-aligned auto-encoder\n\n(cid:82)\n\nThe second variant, cross-aligned auto-encoder, directly aligns the transfered samples from one\nstyle with the true samples from the other. Under the generative assumption, p(x2|y2) =\np(x2|x1; y1, y2)p(x1|y1)dx1, thus x2 (sampled from the left-hand side) should exhibit the\nx1\nsame distribution as transferred x1 (sampled from the right-hand side), and vice versa. Similar to our\n\ufb01rst model, the second model uses two discriminators D1 and D2 to align the populations. D1\u2019s job\nis to distinguish between real x1 and transferred x2, and D2\u2019s job is to distinguish between real x2\nand transferred x1.\nAdversarial training over the discrete samples generated by G hinders gradients propagation. Although\nsampling-based gradient estimator such as REINFORCE (Williams, 1992) can by adopted, training\nwith these methods can be unstable due to the high variance of the sampled gradient. Instead, we\nemploy two recent techniques to approximate the discrete training (Hu et al., 2017; Lamb et al.,\n2016). First, instead of feeding a single sampled word as the input to the generator RNN, we use the\nsoftmax distribution over words instead. Speci\ufb01cally, during the generating process of transferred x2\nfrom G(y1, z2), suppose at time step t the output logit vector is vt. We feed its peaked distribution\nsoftmax(vt/\u03b3) as the next input, where \u03b3 \u2208 (0, 1) is a temperature parameter.\nSecondly, we use Professor-Forcing (Lamb et al., 2016) to match the sequence of hidden states\ninstead of the output words, which contains the information about outputs and is smoothly distributed.\nThat is, the input to the discriminator D1 is the sequence of hidden states of either (1) G(y1, z1)\nteacher-forced by a real example x1, or (2) G(y1, z2) self-fed by previous soft distributions.\n\n5\n\n\fFigure 2: Cross-aligning between x1 and transferred x2. For x1, G is teacher-forced by its words\nw1w2 \u00b7\u00b7\u00b7 wt. For transfered x2, G is self-fed by previous output logits. The sequence of hidden\nstates h0,\u00b7\u00b7\u00b7 , ht and \u02dch0,\u00b7\u00b7\u00b7 , \u02dcht are passed to discriminator D1 to be aligned. Note that our \ufb01rst\nvariant aligned auto-encoder is a special case of this, where only h0 and \u02dch0, i.e. z1 and z2, are\naligned.\n\nAlgorithm 1 Cross-aligned auto-encoder training. The hyper-parameters are set as \u03bb = 1, \u03b3 = 0.001\nand learning rate is 0.0001 for all experiments in this paper.\nInput: Two corpora of different styles X1, X2. Lagrange multiplier \u03bb, temperature \u03b3.\n\nInitialize \u03b8E, \u03b8G, \u03b8D1, \u03b8D2\nrepeat\n\nfor p = 1, 2; q = 2, 1 do\n\np }k\nSample a mini-batch of k examples {x(i)\ni=1 from Xp\np = E(x(i)\nGet the latent content representations z(i)\np , yp)\np ) by feeding x(i)\nUnroll G from initial state (yp, z(i)\np , and get the hidden states sequence h(i)\np\nUnroll G from initial state (yq, z(i)\np ) by feeding previous soft output distribution with temper-\nature \u03b3, and get the transferred hidden states sequence \u02dch(i)\np\n\nend for\nCompute the reconstruction Lrec by Eq. (3)\nCompute D1\u2019s (and symmetrically D2\u2019s) loss:\n\nk(cid:88)\n\ni=1\n\nLadv1 = \u2212 1\nk\n\nlog D1(h(i)\n\n1 ) \u2212 1\nk\n\nk(cid:88)\n\ni=1\n\nlog(1 \u2212 D1(\u02dch(i)\n2 ))\n\nUpdate {\u03b8E, \u03b8G} by gradient descent on loss\n\nLrec \u2212 \u03bb(Ladv1 + Ladv2)\n\nUpdate \u03b8D1 and \u03b8D2 by gradient descent on loss Ladv1 and Ladv2 respectively\n\nuntil convergence\n\nOutput: Style transfer functions G(y2, E(\u00b7, y1)) : X1 \u2192 X2 and G(y1, E(\u00b7, y2)) : X2 \u2192 X1\n\n(8)\n\n(9)\n\nThe running procedure of our cross-aligned auto-encoder is illustrated in Figure 2. Note that cross-\naligning strengthens the alignment of latent variable z over the recurrent network of generator G.\nBy aligning the whole sequence of hidden states, it prevents z1 and z2\u2019s initial misalignment from\npropagating through the recurrent generating process, as a result of which the transferred sentence\nmay end up somewhere far from the target domain.\nWe implement both D1 and D2 using convolutional neural networks for sequence classi\ufb01cation (Kim,\n2014). The training algorithm is presented in Algorithm 1.\n\n6\n\n\f5 Experimental setup\n\nSentiment modi\ufb01cation Our \ufb01rst experiment focuses on text rewriting with the goal of changing\nthe underlying sentiment, which can be regarded as \u201cstyle transfer\u201d between negative and positive\nsentences. We run experiments on Yelp restaurant reviews, utilizing readily available user ratings\nassociated with each review. Following standard practice, reviews with rating above three are\nconsidered positive, and those below three are considered negative. While our model operates at the\nsentence level, the sentiment annotations in our dataset are provided at the document level. We assume\nthat all the sentences in a document have the same sentiment. This is clearly an oversimpli\ufb01cation,\nsince some sentences (e.g., background) are sentiment neutral. Given that such sentences are more\ncommon in long reviews, we \ufb01lter out reviews that exceed 10 sentences. We further \ufb01lter the\nremaining sentences by eliminating those that exceed 15 words. The resulting dataset has 250K\nnegative sentences, and 350K positive ones. The vocabulary size is 10K after replacing words\noccurring less than 5 times with the \u201c\u201d token. As a baseline model, we compare against the\ncontrol-gen model of Hu et al. (2017).\nTo quantitatively evaluate the transfered sentences, we adopt a model-based evaluation metric similar\nto the one used for image transfer (Isola et al., 2016). Speci\ufb01cally, we measure how often a transferred\nsentence has the correct sentiment according to a pre-trained sentiment classi\ufb01er. For this purpose,\nwe use the TextCNN model as described in Kim (2014). On our simpli\ufb01ed dataset for style transfer,\nit achieves nearly perfect accuracy of 97.4%.\nWhile the quantitative evaluation provides some indication of transfer quality, it does not capture\nall the aspects of this generation task. Therefore, we also perform two human evaluations on 500\nsentences randomly selected from the test set2. In the \ufb01rst evaluation, the judges were asked to rank\ngenerated sentences in terms of their \ufb02uency and sentiment. Fluency was rated from 1 (unreadable)\nto 4 (perfect), while sentiment categories were \u201cpositive\u201d, \u201cnegative\u201d, or \u201cneither\u201d (which could be\ncontradictory, neutral or nonsensical). In the second evaluation, we evaluate the transfer process\ncomparatively. The annotator was shown a source sentence and the corresponding outputs of the\nsystems in a random order, and was asked \u201cWhich transferred sentence is semantically equivalent\nto the source sentence with an opposite sentiment?\u201d. They can be both satisfactory, A/B is better,\nor both unsatisfactory. We collect two labels for each question. The label agreement and con\ufb02ict\nresolution strategy can be found in the supplementary material. Note that the two evaluations are not\nredundant. For instance, a system that always generates the same grammatically correct sentence\nwith the right sentiment independently of the source sentence will score high in the \ufb01rst evaluation\nsetup, but low in the second one.\n\nWord substitution decipherment Our second set of experiments involves decipherment of word\nsubstitution ciphers, which has been previously explored in NLP literature (Dou and Knight, 2012;\nNuhn and Ney, 2013). These ciphers replace every word in plaintext (natural language) with a cipher\ntoken according to a 1-to-1 substitution key. The decipherment task is to recover the plaintext from\nciphertext. It is trivial if we have access to parallel data. However we are interested to consider\na non-parallel decipherment scenario. For training, we select 200K sentences as X1, and apply a\nsubstitution cipher f on a different set of 200K sentences to get X2. While these sentences are non-\nparallel, they are drawn from the same distribution from the review dataset. The development and test\nsets have 100K parallel sentences D1 = {x(1),\u00b7\u00b7\u00b7 , x(n)} and D2 = {f (x(1)),\u00b7\u00b7\u00b7 , f (x(n))}. We\ncan quantitatively compare between D1 and transferred (deciphered) D2 using Bleu score (Papineni\net al., 2002).\nClearly, the dif\ufb01culty of this decipherment task depends on the number of substituted words. There-\nfore, we report model performance with respect to the percentage of the substituted vocabulary. Note\nthat the transfer models do not know that f is a word substitution function. They learn it entirely\nfrom the data distribution.\nIn addition to having different transfer models, we introduce a simple decipherment baseline based\non word frequency. Speci\ufb01cally, we assume that words shared between X1 and X2 do not require\ntranslation. The rest of the words are mapped based on their frequency, and ties are broken arbitrarily.\nFinally, to assess the dif\ufb01culty of the task, we report the accuracy of a machine translation system\ntrained on a parallel corpus (Klein et al., 2017).\n\n2we eliminated 37 sentences from them that were judged as neutral by human judges.\n\n7\n\n\fMethod\nHu et al. (2017)\nVariational auto-encoder\nAligned auto-encoder\nCross-aligned auto-encoder\n\naccuracy\n\n83.5\n23.2\n48.3\n78.4\n\nTable 1: Sentiment accuracy of transferred sentences, as measured by a pretrained classi\ufb01er.\n\nMethod\nHu et al. (2017)\nCross-align\n\nsentiment\n\n\ufb02uency\n\noverall transfer\n\n70.8\n62.6\n\n3.2\n2.8\n\n41.0\n41.5\n\nTable 2: Human evaluations on sentiment, \ufb02uency and overall transfer quality. Fluency rating is from\n1 (unreadable) to 4 (perfect). Overall transfer quality is evaluated in a comparative manner, where the\njudge is shown a source sentence and two transferred sentences, and decides whether they are both\ngood, both bad, or one is better.\n\nWord order recovery Our \ufb01nal experiments focus on the word ordering task, also known as bag\ntranslation (Brown et al., 1990; Schmaltz et al., 2016). By learning the style transfer functions\nbetween original English sentences X1 and shuf\ufb02ed English sentences X2, the model can be used to\nrecover the original word order of a shuf\ufb02ed sentence (or conversely to randomly permute a sentence).\nThe process to construct non-parallel training data and parallel testing data is the same as in the word\nsubstitution decipherment experiment. Again the transfer models do not know that f is a shuf\ufb02e\nfunction and learn it completely from data.\n\n6 Results\n\nSentiment modi\ufb01cation Table 1 and Table 2 show the performance of various models for both\nhuman and automatic evaluation. The control-gen model of Hu et al. (2017) performs better in terms\nof sentiment accuracy in both evaluations. This is not surprising because their generation is directly\nguided by a sentiment classi\ufb01er. Their system also achieves higher \ufb02uency score. However, these\ngains do not translate into improvements in terms of the overall transfer, where our model faired\nbetter. As can be seen from the examples listed in Table 3, our model is more consistent with the\ngrammatical structure and semantic meaning of the source sentence. In contrast, their model achieves\nsentiment change by generating an entirely new sentence which has little overlap with the original.\nThe discrepancy between the two experiments demonstrate the crucial importance of developing\nappropriate evaluation measures for comparing methods for style transfer.\n\nWord substitution decipherment Table 4 summarizes the performance of our model and the\nbaselines on the decipherment task, at various levels of word substitution. Consistent with our\nintuition, the last row in this table shows that the task is trivial when the parallel data is provided.\nIn non-parallel case, the dif\ufb01culty of the task is driven by the substitution rate. Across all the\ntesting conditions, our cross-aligned model consistently outperforms its counterparts. The difference\nbecomes more pronounced as the task becomes harder. When the substitution rate is 20%, all\nmethods do a reasonably good job in recovering substitutions. However, when 100% of the words\nare substituted (as expected in real language decipherment), the poor performance of variational\nautoencoder and aligned auto-encoder rules out their application for this task.\n\nWord order recovery The last column in Table 4 demonstrates the performance on the word\norder recovery task. Order recovery is much harder\u2014even when trained with parallel data, the\nmachine translation model achieves only 64.6 Bleu score. Note that some generated orderings may be\ncompletely valid (e.g., reordering conjunctions), but the models will be penalized for producing them.\nIn this task, only the cross-aligned auto-encoder achieves grammatical reorder to a certain extent,\ndemonstrated by its Bleu score 26.1. Other models fail this task, doing no better than no transfer.\n\n8\n\n\fFrom negative to positive\n\nconsistently slow .\nconsistently good .\nconsistently fast .\nmy goodness it was so gross .\nmy husband \u2019s steak was phenomenal .\nmy goodness was so awesome .\nit was super dry and had a weird taste to the entire slice .\nit was a great meal and the tacos were very kind of good .\nit was super \ufb02avorful and had a nice texture of the whole side .\n\nFrom positive to negative\n\ni love the ladies here !\ni avoid all the time !\ni hate the doctor here !\nmy appetizer was also very good and unique .\nmy bf was n\u2019t too pleased with the beans .\nmy appetizer was also very cold and not fresh whatsoever .\ncame here with my wife and her grandmother !\ncame here with my wife and hated her !\ncame here with my wife and her son .\n\nTable 3: Sentiment transfer samples. The \ufb01rst line is an input sentence, the second and third lines are\nthe generated sentences after sentiment transfer by Hu et al. (2017) and our cross-aligned auto-encoder,\nrespectively.\n\nMethod\n\nNo transfer (copy)\nUnigram matching\nVariational auto-encoder\nAligned auto-encoder\nCross-aligned auto-encoder\nParallel translation\n\nSubstitution decipher\n\n21.4\n48.1\n59.6\n68.9\n79.1\n98.9\n\n6.3\n17.8\n44.6\n50.7\n74.7\n98.2\n\n20% 40% 60% 80% 100%\n56.4\n74.3\n79.8\n81.0\n83.8\n99.0\n\n0\n1.2\n0.9\n7.2\n57.4\n97.2\n\n4.5\n10.7\n34.4\n45.6\n66.1\n98.5\n\nOrder recover\n\n5.1\n-\n5.3\n5.2\n26.1\n64.6\n\nTable 4: Bleu scores of word substitution decipherment and word order recovery.\n\n7 Conclusion\n\nTransferring languages from one style to another has been previously trained using parallel data. In\nthis work, we formulate the task as a decipherment problem with access only to non-parallel data.\nThe two data collections are assumed to be generated by a latent variable generative model. Through\nthis view, our method optimizes neural networks by forcing distributional alignment (invariance) over\nthe latent space or sentence populations. We demonstrate the effectiveness of our method on tasks\nthat permit quantitative evaluation, such as sentiment transfer, word substitution decipherment and\nword ordering. The decipherment view also provides an interesting open question\u2014when can the\njoint distribution p(x1, x2) be recovered given only marginal distributions? We believe addressing\nthis general question would promote the style transfer research in both vision and NLP.\n\n9\n\n\fAcknowledgments\n\nWe thank Nicholas Matthews for helping to facilitate human evaluations, and Zhiting Hu for sharing\nhis code. We also thank Jonas Mueller, Arjun Majumdar, Olga Simek, Danelle Shah, MIT NLP group\nand the reviewers for their helpful comments. This work was supported by MIT Lincoln Laboratory.\n\nReferences\nPeter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick Jelinek, John D\nLafferty, Robert L Mercer, and Paul S Roossin. A statistical approach to machine translation.\nComputational linguistics, 16(2):79\u201385, 1990.\n\nTong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua\nBengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint\narXiv:1702.07983, 2017.\n\nXi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, 2016.\n\nQing Dou and Kevin Knight. Large scale decipherment for out-of-domain machine translation. In\nProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing\nand Computational Natural Language Learning, pages 266\u2013275. Association for Computational\nLinguistics, 2012.\n\nLeon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neu-\nral networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2414\u20132423, 2016.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems, pages 2672\u20132680, 2014.\n\nKartik Goyal, Chris Dyer, and Taylor Berg-Kirkpatrick. Differentiable scheduled sampling for credit\n\nassignment. arXiv preprint arXiv:1704.06970, 2017.\n\nR Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundary-\n\nseeking generative adversarial networks. arXiv preprint arXiv:1702.08431, 2017.\n\nZhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text\n\ngeneration. arXiv preprint arXiv:1703.00955, 2017.\n\nPhillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.\n\nEric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\nTaeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover\ncross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192,\n2017.\n\nYoon Kim. Convolutional neural networks for sentence classi\ufb01cation. arXiv preprint arXiv:1408.5882,\n\n2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nGuillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt:\n\nOpen-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810, 2017.\n\nMatt J Kusner and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Gans for sequences of discrete elements with the\n\ngumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.\n\n10\n\n\fAlex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C\nCourville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks.\nIn Advances In Neural Information Processing Systems, pages 4601\u20134609, 2016.\n\nMing-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural\n\nInformation Processing Systems, pages 469\u2013477, 2016.\n\nMing-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks.\n\narXiv preprint arXiv:1703.00848, 2017.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\nAlireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\nJonas Mueller, Tommi Jaakkola, and David Gifford. Sequence to better sequence: continuous revision\n\nof combinatorial structures. International Conference on Machine Learning (ICML), 2017.\n\nMalte Nuhn and Hermann Ney. Decipherment complexity in 1: 1 substitution ciphers. In ACL (1),\n\npages 615\u2013621, 2013.\n\nKishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\nAllen Schmaltz, Alexander M. Rush, and Stuart Shieber. Word ordering without syntax. In Pro-\nceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages\n2319\u20132324. Association for Computational Linguistics, 2016.\n\nYaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. arXiv\n\npreprint arXiv:1611.02200, 2016.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\nZili Yi, Hao Zhang, Ping Tan Gong, et al. Dualgan: Unsupervised dual learning for image-to-image\n\ntranslation. arXiv preprint arXiv:1704.02510, 2017.\n\nLantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: sequence generative adversarial nets\n\nwith policy gradient. arXiv preprint arXiv:1609.05473, 2016.\n\nJun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation\n\nusing cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n11\n\n\fA Proof of Lemma 1\n\nLemma 1. Let z be a mixture of Gaussians p(z) =(cid:80)K\n\nk=1 \u03c0kN (z; \u00b5k, \u03a3k). Assume K \u2265 2, and\nthere are two different \u03a3i (cid:54)= \u03a3j. Let Y = {(A, b)||A| (cid:54)= 0} be all invertible af\ufb01ne transformations,\nand p(x|y, z) = N (x; Az + b, \u00012I), in which \u0001 is a noise. Then for all y (cid:54)= y(cid:48) \u2208 Y, p(x|y) and\np(x|y(cid:48)) are different distributions.\n\nProof.\n\np(x|y = (A, b)) =\n\n\u03c0kN (x; A\u00b5k + b, A\u03a3kA(cid:62) + \u00012I)\n\nFor different y = (A, b) and y(cid:48) = (A(cid:48), b(cid:48)), p(x|y) = p(x|y(cid:48)) entails that for k = 1,\u00b7\u00b7\u00b7 , K,\n\nk=1\n\nK(cid:88)\n(cid:26)A\u00b5k + b = A(cid:48)\u00b5k + b(cid:48)\n\nA\u03a3kA(cid:62) = A(cid:48)\u03a3kA(cid:48)(cid:62)\n\nSince all Y are invertible,\n\nSuppose \u03a3k = QkDkQ(cid:62)\nhave the form:\n\nk is \u03a3k\u2019s orthogonal diagonalization. If k = 1, all solutions for A\u22121A(cid:48)\n\n(A\u22121A(cid:48))\u03a3k(A\u22121A(cid:48))(cid:62) = \u03a3k\n\nQD1/2U D\u22121/2Q(cid:62)(cid:12)(cid:12)(cid:12)U is orthogonal\n(cid:110)\n\n(cid:111)\n\nHowever, when K \u2265 2 and there are two different \u03a3i (cid:54)= \u03a3j, the only solution is A\u22121A(cid:48) = I, i.e.\nA = A(cid:48), and thus b = b(cid:48).\nTherefore, for all y (cid:54)= y(cid:48), p(x|y) (cid:54)= p(x|y(cid:48)).\n\n12\n\n\f", "award": [], "sourceid": 3430, "authors": [{"given_name": "Tianxiao", "family_name": "Shen", "institution": "MIT"}, {"given_name": "Tao", "family_name": "Lei", "institution": "MIT"}, {"given_name": "Regina", "family_name": "Barzilay", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}