{"title": "ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 5495, "page_last": 5503, "abstract": "We investigate the non-identifiability issues associated with bidirectional adversarial training for joint distribution matching. Within a framework of conditional entropy, we propose both adversarial and non-adversarial approaches to learn desirable matched joint distributions for unsupervised and supervised tasks. We unify a broad family of adversarial models as joint distribution matching problems. Our approach stabilizes learning of unsupervised bidirectional adversarial learning methods. Further, we introduce an extension for semi-supervised learning tasks. Theoretical results are validated in synthetic data and real-world applications.", "full_text": "ALICE: Towards Understanding Adversarial\n\nLearning for Joint Distribution Matching\n\nChunyuan Li1, Hao Liu2, Changyou Chen3, Yunchen Pu1, Liqun Chen1,\n\ncl319@duke.edu\n\nRicardo Henao1 and Lawrence Carin1\n\n1Duke University 2Nanjing University 3University at Buffalo\n\nAbstract\n\nWe investigate the non-identi\ufb01ability issues associated with bidirectional adver-\nsarial training for joint distribution matching. Within a framework of conditional\nentropy, we propose both adversarial and non-adversarial approaches to learn\ndesirable matched joint distributions for unsupervised and supervised tasks. We\nunify a broad family of adversarial models as joint distribution matching problems.\nOur approach stabilizes learning of unsupervised bidirectional adversarial learning\nmethods. Further, we introduce an extension for semi-supervised learning tasks.\nTheoretical results are validated in synthetic data and real-world applications.\n\nIntroduction\n\n1\nDeep directed generative models are a powerful framework for modeling complex data distributions.\nGenerative Adversarial Networks (GANs) [1] can implicitly learn the data generating distribution;\nmore speci\ufb01cally, GAN can learn to sample from it. In order to do this, GAN trains a generator to\nmimic real samples, by learning a mapping from a latent space (where the samples are easily drawn)\nto the data space. Concurrently, a discriminator is trained to distinguish between generated and real\nsamples. The key idea behind GAN is that if the discriminator \ufb01nds it dif\ufb01cult to distinguish real from\narti\ufb01cial samples, then the generator is likely to be a good approximation to the true data distribution.\nIn its standard form, GAN only yields a one-way mapping, i.e., it lacks an inverse mapping mechanism\n(from data to latent space), preventing GAN from being able to do inference. The ability to compute\na posterior distribution of the latent variable conditioned on a given observation may be important\nfor data interpretation and for downstream applications (e.g., classi\ufb01cation from the latent variable)\n[2, 3, 4, 5, 6, 7]. Efforts have been made to simultaneously learn an ef\ufb01cient bidirectional model\nthat can produce high-quality samples for both the latent and data spaces [3, 4, 8, 9, 10, 11]. Among\nthem, the recently proposed Adversarially Learned Inference (ALI) [4, 10] casts the learning of such\na bidirectional model in a GAN-like adversarial framework. Speci\ufb01cally, a discriminator is trained to\ndistinguish between two joint distributions: that of the real data sample and its inferred latent code,\nand that of the real latent code and its generated data sample.\nWhile ALI is an inspiring and elegant approach, it tends to produce reconstructions that are not\nnecessarily faithful reproductions of the inputs [4]. This is because ALI only seeks to match two\njoint distributions, but the dependency structure (correlation) between the two random variables\n(conditionals) within each joint is not speci\ufb01ed or constrained. In practice, this results in solutions\nthat satisfy ALI\u2019s objective and that are able to produce real-looking samples, but have dif\ufb01culties\nreconstructing observed data [4]. ALI also has dif\ufb01culty discovering the correct pairing relationship\nin domain transformation tasks [12, 13, 14].\nIn this paper, (i) we \ufb01rst describe the non-identi\ufb01ability issue of ALI. To solve this problem, we\npropose to regularize ALI using the framework of Conditional Entropy (CE), hence we call the\nproposed approach ALICE. (ii) Adversarial learning schemes are proposed to estimate the conditional\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fentropy, for both unsupervised and supervised learning paradigms. (iii) We provide a uni\ufb01ed view\nfor a family of recently proposed GAN models from the perspective of joint distribution matching,\nincluding ALI [4, 10], CycleGAN [12, 13, 14] and Conditional GAN [15]. (iv) Extensive experiments\non synthetic and real data demonstrate that ALICE is signi\ufb01cantly more stable to train than ALI, in\nthat it consistently yields more viable solutions (good generation and good reconstruction), without\nbeing too sensitive to perturbations of the model architecture, i.e., hyperparameters. We also show\nthat ALICE results in more faithful image reconstructions. (v) Further, our framework can leverage\npaired data (when available) for semi-supervised tasks. This is empirically demonstrated on the\ndiscovery of relationships for cross domain tasks based on image data.\n2 Background\nConsider two general marginal distributions q(x) and p(z) over x \u2208 X and z \u2208 Z. One domain\ncan be inferred based on the other using conditional distributions, q(z|x) and p(x|z). Further, the\ncombined structure of both domains is characterized by joint distributions q(x, z) = q(x)q(z|x) and\np(x, z) = p(z)p(x|z).\nTo generate samples from these random variables, adversarial methods [1] provide a sampling\nmechanism that only requires gradient backpropagation, without the need to specify the conditional\ndensities. Speci\ufb01cally, instead of sampling directly from the desired conditional distribution, the\nrandom variable is generated as a deterministic transformation of two inputs, the variable in the source\ndomain, and an independent noise, e.g., a Gaussian distribution. Without loss of generality, we use an\nuniversal distribution approximator speci\ufb01cation [9], i.e., the sampling procedure for conditionals\n\u02dcx \u223c p\u03b8(x|z) and \u02dcz \u223c q\u03c6(z|x) is carried out through the following two generating processes:\n(1)\n\u02dcx = g\u03b8(z, \u0001), z \u223c p(z), \u0001 \u223c N (0, I), and \u02dcz = g\u03c6(x, \u03b6), x \u223c q(x), \u03b6 \u223c N (0, I),\nwhere g\u03b8(\u00b7) and g\u03c6(\u00b7) are two generators, speci\ufb01ed as neural networks with parameters \u03b8 and\n\u03c6, respectively. In practice, the inputs of g\u03b8(\u00b7) and g\u03c6(\u00b7) are simple concatenations, [z \u0001] and\n[x \u03b6], respectively. Note that (1) implies that p\u03b8(x|z) and q\u03c6(z|x) are parameterized by \u03b8 and \u03c6\nrespectively, hence the subscripts.\n\nThe goal of GAN [1] is to match the marginal p\u03b8(x) =(cid:82) p\u03b8(x|z)p(z)dz to q(x). Note that q(x)\n\ndenotes the true distribution of the data (from which we have samples) and p(z) is speci\ufb01ed as a\nsimple parametric distribution, e.g., isotropic Gaussian. In order to do the matching, GAN trains\na \u03c9-parameterized adversarial discriminator network, f\u03c9(x), to distinguish between samples from\np\u03b8(x) and q(x). Formally, the minimax objective of GAN is given by the following expression:\n\u03c9 LGAN(\u03b8, \u03c9) = Ex\u223cq(x)[log \u03c3(f\u03c9(x))] + E\u02dcx\u223cp\u03b8 (x|z),z\u223cp(z)[log(1 \u2212 \u03c3(f\u03c9(\u02dcx)))], (2)\nmin\nwhere \u03c3(\u00b7) is the sigmoid function. The following lemma characterizes the solutions of (2) in terms\nof marginals p\u03b8(x) and q(x).\nLemma 1 ([1]) The optimal decoder and discriminator, parameterized by {\u03b8\na saddle point of the objective in (2), if and only if p\u03b8\u2217 (x) = q(x).\nAlternatively, ALI [4] matches the joint distributions p\u03b8(x, z) = p\u03b8(x|z)p(z) and q\u03c6(x, z) =\nq(x)q\u03c6(z|x), using an adversarial discriminator network similar to (2), f\u03c9(x, z), parameterized by\n\u03c9. The minimax objective of ALI can be then written as\n\n}, correspond to\n\n, \u03c9\u2217\n\nmax\n\n\u2217\n\n\u03b8\n\nmin\n\u03b8,\u03c6\n\nmax\n\n\u03c9 LALI(\u03b8, \u03c6, \u03c9) = Ex\u223cq(x),\u02dcz\u223cq\u03c6(z|x)[log \u03c3(f\u03c9(x, \u02dcz))]\n\n+ E\u02dcx\u223cp\u03b8 (x|z),z\u223cp(z)[log(1\u2212\u03c3(f\u03c9(\u02dcx, z)))].\n\n(3)\n\n\u2217\n\n, \u03c9\u2217\n\n\u2217\n, \u03c6\n\n} form a saddle point of the objective in (3), if and only if p\u03b8\u2217 (x, z) = q\u03c6\u2217 (x, z).\n\nLemma 2 ([4]) The optimum of the two generators and the discriminator with parameters\n{\u03b8\nFrom Lemma 2, if a solution of (3) is achieved, it is guaranteed that all marginals and conditional\ndistributions of the pair {x, z} match. Note that this implies that q\u03c6(z|x) and p\u03b8(z|x) match;\nhowever, (3) imposes no restrictions on these two conditionals. This is key for the identi\ufb01ability\nissues of ALI described below.\n3 Adversarial Learning with Information Measures\nThe relationship (mapping) between random variables x and z is not speci\ufb01ed or constrained by\nALI. As a result, it is possible that the matched distribution \u03c0(x, z) (cid:44) p\u03b8\u2217 (x, z) = q\u03c6\u2217 (x, z) is\nundesirable for a given application.\n\n2\n\n\fTo illustrate this issue, Figure 1 shows all solutions\n(saddle points) to the ALI objective on a simple toy\nproblem. The data and latent random variables can take\ntwo possible values, X = {x1, x2} and Z = {z1, z2},\nrespectively. In this case, their marginals q(x) and p(z)\nare known, i.e., q(x = x1) = 0.5 and p(z = z1) = 0.5.\nThe matched joint distribution, \u03c0(x, z), can be repre-\nsented as a 2 \u00d7 2 contingency table. Figure 1(a) repre-\nsents all possible solutions of the ALI objective in (3),\nfor any \u03b4 \u2208 [0, 1]. Figures 1(b) and 1(c) represent oppo-\nFigure 1: Illustration of possible solutions to\nsite extreme solutions when \u03b4 = 1 and \u03b4 = 0, respec-\nthe ALI objective. The \ufb01rst row shows the map-\ntively. Note that although we can generate \u201crealistic\u201d\npings between two domains, The second row\nvalues of x from any sample of p(z), for 0 < \u03b4 < 1, we\nshows matched joint distribution, \u03c0(x, z), as\nwill have poor reconstruction ability since the sequence\ncontingency tables parameterized by \u03b4 = [0, 1].\nx \u223c q(x), \u02dcz \u223c q\u03c6(z|x), \u02dcx \u223c p\u03b8(x|\u02dcz), can easily\nresult in \u02dcx (cid:54)= x. The two (trivial) exceptions where the model can achieve perfect reconstruction\ncorrespond to \u03b4 = {1, 0}, and are illustrated in Figures 1(b) and 1(c), respectively. From this simple\nexample, we see that due to the \ufb02exibility of the joint distribution, \u03c0(x, z), it is quite likely to obtain\nan undesirable solution to the ALI objective. For instance, i) one with poor reconstruction ability or\nii) one where a single instance of z can potentially map to any possible value in X , e.g., in Figure 1(a)\nwith \u03b4 = 0.5, z1 can generate either x1 or x2 with equal probability.\nMany applications require meaningful mappings. Consider two scenarios:\n\u2022 A1: In unsupervised learning, one desirable property is cycle-consistency [12], meaning that the\ninferred z of a corresponding x, can reconstruct x itself with high probability. In Figure 1 this\ncorresponds to either \u03b4 \u2192 1 or \u03b4 \u2192 0, as in Figures 1(b) and 1(c).\n\u2022 A2: In supervised learning, the pre-speci\ufb01ed correspondence between samples imposes restrictions\non the mapping between x and z, e.g., in image tagging, x are images and z are tags. In this case,\npaired samples from the desired joint distribution are usually available, thus we can leverage this\nsupervised information to resolve the ambiguity between Figure 1(b) and (c).\n\nFrom our simple example in Figure 1, we see that in order to alleviate the identi\ufb01ability issues\nassociated with the solutions to the ALI objective, we have to impose constraints on the conditionals\nq\u03c6(z|x) and p\u03b8(z|x). Furthermore, to fully mitigate the identi\ufb01ability issues we require supervision,\ni.e., paired samples from domains X and Z.\nTo deal with the problem of undesirable but matched joint distributions, below we propose to use\nan information-theoretic measure to regularize ALI. This is done by controlling the \u201cuncertainty\u201d\nbetween pairs of random variables, i.e., x and z, using conditional entropies.\n3.1 Conditional Entropy\nConditional Entropy (CE) is an information-theoretic measure that quanti\ufb01es the uncertainty of\nrandom variable x when conditioned on z (or the other way around), under joint distribution \u03c0(x, z):\n\nH \u03c0(x|z) (cid:44) \u2212E\n\n\u03c0(x,z)[log \u03c0(x|z)], and H \u03c0(z|x) (cid:44) \u2212E\n\n(4)\nThe uncertainty of x given z is linked with H \u03c0(x|z); in fact, H \u03c0(x|z) = 0 if only if x is a\ndeterministic mapping of z. Intuitively, by controlling the uncertainty of q\u03c6(z|x) and p\u03b8(z|x), we\ncan restrict the solutions of the ALI objective to joint distributions whose mappings result in better\nreconstruction ability. Therefore, we propose to use the CE in (4), denoted as L\u03c0\nCE(\u03b8, \u03c6) = H \u03c0(x|z)\nor H \u03c0(z|x) (depending on the task; see below), as a regularization term in our framework, termed\nALI with Conditional Entropy (ALICE), and de\ufb01ned as the following minimax objective:\n(5)\n\n\u03c0(x,z)[log \u03c0(z|x)].\n\nmin\n\u03b8,\u03c6\n\n\u03c9 LALICE(\u03b8, \u03c6, \u03c9) = LALI(\u03b8, \u03c6, \u03c9) + L\u03c0\nmax\n\nCE(\u03b8, \u03c6).\n\nCE(\u03b8, \u03c6) is dependent on the underlying distributions for the random variables, parametrized by\nL\u03c0\n(\u03b8, \u03c6), as made clearer below. Ideally, we could select the desirable solutions of (5) by evaluating\ntheir CE, once all the saddle points of the ALI objective have been identi\ufb01ed. However, in practice,\nCE(\u03b8, \u03c6) is intractable because we do not have access to the saddle points beforehand. Below, we\nL\u03c0\npropose to approximate the CE in (5) during training for both unsupervised and supervised tasks.\nSince x and z are symmetric in terms of CE according to (4), we use x to derive our theoretical\nresults. Similar arguments hold for z, as discussed in the Supplementary Material (SM).\n\n3\n\nx2z1z2x100z1z2x2x1x2z1z2x100z1z2x2x1x2z1z2x1z1z2x2x1(a)(b)(c)/2/21/21/21/21/2(1)/2(1)/2\f3.2 Unsupervised Learning\nIn the absence of explicit probability distributions needed for computing the CE, we can bound\nthe CE using the criterion of cycle-consistency [12]. We denote the reconstruction of x as \u02c6x, via\ngenerating procedure (cycle) \u02c6x \u223c p\u03b8(\u02c6x|z), z \u223c q\u03c6(z|x), x \u223c q(x). We desire that p\u03b8(\u02c6x|z) have\nhigh likelihood for \u02c6x = x, for the x \u223c q(x) that begins the cycle x \u2192 z \u2192 \u02c6x, and hence that \u02c6x\nbe similar to the original x. Lemma 3 below shows that cycle-consistency is an upper bound of the\nconditional entropy in (4).\nLemma 3 For joint distributions p\u03b8(x, z) or q\u03c6(x, z), we have\nH q\u03c6(x|z) (cid:44) \u2212E\nq\u03c6(x,z)[log p\u03b8(x|z)] \u2212 E\nwhere q\u03c6(z) = (cid:82) dxq\u03c6(x, z). The proof is in the SM. Note that latent z is implicitly involved\nq\u03c6(x,z)[log p\u03b8(x|z)] (cid:44) LCycle(\u03b8, \u03c6).\nin LCycle(\u03b8, \u03c6) via E\nfollowing upper bound of (5):\n\nq\u03c6(x,z)[\u00b7]. For the unsupervised case we want to leverage (6) to optimize the\n(7)\n\nq\u03c6(x,z)[log q\u03c6(x|z)] = \u2212E\n\u2264 \u2212E\n\nq\u03c6(z)[KL(q\u03c6(x|z)(cid:107)p\u03b8(x|z))]\n(6)\n\nmin\n\u03b8,\u03c6\n\n\u03c9 LALI(\u03b8, \u03c6, \u03c9) + LCycle(\u03b8, \u03c6) .\nmax\n\nNote that as ALI reaches its optimum, p\u03b8(x, z) and q\u03c6(x, z) reach saddle point \u03c0(x, z), then\nLCycle(\u03b8, \u03c6) \u2192 H q\u03c6(x|z) \u2192 H \u03c0(x|z) in (4) accordingly, thus (7) effectively approaches (5)\n(ALICE). Unlike L\u03c0\nCE(\u03b8, \u03c6) in (4), its upper bound, LCycle(\u03b8, \u03c6), can be easily approximated via\nMonte Carlo simulation. Importantly, (7) can be readily added to ALI\u2019s objective without additional\nchanges to the original training procedure.\nThe cycle-consistency property has been previously leveraged in CycleGAN [12], DiscoGAN [13]\nand DualGAN [14]. However, in [12, 13, 14], cycle-consistency, LCycle(\u03b8, \u03c6), is implemented via (cid:96)k\nlosses, for k = 1, 2, and real-valued data such as images. As a consequence of an (cid:96)2-based pixel-wise\nloss, the generated samples tend to be blurry [8]. Recognizing this limitation, we further suggest\nto enforce cycle-consistency (for better reconstruction) using fully adversarial training (for better\ngeneration), as an alternative to LCycle(\u03b8, \u03c6) in (7). Speci\ufb01cally, to reconstruct x, we specify an\n\u03b7-parameterized discriminator f\u03b7(x, \u02c6x) to distinguish between x and its reconstruction \u02c6x:\n\nmin\n\u03b8,\u03c6\n\n\u03b7 LA\nmax\n\nCycle(\u03b8, \u03c6, \u03b7) = Ex\u223cq(x)[log \u03c3(f\u03b7(x, x))]\n\n+ E\u02c6x\u223cp\u03b8 (\u02c6x|z),z\u223cq\u03c6(z|x) log(1 \u2212 \u03c3(f\u03b7(x, \u02c6x)))].\n\n(8)\nFinally, the fully adversarial training algorithm for unsupervised learning using the ALICE framework\nCycle(\u03b8, \u03c6, \u03b7) in (7); thus, for \ufb01xed (\u03b8, \u03c6), we maximize\nis the result of replacing LCycle(\u03b8, \u03c6) with LA\nwrt {\u03c9, \u03b7}.\nThe use of paired samples {x, \u02c6x} in (8) is critical.\nIt encourages the generators to mimic the\nreconstruction relationship implied in the \ufb01rst joint; on the contrary, the model may reduce to the\nbasic GAN discussed in Section 3, and generate any realistic sample in X . The objective in (8)\nenjoys many theoretical properties of GAN. Particularly, Proposition 1 guarantees the existence of\nthe optimal generator and discriminator.\nProposition 1 The optimal generators and discriminator {\u03b8\nachieved, if and only if E\n\n} of the objective in (8) is\n\n\u2217\n, \u03c6\n\n, \u03b7\u2217\n\n\u2217\n\nq\u03c6\u2217 (z|x)p\u03b8\u2217 (\u02c6x|z) = \u03b4(x \u2212 \u02c6x).\n\nThe proof is provided in the SM. Together with Lemma 2 and 3, we can also show that:\n\nCorollary 1 When cycle-consistency is satis\ufb01ed (the optimum in (8) is achieved), (i) a determin-\nistic mapping enforces E\nq\u03c6(z)[KL(q\u03c6(x|z)(cid:107)p\u03b8(x|z))] = 0, which indicates the conditionals are\nmatched. (ii) On the contrary, the matched conditionals enforce H q\u03c6(x|z) = 0, which indicates the\ncorresponding mapping becomes deterministic.\n3.3 Semi-supervised Learning\nWhen the objective in (7) is optimized in an unsupervised way, the identi\ufb01ability issues associated\nwith ALI are largely reduced due to the cycle-consistency-enforcing bound in Lemma 3. This\nmeans that samples in the training data have been probabilistically \u201cpaired\u201d with high certainty,\nby conditionals p\u03b8(x|z) and p\u03c6(z|x), though perhaps not in the desired con\ufb01guration. In real-\nworld applications, obtaining correctly paired data samples for the entire dataset is expensive or\n\n4\n\n\feven impossible. However, in some situations obtaining paired data for a very small subset of the\nobservations may be feasible. In such a case, we can leverage the small set of empirically paired\nsamples, to further provide guidance on selecting the correct con\ufb01guration. This suggests that ALICE\nis suitable for semi-supervised classi\ufb01cation.\nFor a paired sample drawn from empirical distribution \u02dc\u03c0(x, z), its desirable joint distribution is well\nspeci\ufb01ed. Thus, one can directly approximate the CE as\n\nH \u02dc\u03c0(x|z) \u2248 E\n\n(9)\nwhere the approximation (\u2248) arises from the fact that p\u03b8(x|z) is an approximation to \u02dc\u03c0(x|z). For\nthe supervised case we leverage (9) to approximate (5) using the following minimax objective:\n(10)\n\n\u02dc\u03c0(x,z)[log p\u03b8(x|z)] (cid:44) LMap(\u03b8) ,\n\nmin\n\u03b8,\u03c6\n\n\u03c9 LALI(\u03b8, \u03c6, \u03c9) + LMap(\u03b8).\nmax\n\nNote that as ALI reaches its optimum, p\u03b8(x, z) and q\u03c6(x, z) reach saddle point \u03c0(x, z), then\nLMap(\u03b8) \u2192 H \u02dc\u03c0(x|z) \u2192 H \u03c0(x|z) in (4) accordingly, thus (10) approaches (5) (ALICE).\nWe can employ standard losses for supervised learning objectives to approximate LMap(\u03b8) in (10),\nsuch as cross-entropy or (cid:96)k loss in (9). Alternatively, to also improve generation ability, we propose\nan adversarial learning scheme to directly match p\u03b8(x|z) to the paired empirical conditional \u02dc\u03c0(x|z),\nusing conditional GAN [15] as an alternative to LMap(\u03b8) in (10). The \u03c7-parameterized discriminator\nf\u03c7 is used to distinguish the true pair {x, z} from the arti\ufb01cially generated one {\u02c6x, z} (conditioned\non z), using\n(11)\n\u03c7 LA\nmax\n\nMap(\u03b8, \u03c7) = Ex,z\u223c\u02dc\u03c0(x,z)[log \u03c3(f\u03c7(x, z)) + E\u02c6x\u223cp\u03b8 (\u02c6x|z) log(1 \u2212 \u03c3(f\u03c7(\u02c6x, z)))].\n\nmin\n\n\u03b8\n\nMap(\u03b8, \u03c7) in (10), thus for \ufb01xed (\u03b8, \u03c6) we maximize wrt {\u03c9, \u03c7}.\n\nThe fully adversarial training algorithm for supervised learning using the ALICE in (11) is the result\nof replacing LMap(\u03b8) with LA\n, \u03c7\u2217\nProposition 2 The optimum of generators and discriminator {\u03b8\n} form saddle points of objective\nin (11), if and only if \u02dc\u03c0(x|z) = p\u03b8\u2217 (x|z) and \u02dc\u03c0(x, z) = p\u03b8\u2217 (x, z).\nThe proof is provided in the SM. Proposition 2 enforces that the generator will map to the correctly\npaired sample in the other space. Together with the theoretical result for ALI in Lemma 2, we have\n\n\u2217\n\nCorollary 2 When the optimum in (10) is achieved, \u02dc\u03c0(x, z) = p\u03b8\u2217 (x, z) = q\u03c6\u2217 (x, z).\nCorollary 2 indicates that ALI\u2019s drawbacks associated with identi\ufb01ability issues can be alleviated for\nthe fully supervised learning scenario. Two conditional GANs can be used to boost the perfomance,\neach for one direction mapping. When tying the weights of discriminators of two conditional GANs,\nALICE recovers Triangle GAN [16]. In practice, samples from the paired set \u02dc\u03c0(x, z) often contain\nenough information to readily approximate the suf\ufb01cient statistics of the entire dataset. In such case,\nwe may use the following objective for semi-supervised learning:\n\nmin\n\u03b8,\u03c6\n\n\u03c9 LALI(\u03b8, \u03c6, \u03c9) + LCycle(\u03b8, \u03c6) + LMap(\u03b8) .\nmax\n\n(12)\n\nThe \ufb01rst two terms operate on the entire set, while the last term only applies to the paired subset. Note\nthat we can train (12) fully adversarially by replacing LCycle(\u03b8, \u03c6) and LMap(\u03b8) with LA\nCycle(\u03b8, \u03c6, \u03b7)\nMap(\u03b8, \u03c7) in (8) and (11), respectively. In (12) each of the three terms are treated with equal\nand LA\nweighting in the experiments if not speci\ufb01cially mentioned, but of course one may introduce additional\nhyperparameters to adjust the relative emphasis of each term.\n4 Related Work: A Uni\ufb01ed Perspective for Joint Distribution Matching\nConnecting ALI and CycleGAN. We provide an information theoretical interpretation for cycle-\nconsistency, and show that it is equivalent to controlling conditional entropies and matching con-\nditional distributions. When cycle-consistency is satis\ufb01ed, Corollary 1 shows that the conditionals\nare matched in CycleGAN. They also train additional discriminators to guarantee the matching of\nmarginals for x and z using the original GAN objective in (2). This reveals the equivalence between\nALI and CycleGAN, as the latter can also guarantee the matching of joint distributions p\u03b8(x, z) and\nq\u03c6(x, z). In practice, CycleGAN is easier to train, as it decomposes the joint distribution matching\nobjective (as in ALI) into four subproblems. Our approach leverages a similar idea, and further\nimproves it with adversarially learned cycle-consistency, when high quality samples are of interest.\n\n5\n\n\f(a) True x\n\n(b) True z\n\n(c) Inception Score\n\n(d) MSE\n\nFigure 2: Quantitative evaluation of generation (c) and reconstruction (d) results on toy data (a,b).\n\nStochastic Mapping vs. Deterministic Mapping. We propose to enforce the cycle-consistency in\nALI for the case when two stochastic mappings are speci\ufb01ed as in (1). When cycle-consistency is\nachieved, Corollary 1 shows that the bounded conditional entropy vanishes, and thus the corresponding\nmapping reduces to be deterministic. In the literture, one deterministic mapping has been empirically\ntested in ALI\u2019s framework [4], without explicitly specifying cycle-consistency. BiGAN [10] uses\ntwo deterministic mappings. In theory, deterministic mappings guarantee cycle-consistency in ALI\u2019s\nframework. However, to achieve this, the model has to \ufb01t a delta distribution (deterministic mapping)\nto another distribution in the sense of KL divergence (see Lemma 3). Due to the asymmetry of\nKL, the cost function will pay extremely low cost for generating fake-looking samples [17]. This\nexplains the under\ufb01tting reasoning in [4] behind the subpar reconstruction ability of ALI. Therefore,\nin ALICE, we explicitly add a cycle-consistency regularization to accelerate and stabilize training.\nConditional GANs as Joint Distribution Matching. Conditional GAN and its variants [15, 18, 19,\n20] have been widely used in supervised tasks. Our scheme to learn conditional entropy borrows the\nformulation of conditional GAN [15]. To the authors\u2019 knowledge, this is the \ufb01rst attempt to study the\nconditional GAN formulation as joint distribution matching problem. Moreover, we add the potential\nto leverage the well-de\ufb01ned distribution implied by paired data, to resolve the ambiguity issues of\nunsupervised ALI variants [4, 10, 12, 13, 14].\n5 Experimental Results\nThe code to reproduce these experiments is at https://github.com/ChunyuanLI/ALICE\n5.1 Effectiveness and Stability of Cycle-Consistency\nTo highlight the role of the CE regularization for unsupervised learning, we perform an experiment\non a toy dataset. q(x) is a 2D Gaussian Mixture Model (GMM) with 5 mixture components, and\np(z) is chosen as a standard Gaussian, N (0, I). Following [4], the covariance matrices and centroids\nare chosen such that the distribution exhibits severely separated modes, which makes it a relatively\nhard task despite its 2D nature. Following [21], to study stability, we run an exhaustive grid search\nover a set of architectural choices and hyper-parameters, 576 experiments for each method. We report\nMean Squared Error (MSE) and inception score (denoted as ICP) [22] to quantitatively evaluate the\nperformance of generative models. MSE is a proxy for reconstruction quality, while ICP re\ufb02ects the\nplausibility and variety of sample generation. Lower MSE and higher ICP indicate better results. See\nSM for the details of the grid search and the calculation of ICP.\nWe train on 2048 samples, and test on 1024 samples. The ground-truth test samples for x and z are\nshown in Figure 2(a) and (b), respectively. We compare ALICE, ALI and Denoising Auto-Encoders\n(DAEs) [23], and report the distribution of ICP and MSE values, for all (576) experiments in Figure 2\n(c) and (d), respectively. For reference, samples drawn from the \u201coracle\u201d (ground-truth) GMM yield\nICP=4.977\u00b10.016. ALICE yields an ICP larger than 4.5 in 77% of experiments, while ALI\u2019s ICP\nwildly varies across different runs. These results demonstrate that ALICE is more consistent and\nquantitatively reliable than ALI. The DAE yields the lowest MSE, as expected, but it also results in\nthe weakest generation ability. The comparatively low MSE of ALICE demonstrates its acceptable\nreconstruction ability compared to DAE, though a very signi\ufb01cantly improvement over ALI.\nFigure 3 shows the qualitative results on the test set. Since ALI\u2019s results vary largely from trial to\ntrial, we present the one with highest ICP. In the \ufb01gure, we color samples from different mixture\ncomponents to highlight their correspondance between the ground truth, in Figure 2(a), and their\nreconstructions, in Figure 3 (\ufb01rst row, columns 2, 4 and 6, for ALICE, ALI and DAE, respectively).\nImportantly, though the reconstruction of ALI can recover the shape of manifold in x (Gaussian\nmixture), each individual reconstructed sample can be substantially far away from its \u201coriginal\u201d\nmixture component (note the highly mixed coloring), hence the poor MSE. This occurs because the\nadversarial training in ALI only requires that the generated samples look realistic, i.e., to be located\n\n6\n\n\f(a) ALICE\n\n(b) ALI\n\n(c) DAEs\n\nFigure 3: Qualitative results on toy data. Two-column blocks represent the results of each method, with left for\nz and right for x. For the \ufb01rst row, left is sampling of z, and right is reconstruction of x. Colors indicate mixture\ncomponent membership. The second row shows reconstructions, x, from linearly interpolated samples in z.\nnear true samples in X , but the mapping between observed and latent spaces (x \u2192 z and z \u2192 x) is\nnot speci\ufb01ed. In the SM we also consider ALI with various combinations of stochastic/deterministic\nmappings, and conclude that models with deterministic mappings tend to have lower reconstruction\nability but higher generation ability. In terms of the estimated latent space, z, in Figure 3 (\ufb01rst row,\ncolumns 1, 3 and 5, for ALICE, ALI and DAE, respectively), we see that ALICE results in a better\nlatent representation, in the sense of mapping consistency (samples from different mixture components\nremain clustered) and distribution consistency (samples approximate a Gaussian distribution). The\nresults for reconstruction of z and sampling of x are shown in the SM.\nIn Figure 3 (second row), we also investigate latent space interpolation between a pair of test set\nexamples. We use x1 = [\u22122.2,\u22122.2] and x9 = [2.2, 2.2], map them into z1 and z9, linearly\ninterpolate between z1 and z9 to get intermediate points z2, . . . , z8, and then map them back to the\noriginal space as x2, . . . , x8. We only show the index of the samples for better visualization. Figure 3\nshows that ALICE\u2019s interpolation is smooth and consistent with the ground-truth distributions.\nInterpolation using ALI results in realistic samples (within mixture components), but the transition is\nnot order-wise consistent. DAEs provides smooth transitions, but the samples in the original space\nlook unrealistic as some of them are located in low probability density regions of the true model.\nWe investigate the impact of different amount of regularization on three datasets, including the toy\ndataset, MNIST and CIFAR-10 in SM Section D. The results show that our regularizer can improve\nimage generation and reconstruction of ALI for a large range of weighting hyperparameter values.\n5.2 Reconstruction and Cross-Domain Transformation on Real Datasets\nTwo image-to-image translation tasks are considered. (i) Car-to-Car [24]: each domain (x and z)\nincludes car images in 11 different angles, on which we seek to demonstrate the power of adversarially\nlearned reconstruction and weak supervision. (ii) Edge-to-Shoe [25]: x domain consists of shoe\nphotos and z domain consists of edge images, on which we report extensive quantitative comparisons.\nCycle-consistency is applied on both domains. The goal is to discover the cross-domain relationship\n(i.e., cross-domain prediction), while maintaining reconstruction ability on each domain.\nAdversarially learned reconstruction To demonstrate the effectiveness of our fully adversarial\nscheme in (8) (Joint A.) on real datasets, we use it in place of the (cid:96)2 losses in DiscoGAN [13]. In\npractice, feature matching [22] is used to help the adversarial objective in (8) to reach its optimum.\nWe also compared with a baseline scheme (Marginal A.) in [12], which adversarially discriminates\nbetween x and its reconstruction \u02c6x.\nThe results are shown in Figure 4 (a). From\ntop to bottom, each row shows ground-truth\nimages, DiscoGAN (with Joint A., (cid:96)2 loss\nand Marginal A. schemes, respectively) and\nBiGAN [10]. Note that BiGAN is the best\nALI variant in our grid search compasion.\nThe proposed Joint A. scheme can retain the\nsame crispness characteristic to adversarially-\ntrained models, while (cid:96)2 tends to be blurry.\nMarginal A. provides realistic car images, but not faithful reproductions of the inputs. This explains\n\nFigure 4: Results on Car-to-Car task.\n\n(a) Reconstruction\n\n(b) Prediction\n\n7\n\nInputs`2BiGANJointA.MarginalA.loss246810Number of Paired Angles20406080Classification Accuracy (%)ALICE (10% sup.)ALICE (1% sup.)DiscoGANBiGAN\f(a) Cross-domain transformation\n\n(b) Reconstruction\n\nFigure 5: SSIM and generated images on Edge-to-Shoe dataset.\n\n(c) Generated edges\n\nthe observations in [12] in terms of no performance gain. The BiGAN learns the shapes of cars, but\nmisses the textures. This is a sign of under\ufb01tting, thus indicating BiGAN is not easy to train.\nWeak supervision The DiscoGAN and BiGAN are unsupervised methods, and exhibit very different\ncross-domain pairing con\ufb01gurations during different training epochs, which is indicative of non-\nidenti\ufb01ability issues. We leverage very weak supervision to help with convergence and guide the\npairing. The results on shown in Figure 4 (b). We run each methods 5 times, the width of the\ncolored lines re\ufb02ect the standard deviation. We start with 1% true pairs for supervision, which yields\nsigni\ufb01cantly higher accuracy than DiscoGAN/BiGAN. We then provided 10% supervison in only 2\nor 6 angles (of 11 total angles), which yields comparable angle prediction accuracy with full angle\nsupervison in testing. This shows ALICE\u2019s ability in terms of zero-shot learning, i.e., predicting\nunseen pairs. In the SM, we show that enforcing different weak supervision strategies affects the \ufb01nal\npairing con\ufb01gurations, i.e., we can leverage supervision to obtain the desirable joint distribution.\nQuantitative comparison To quantitatively assess the generated images, we use structural similarity\n(SSIM) [26], which is an established image quality metric that correlates well with human visual\nperception. SSIM values are between [0, 1]; higher is better. The SSIM of ALICE on prediction and\nreconstruction is shown in Figure 5 (a)(b) for the edge-to-shoe task. As a baseline, we set DiscoGAN\nwith (cid:96)2-based supervision ((cid:96)2-sup). BiGAN/ALI, highlighted with a circle is outperformed by\nALICE in two aspects: (i) In the unpaired setting (0% supervision), cycle-consistency regularization\n(LCycle) shows signi\ufb01cant performance gains, particularly on reconstruction. (ii) When supervision\nis leveraged (10%), SSIM is signi\ufb01cantly increased on prediction. The adversarial-based supervision\n((cid:96)A-sup) shows higher prediction than (cid:96)2-sup. ALICE achieves very similar performance with the\n50% and full supervision setup, indicating its advantage of in semi-supervised learning. Several\ngenerated edge images (with 50% supervision) are shown in Figure 5(c), (cid:96)A-sup tends to provide\nmore details than (cid:96)2-sup. Both methods generate correct paired edges, and quality is higher than\nBiGAN and DiscoGAN. In the SM, we also report MSE metrics, and results on edge domain only,\nwhich are consistent with the results presented here.\nOne-side cycle-consistency When uncertainty in one domain is desirable, we consider one-side\ncycle-consistency. This is demonstrated on the CelebA face dataset [27]. Each face is associated\nwith a 40-dimensional attribute vector. The results are in the Figure 8 of SM. In the \ufb01rst task, we\nconsider the images x are generated from a 128-dimensional Gaussian latent space z, and apply\nLCycle on x. We compare ALICE and ALI on reconstruction in Figure 8 (a)(b). ALICE shows more\nfaithful reproduction of the input subjects. In the second task, we consider z as the attribute space,\nfrom which the images x are generated. The mapping from x to z is then attribute classi\ufb01cation. We\nMap on both domains. When 10% paired samples\nonly apply LCycle on the attribute domain, and LA\nare considered, the predicted attributes still reach 86% accuracy, which is comparable with the fully\nsupervised case. To test the diversity on x, we \ufb01rst predict the attributes of a true face image, and\nthen generated multiple images conditioned on the predicted attributes. Four examples are shown in\nFigure 8 (c).\n6 Conclusion\nWe have studied the problem of non-identi\ufb01ability in bidirectional adversarial networks. A uni\ufb01ed\nperspective of understanding various GAN models as joint matching is provided to tackle this problem.\nThis insight enables us to propose ALICE (with both adversarial and non-adversarial solutions) to\nreduce the ambiguity and control the conditionals in unsupervised and semi-supervised learning. For\nfuture work, the proposed view can provide opportunities to leverage the advantages of each model,\nto advance joint-distribution modeling.\n\n8\n\nBiGANBiGAN(cid:11)(cid:20)(cid:23)(cid:14)(cid:1)(cid:14)(cid:13)(cid:15)(cid:14)(cid:21)(cid:8)(cid:17)(cid:19)(cid:23)(cid:22)(cid:21)(cid:3)(cid:16)(cid:7)(cid:2)(cid:10)(cid:5)(cid:16)(cid:21)(cid:12)(cid:18)(cid:7)(cid:2)(cid:10)Lcycle+`2Lcycle+`A(cid:2)(cid:9)(cid:8)(cid:4)(cid:6)(cid:2)(cid:9)(cid:8)(cid:4)(cid:6)\fAcknowledgements We acknowledge Shuyang Dai, Chenyang Tao and Zihang Dai for helpful\nfeedback/editing. This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.\n\nReferences\n[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, 2014.\n\n[2] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable\n\nrepresentation learning by information maximizing generative adversarial nets. In NIPS, 2016.\n\n[4] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. A., O. Mastropietro, and A. Courville. Adversarially\n\nlearned inference. ICLR, 2017.\n\n[5] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders\n\nand generative adversarial networks. ICML, 2017.\n\n[6] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep\n\nlearning of images, labels and captions. In NIPS, 2016.\n\n[7] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via Stein variational gradient descent.\n\nNIPS, 2017.\n\n[8] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. ICML, 2016.\n\n[9] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint\n\narXiv:1511.05644, 2015.\n\n[10] J. Donahue, K. Philipp, and T. Darrell. Adversarial feature learning. ICLR, 2017.\n[11] Y. Pu, W. Wang, R. Henao, L. Chen, Z. Gan, C. Li, and L. Carin. Adversarial symmetric variational\n\nautoencoder. NIPS, 2017.\n\n[12] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. ICCV, 2017.\n\n[13] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative\n\nadversarial networks. ICML, 2017.\n\n[14] Z. Yi, H. Zhang, and P. Tan. DualGAN: Unsupervised dual learning for image-to-image translation. ICCV,\n\n2017.\n\n[15] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.\n[16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial\n\nnetworks. NIPS, 2017.\n\n[17] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In\n\nICLR, 2017.\n\n[18] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image\n\nsynthesis. In ICML, 2016.\n\n[19] P. Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image translation with conditional adversarial networks.\n\nCVPR, 2017.\n\n[20] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. NIPS, 2017.\n[21] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. ICLR, 2017.\n[22] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining GANs. In NIPS, 2016.\n\n[23] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[24] S. Fidler, S. Dickinson, and R. Urtasun. 3D object detection and viewpoint estimation with a deformable\n\n3D cuboid model. In NIPS, 2012.\n\n[25] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014.\n[26] Z. Wang, A. C Bovik, H. R Sheikh, and E. P Simoncelli. Image quality assessment: from error visibility to\n\nstructural similarity. IEEE trans. on Image Processing, 2004.\n\n[27] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2833, "authors": [{"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Hao", "family_name": "Liu", "institution": "Nanjing University"}, {"given_name": "Changyou", "family_name": "Chen", "institution": "University at Buffalo"}, {"given_name": "Yuchen", "family_name": "Pu", "institution": "Duke University"}, {"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}