{"title": "Discriminator optimal transport", "book": "Advances in Neural Information Processing Systems", "page_first": 6816, "page_last": 6826, "abstract": "Within a broad class of generative adversarial networks, we show that discriminator optimization process increases a lower bound of the dual cost function for the Wasserstein distance between the target distribution $p$ and the generator distribution $p_G$. It implies that the trained discriminator can approximate optimal transport (OT) from $p_G$ to $p$. Based on some experiments and a bit of OT theory, we propose discriminator optimal transport (DOT) scheme to improve generated images. We show that it improves inception score and FID calculated by un-conditional GAN trained by CIFAR-10, STL-10 and a public pre-trained model of conditional GAN trained by ImageNet.", "full_text": "Discriminator optimal transport\n\nMathematical Science Team, RIKEN Center for Advanced Intelligence Project (AIP)\n\n1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan\n\nInterdisciplinary Theoretical and Mathematical Sciences Program (iTHEMS), RIKEN\n\nAkinori Tanaka\n\n2-1 Hirosawa, Wako, Saitama 351-0198, Japan\n\nDepartment of Mathematics, Faculty of Science and Technology, Keio University\n\n3-14-1 Hiyoshi, Kouhoku-ku, Yokohama 223-8522, Japan\n\nakinori.tanaka@riken.jp\n\nAbstract\n\nWithin a broad class of generative adversarial networks, we show that discrimi-\nnator optimization process increases a lower bound of the dual cost function for\nthe Wasserstein distance between the target distribution p and the generator dis-\ntribution pG. It implies that the trained discriminator can approximate optimal\ntransport (OT) from pG to p. Based on some experiments and a bit of OT theory,\nwe propose discriminator optimal transport (DOT) scheme to improve generated\nimages. We show that it improves inception score and FID calculated by un-\nconditional GAN trained by CIFAR-10, STL-10 and a public pre-trained model of\nconditional GAN trained by ImageNet.\n\n1 Introduction\n\nGenerative Adversarial Network (GAN) [1] is one of recent promising generative models. In this\ncontext, we prepare two networks, a generator G and a discriminator D. G generates fake samples\nG(z) from noise z and tries to fool D. D classi\ufb01es real sample x and fake samples y = G(z). In\nthe training phase, we update them alternatingly until it reaches to an equilibrium state. In general,\nhowever, the training process is unstable and requires tuning of hyperparameters. Since from the \ufb01rst\nsuccessful implementation by convolutional neural nets [2], most literatures concentrate on how to\nimprove the unstable optimization procedures including changing objective functions [3, 4, 5, 6, 7,\n8], adding penalty terms [9, 10, 11], techniques on optimization precesses themselves [12, 13, 14,\n15], inserting new layers to the network [16, 17], and others we cannot list here completely.\nEven if one can make the optimization relatively stable and succeed in getting G around an equilib-\nrium, it sometimes fails to generate meaningful images. Bad images may include some unwanted\nstructures like unnecessary shadows, strange symbols, and blurred edges of objects. For example,\nsee generated images surrounded by blue lines in Figure 1. These problems may be \ufb01xed by scaling\nup the network structure and the optimization process. Generically speaking, however, it needs large\nscale computational resources, and if one wants to apply GAN to individual tasks by making use of\nmore compact devices, the above problem looks inevitable and crucial.\nThere is another problem. In many cases, we discard the trained discriminator D after the training.\nThis situation is in contrast to other latent space generative models. For example, variational auto-\nencoder (VAE) [18] is also composed of two distinct networks, an encoder network and a decoder\nnetwork. We can utilize both of them after the training: the encoder can be used as a data compressor,\nand the decoder can be regarded as a generator. Compared to this situation, it sounds wasteful to use\nonly G after the GAN training.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Each left image (blue): a sample from generator G. Each right image (red): the sample\nmodi\ufb01ed by our algorithm based on discriminator D. We use here the trained model available on\nhttps://github.com/pfnet-research/sngan_projection .\n\nFrom this viewpoint, it would be natural to ask how to use trained models G and D ef\ufb01ciently.\nRecent related works in the same spirit are discriminator rejection sampling (DRS) [19] and\nMetropolis-Hastings GAN (MH-GAN) [20]. In each case, they use the generator-induced distribu-\ntion pG as a proposal distribution, and approximate acceptance ratio of the proposed sample based\non the trained D. Intuitively, generated image y = G(z) is accepted if the value D(y) is relatively\nlarge, otherwise it is rejected. They show its theoretical backgrounds, and it actually improve scores\non generated images in practice.\nIn this paper, we try a similar but more active approach, i.e. improving generated image y = G(z)\ndirectly by adding (cid:14)y to y such that D(y + (cid:14)y) > D(y). The optimal transport (OT) theory guaran-\ntees that the improved samples can be regarded as approximate samples from the target distribution\np. More concretely, our contributions are\n\n(cid:15) Proposal of the discriminator optimal transport (DOT) based on the fact that the objective\nfunction for D provides lower bound of the dual cost function for the Wasserstein distance\nbetween p and pG.\n\n(cid:15) Suggesting approximated algorithms and verifying that they improve Earth Mover\u2019s dis-\n\ntance (EMD) [21], inception score [13] and Fr\u00e9chet inception distance (FID) [15].\n\n(cid:15) Pointing out a generality on DOT, i.e. if the G\u2019s output domain is same as the D\u2019s input\n\ndomain, then we can use any pair of trained G and D to improve generated samples.\n\nIn addition, we show some results on experiment comparing DOT and a naive method of improving\nsample just by increasing the value of D, under a fair setting. One can download our codes from\nhttps://github.com/AkinoriTanaka-phys/DOT.\n\n2 Background\n\n2.1 Generative Adversarial Nets\n)\nThroughout this paper, we regard an image sample as a vector in certain Euclidean space: x 2 X.\nWe name latent space as Z and a prior distribution on it as pZ(z). The ultimate goal of the GAN is\nmaking generator G : Z ! X whose push-foward of the prior pG(x) =\ndz\nreproduces data-generating probability density p(x). To achieve it, discriminator D : X ! R and\n\n(\nx (cid:0) G(z)\n\nZ pZ(z)(cid:14)\n\n\u222b\n\n2\n\n\fobjective functions,\n\nVD(G; D) = Ex(cid:24)p [f ((cid:0)D(x))] + Ey(cid:24)pG [f (D(y))] ;\nVG(G; D) = Ey(cid:24)pG [g(D(y))] ;\n\n(1)\n(2)\nare introduced. Choice of functions f and g corresponds to choice of GAN update algorithm as\nexplained below. Practically, G and D are parametric models G(cid:18) and D\u03c6, and they are alternatingly\nupdated as\n\n\u03c6 \u03c6 + \u03f5\u2207\u03c6VD(G(cid:18); D\u03c6);\n(cid:18) (cid:18) (cid:0) \u03f5\u2207(cid:18)VG(G(cid:18); D\u03c6);\n\n(3)\n(4)\n\nuntil the updating dynamics reaches to an equilibrium. One of well know choices for f and g is\n\nf (u) = (cid:0) log(1 + eu)\n\ng(u) = (cid:0)f ((cid:0)u):\n\n(5)\nTheoretically speaking, it seems better to take g(u) = f (u), which is called minimax GAN [22] to\nguarantee pG = p at the equilibrium as proved in [1]. However, it is well known that taking (5),\ncalled non-saturating GAN, enjoys better performance practically. As an alternative, we can choose\nthe following f and g [6, 4]:\n\n(6)\nIt is also known to be relatively stable. In addition to it, pG = p at an equilibrium is proved at least\nin the theoretically ideal situation. Another famous choice is taking\n\nf (u) = max(0;(cid:0)1 (cid:0) u);\n\ng(u) = (cid:0)u:\n\n(7)\nThe resultant GAN is called WGAN [5]. We use (7) with gradient penalty (WGAN-GP) [9] in\nour experiment. WGAN is related to the concept of the optimal transport (OT) which we review\nbelow, so one might think our method is available only when we use WGAN. But we would like to\nemphasize that such OT approach is also useful even when we take GANs described by (5) and (6)\nas we will show later.\n\ng(u) = u:\n\nf (u) = (cid:0)u;\n\n2.2 Spectral normalization\n\n{jjf (x) (cid:0) f (y)jj2\n\n}\n\n(cid:12)(cid:12)(cid:12)x \u0338= y\n\n:\n\nSpectral normalization (SN) [16] is one of standard normalizations on neural network weights to\nstabilize training process of GANs. To explain it, let us de\ufb01ne a norm for function called Lipschitz\nnorm,\n\njjfjjLip := sup\n\njjx (cid:0) yjj2\n\n(8)\nFor example, jjReLUjjLip = jjlReLUjjLip = 1 because their maximum gradient is 1. For linear\ntransformation lW;b with weight matrix W and bias b, the norm jjlW;bjjLip is equal to the maximum\nsingular value (cid:27)(W ). Spectral normalization on lW;b is de\ufb01ned by dividing the weight W in the\nlinear transform by the (cid:27)(W ):\n(9)\nBy de\ufb01nition, it enjoys jjlW=(cid:27)(W )jjLip = 1. If we focus on neural networks, estimation of the upper\nbound of the norm is relatively easy because of the following property1:\n\nSN (lW;b) = lW=(cid:27)(W );b:\n\n(10)\nFor example, suppose fnn is a neural network with ReLU or lReLU activations and spectral normal-\nizations: fnn(x) = SN \u25e6 lWL\n\u25e6 (cid:1)(cid:1)(cid:1) \u25e6 SN \u25e6 lW1(x), then the Lipschitz norm is\nbounded by one:\n\njjf \u25e6 gjjLip (cid:20) jjfjjLip (cid:1) jjgjjLip:\n\u25e6 f \u25e6 SN \u25e6 lWL(cid:0)1\n\njjfnnjjLip (cid:20) L\u220f\n\njjlWl=(cid:27)(Wl)jjLip = 1\n\n(11)\n\nl=1\n\nThanks to this Lipschitz nature, the normalized network gradient behaves mild during repeating\nupdates (3) and (4), and as a result, it stabilizes the wild and dynamic optimization process of\nGANs.\n\n1 This inequality can be understood as follows. Naively, the norm (8) is de\ufb01ned by the maximum gradient\nbetween two different points. Suppose x1 and x2 realizing maximum of gradient for g and u1 and u2 are\npoints for f, then the (RHS) of the inequality (10) is equal to jjf (u1) (cid:0) f (u2)jj2=jju1 (cid:0) u2jj2 (cid:2) jjg(x1) (cid:0)\ng(x2)jj2=jjx1 (cid:0) x2jj2. If g(xi) = ui, it reduces to the (LHS) of the (10), but the condition is not satis\ufb01ed in\ngeneral, and the (RHS) takes a larger value than (LHS). This observation is actually important to the later part\nof this paper because estimation of the norm based on the inequality seems to be overestimated in many cases.\n\n3\n\n\f2.3 Optimal transport\n\nAnother important background in this paper is optimal transport. Suppose there are two probability\ndensities, p(x) and q(y) where x; y 2 X. Let us consider the cost for transporting one unit of mass\n[\n])\nfrom x (cid:24) p to y (cid:24) q. The optimal cost is called Wasserstein distance. Throughout this paper, we\nfocus on the Wasserstein distance de\ufb01ned by l2-norm cost jjx (cid:0) yjj2:\njjx (cid:0) yjj2\n\nW (p; q) = min\n\nE(x;y)(cid:24)(cid:25)\n\n(\n\n(12)\n\n:\n\n(cid:25)2(cid:5)(p;q)\n\n(cid:25) means joint probability for transportation between x and y. To realize it, we need to restrict (cid:25)\nsatisfying marginality conditions,\n\n\u222b\n\n(\n\n(13)\n(cid:3) satis\ufb01es W (p; q) = E(x;y)(cid:24)(cid:25)(cid:3)[jjx(cid:0) yjj2], and it realizes the most effective transport\n\nAn optimal (cid:25)\nbetween two probability densities under the l2 cost. Interestingly, (12) has the dual form\n\ndy (cid:25)(x; y) = p(x):\n\ndx (cid:25)(x; y) = q(y);\n\nEx(cid:24)p\n\nW (p; q) = max\n\n(14)\nThe duality is called Kantorovich-Rubinstein duality [23, 24]. Note that jjfjjLip is de\ufb01ned in (8),\nand the dual variable ~D should satisfy Lipschitz continuity condition jj ~DjjLip (cid:20) 1. One may wonder\n(cid:3)\n(cid:3) and the optimal dual variable D\nwhether any relationship between the optimal transport plan (cid:25)\nexists or not. The following theorem is an answer and it plays a key role in this paper.\n\njj ~DjjLip(cid:20)1\n\n:\n\n~D(x)\n\n~D(y)\n\n(cid:0) Ey(cid:24)q\n\n[\n\n])\n\n]\n\n\u222b\n[\n\n(cid:3) and D\n\n(cid:3) are optimal solutions of the primal (12) and the dual (14) problem,\n(cid:3) is deterministic optimal transport described by a certain automorphism2 T :\n\nTheorem 1 Suppose (cid:25)\nrespectively. If (cid:25)\nX ! X, then the following equations are satis\ufb01ed:\n{\n(\n)\njjx (cid:0) yjj2 (cid:0) D\nx (cid:0) T (y)\n\njjD\nT (y) = arg min\n\n(cid:3)jjLip = 1;\n\u222b\n\np(x) =\n\ndy (cid:14)\n\nq(y):\n\nx\n\n}\n\n(cid:3)\n\n(x)\n\n;\n\n(15)\n\n(16)\n\n(17)\n\n(Proof) It can be proved by combining well know facts. See Supplementary Materials. \u25a1\n\n3 Discriminator optimal transport\nIf we apply the spectral normalization on a discriminator D, it satis\ufb01es jjDjjLip = K with a certain\nreal number K. By rede\ufb01ning it to ~D = D=K, it becomes 1-Lipschitz jj ~DjjLip = 1. It reminds us\nthe equation (15), and one may expect a connection between OT and GAN. In fact, we can show the\nfollowing theorem:\n\nTheorem 2 Each objective function of GAN using logistic (5), or hinge (6), or identity (7) loss with\ngradient penalty, provides lower bound of the mean discrepancy of ~D = D=K between p and pG:\n(18)\n\nVD(G; D) (cid:20) K\n\n(cid:0) Ey(cid:24)pG\n\nEx(cid:24)p\n\n~D(x)\n\n~D(y)\n\n:\n\n(\n\n[\n\n]\n\n[\n\n])\n\n(Proof) See Supplementary Materials. \u25a1\nIn practical optimization process of GAN, K could change its value during the training process, but\nit stays almost constant at least approximately as explained below.\n\n2 It is equivalent to assume there exists an unique solution of the corresponding Monge problem:\n\n(\n\n[\njjT (y) (cid:0) yjj2\n\n])\n\nEy(cid:24)q\n\nmin\nT :X!X\n(cid:3)\n\n(cid:3)\n\nfrom (cid:25)\n\nReconstructing T\nwithout any assumption is a subtle problem and only guaranteed within strictly\nconvex cost functions [25]. Unfortunately, it is not satis\ufb01ed in our l2 cost. However, there is a known method\n[26] to \ufb01nd a solution based on relaxing the cost to strictly convex cost jjx (cid:0) yjj1+\u03f5\n2 with \u03f5 > 0. In our\nexperiments, DOT works only when jjx (cid:0) yjj2 is small enough for given y.\nIn this case, there is no big\ndifference between jjx (cid:0) yjj2 and jjx (cid:0) yjj1+\u03f5\n\n, and it suggests DOT approximates their solution.\n\n2\n\n;\n\nconstrained by (17).\n\n4\n\n\fFigure 2: Logs of inception score (left), approximated Lipschitz constant of D (middle), and ap-\nproximated Lipschitz constant of D\u25e6G (right) on each GAN trained with CIFAR-10. Approximated\nLipschitz constants are calculated by random 500 pair samples. Errorbars are plotted within 1(cid:27) by\n500 trials.\n\n3.1 Discriminator Optimal Transport (ideal version)\n\nThe inequality (18) implies that the update (3) of D during GANs training maximizes the lower\nbound of the objective in (14), the dual form of the Wasserstein distance. In this sense, the optimiza-\ntion of D in (3) can be regarded solving the problem (14) approximately3. If we apply (16) with\n(cid:3) (cid:25) ~D = D=K, the following transport of given y (cid:24) pG\nD\n\nTD(y) = arg min\n\nx\n\nD(x)\n\n(19)\n\nis expected to recover the sampling from the target distribution p thanks to the equality (17).\n\n3.2 Discriminator Optimal Transport (practical version)\n\n{\njjx (cid:0) yjj2 (cid:0) 1\n\nK\n\n}\n\nTo check whether K changes drastically or not during the GAN updates, we calculate approximated\nLipschitz constants de\ufb01ned by\n\n(cid:12)(cid:12)(cid:12)x; y (cid:24) pG\n{jD(x) (cid:0) D(y)j\n{jD \u25e6 G(z) (cid:0) D \u25e6 G(zy)j\n\njjx (cid:0) yjj2\n\n}\n(cid:12)(cid:12)(cid:12)z; zy (cid:24) pZ\n\n;\n\njjz (cid:0) zyjj2\n\n}\n\n;\n\nKeff = max\n\nkeff = max\n\n(20)\n\n(21)\n\nin each 5,000 iteration on GAN training with CIFAR-10 data with DCGAN models explained in\nSupplementary Materials. As plotted in Figure 2, both of them do not increase drastically. It is worth\nto mention that the naive upper bound of the Lipschitz constant like (11) turn to be overestimated.\nFor example, SNGAN has the naive upper bound 1, but (20) stays around 0.08 in Figure 2.\n\n{\njjx (cid:0) yjj2 (cid:0) 1\nKeff\n\n}\n\nTarget space DOT Based on these facts, we conclude that trained discriminators can approximate\nthe optimal transport (16) by\n\nT eff\nD (y) = arg min\n\nx\n\nD(x)\n\n:\n\n(22)\n\nKeff\n\nAs a preliminary experiment, we apply DOT to WGAN-GP trained by 25gaussians dataset and\nswissroll dataset. We use the gradient descent method shown in Algorithm 1 to search transported\nD (y) for given y (cid:24) pG. In Figure 3, we compare the DOT samples and naively transported\npoint T eff\nsamples by the discriminator which is implemented by replacing the gradient in Algorithm 1 to\n(cid:0) 1\n\u2207xD(x) , i.e. just searching x with large D(x) from initial condition x y where y (cid:24) pG.\nDOT outperforms the naive method qualitatively and quantitatively. On the 25gaussians, one might\nthink 4th naively improved samples are better than 3rd DOT samples. However, the 4th samples are\ntoo concentrated and lack the variance around each peak. In fact, the value of the Earth Mover\u2019s\ndistance, EMD, which measures how long it is separated from the real samples, shows relatively\nlarge value. On the swissroll, 4th samples based on naive transport lack many relevant points close\nto the original data, and it is trivially bad. On the other hand, one can see that the 3rd DOT samples\nkeep swissroll shape and clean the blurred shape in the original samples by generator.\n\n3 This situation is similar to guarantee VAE [18] objective function which is a lower bound of the likelihood\n\ncalled evidence lower bound (ELBO).\n\n5\n\n\fAlgorithm 1 Target space optimal transport by gradient descent\nRequire: trained D, approximated Keff by (20), sample y, learning rate \u03f5 and small vector (cid:14)\n\n{\njjx (cid:0) y + (cid:14)jj2 (cid:0) 1\n\nInitialize x y\nfor ntrial in range(Nupdates) do\nx x (cid:0) \u03f5\u2207x\nend for\nreturn x\n\n}\n\nD(x)\n\n( (cid:14) is for preventing over\ufb02ow. )\n\nKeff\n\nFigure 3: 2d experiments by using trained model of WGAN-GP. 1,000 samples of, 1st: training\nsamples, 2nd: generated samples by G, 3rd: samples by target space DOT, 4th: samples by naive\ntransport, are plotted. Each EMD value is calculated by 100 trials. The error corresponds to 1(cid:27). We\nuse (cid:14) = 0:001. See the supplementary material for more details on this experiment.\n\nLatent space DOT The target space DOT works in low dimensional data, but it turns out to be\nuseless once we apply it to higher dimensional data. See Figure 4 for example. Alternative, and\nmore workable idea is regarding D \u25e6 G : Z ! R as the dual variable for de\ufb01ning Wasserstein\ndistance between \u201cpullback\u201d of p by G and prior pZ. Latent space OT itself is not a novel idea\n[27, 28], but there seems to be no literature using trained G and D, to the best of our knowledge.\nThe approximated Lipschitz constants of G \u25e6 D also stay constant as shown in the right sub-\ufb01gure\nin Figure 2, so we conclude that\n\n{\njjz (cid:0) zyjj2 (cid:0) 1\nkeff\n\n}\n\nD \u25e6 G(z)\n\nT eff\nD\u25e6G(zy) = arg min\n\nz\n\n(23)\n\napproximates optimal transport in latent space. Note that if the prior pZ has non-trivial support, we\nneed to restrict z onto the support during the DOT process. In our algorithm 2, we apply projection\nof the gradient. One of the major practical priors is normal distribution N (0; I D(cid:2)D) where D\np\nis the latent space dimension. If D is large, it is well known that the support is concentrated on\n(D (cid:0) 1)-dimensional sphere with radius\np\nD, so the projection of the gradient g can be calculated\nby g (cid:0) (g (cid:1) z)z=\nD approximately. Even if we skip this procedure, transported images may look\nimproved, but it downgrades inception scores and FIDs.\n\n6\n\nEMD0.052(08)0.052(10)0.065(11)EMD0.021(05)0.020(06)0.160(22)\f{\njjz (cid:0) zy + (cid:14)jj2 (cid:0) 1\n\nAlgorithm 2 Latent space optimal transport by gradient descent\n}\nRequire: trained G and D, approximated keff, sample zy, learning rate \u03f5, and small vector (cid:14)\nInitialize z zy\nfor ntrial in range(Nupdates) do\ng = \u2207z\nD \u25e6 G(z)\nif noise is generated by N (0; I D(cid:2)D) then\np\ng g (cid:0) (g (cid:1) z)z=\nend if\nz z (cid:0) \u03f5g\nif noise is generated by U ([(cid:0)1; 1]) then\nend if\nend for\nreturn x = G(z)\n\n( (cid:14) is for preventing over\ufb02ow. )\n\nclip z 2 [(cid:0)1; 1]\n\nD\n\nkeff\n\nFigure 4: Target space DOT sample (each left) and latent space DOT sample (each right). The\nformer looks like giving meaningless noises like perturbations in adversarial examples [29]. On the\nother hand, the latent space DOT samples keep the shape of image, and clean it.\n\n4 Experiments on latent space DOT\n\n4.1 CIFAR-10 and SLT-10\n\nWe prepare pre-trained DCGAN models and ResNet models on various settings, and apply latent\nspace DOT. In each case, inception score and FID are improved (Table 1). We can use arbitrary\ndiscriminator D to improve scores by \ufb01xed G as shown in Table 2. As one can see, DOT really\nworks. But it needs tuning of hyperparameters. First, it is recommended to use small \u03f5 as possible.\nA large \u03f5 may accelerate upgrading, but easily downgrade unless appropriate Nupdates is chosen.\nSecond, we recommend to use keff calculated by using enough number of samples. If not, it becomes\nrelatively small and it also possibly downgrade images. As a shortcut, keff = 1 also works. See\nSupplementary Materials for details and additional results including comparison to other methods.\n\nDCGAN WGAN-GP\nSNGAN(ns)\nSNGAN(hi)\nSAGAN(ns)\nSAGAN(hi)\nSAGAN(ns)\nSAGAN(hi)\n\nResnet\n\nCIFAR-10\n\nbare\n\n6.53(08), 27.84\n7.45(09), 20.74\n7.45(08), 20.47\n7.75(07), 25.37\n7.52(06), 25.78\n7.74(09), 22.13\n7.85(11), 21.53\n\nDOT\n\n7.45(05), 24.14\n7.97(14), 15.78\n8.02(16), 17.12\n8.50(01), 20.57\n8.38(05), 21.21\n8.49(13), 20.22\n8.50(12), 19.71\n\nSTL-10\n\nbare\n\n8.69(07), 49.94\n8.67(01), 41.18\n8.83(12), 40.10\n8.68(01), 48.23\n9.29(13), 45.79\n9.33(08), 41.91\n\nDOT\n\n9.31(07), 44.45\n9.45(13), 34.84\n9.35(12), 34.85\n10.04(14), 41.19\n10.30(21), 40.51\n10.03(14), 39.48\n\nTable 1: (Inception score, FID) by usual sampling (bare) and DOT: Models in [16] and self-attention\nlayer [17] are used. (ns) and (hi) mean models trained by (5) and (6). \u03f5 = 0:01 SGD is applied 20\ntimes for CIDAR-10 and 10 times for STL-10. keff is calculated by 100 samples and (cid:14) = 0:001.\nSAGAN(hi)\n\nSNGAN(ns)\n\nSAGAN(ns)\n\nSNGAN(hi)\n\nD\nIS\nFID\n\nwithout D WGAN-gp\n8.03(11)\n7.52(06)\n25.78\n24.47\n\n8.22(07)\n21.45\n\n8.38(07)\n23.03\n\n8.36(12)\n21.07\n\n8.38(05)\n21.21\n\nTable 2: Results on scores by GSAGAN(ns) after latent space DOT using each D in different training\nscheme using CIFAR-10 within DCGAN architecture. Parameters for DOT are same in Table 1.\n\n7\n\n\fFigure 5: Left images surrounded by blue lines are samples from the conditional generator. The\nnumber of updates Nupdates for DOT increases along horizontal axis. Right Images surrounded by\nred lines corresponds after 30 times updates with Adam ((cid:11); (cid:12)1; (cid:12)2) = (0:01; 0; 0:9) and keff(y) = 1.\n\n4.2\n\nImageNet\n\nConditional version of latent space DOT In this section, we show results on ImageNet dataset.\nAs pre-trained models, we utilize a pair of public models (G; D) [30] of conditional GAN [31]4. In\nconditional GAN, G and D are networks conditioned by label y. Typical objective function VD is\ntherefore represented by average over the label:\nVD(G; D) = Ey(cid:24)p(y)\n\n(24)\nBut, once y is \ufb01xed, G(zjy) and D(xjy) can be regarded as usual networks with input z and x\nrespectively. So, by repeating our argument so far, DOT in conditional GAN can be written by\n\n(\n\n[\n\nVD\n\n{\n\n:\n\n)]\nG((cid:1)jy); D((cid:1)jy)\n)}\n(\n(cid:12)(cid:12)y\n)j\n(cid:12)(cid:12)(cid:12)z; zy (cid:24) pZ\n}\n\nG(zjy)\n\n(cid:12)(cid:12)y\n\nkeff(y)\n\nD\n\n:\n\n:\n\njjz (cid:0) zyjj2 (cid:0) 1\n\n(cid:12)(cid:12)y\n\n) (cid:0) D\n\n(\nG(zyjy)\njjz (cid:0) zyjj2\n\nwhere keff(y) is approximated Lipschitz constant conditioned by y. It is calculated by\n\nTG\u25e6D(zyjy) = argminz\n(\nG(zjy)\n\n{jD\n\nkeff(y) = max\n\n(25)\n\n(26)\n\nExperiments We apply gradient descent updates with with Adam((cid:11); (cid:12)1; (cid:12)2) = (0:01; 0; 0:9). We\nshow results on 4 independent trials in Table 3. It is clear that DOT mildly improve each score. Note\nthat we need some tunings on hyperparameters \u03f5; Nupdates as we already commented in 4.1.\n\n# updates=0\n\n# updates=4\n\n# updates=32\n37.61(88), 42.35\ntrial1(keff(y) = 1)\n37.02(73), 42.74\ntrial2(keff(y) = 1)\n36.88(79), 42.52\ntrial3\n37.29(07), 42.40\ntrial4\nTable 3: (Inception score, FID) for each update. Upper 2 cases are executed by keff(y) = 1 without\ncalculating (26). We use 50 samples for each label y to calculate keff(y) in lower 2 trials. (cid:14) = 0:001.\n\n36.99(75), 43.01\n36.26(98), 43.09\n36.87(84), 43.11\n36.49(54), 43.25\n\n37.25(84), 42.70\n36.97(63), 42.85\n37.51(01), 42.43\n37.29(86), 42.67\n\n36.40(91), 43.34\n36.68(59), 43.60\n36.64(63), 43.55\n36.23(98), 43.63\n\n# updates=16\n\n4 These are available on https://github.com/pfnet-research/sngan_projection .\n\n8\n\n\f)}\n\n(cid:12)(cid:12)(cid:12)y\n\nG(zjy)\n\nkeff(y) D\n\n{\njjz (cid:0) zy + (cid:14)jj2 (cid:0) 1\n\nAlgorithm 3 Latent space conditional optimal transport by gradient descent\nRequire: trained G and D, label y, approximated keff(y), sample zy, learning rate \u03f5 and small\n(\nvector (cid:14)\nInitialize z zy\nfor ntrial in range(Nupdate) do\ng = \u2207z\nif noise is generated by N (0; I D(cid:2)D) then\np\ng g (cid:0) (g (cid:1) z)z=\nend if\nz z (cid:0) \u03f5g\nif noise is generated by U ([(cid:0)1; 1]) then\nend if\nend for\nreturn x = G(zjy)\n\n( (cid:14) is for preventing over\ufb02ow. )\n\nclip z 2 [(cid:0)1; 1]\n\nD\n\nEvaluation To calculate FID, we use available 798,900 image \ufb01les in ILSVRC2012 dataset. We\nreshape each image to the size 299 (cid:2) 299 (cid:2) 3, feed all images to the public inception model to get\nthe mean vector mw and the covariance matrix C w in 2,048 dimensional feature space explained in\nSupplementary Materials.\n\n5 Conclusion\n\nIn this paper, we show the relevance of discriminator optimal transport (DOT) method on various\ntrained GAN models to improve generated samples. Let us conclude with some comments here.\nFirst, DOT objective function in (22) reminds us the objective for making adversarial examples\n[29]. There is known fast algorithm to make adversarial example making use of the piecewise-linear\nstructure of the ReLU neural network [32]. The method would be also useful for accelerating DOT.\nSecond, latent space DOT can be regarded as improving the prior pZ. A similar idea can be found\nalso in [33]. In the usual context of the GAN, we \ufb01x the prior, but it may be possible to train the\nprior itself simultaneously by making use of the DOT techniques.\nWe leave these as future works.\n\nAcknowledgments\n\nWe would like to thank Asuka Takatsu for fruitful discussion and Kenichi Bannai for careful reading\nthis manuscript. This work was supported by computational resources provided by RIKEN AIP deep\nlearning environment (RAIDEN) and RIKEN iTHEMS.\n\nReferences\n[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in\nNeural Information Processing Systems 27: Annual Conference on Neural Information Pro-\ncessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672\u20132680,\n2014.\n\n[2] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. In 4th International Conference on Learn-\ning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track\nProceedings, 2016.\n\n[3] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.\n\nf-gan: Training generative neural\nsamplers using variational divergence minimization. In Advances in Neural Information Pro-\ncessing Systems 29: Annual Conference on Neural Information Processing Systems 2016, De-\ncember 5-10, 2016, Barcelona, Spain, pages 271\u2013279, 2016.\n\n9\n\n\f[4] Junbo Jake Zhao, Micha\u00ebl Mathieu, and Yann LeCun. Energy-based generative adversarial\n\nnetwork. CoRR, abs/1609.03126, 2016.\n\n[5] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN.\n\nabs/1701.07875, 2017.\n\nCoRR,\n\n[6] Jae Hyun Lim and Jong Chul Ye. Geometric GAN. CoRR, abs/1705.02894, 2017.\n\n[7] Thomas Unterthiner, Bernhard Nessler, Calvin Seward, G\u00fcnter Klambauer, Martin Heusel,\nHubert Ramsauer, and Sepp Hochreiter. Coulomb gans: Provably optimal nash equilibria via\nIn 6th International Conference on Learning Representations, ICLR 2018,\npotential \ufb01elds.\nVancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[8] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan,\nStephan Hoyer, and R\u00e9mi Munos. The cramer distance as a solution to biased wasserstein\ngradients. CoRR, abs/1705.10743, 2017.\n\n[9] Ishaan Gulrajani, Faruk Ahmed, Mart\u00edn Arjovsky, Vincent Dumoulin, and Aaron C. Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems\n30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017,\nLong Beach, CA, USA, pages 5767\u20135777, 2017.\n\n[10] Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of wasserstein\ngans. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver,\nBC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[11] Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. Improving the improved\nIn 6th International\ntraining of wasserstein gans: A consistency term and its dual effect.\nConference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May\n3, 2018, Conference Track Proceedings, 2018.\n\n[12] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\nnetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,\nFrance, April 24-26, 2017, Conference Track Proceedings, 2017.\n\n[13] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and\nIn Advances in Neural Information Pro-\nXi Chen.\ncessing Systems 29: Annual Conference on Neural Information Processing Systems 2016, De-\ncember 5-10, 2016, Barcelona, Spain, pages 2226\u20132234, 2016.\n\nImproved techniques for training gans.\n\n[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans\nIn 6th International Conference on Learning\nfor improved quality, stability, and variation.\nRepresentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference\nTrack Proceedings, 2018.\n\n[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems 30: Annual Conference on Neural Information Pro-\ncessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6626\u20136637, 2017.\n\n[16] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-\ntion for generative adversarial networks. In 6th International Conference on Learning Rep-\nresentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track\nProceedings, 2018.\n\n[17] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention gen-\nerative adversarial networks. In Proceedings of the 36th International Conference on Machine\nLearning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 7354\u20137363, 2019.\n\n[18] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International\nConference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,\nConference Track Proceedings, 2014.\n\n10\n\n\f[19] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian J. Goodfellow, and Augustus Odena.\nDiscriminator rejection sampling. In 7th International Conference on Learning Representa-\ntions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.\n\n[20] Ryan D. Turner, Jane Hung, Eric Frank, Yunus Saatchi, and Jason Yosinski. Metropolis-\nhastings generative adversarial networks. In Proceedings of the 36th International Conference\non Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6345\u2013\n6353, 2019.\n\n[21] R\u2019emi Flamary and Nicolas Courty. Pot python optimal transport library, 2017.\n\n[22] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed,\nand Ian J. Goodfellow. Many paths to equilibrium: Gans do not need to decrease a diver-\ngence at every step. In 6th International Conference on Learning Representations, ICLR 2018,\nVancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[23] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[24] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. Foundations and Trends in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[25] Wilfrid Gangbo and Robert J McCann. The geometry of optimal transportation. Acta Mathe-\n\nmatica, 177(2):113\u2013161, 1996.\n\n[26] Luis Caffarelli, Mikhail Feldman, and Robert McCann. Constructing optimal maps for monge(cid:671)\ns transport problem as a limit of strictly convex costs. Journal of the American Mathematical\nSociety, 15(1):1\u201326, 2002.\n\n[27] Eirikur Agustsson, Alexander Sage, Radu Timofte, and Luc Van Gool. Optimal transport maps\nfor distribution preserving operations on latent spaces of generative models. In 7th Interna-\ntional Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,\n2019, 2019.\n\n[28] Tim Salimans, Han Zhang, Alec Radford, and Dimitris N. Metaxas.\n\nImproving gans using\noptimal transport. In 6th International Conference on Learning Representations, ICLR 2018,\nVancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[29] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J.\nGoodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International\nConference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,\nConference Track Proceedings, 2014.\n\n[30] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. In 6th Interna-\ntional Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\n- May 3, 2018, Conference Track Proceedings, 2018.\n\n[31] Mehdi Mirza and Simon Osindero.\n\nabs/1411.1784, 2014.\n\nConditional generative adversarial nets.\n\nCoRR,\n\n[32] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-\nsarial examples. In 3rd International Conference on Learning Representations, ICLR 2015,\nSan Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.\n\n[33] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity\nnatural image synthesis. In 7th International Conference on Learning Representations, ICLR\n2019, New Orleans, LA, USA, May 6-9, 2019, 2019.\n\n11\n\n\f", "award": [], "sourceid": 3691, "authors": [{"given_name": "Akinori", "family_name": "Tanaka", "institution": "RIKEN/Keio Univ."}]}