{"title": "Two Generator Game: Learning to Sample via Linear Goodness-of-Fit Test", "book": "Advances in Neural Information Processing Systems", "page_first": 11260, "page_last": 11271, "abstract": "Learning the probability distribution of high-dimensional data is a challenging problem. To solve this problem, we formulate a deep energy adversarial network (DEAN), which casts the energy model learned from real data into an optimization of a goodness-of-fit (GOF) test statistic. DEAN can be interpreted as a GOF game between two generative networks, where one explicit generative network learns an energy-based distribution that fits the real data, and the other implicit generative network is trained by minimizing a GOF test statistic between the energy-based distribution and the generated data, such that the underlying distribution of the generated data is close to the energy-based distribution. We design a two-level alternative optimization procedure to train the explicit and implicit generative networks, such that the hyper-parameters can also be automatically learned. Experimental results show that DEAN achieves high quality generations compared to the state-of-the-art approaches.", "full_text": "Two Generator Game: Learning to Sample via\n\nLinear Goodness-of-Fit Test\n\nLizhong Ding1 Mengyang Yu1 Li Liu1 Fan Zhu1\n\nYong Liu2 Yu Li3 Ling Shao1\n\n1Inception Institute of Arti\ufb01cial Intelligence (IIAI), Abu Dhabi, UAE.\n\n2Institute of Information Engineering, CAS, China.\n\n3King Abdullah University of Science and Technology (KAUST), Saudi Arabia.\n\nAbstract\n\nLearning the probability distribution of high-dimensional data is a challenging\nproblem. To solve this problem, we formulate a deep energy adversarial network\n(DEAN), which casts the energy model learned from real data into an optimization\nof a goodness-of-\ufb01t (GOF) test statistic. DEAN can be interpreted as a GOF game\nbetween two generative networks, where one explicit generative network learns an\nenergy-based distribution that \ufb01ts the real data, and the other implicit generative\nnetwork is trained by minimizing a GOF test statistic between the energy-based\ndistribution and the generated data, such that the underlying distribution of the\ngenerated data is close to the energy-based distribution. We design a two-level\nalternative optimization procedure to train the explicit and implicit generative\nnetworks, such that the hyper-parameters can also be automatically learned. Exper-\nimental results show that DEAN achieves high quality generations compared to the\nstate-of-the-art approaches.\n\n1\n\nIntroduction\n\nLearning the probability distribution of high-dimensional data, such as images and natural language\ncorpora, is a challenging problem in machine learning. Traditionally, we de\ufb01ne a parametric family\nof densities {p(x; \u03b8), \u03b8 \u2208 \u0398} and \ufb01nd the one with the maximum likelihood using data {xi}m\ni=1 in\n\u0398 (known as maximum likelihood estimation, MLE) [KW56]. However, the normalization factor\nintroduces dif\ufb01culties during the MLE training, because it is an integration over all con\ufb01gurations of\nrandom variables. Markov chain Monte Carlo (MCMC) [ADFDJ03, SEMFV17] could be used, but\nthe distributions of real-world data, such as images, have an intriguing property, that probability mass\nis concentrated in sharp ridges that are separated by large low probability regions. This complexity of\nthe probability landscape is a road block and a challenge that MCMC methods have to meet[BCV13].\nInspired by the representation ability of the hierarchical models of deep learning [HS06, HOT06,\nB+09, LBH15, GBCB16], and as an alternative to MLE approaches, generative adversarial networks\n(GANs) [GPAM+14] represent an important milestone on the path towards more effective generative\nmodels [RMC16, CDH+16, NCT16, ACB17b, ZML17, LCC+17, BSAG18]. GANs are a type\nof implicit generative models (IGMs), which generate images drawn from an unknown complex\nhigh-dimensional distribution p(x), using an implicit distribution q(x; \u03b8g) usually represented by\na deep network with parameter \u03b8g. No estimation of likelihoods or exact inference are required\nin GAN-like models. This class of models has recently led to many impressive results [HLP+17,\nJZL+17, KALL17, BDS18], and different variants have been proposed for speci\ufb01c tasks, such as\nconditional GAN [MO14], Pix2Pix [IZZE17], CycleGAN [ZPIE17], starGAN [CCK+18] and etc.\nIn our opinion, the existing GAN models are fundamentally two-sample test problems. The goal\nof the two-sample test is to determine whether two distributions p and q are different, based on\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fi=1 \u2282 X and Dx(cid:48) = {x(cid:48)\n\nsamples Dx = {xi}n\nj}m\nj=1 \u2282 X independently drawn from p and q,\nrespectively. GANs can be considered two-sample test problems because they need to decide whether\nthe underlying distribution p(x) of real data and an implicit distribution q(x), which generates fake\ndata, are different. From this perspective, we summarize existing GAN models into two categories\nbased on how they measure the discrepancy between p(x) and q(x). The \ufb01rst category is the integral\nprobability metric (IPM) [M\u00fcl97]. For a class F of functions, the IPM \u03b4 between two distributions p\nand q is de\ufb01ned as\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\nX\n\n\u03b4(p, q) = sup\nf\u2208F\n\nf (x)p(x)dx \u2212\n\n(cid:90)\n\nX\n\nf (x(cid:48))q(x(cid:48))dx(cid:48)(cid:12)(cid:12)(cid:12)(cid:12) .\n\nIf F is a class of Lipschitz functions, \u03b4(p, q) is called the Wasserstein IPM. Wasserstein GANs\nIf F is a unit\n(WGANs) were proposed based on the Wasserstein IPM [ACB17a, GAA+17].\nball within the reproducing kernel Hilbert space (RKHS), \u03b4(p, q) is called the maximum mean\ndiscrepancy (MMD), which has been attracting much attention due to its solid theoretical foundations\n[SFG+09, GBR+12, GSS+12, ZGB13, DLL+18]. It is natural that MMD was introduced into GAN-\ntype learning, named MMD-GAN [LSZ15, DRG15, STS+17, LCC+17, BSAG18, ASBG18]. The\nsecond category is the \u03b6-divergence [CS04], which is de\ufb01ned as\n\n(cid:90)\n\nX\n\n(cid:18) p(x)\n\n(cid:19)\n\nq(x)\n\n\u03b4\u03b6(p, q) =\n\nq(x)\u03b6\n\ndx,\n\nwhere \u03b6 is a convex, lower-semicontinuous function satisfying \u03b6(1) = 0. For different \u03b6, we have\ndifferent \u03b4\u03b6(p, q) and hence we can design different GAN models [NCT16]. For example, the\npioneering GAN [GPAM+14] is based on the Jensen-Shannon divergence, and the least squares GAN\nis related to the Pearson \u03c72 divergence [MLX+17], which are \u03b6-divergences.\nIn this paper, we propose a new paradigm that casts the generative adversarial learning as a goodness-\nof-\ufb01t (GOF) test problem. It is fundamentally different from the existing GAN models principled\non two-sample tests. The aim of the goodness-of-\ufb01t (GOF) test is to determine how well a given\nmodel distribution p \ufb01ts a set of given samples Dx = {xi}n\ni=1 from an unknown distribution q. The\nknowledge of p is what distinguishes the GOF test from the two-sample test, and brings higher power\n(i.e., probability of correctly rejecting the null hypothesis) to the GOF test statistics compared to the\ntwo-sample test statistics. Higher power in hypothesis testing suggests higher discriminability in\nGAN training. Speci\ufb01cally, by adopting the energy model to simulate the underlying distribution\nof the real data, we propose a deep energy adversarial network (DEAN) that casts the adversarial\nlearning as an optimization of a GOF test statistic. We adopt a variant of \ufb01nite set Stein discrepancy\n(vFSSD) [JXS+17] as the GOF test statistic, which is a linear time nonparametric kernel test statistic\nand shows stronger power than the two-sample test statistic, MMD. The proposed DEAN can be\ninterpreted as a novel two generator game via GOF tests: One explicit generator is designed to learn\nan energy-based distribution (EBD), which maps the real data to a scalar energy-based probability,\nand the other implicit generator is trained by minimizing the vFSSD between the EBD and the\ngenerated data. We design a two-level alternative optimization procedure to train the two generators,\nsuch that the explicit one provides the formulation of the distribution and the implicit one produces\ngenuine-looking images. It is worth noting that the DEAN framework with two generators proposed\nin this paper is versatile and able to yield speci\ufb01c training algorithms for different architectures of\ndeep neural networks.\n\n2 Related Work\n\nEnergy-Based GANs. Energy-based models capture dependencies over random variables by de\ufb01ning\nan energy function. The energy function maps each con\ufb01guration of random variables to a scalar\nenergy value, where lower energy values are assigned to more likely con\ufb01gurations. In general, the\nexact MLE of an energy model is challenging to calculate due to the dif\ufb01culty of evaluating the\nnormalization constant and its gradient. To overcome this dif\ufb01culty, deep energy models [NCKN11,\nXLZW16, SMSH18] and a series of energy-based GANs have been proposed, including EBGAN\n[ZML17], calibrated EBGAN [DAB+17], BEGAN [BSM17] and MAGAN [WCCD17]. In this\npaper, we adopt the energy-based models to \ufb01t the real data, and use the resulting energy-based\ndistribution as the known distribution in the GOF test to optimize the implicit generator.\nScore Matching and Stein\u2019s Method. Score matching was developed for the parameter estimation\nof unnormalized densities caused by the partition function being eliminated in the score function\n\n2\n\n\f[Hyv05, SMSH18]. For the GOF test, traditional methods need to calculate the likelihoods of the\nmodels. However, for large deep generative models, this is computationally intractable due to the\ncomplexity of the models. Recently, Stein\u2019s method [S+72, OGC17] was introduced into the kernel\ndomain [GM17, WL16, LW16, LW18, FWL17, DLL+19b], which combines Stein\u2019s identity with the\nRKHS theory. This is a likelihood-free method that depends on the known distribution p only through\nlogarithmic derivatives, and is closely related to score matching. The proposed statistic is referred to\nas kernel Stein discrepancy (KSD) [CSG16, LLJ16]. To improve the performance of KSD and MMD,\nJitkrittum et al. [JXS+17] proposed the \ufb01nite set Stein discrepancy (FSSD) by introducing a witness\nfunction on a \ufb01nite set. Inspired by [JXS+17], we introduce score matching and Stein\u2019s method into\nthe domain of generative adversarial learning, making the GOF test possible by providing one of the\ndistributions p and q. We eliminate the partition function by taking logarithmic derivatives of the\nenergy-based distribution that is directly involved in optimizing the implicit generator.\nGoodness-of-\ufb01t Test for Generative Model Learning. In recent years, there are two emerging\nfamilies for generative model learning [HYSX18], generative adversarial networks (GANs) and\nautoencoders (AEs) or variational AEs (VAEs), which are two distinct paradigms and have both\nreceived extensive studies. Our paper and [PDB18] both introduce GOF tests into deep generative\nmodeling, but fall into different paradigms: [PDB18] is an AE-based method without adversarial\nlearning while our paper is a GAN-type approach. The HTAE (hypothesis testing AE) in [PDB18]\nminimized the reconstruction error, but no adversarial learning (min-max adversarial optimization)\nwas involved. The statistic in this paper is a kernel-based nonparametric GOF statistic. The Shapiro-\nWilk test in [PDB18] is a traditional parametric GOF statistic for testing normality.\n\n3 Background\n\nThe paradigm of generative adversarial networks (GANs) [GPAM+14] generates samples using a\ntraining procedure that pits a generator G against a discriminator D. D is trained to distinguish\ntraining samples from the samples produced by G, while G is trained to increase the probability of\nits samples being incorrectly classi\ufb01ed as real data. In the original formulation [GPAM+14], the\ntraining procedure de\ufb01nes a minimax game\n\nmin\n\nG\n\nmax\n\nD\n\nEx\u223cp(x) [log D(x)] + Ez\u223cpz(z) [log(1 \u2212 D(G(z)))] ,\n\nwhere p(x) is a data distribution in Rd, D is a function that maps Rd to [0, 1], and G is a function\nthat maps a noise vector z \u2208 Rm, drawn from a simple distribution pz(z), to the ambient space of the\ntraining data. The idealized algorithm can be shown to converge and to minimize the Jensen-Shannon\ndivergence between the data generating distribution and the distribution parameterized by G.\nLet H\u03ba be a reproducing kernel Hilbert space (RKHS) de\ufb01ned on the data domain X with the\nreproducing kernel \u03ba : X \u00d7 X \u2192 R. We consider the function class F as a unit ball in a universal\nRKHS H\u03ba, since this class is rich enough to show equivalence between the zero expectation of\nthe statistics and the equality of two distributions [FBJ04, SGF+10, Ste01, MXZ06]. Universality\nrequires that \u03ba is continuous and H\u03ba is dense in the space of bounded continuous functions C(X )\nwith respect to the L\u221e norm. Gaussian and Laplacian RKHSs are universal [Ste01].\nThe mean embedding of a distribution p in F, written as \u00b5\u03ba(p) \u2208 F, is de\ufb01ned such that Ex\u223cpf (x) =\n(cid:104)f, \u00b5\u03ba(p)(cid:105) for all f \u2208 F. The squared MMD between two distributions p and q is the squared RKHS\ndistance between their respective mean embeddings,\n\nMMD2[F, p, q] = (cid:107)\u00b5\u03ba(p) \u2212 \u00b5\u03ba(q)(cid:107)2F = Ezz(cid:48)h(z, z(cid:48)),\n\nwhere z = (x, y), z(cid:48) = (x(cid:48), y(cid:48)) and h(z, z(cid:48)) = \u03ba(x, x(cid:48)) + \u03ba(y, y(cid:48)) \u2212 \u03ba(x, y(cid:48)) \u2212 \u03ba(x(cid:48), y). It has been\nproved that for a unit ball F in a universal RKHS, MMD[F, p, q] = 0 if and only if p = q [GBR+12].\nFor two sets of samples Dx = {xi}n\nj=1 \u2282\nY \u2286 Rd, where yj \u223c q i.i.d., if we assume m = n, the minimum variance unbiased estimator of\nMMD2[F, p, q] can be represented as\n\ni=1 \u2282 X \u2286 Rd, where xi \u223c p i.i.d., and Dy = {yj}m\n\n(cid:88)\n\ni(cid:54)=j\n\nMMD2\n\nUnb[F, p, q] =\n\n1\n\nn(n \u2212 1)\n\n3\n\nh(zi, zj).\n\n\fThe typical two-sample test based GANs are MMD GANs [LSZ15, DRG15, STS+17, LCC+17]\nwhich train the parameter \u03b8g by optimizing\n\nMMD2\n\nUnb[F, p, G(z; \u03b8g)].\n\narg min\n\n\u03b8g\n\n4 Deep Energy Adversarial Network\n\nIn this section, we present the deep energy adversarial network (DEAN). Our primary contribution\nis a new paradigm for generative adversarial learning, which consists of two generative networks:\nan explicit one that learns an energy-based distribution (EBD) \ufb01tting the real data, and an implicit\none that produces genuine-looking images by minimizing the discrepancy between the underlying\ndistribution of the real data and the EBD produced by the explicit generative network.\nThe following characteristics make the proposed DEAN distinguishable from the existing GANs.\nFirst, DEAN makes it possible for the generative adversarial learning to approximate the underlying\ndistribution of the real data, not just produce fake data that mimics the real data. Second, the GOF\ntest is adopted to replace the two-sample test, such that the knowledge of p is used to increase the\ntest power (probability of correctly rejecting the null hypothesis). This power can be understood\nas the discriminability in GAN training. Third, DEAN can be considered an algorithm for training\ndeep energy-based models [NCKN11, XLZW16, SMSH18], where the implicit generator is used to\nprovide \u201cnegative\u201d samples. Fourth, the explicit generator plays the role similar to the discriminator\nof the existing GAN models. DEAN is a two generator game.\n\n4.1 Energy Estimator Network\nEnergy-based models E\u03b8(x) : X \u2192 R associate an energy value with a sample x, where \u03b8 are the\nparameters. Ideally, high energy is assigned to the generated fake data, and low energy to real data.\nWe can obtain a distribution based on E\u03b8(x),\n\np(x; \u03b8) =\n\nmain challenge in this optimization is evaluating the partition function Z\u03b8 =(cid:82)\n\nThe parameters \u03b8 of the energy function are often learned to maximize the likelihood of the data; the\nx exp(\u2212E\u03b8(x)), which\n\nis an intractable sum or integral for most high-dimensional problems.\nNow we de\ufb01ne the loss function of the explicit generative network (EGN) of DEAN as follows:\n\nexp(\u2212E\u03b8(x)).\n\n1\nZ\u03b8\n\nE(x; \u03b8e) +(cid:2)\u03b3 \u2212 E(cid:0)G(z; \u03b8\u2217\n\n(cid:1)(cid:3)+\n\n,\n\ng); \u03b8e\n\nmin\n\u03b8e\n\n:= G(zi; \u03b8\u2217\n\n(1)\nwhere E(x; \u03b8e) is an energy model parameterized by \u03b8e, [\u00b7]+ = max(\u00b7, 0) and \u03b3 is a given positive\nmargin. We can use Dx(cid:48) = {x(cid:48)\ni=1 to denote the generated fake samples with\ng optimized in the implicit generator network, where n is the batch size. Dx(cid:48) and the real data\n\u03b8\u2217\nDx = {xi}n\ni=1 is forced to have\nlow energy, while generated fake data Dx(cid:48) is forced to have high energy. This loss function (1) is\npossibly the simplest energy-based loss and is the same as that of EBGAN [ZML17]. However, for\nthe DEAN framework, other energy-based losses can also be adopted, such as the losses in calibrated\nEBGAN [DAB+17], BEGAN [BSM17] and MAGAN [WCCD17].\nWhen the network parameters \u03b8\u2217\n\ni=1 are both fed into Equation (1), where the real data Dx = {xi}n\n\ng)}n\n\ni\n\ne are optimized, we can de\ufb01ne a probability distribution\np(x; \u03b8\u2217\n\nexp(\u2212E(x; \u03b8\u2217\n\ne ) =\n\ne )).\n\n1\nZ\u03b8\u2217\n\ne\n\nWe take two cases of E(x; \u03b8e) as examples. First, we consider the Gaussian-Bernoulli restricted\nBoltzmann machine (RBM) [HS06], which is a hidden variable graphical model consisting of a\ncontinuous observable variable, x \u2208 Rd, and a binary hidden variable, r \u2208 {\u00b11}dh 1. We write\n\nE(x; \u03b8e) =\n\n1\n2\n\n(cid:107)x(cid:107)2 \u2212 bTx \u2212 \u03c2(BTx + c),\n\n1 The joint probability distribution of x and r is p(x, r) = 1\nZ\u03b8e\n\nexp(xTBr + bTx + cTx \u2212 1\n\n2(cid:107)x(cid:107)2).\n\n4\n\n\fwhere \u03b8e = {b, B, c} and \u03c2(v) =(cid:80)n\n\ni=1 log(exp(vi) + exp(\u2212vi)). For optimized \u03b8\u2217\n\ne, we have\n\np(x; \u03b8\u2217\n\ne ) =\n\n1\nZ\u03b8\u2217\n\ne\n\nexp(\u2212E(x; \u03b8\u2217\n\ne )).\n\nSecond, we consider a deep auto-encoder as a more complex energy model\n\nE(x; \u03b8) = (cid:107)x \u2212 AE(x; \u03b8e)(cid:107),\n\nwhere AE(x; \u03b8e) denotes a deep auto-encoder parameterized by \u03b8e. For the optimized parameters \u03b8\u2217\ne,\nwe can de\ufb01ne\n\np(x; \u03b8\u2217\n\ne ) =\n\n1\nZ\u03b8\u2217\n\ne\n\nexp(\u2212E(x; \u03b8\u2217\n\ne )).\n\nIn the implicit generator network of DEAN shown in the next section, we will introduce a score\nfunction [Hyv05] to avoid calculating the partition function Z\u03b8,\n\ns(x, \u03b8) = \u2207x log p(x, \u03b8) = \u2212\u2207xE(x, \u03b8\u2217\ne ),\n\nsince Z\u03b8 is independent of x. We will fully exploit the knowledge of the distribution p by introducing\nthe Stein operator [S+72, OGC17]. It is the knowledge of p that distinguishes the GOF test from the\ntwo-sample test, and makes the DEAN paradigm fundamentally different from the existing GANs.\n\n4.2 GOF-driven Generator Network\n\nWe present the implicit generative network (IGN) of DEAN, which is trained by minimizing a GOF\ntest statistic between the energy-based distribution p(x; \u03b8\u2217\ne ) learned by the EGN and the generated\n(fake) data, such that the underlying distribution of the generated data is close to p(x; \u03b8\u2217\ne ).\nWe \ufb01rst introduce the Stein operator [S+72, OGC17], which depends on the distribution p on-\nly through logarithmic derivatives. A Stein operator Tp takes a multivariate function f (x) =\n(f1(x), . . . , fd(x))T \u2208 Rd as input and outputs a function (Tpf )(x) : Rd \u2192 R. The function Tpf\nhas the key property that, for all fs in an appropriate function class, Ex\u223cq[(Tpf )(x)] = 0 if and only\nif p = q. Thus, this expectation can be used to test the goodness-of-\ufb01t: how well a model distribution\np \ufb01ts a given set of samples {xi}n\nWe consider the function class F d := F \u00d7 \u00b7\u00b7\u00b7 \u00d7 F, where F is a unit-norm ball in a universal\n(cid:80)d\nRKHS. Assume that fi \u2208 F for all i = 1, . . . , d such that f \u2208 F d with the inner product (cid:104)f, g(cid:105)F d :=\ni=1(cid:104)fi, gi(cid:105)F for g \u2208 F d. According to the reproducing property of F, fi(x) = (cid:104)fi, \u03ba(x,\u00b7)(cid:105)F , and\n\u2202\u03ba(x,\u00b7)\n\u2202x . Kernel Stein operator can be written as\n\ni=1 \u2282 X \u2286 Rd from an unknown distribution q.\n\n\u2208 F, we de\ufb01ne \u03c9p(x,\u00b7) = \u2202 log p(x)\n\n\u03ba(x,\u00b7) + \u03ba(x,\u00b7)\n\n\u2202xi\n\n\u2202x\n\n(cid:18) \u2202 log p(x)\n\nd(cid:88)\n\ni=1\n\n\u2202xi\n\n(cid:19)\n\n(Tpf )(x) =\n\nfi(x) +\n\n\u2202fi(x)\n\n\u2202xi\n\n= (cid:104)f, \u03c9p(x,\u00b7)(cid:105)F d .\n\nNow we introduce the kernel Stein discrepancy (KSD) [CSG16, LLJ16], which is formulated as\n\nKSD[F d, p,Dx] = sup\n(cid:107)f(cid:107)F d\u22641\n\n(cid:104)f, Ex\u223cq\u03c9p(x,\u00b7)(cid:105) := (cid:107)g(\u00b7)(cid:107)F d ,\n\n(2)\n\nwhere g(\u00b7) = Ex\u223cq\u03c9p(x,\u00b7).\nLet V = {v1, . . . , vJ} \u2282 Rd be random vectors drawn from a distribution, where J is a pre-de\ufb01ned\nhyper-parameter. The statistic of the \ufb01nite set Stein discrepancy (FSSD) [JXS+17] is de\ufb01ned as\n\nFSSD[F d, p,Dx] =\n\n1\ndJ\n\ng2\ni (vj),\n\nd(cid:88)\n\nJ(cid:88)\n\ni=1\n\nj=1\n\nwhere g(\u00b7) is referred to as the Stein witness function, given in Equation (2).\nIn the following, we present a variant of FSSD as the loss function of the IGN of DEAN. Let\n\u2126(x) \u2208 Rd\u00d7J, such that\n\n\u221a\n\u2126(x)i,j = \u03c9p,i(x, vj)/\n\ndJ,\n\n\u03c4 (x) = vec(\u2126(x)) \u2208 RdJ ,\n\n5\n\n\fwhere vec(\u00b7) denotes the vectorization of matrices. The unbiased estimator of FSSD is de\ufb01ned as\n\n(cid:92)FSSD\n\n2\n\n[F d, p,Dx] =\n\n2\n\nn(n \u2212 1)\n\n\u2206(xi, xj),\n\n(cid:88)\n\ni<j\n\nwhere \u2206(x, y) = \u03c4 (x)T\u03c4 (y). Without loss of generality, we will adopt (cid:92)FSSD\ntion, since the function class F d is \ufb01xed when the kernel is given.\nNow we present a variant of (cid:92)FSSD\n[p,Dx] as the loss function. The reason for introducing a variant\nis as follows. According to Proposition 2 in [JXS+17], under the alternative hypothesis H1 : p (cid:54)= q,\n2 \u223c \u221a\n\n[p,Dx] as an abbrevia-\n\nnN (0, \u03c3H1 ) + nFSSD2,\n\nn(cid:92)FSSD\n\n2\n\n2\n\nif \u03c3H1 = 4\u00b5T\u03a3q\u00b5 > 0, where \u00b5 = Ex\u223cq[\u03c4 (x)] and \u03a3q = covx\u223cq[\u03c4 (x)] \u2208 RdJ\u00d7dJ . From the\nabove equation, we know that n(cid:92)FSSD\nis highly dependent on the dimension of the data: when the\ndimension d increases, the dimension of \u03a3q will also increase, and then the variance \u03c3H1 becomes\nlarger. When the variance becomes larger, the resulting values of the statistic will become unstable.\nTo alleviate the impact of dimension and stabilize the statistic, we introduce\n\n2\n\nvFSSD[p,Dx] =\n\n1\n\u02c6\u03c3H1\n\n(cid:92)FSSD\n\n2\n\n[p,Dx]\n\n2\n\nas the variant of (cid:92)FSSD\nwhich is the empirical variance of the limiting distribution of\nde\ufb01ne the loss function of the IGN as follows:\n\n[p,Dx] [JXS+17], where \u02c6\u03c3H1 is an empirical estimate of \u03c3H1 = 4\u00b5T\u03a3q\u00b5,\n. Now we\n\n\u221a\n\nn\n\n(cid:16)(cid:92)FSSD\n\n2 \u2212 FSSD2(cid:17)\n\n(cid:110){vi}J\n\n(cid:111)\n\nmin\n\u03b8g\n\nmax\n\n\u03be\n\nvFSSD\u03be [p(x; \u03b8\u2217\n\ne ),Dx(cid:48)] ,\n\ni=1 , \u03c3k\n\ndenotes the hyper-parameters of vFSSD, including the kernel parameter\ni=1.\n\nwhere \u03be =\n\u03c3k and J test locations {vi}J\nRemark: In Equation (3), the inner maximum is used to optimize the hyper-parameters \u03be of IGN\nitself. This is similar to the idea of Equation (3) in [LCC+17]. We can set random values for the\nhyper-parameters \u03be. If so, we solve the DEAN framework by alternately optimizing the loss function\nof EGN (Equation (1)) and the loss function of IGN (Equation (3)) with \ufb01xed \u03be. However, maximizing\nEquation (3) with respect to the hyper-parameters \u03be can increase the test power of vFSSD2, which\nwill eventually force the IGN to produce more realistic-looking images.\nTherefore, we present the following two objectives, (4) and (5), to optimize Equation (3) and improve\nthe test power of DEAN.\n\n(3)\n\n(4)\n\n(5)\n\ng) is a deep network with the optimized parameter\n\nwill be optimized in Equation (4).\n\n:= G(zi; \u03b8\u2217\n\u03b8\u2217\ng. The hyper-parameters \u03be =\n\ni\n\nwhere Dx(cid:48)\u2217 =(cid:8)x(cid:48)\u2217\n(cid:110){v\u2217\n\n(cid:111)\n\nmax\n\n\u03be\n\ng)(cid:9)n\n(cid:110){vi}J\n\ni=1\n\n(cid:111)\n\nvFSSD\u03be [p(y; \u03b8\u2217\nand G(zi; \u03b8\u2217\ni=1 , \u03c3k\nvFSSD\u03be\u2217 [p(y; \u03b8\u2217\n\nmin\n\u03b8g\n\ne ),Dx(cid:48)\u2217] ,\n\ne ),Dx(cid:48)] ,\n\ni }J\n\nk\n\ni=1 will be optimized.\n\ni=1 , \u03c3\u2217\ni := G(zi; \u03b8g)}n\n\ndenotes the optimized hyper-parameters, and the parameters \u03b8g for\n\nwhere \u03be\u2217 =\nDx(cid:48) = {x(cid:48)\nIn summary, DEAN is solved by alternately optimizing Equation (1) and Equation (3); Equation (3)\nis solved by alternatively optimizing Equation (4) and Equation (5), if necessary. It is a two-level\nalternative optimization procedure. The energy-based probability p(x; \u03b8e\u2217 ), playing the role of a\ndiscriminator, is trained to provide low energy to the real data, and high energy to the fake data\nproduced by the IGN G(z, \u03b8g). The IGN is trained by minimizing vFSSD between the generated data\n\n2Please refer to Proposition 4 of [JXS+17].\n\n6\n\n\fe ) that \ufb01ts the real data.\n\ne ), such that the underlying distribution of the generated data gradually becomes closer to\n\nand p(x; \u03b8\u2217\np(x; \u03b8\u2217\nFinally, we characterize the solutions of DEAN. Let px and px(cid:48) be the distributions of real and fake\ndata; pe denotes the energy-based distribution. In DEAN, pe is a bridge connecting px and px(cid:48). For\nthe IGN, the network is trained to have px(cid:48) equal to pe. Please refer to Theorem 1, which can be\neasily proved based on Theorem 1 of [JXS+17]. For the EGN, pe is learned to estimate px. Please see\nTheorem 2, which can be proved according to Theorem 1 of [ZML17] and Theorem 1 of [GPAM+14]\nwith \ufb01xed px(cid:48). Different from GANs, which are implicit generative models (IGMs), DEAN can\nexplicitly estimate the underlying distribution of the real data after estimating \u03b8e and \u03b8g.\nTheorem 1 We assume that Dx(cid:48) is drawn from px(cid:48).\nIf \u03ba is a universal and analytic kernel;\n< \u221e\nEa\u223cpx(cid:48) Eb\u223cpe\nwith s(a) = \u2207a log pe(a); Ea\u223cpx(cid:48)(cid:107)\u2207a log pe(a) \u2212 \u2207a log px(cid:48)(a)(cid:107)2 < \u221e; lim(cid:107)a(cid:107)\u2192\u221e pe(a)g(a) =\n0, where g(\u00b7) is given in Eq. (2) in Section 4.2; for any J \u2265 1, almost surely FSSD[pe,Dx(cid:48)] = 0 if\nand only if px(cid:48) = pe.\n\nsT(a)s(b)\u03ba(a, b) + sT(b)\u2207a\u03ba(a, b) + sT(a)\u2207b\u03ba(a, b) +(cid:80)d\n\n\u22022\u03ba(a,b)\n\u2202ai\u2202bi\n\nTheorem 2 Let \u039b(\u03b8e) = E(x; \u03b8e) +(cid:2)\u03b3 \u2212 E(cid:0)G(z; \u03b8\u2217\n\n(cid:1)(cid:3)+. The minimum of \u039b(\u03b8e) is achieved if\n\n(cid:105)\n\n(cid:104)\n\ni=1\n\nand only if pe = px. With the optimized \u03b8\u2217\n\ne,(cid:82)\n\nx,z \u039b(\u03b8\u2217\n\ng); \u03b8e\ne )px(x)pz(z)dxdz = \u03b3.\n\n5 Experiments\n\nHere, we conduct experiments to evaluate the performance of the proposed DEAN as compared with\nthe existing GAN models.\nWe compared \ufb01ve related GAN models, DCGAN [RMC16], EBGAN [MLS+17], WGAN-GP\n[SGZ+16], MMD-GAN [LCC+17, BSAG18] and Scaled MMD-GAN (SMMD-GAN) [ASBG18].\nThe evaluations are conducted on three popular datasets, including MNIST [LBBH98] (70,000\nimages, 28 \u00d7 28), CIFAR-10 [KH09] (60,000 images, 32 \u00d7 32), and CelebA [YLLT15] (202,599\nface images, resized and cropped to 160 \u00d7 160).\nFor MNIST and CIFAR-10, the IGN of DEAN adopts a DCGAN generator [RMC16] with vFSSD\nas the loss function. An auto-encoder with convolutional layers is adopted as the EGN of DEAN\n(analogous to the discriminators of the existing GANs). The loss of the discriminator is de\ufb01ned in (1),\nwhere we set \u03b3 = 1 as in [ZML17], and E(x; \u03b8) = (cid:107)x \u2212 AE(x; \u03b8e)(cid:107)3. For CelebA, we use a ResNet\nas the IGN and an auto-encoder as the EGN. The input noise vector z \u2208 R128 for the generator (IGN)\nis independently drawn from a standard normal distribution.\n\n(a) Fixed \u03be\n\n(b) Optimized \u03be\n\n(c) Fixed \u03be\n\n(d) Optimized \u03be\n\nFigure 1: Images generated by DEAN with \ufb01xed and optimized hyper-parameters \u03be on MNIST and\nCIFAR-10.\n\nKernel selection is import to the performance of kernel methods [DL14b, DL14a, DL17, LLD+18,\nLLD+18, DLL+19a, LLJ+19]. We introduce the mixture of linear and rational quadratic functions\ngiven in [BSAG18] as kernel functions for the DEAN framework: \u03badot+rq(x, y) = \u03badot(x, y) +\n, where\n\n\u03barq(x, y), \u03badot = (cid:104)x, y(cid:105), \u03barq(x, y) = (cid:80)\n\n(cid:17)\u2212\u03b1\n\n(cid:107)x\u2212y(cid:107)2\n\n(cid:16)\n\n1 +\n\n\u03b1\u2208A \u03barq\n\n\u03b1 (x, y), \u03barq\n\n\u03b1 (x, y) =\n\n2\u03b1\n\n3We also adopted RBM as the energy function at the initial stage. However, the performance of DEAN with\n\nRBM is not comparable to that with autoencoder, so we discarded the results.\n\n7\n\n\fA = {0.2, 0.5, 1, 2, 5}. If we simply calculate pixel-level kernels, the performance is poor due\nto the high dimension of the images. Following [LCC+17], we consider kernels de\ufb01ned on\ntop of a low-dimensional representation \u03c6\u03b8 : X \u2192 Rs, which implies that \u03badot+rq\n(x, y) =\n\u03badot+rq(\u03c6\u03b8(x), \u03c6\u03b8(y)). In DEAN, we adopt the output of the inner layer of the auto-encoder as\nthe low-dimensional representation \u03c6\u03b8. We set the number of test locations J = 5 to compute the\nvalue of vFSSD.\n\n\u03b8\n\n(a) DCGAN\n\n(b) WGAN-GP\n\n(c) EBGAN\n\n(d) MMD-GAN\nFigure 2: Faces generated by different GAN models trained on CelebA.\n\n(e) SMMD-GAN\n\n(f) DEAN\n\nAll models are trained on an NVIDIA Tesla V100 GPU. We adopt \ufb01ve EGN updates per IGN\nstep. For IGN, we optimize the hyper-parameter \u03be by Equation (4) for every update. We use initial\nlearning rates of 0.0001 for MNIST, CIFAR-10 and CelebA. We use the Adam optimizer [KB15]\nwith \u03b21 = 0.5, \u03b22 = 0.9.\nWe show the images generated by DEAN trained on MNIST and CIFAR-10 in Figure 1. We have two\nobservations: a) DEAN can produce genuine-looking images; b) the quality of the images generated\nwith optimized hyper-parameter \u03be is better than that with \ufb01xed hyper-parameter. The generated\nimages trained on CelebA are shown in Figure 2. We can \ufb01nd that faces generated by DEAN are\nrealistic-looking. In the compared methods, the face quality of SMMD-GAN is better than that\nof other GANs. However, the existing MMD loss may discourage the learning of \ufb01ne details in\ndata [WSH19]. Therefore, higher discriminability of the loss function and automatically tunable\nhyper-parameters of DEAN may help to learn \ufb01ne details of images.\n\n6 Conclusions\n\nIn this paper, we established the connection between the goodness-of-\ufb01t (GOF) test and generative\nadversarial learning, and proposed a new adversarial learning paradigm. It is a game between two\ngenerative networks, fundamentally different from the existing GANs principled on two-sample\ntests, which may open the door for research into generative-to-generative adversarial learning models.\nEmpirical evaluations have shown that DEAN can achieve high quality generations as compared to the\nstate-of-the-art approaches. Besides GOF-test GANs, in the near future, we will study independence-\ntest GANs via Hilbert Schmidt independence criterion (HSIC) [GFT+08].\n\nAcknowledgments\n\nThis work was supported in part by National Natural Science Foundation of China (No. 61703396),\nthe CCF-Tencent Open Fund and Shenzhen Government (GJHZ20180419190732022).\n\n8\n\n\fReferences\n\n[ACB17a] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint arX-\n\niv:1701.07875, 2017.\n\n[ACB17b] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks.\n\nIn ICML, pages 214\u2013223, 2017.\n\n[ADFDJ03] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to\n\nMCMC for machine learning. Machine learning, 50(1-2):5\u201343, 2003.\n\n[ASBG18] Michael Arbel, Dougal J. Sutherland, Miko\u0142aj Bi\u00b4nkowski, and Arthur Gretton. On gradient\n\nregularizers for MMD GANs. In NeurIPS 31, 2018.\n\n[B+09] Yoshua Bengio et al. Learning deep architectures for AI. Foundations and trends R(cid:13) in Machine\n\nLearning, 2(1):1\u2013127, 2009.\n\n[BCV13] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\nperspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798\u20131828,\n2013.\n\n[BDS18] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity\n\nnatural image synthesis. In ICLR, 2018.\n\n[BSAG18] Miko\u0142aj Bi\u00b4nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying\n\nMMD GANs. In ICLR, 2018.\n\n[BSM17] David Berthelot, Thomas Schumm, and Luke Metz. BEGAN: boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[CCK+18] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.\nStarGAN: Uni\ufb01ed generative adversarial networks for multi-domain image-to-image translation.\nIn CVPR, 2018.\n\n[CDH+16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nNIPS 29, pages 2172\u20132180, 2016.\n\n[CS04] Imre Csisz\u00e1r and Paul C. Shields. Information theory and statistics: A tutorial. Foundations and\n\nTrends R(cid:13) in Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[CSG16] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t. In\n\nICML, pages 2606\u20132615, 2016.\n\n[DAB+17] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating\n\nenergy-based generative adversarial networks. In ICLR, 2017.\n\n[DL14a] Lizhong Ding and Shizhong Liao. Approximate consistency: Towards foundations of approximate\nkernel selection. In Proceedings of the European Conference on Machine Learning and Principles\nand Practice of Knowledge Discovery in Database (ECML PKDD), pages 354\u2013369, 2014.\n\n[DL14b] Lizhong Ding and Shizhong Liao. Model selection with the covering number of the ball of\nRKHS. In Proceedings of the 23rd ACM International Conference on Information and Knowledge\nManagement (CIKM), pages 1159\u20131168, 2014.\n\n[DL17] Lizhong Ding and Shizhong Liao. An approximate approach to automatic kernel selection. IEEE\n\nTransactions on Cybernetics, 47(3):554\u2013565, 2017.\n\n[DLL+18] Lizhong Ding, Shizhong Liao, Yong Liu, Peng Yang, and Xin Gao. Randomized kernel selection\nwith spectra of multilevel circulant matrices. In Proceedings of the 32nd AAAI Conference on\nArti\ufb01cial Intelligence (AAAI), pages 2910\u20132917, 2018.\n\n[DLL+19a] Lizhong Ding, Yong Liu, Shizhong Liao, Yu Li, Peng Yang, Yijie Pan, Chao Huang, Ling Shao,\nand Xin Gao. Approximate kernel selection with strong approximate consistency. In Proceedings\nof the 33rd AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 3462\u20133469, 2019.\n\n[DLL+19b] Lizhong Ding, Zhi Liu, Yu Li, Shizhong Liao, Yong Liu, Peng Yang, Ge Yu, Ling Shao, and Xin\nGao. Linear kernel tests via empirical likelihood for high-dimensional data. In Proceedings of the\n33rd AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 3454\u20133461, 2019.\n\n9\n\n\f[DRG15] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via\n\nmaximum mean discrepancy optimization. In UAI, pages 258\u2013267, 2015.\n\n[FBJ04] Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised\nlearning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:73\u201399,\n2004.\n\n[FWL17] Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized Stein\n\nvariational gradient descent. arXiv preprint arXiv:1707.06626, 2017.\n\n[GAA+17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville.\n\nImproved training of Wasserstein GANs. In NIPS 30, pages 5769\u20135779, 2017.\n\n[GBCB16] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep Learning. MIT\n\nPress, Cambridge, MA, USA, 2016.\n\n[GBR+12] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander J.\n\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[GFT+08] Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Sch\u00f6lkopf, and Alex J. Smola.\n\nA kernel statistical test of independence. In NIPS 21, pages 585\u2013592, 2008.\n\n[GM17] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In ICML, pages\n\n1292\u20131301, 2017.\n\n[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS 27, pages 2672\u20132680,\n2014.\n\n[GSS+12] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano\nPontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale\ntwo-sample tests. In NIPS 25, pages 1205\u20131213, 2012.\n\n[HLP+17] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative\n\nadversarial networks. In CVPR, 2017.\n\n[HOT06] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief\n\nnets. Neural Computation, 18(7):1527\u20131554, 2006.\n\n[HS06] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural\n\nnetworks. Science, 313(5786):504\u2013507, 2006.\n\n[HYSX18] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. On unifying deep generative\n\nmodels. In ICLR, 2018.\n\n[Hyv05] Aapo Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. Journal of\n\nMachine Learning Research, 6:695\u2013709, 2005.\n\n[IZZE17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with\n\nconditional adversarial networks. In CVPR, pages 5967\u20135976, 2017.\n\n[JXS+17] Wittawat Jitkrittum, Wenkai Xu, Zolt\u00e1n Szab\u00f3, Kenji Fukumizu, and Arthur Gretton. A linear-time\n\nkernel goodness-of-\ufb01t test. In NIPS 30, pages 261\u2013270, 2017.\n\n[JZL+17] Yanghua Jin, Jiakai Zhang, Minjun Li, Yingtao Tian, Huachun Zhu, and Zhihao Fang. Towards\nthe automatic anime characters creation with generative adversarial networks. arXiv preprint\narXiv:1708.05509, 2017.\n\n[KALL17] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[KB15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[KH09] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, 2009.\n\n[KW56] Jack Kiefer and Jacob Wolfowitz. Consistency of the maximum likelihood estimator in the\npresence of in\ufb01nitely many incidental parameters. Annals of Mathematical Statistics, pages\n887\u2013906, 1956.\n\n10\n\n\f[LBBH98] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied\n\nto document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.\n\n[LCC+17] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. MMD GAN:\nTowards deeper understanding of moment matching network. In NIPS 30, pages 2200\u20132210,\n2017.\n\n[LLD+18] Yong Liu, Hailun Lin, Lizhong Ding, Weiping Wang, and Shizhong Liao. Fast cross-validation.\nIn Proceedings of the 27th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n2497\u20132503, 2018.\n\n[LLJ16] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t\n\ntests. In ICML 2016, pages 276\u2013284, 2016.\n\n[LLJ+19] Yong Liu, Shizhong Liao, Shali Jiang, Lizhong Ding, Hailun Lin, and Weiping Wang. Fast\ncross-validation for kernel-based algorithms. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 2019.\n\n[LSZ15] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML,\n\npages 1718\u20131727, 2015.\n\n[LW16] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian\n\ninference algorithm. In NIPS 29, pages 2378\u20132386, 2016.\n\n[LW18] Qiang Liu and Dilin Wang. Stein variational gradient descent as moment matching. In NeurIPS\n\n30, pages 8854\u20138863, 2018.\n\n[MLS+17] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev GAN. In ICLR,\n\n2017.\n\n[MLX+17] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley.\n\nLeast squares generative adversarial networks. In ICCV, pages 2813\u20132821, 2017.\n\n[MO14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[M\u00fcl97] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29(2):429\u2013443, 1997.\n\n[MXZ06] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine\n\nLearning Research, 7:2651\u20132667, 2006.\n\n[NCKN11] Jiquan Ngiam, Zhenghao Chen, Pang W. Koh, and Andrew Y. Ng. Learning deep energy models.\n\nIn ICML, pages 1105\u20131112, 2011.\n\n[NCT16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural\n\nsamplers using variational divergence minimization. In NIPS 29, pages 271\u2013279, 2016.\n\n[OGC17] Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integra-\ntion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695\u2013718,\n2017.\n\n[PDB18] Aaron Palmer, Dipak Dey, and Jinbo Bi. Reforming generative autoencoders via goodness-of-\ufb01t\n\nhypothesis testing. In UAI, pages 1009\u20131019, 2018.\n\n[RMC16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[S+72] Charles Stein et al. A bound for the error in the normal approximation to the distribution of a sum\nof dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical\nStatistics and Probability, Volume 2: Probability Theory. The Regents of the University of\nCalifornia, 1972.\n\n[SEMFV17] Sandro Sch\u00f6nborn, Bernhard Egger, Andreas Morel-Forster, and Thomas Vetter. Markov chain\nInternational Journal of Computer Vision,\n\nMonte Carlo for automated face image analysis.\n123(2):160\u2013183, 2017.\n\n11\n\n\f[SFG+09] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Gert RG Lanckriet, and Bernhard\nSch\u00f6lkopf. Kernel choice and classi\ufb01ability for RKHS embeddings of probability distributions.\nIn NIPS 22, pages 1750\u20131758, 2009.\n\n[SGF+10] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch\u00f6lkopf, and Gert R. G.\nLanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine\nLearning Research, 11:1517\u20131561, 2010.\n\n[SGZ+16] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training GANs. In NIPS 29, pages 2234\u20132242, 2016.\n\n[SMSH18] Saeed Saremi, Arash Mehrjou, Bernhard Sch\u00f6lkopf, and Aapo Hyv\u00e4rinen. Deep energy estimator\n\nnetworks. ArXiv:1805.08306, 2018.\n\n[Ste01] Ingo Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines.\n\nJournal of Machine Learning Research, 2:67\u201393, 2001.\n\n[STS+17] Dougal J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex\nSmola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean\ndiscrepancy. In ICLR, 2017.\n\n[WCCD17] Ruohan Wang, Antoine Cully, Hyung Jin Chang, and Yiannis Demiris. Magan: Margin adaptation\n\nfor generative adversarial networks. arXiv preprint arXiv:1704.03817, 2017.\n\n[WL16] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized MLE for\n\ngenerative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\n[WSH19] Wei Wang, Yuan Sun, and Saman Halgamuge. Improving MMD-GAN training with repulsive\n\nloss function. In ICLR, 2019.\n\n[XLZW16] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In\n\nICML, pages 2635\u20132644, 2016.\n\n[YLLT15] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. From facial parts responses to face\n\ndetection: A deep learning approach. In ICCV, pages 3676\u20133684, 2015.\n\n[ZGB13] Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low\n\nvariance kernel two-sample test. In NIPS 26, pages 755\u2013763, 2013.\n\n[ZML17] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\nIn ICLR, 2017.\n\n[ZPIE17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image\n\ntranslation using cycle-consistent adversarial networks. In ICCV, pages 2242\u20132251, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6021, "authors": [{"given_name": "Lizhong", "family_name": "Ding", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Mengyang", "family_name": "Yu", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Li", "family_name": "Liu", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Fan", "family_name": "Zhu", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Yong", "family_name": "Liu", "institution": "Institute of Information Engineering, CAS"}, {"given_name": "Yu", "family_name": "Li", "institution": "King Abdullah University of Science and Technology"}, {"given_name": "Ling", "family_name": "Shao", "institution": "Inception Institute of Artificial Intelligence"}]}