{"title": "The Implicit Metropolis-Hastings Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 13954, "page_last": 13964, "abstract": "Recent works propose using the discriminator of a GAN to filter out unrealistic samples of the generator. We generalize these ideas by introducing the implicit Metropolis-Hastings algorithm. For any implicit probabilistic model and a target distribution represented by a set of samples, implicit Metropolis-Hastings operates by learning a discriminator to estimate the density-ratio and then generating a chain of samples. Since the approximation of density ratio introduces an error on every step of the chain, it is crucial to analyze the stationary distribution of such chain. For that purpose, we present a theoretical result stating that the discriminator loss upper bounds the total variation distance between the target distribution and the stationary distribution. Finally, we validate the proposed algorithm both for independent and Markov proposals on CIFAR-10, CelebA, ImageNet datasets.", "full_text": "The Implicit Metropolis-Hastings Algorithm\n\nKirill Neklyudov\n\nSamsung-HSE Laboratory\nHSE\u2217, Moscow, Russia\n\nSamsung AI Center Moscow\nk.necludov@gmail.com\n\nEvgenii Egorov\n\nSkoltech\u2020, Moscow, Russia\negorov.evgenyy@ya.ru\n\nAbstract\n\nDmitry Vetrov\n\nSamsung-HSE Laboratory\nHSE\u2217, Moscow, Russia\n\nSamsung AI Center Moscow\n\nvetrovd@yandex.ru\n\nRecent works propose using the discriminator of a GAN to \ufb01lter out unrealistic\nsamples of the generator. We generalize these ideas by introducing the implicit\nMetropolis-Hastings algorithm. For any implicit probabilistic model and a target\ndistribution represented by a set of samples, implicit Metropolis-Hastings operates\nby learning a discriminator to estimate the density-ratio and then generating a\nchain of samples. Since the approximation of density ratio introduces an error on\nevery step of the chain, it is crucial to analyze the stationary distribution of such\nchain. For that purpose, we present a theoretical result stating that the discriminator\nloss upper bounds the total variation distance between the target distribution and\nthe stationary distribution. Finally, we validate the proposed algorithm both for\nindependent and Markov proposals on CIFAR-10, CelebA and ImageNet datasets.\n\n1\n\nIntroduction\n\nLearning a generative model from an empirical target distribution is one of the key tasks in unsu-\npervised machine learning. Currently, Generative Adversarial Networks (GANs) (Goodfellow et al.,\n2014) are among the most successful approaches in building such models. Unlike conventional\nsampling techniques, such as Markov Chain Monte-Carlo (MCMC), they operate by learning the\nimplicit probabilistic model, which allows for sampling but not for a density evaluation. Due to\nthe availability of large amounts of empirical data, GANs \ufb01nd many applications in computer vi-\nsion: image super-resolution (Ledig et al., 2017), image inpainting (Yu et al., 2018), and learning\nrepresentations (Donahue et al., 2016).\nDespite the practical success, GANs remain hard for theoretical analysis and do not provide any\nguarantees on the learned model. For now, most of the theoretical results assume optimality of the\nlearned discriminator (critic) what never holds in practice (Goodfellow et al., 2014; Nowozin et al.,\n2016; Arjovsky et al., 2017). Moreover, there is empirical evidence that GANs do not learn to sample\nfrom a target distribution (Arora & Zhang, 2017).\nRecently, the idea of a GAN postprocessing by \ufb01ltering the generator was proposed in several\nworks. Under the assumption that the learned discriminator evaluates the exact density-ratio they\n\ufb01lter samples from a generator by rejection sampling (Azadi et al., 2018) or by the independent\nMetropolis-Hastings algorithm (Neklyudov et al., 2018; Turner et al., 2018). Since the assumption\nof the discriminator optimality never holds in practice, we still cannot be sure that the resulting\ndistribution will be close to the target, we even cannot guarantee that we will improve the output of\nthe generator.\nIn this work, we present a theoretical result that justi\ufb01es the heuristic proposed by Neklyudov\net al. (2018); Turner et al. (2018) and generalize the proposed algorithm to the case of any implicit\n\n\u2217National Research University Higher School of Economics\n\u2020Skolkovo Institute of Science and Technology\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fprobabilistic models \u2014 both independent and Markov. To do that, we consider some, maybe not\noptimal, discriminator in the Metropolis-Hastings test, and approach the problem from the MCMC\nperspective. Under reasonable assumptions, we derive an upper bound on the total variation distance\nbetween the target distribution and the stationary distribution of the produced chain, that can be\nminimized w.r.t. parameters of the discriminator.\nOn real-world datasets (CIFAR-10, CelebA, ImageNet), we validate our approach using different\ndeep generative models as independent proposals: DCGAN (Radford et al., 2015); Wasserstein GAN\nwith gradient penalty (Gulrajani et al., 2017); variational auto-encoder (Kingma & Welling, 2014);\nBigGAN (Brock et al., 2018); MMD-GAN (Li et al., 2017). Every model is learned independently\noptimizing its original objective, what allows us to test the algorithm on a wide range of different\nproposals. For every proposal, we learn the discriminator from scratch (except BigGAN) and observe\nmonotonous improvements of metrics throughout the learning. Further, we construct a Markov\nproposal by traversing the latent space of WPGAN generator via a Markov chain. Our experiments\ndemonstrate that this proposal compares favorably against the independent proposal while using the\nsame generator network.\nWe consider the provided theoretical analysis and the empirical evaluation as a result that allows to\nalleviate or even eliminate the bias of any generative model learned from the empirical distribution.\nTo be more factual, we summarize our main contributions as follows.\n\n\u2022 We propose the implicit Metropolis-Hastings algorithm, that can be seen as an adaptation of\nthe classical Metropolis-Hastings algorithm to the case of an implicit probabilistic model\nand an empirical target distribution (Section 3).\n\n\u2022 We justify the algorithm proposed by Neklyudov et al. (2018) and Turner et al. (2018).\nIn particular, we demonstrate that learning the discriminator via the binary cross-entropy\nminimizes an upper bound on the distance between the target distribution and the stationary\ndistribution of the chain (Section 3.5).\n\n\u2022 We empirically validate the obtained theoretical result on real-world datasets (CIFAR-10,\nCelebA, ImageNet) (Section 4.1). We also demonstrate empirical gains by applying our\nalgorithm for Markov proposals (Section 4.2).\n\n2 Background\n\n2.1 The Metropolis-Hastings algorithm\n\nThe MH algorithm allows for sampling from an analytic target distribution p(x) by \ufb01ltering samples\nfrom a proposal distribution q(x| y) that is also given in the analytic form. It operates by sampling a\nchain of correlated samples that converge in distribution to the target (see Algorithm 1).\n\nAlgorithm 1 The Metropolis-Hastings algorithm\ninput density of target distribution \u02c6p(x) \u221d p(x)\ninput proposal distribution q(x| y)\n\nAlgorithm 2 Metropolis-Hastings GAN\ninput target dataset D\ninput learned generator q(x), discriminator d(\u00b7)\n\ny \u2190 random init\nfor i = 0 . . . n do\n\nsample proposal point x \u223c q(x| y)\nP = min{1, \u02c6p(x)q(y | x)\n\u02c6p(y)q(x | y)}\nxi =\ufffdx, with probability P\n\ny, with probability (1 \u2212 P )\n\ny \u2190 xi\n\nend for\n\ny \u223c D initialize from the dataset\nfor i = 0 . . . n do\nsample proposal point x \u223c q(x)\nP = min{1, d(x)(1\u2212d(y))\n(1\u2212d(x))d(y)}\nxi =\ufffdx, with probability P\n\ny, with probability (1 \u2212 P )\n\ny \u2190 xi\n\nend for\n\noutput {x0, . . . , xn}\nIf we take a proposal distribution that is not conditioned on the previous point, we will obtain the\nindependent MH algorithm. It operates in the same way, but samples all of the proposal points\nindependently q(x| y) = q(x).\n\noutput {x0, . . . , xn}\n\n2\n\n\f2.2 Metropolis-Hastings GAN\n\nRecent works (Neklyudov et al., 2018; Turner et al., 2018) propose to treat the generator of a GAN\nas an independent proposal distribution q(x) and perform an approximate Metropolis-Hastings test\nvia the discriminator. Authors motivate this approximation by the fact that the optimal discriminator\nevaluates the true density-ratio\n\nd\u2217(x) =\n\np(x)\n\np(x) + q(x)\n\n= arg min\n\nd\n\n\ufffd \u2212 Ex\u223cp(x) log d(x) \u2212 Ex\u223cq(x) log(1 \u2212 d(x))\ufffd.\n\n(1)\n\nSubstituting the optimal discriminator in the acceptance test, one can obtain the Metropolis-Hastings\ncorrection of a GAN, that is described in Algorithm 2.\nIn contrast to the previous works, we take the non-optimality of the discriminator as given and analyze\nthe stationary distribution of the resulting chain for both independent and Markov proposals. In\nSection 3, we formulate the implicit Metropolis-Hastings algorithm and derive an upper bound on the\ntotal variation distance between the target distribution and the stationary distribution of the chain.\nThen, in Appendix F, we justify Algorithm 2 by relating the obtained upper bound with the binary\ncross-entropy.\n\n3 The Implicit Metropolis-Hastings Algorithm\n\nAlgorithm 3\nThe implicit Metropolis-Hastings algorithm\ninput target dataset D\ninput implicit model q(x| y)\ninput learned discriminator d(\u00b7,\u00b7)\ny \u223c D initialize from the dataset\nfor i = 0 . . . n do\n\nIn this section, we describe the implicit\nMetropolis-Hastings algorithm and present a the-\noretical analysis of its stationary distribution.\nThe Implicit Metropolis-Hastings algorithm is\naimed to sample from an empirical target dis-\ntribution p(x), x \u2208 RD, while being able to\nsample from an implicit proposal distribution\nq(x| y). Given a discriminator d(x, y), it gen-\nerates a chain of samples as described in Algo-\nrithm 3.\nWe build our reasoning by \ufb01rst assuming that\nthe chain is generated using some discriminator\nand then successively introducing conditions on\nthe discriminator and upper bounding the dis-\ntance between the chain and the target. Finally,\nwe come up with an upper bound that can be\nminimized w.r.t. parameters of the discriminator.\nHere we consider the case of an implicit Markov proposal, but all of the derivations also hold for\nindependent proposals.\nThe transition kernel of the implicit Metropolis-Hastings algorithm is\n\nsample proposal point x \u223c q(x| y)\nP = min{1, d(x,y)\nd(y,x)}\nxi =\ufffdx, with probability P\n\ny, with probability (1 \u2212 P )\n\noutput {x0, . . . , xn}\n\ny \u2190 xi\n\nend for\n\nd(x, y)\n\nt(x| y) = q(x| y) min\ufffd1,\nFirstly, we require the proposal distribution q(x| y) and the discriminator d(x, y) to be continuous\nand positive on RD \u00d7 RD. In Appendix A, we show that these requirements guarantee the following\nproperties of the transition kernel t:\n\nd(y, x)\ufffd + \u03b4(x \u2212 y)\ufffd dx\ufffdq(x\ufffd | y)\ufffd1 \u2212 min\ufffd1,\n\nd(y, x\ufffd)\ufffd\ufffd.\n\nd(x\ufffd, y)\n\n(2)\n\n\u2022 the kernel t de\ufb01nes a correct conditional distribution;\n\u2022 the Markov chain de\ufb01ned by t is irreducible;\n\u2022 the Markov chain de\ufb01ned by t is aperiodic.\n\nTo ensure the existence of the unique invariant probabilistic measure of the chain, we should assume\nthe recurrence of the chain (Theorem 10.0.1, Meyn & Tweedie (2012)). We satisfy the assumption on\nthe recurrence by introducing the minorization condition in the next subsection (Orey, 1971). Then\nthe aforementioned properties imply the convergence of the Markov chain de\ufb01ned by the transition\nkernel t(x| y) to the stationary distribution t\u221e (Theorem 4, Roberts et al. (2004)) from any point y.\n\n3\n\n\fFurther, we want the stationary distribution t\u221e of our Markov chain to be as close as possible to the\ntarget distribution p. To measure the closeness of distributions, we consider a standard metric for\nanalysis in MCMC \u2014 the total variation distance\n\n\ufffdt\u221e \u2212 p\ufffdT V =\n\n1\n\n2\ufffd |t\u221e(x) \u2212 p(x)|dx.\n\n(3)\n\nWe assume the proposal q(x| y) to be given, but different d(x, y) may lead to different t\u221e. That is\nwhy we want to derive an upper bound on the distance \ufffdt\u221e \u2212 p\ufffdT V and minimize it w.r.t. parameters\nof the discriminator d(x, y). We derive this upper bound in three steps in the following subsections.\n\n3.1 Fast convergence\nIn practice, estimation of the stationary distribution t\u221e by running a chain is impossible. Nevertheless,\nif we know that the chain converges fast enough, we can upper bound the distance \ufffdt\u221e \u2212 p\ufffdT V using\nthe distance \ufffdt1 \u2212 p\ufffdT V , where t1 is the one-step distribution t1(x) =\ufffd t(x| y)t0(y)dy, and t0 is\nsome initial distribution of the chain.\nTo guarantee fast convergence of a chain, we propose to use the minorization condition (Roberts\net al., 2004). For a transition kernel t(x| y), it requires that exists such \u03b5 > 0 and a distribution \u03bd that\nthe following condition is satis\ufb01ed\n\nt(x| y) > \u03b5\u03bd(x) \u2200(x, y) \u2208 RD \u00d7 RD.\n\n(4)\n\nWhen a transition kernel satis\ufb01es the minorization condition, the Markov chain converges \"fast\" to\nthe stationary distribution. We formalize this statement in the following Proposition.\nProposition 1 Consider a transition kernel t(x| y) that satis\ufb01es the minorization condition t(x| y) >\n\u03b5\u03bd(x) for some \u03b5 > 0, and distribution \u03bd. Then the distance between two consequent steps decreases\nas:\n\n\ufffdtn+2 \u2212 tn+1\ufffdT V \u2264 (1 \u2212 \u03b5)\ufffdtn+1 \u2212 tn\ufffdT V ,\n\nwhere distribution tk+1(x) =\ufffd t(x| y)tk(y)dy.\nThis result could be considered as a corollary of Theorem 8 in Roberts et al. (2004). For consistency,\nwe provide an independent proof of Proposition 1 in Appendix B.\nTo guarantee minorization condition of our transition kernel t(x| y), we require the proposal q(x| y)\nto satisfy minorization condition with some constant \u03b5 and distribution \u03bd (note that for an independent\nproposal, the minorization condition holds automatically with \u03b5 = 1). Also, we limit the range of\nthe discriminator as d(x, y) \u2208 [b, 1] \u2200x, y, where b is some positive constant that can be treated as a\nhyperparameter of the algorithm. These requirements imply\n\n(5)\n\nt(x| y) \u2265 bq(x| y) > b\u03b5\u03bd(x).\n\n(6)\n\nUsing Proposition 1 and minorization condition (6) for t, we can upper bound the TV-distance\nbetween an initial distribution t0 and the stationary distribution t\u221e of implicit Metropolis-Hastings.\n\n\ufffdt\u221e \u2212 t0\ufffdT V \u2264\n\n\u221e\ufffdi=0\n\n\ufffdti+1 \u2212 ti\ufffdT V \u2264\n\n\u221e\ufffdi=0\n\n(1 \u2212 b\u03b5)i \ufffdt1 \u2212 t0\ufffdT V =\n\n1\nb\u03b5 \ufffdt1 \u2212 t0\ufffdT V\n\n(7)\n\nTaking the target distribution p(x) as the initial distribution t0(x) of our chain t(x| y), we reduce\nthe problem of estimation of the distance \ufffdt\u221e \u2212 p\ufffdT V to the problem of estimation of the distance\n\ufffdt1 \u2212 p\ufffdT V :\n\n\ufffdt\u221e \u2212 p\ufffdT V \u2264\n\n1\nb\u03b5 \ufffdt1 \u2212 p\ufffdT V =\n\n1\nb\u03b5 \u00b7\n\n1\n\n2\ufffd dx\ufffd\ufffd\ufffd\ufffd\ufffd t(x| y)p(y)dy \u2212 p(x)\ufffd\ufffd\ufffd\ufffd.\n\n(8)\n\nHowever, the estimation of this distance raises two issues. Firstly, we need to get rid of the inner\n\nintegral\ufffd t(x| y)p(y)dy. Secondly, we need to bypass the evaluation of densities t(x| y) and p(x).\n\nWe address these issues in the following subsections.\n\n4\n\n\f3.2 Dealing with the integral inside of the nonlinearity\nFor now, assume that we have access to the densities t(x| y) and p(x). However, evaluation of the\ndensity t1(x) is an infeasible problem in most cases. To estimate t1(x), one would like to resort to\nthe Monte-Carlo estimation:\n\nt1(x) =\ufffd t(x| y)p(y)dy = Ey\u223cp(y)t(x| y).\n\n(9)\n\nHowever, straightforward estimation of t1(x) results in a biased estimation of \ufffdt1 \u2212 p\ufffdT V , since the\nexpectation is inside of a nonlinear function. To overcome this problem, we upper bound this distance\nin the following proposition.\nProposition 2 For the kernel t(x| y) of the implicit Metropolis-Hastings algorithm, the distance\nbetween initial distribution p(x) and the distribution t1(x) has the following upper bound\n\n,\n\n(10)\n\n\ufffdt1 \u2212 p\ufffdT V \u2264 2\ufffd\ufffd\ufffd\ufffdq(y | x)p(x) \u2212 q(x| y)p(y)\n\nd(x, y)\n\nd(y, x)\ufffd\ufffd\ufffd\ufffdT V\n\nwhere the TV-distance on the right side is evaluated in the joint space (x, y) \u2208 RD \u00d7 RD.\nFor the proof of this proposition, see Appendix C. Note that the obtained upper bound no longer\nrequires evaluation of an integral inside of a nonlinear function. Moreover, the right side of (10) has\na reasonable motivation since it is an averaged l1 error of the density ratio estimation.\n\n\ufffd\ufffd\ufffd\ufffdq(y | x)p(x) \u2212 q(x| y)p(y)\n\nd(x, y)\n\nd(y, x)\ufffd\ufffd\ufffd\ufffdT V\n\n=\n\n1\n\n2\ufffd p(y)q(x| y)\ufffd\ufffd\ufffd\ufffd\n\nq(y | x)p(x)\nq(x| y)p(y) \u2212\n\nIn this formulation, we see that we still could achieve zero value of \ufffdt1 \u2212 p\ufffdT V if we could take such\ndiscriminator that estimates the desired density ratio d(x,y)\n\nd(y,x) = q(y | x)p(x)\nq(x | y)p(y) .\n\nd(x, y)\n\nd(y, x)\ufffd\ufffd\ufffd\ufffddxdy (11)\n\n3.3 Dealing with the evaluation of densities\nFor an estimation of the right side of (10), we still need densities p(x) and q(x| y). To overcome this\nissue, we propose to upper bound the obtained TV distance via KL-divergence. Then we show that\nobtained KL divergence decomposes into two terms: the \ufb01rst term requires evaluation of densities\nbut does not depend on the discriminator d(x, y), and the second term can be estimated only by\nevaluation of d(x, y) on samples from p(x) and q(x| y).\nTo upper bound the TV-distance \ufffd\u03b1 \u2212 \u03b2\ufffdT V via KL-divergence KL(\u03b1\ufffd\u03b2) one can use well-known\nPinsker\u2019s inequality:\n(12)\nHowever, Pinsker\u2019s inequality assumes that both \u03b1 and \u03b2 are distributions, while it is not always true\nfor function q(x| y)p(y) d(x,y)\nd(y,x) in (10). In the following proposition, we extend Pinsker\u2019s inequality\nto the case when one of the functions is not normalized.\n\nT V \u2264 KL(\u03b1\ufffd\u03b2).\n\n2\ufffd\u03b1 \u2212 \u03b2\ufffd2\n\nProposition 3 For a distribution \u03b1(x) and some positive function f (x) > 0 \u2200x the following\ninequality holds:\n\n\ufffd\u03b1 \u2212 f\ufffd2\n\nT V \u2264\ufffd 2Cf + 1\n\n6\n\n\ufffd(\ufffdKL(\u03b1\ufffdf ) + Cf \u2212 1),\n\nwhere Cf is the normalization constant of function f: Cf =\ufffd f (x)dx, and \ufffdKL(\u03b1\ufffdf ) is the formal\n\nevaluation of the KL divergence\n\n(13)\n\ndx.\n\n(14)\n\n\ufffdKL(\u03b1\ufffdf ) =\ufffd \u03b1(x) log\n\n\u03b1(x)\nf (x)\n\nThe proof of the proposition is in Appendix D.\n\n5\n\n\fd(y,x) . For the multiplicative term (2C + 1)/6,\n\nd(x, y)\n\n\u2264\ufffd 2C + 1\n\nNow we use this proposition to upper bound the right side of (10):\n2\n\nHere C is the normalization constant of q(x| y)p(y) d(x,y)\nwe upper bound C as\n\nd(y, x)\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffdq(y | x)p(x) \u2212 q(x| y)p(y)\nT V \u2264\n6 \ufffd\ufffd\ufffdKL\ufffdq(y | x)p(x)\ufffd\ufffd\ufffd\ufffdq(x| y)p(y)\nd(y, x)\ufffd + C \u2212 1\ufffd.\nd(x, y)\ndxdy \u2264\ufffd q(x| y)p(y)\nC =\ufffd q(x| y)p(y)\n1\nb\nsince we limit the range of the discriminator as d(x, y) \u2208 [b, 1] \u2200x, y.\nSumming up the results (8), (10), (15), (16), we obtain the \ufb01nal upper bound as follows.\nb2\u03b52\ufffd\ufffd\ufffd\ufffdq(y | x)p(x) \u2212 q(x| y)p(y)\nd(x, y)\ufffd\n\ufffd\n\n\ufffdt\u221e \u2212 p\ufffd2\n3\u03b52b3\ufffd\ufffdE x \u223c p(x)\n\u2264\ufffd 4 + 2b\n\nd(y, x)\ufffd\ufffd\ufffd\ufffd\n\nT V \u2264\nd(y, x)\nd(x, y)\n\n1\nb2\u03b52 \ufffdt1 \u2212 p\ufffd2\n\nd(x, y)\nd(y, x)\n\nT V \u2264\n\ndxdy =\n\nd(y, x)\n\nd(x, y)\n\n1\nb\n\n,\n\n4\n\nMinimization of the resulting upper bound w.r.t.\nfollowing optimization problem:\n\n\ufffd\n\n+\n\nloss for the discriminator\n\ny \u223c q(y | x)\ufffd log\n\ufffd\ufffd\ny \u223c q(y | x)\ufffd log\nd E x \u223c p(x)\n\nmin\n\n2\n\nT V \u2264\n\n\u22121 + KL\ufffdq(y | x)p(x)\ufffd\ufffd\ufffd\ufffdq(x| y)p(y)\ufffd\ufffd\n\nthe discriminator d(x, y) is equivalent to the\n\nd(y, x)\nd(x, y)\n\n+\n\nd(y, x)\n\nd(x, y)\ufffd.\n\n(18)\n\nThus, we derive the loss function that we can unbiasedly estimate and minimize w.r.t. parameters of\nd(x, y). We analyze the optimal solution in the following subsection.\n\n3.4 The optimal discriminator\n\nBy taking the derivative of objective (18), we show (see Appendix E) that the optimal discriminator\nd\u2217 must satisfy\n\nq(y | x)p(x)\nq(x| y)p(y)\nWhen the loss function (18) achieves its minimum, it becomes\n\nd\u2217(x, y)\nd\u2217(y, x)\n\n=\n\n.\n\nq(x| y)p(y)\nq(y | x)p(x)\n\ny \u223c q(y | x)\ufffd log\nE x \u223c p(x)\nSubstituting this equation into (17), we achieve \ufffdt\u221e \u2212 p\ufffdT V = 0. However, since we limit the\nrange of the discriminator d(x, y) \u2208 [b, 1], the optimal solution could be achieved only when the\ndensity-ratio lies in the following range:\n\nq(y | x)p(x)\ufffd = \u2212KL\ufffdq(y | x)p(x)\ufffd\ufffd\ufffd\ufffdq(x| y)p(y)\ufffd + 1 (20)\n\nq(x| y)p(y)\n\n+\n\n(15)\n\n(16)\n\n(17)\n\n(19)\n\n(21)\n\n\u2200x, y\n\nq(y | x)p(x)\nq(x| y)p(y) \u2208 [b, b\u22121].\n\nTherefore, b should be chosen small enough that range [b, b\u22121] includes all the possible values of\ndensity-ratio. Such b > 0 exists if the support of the target distribution is compact. Indeed, if we have\npositive p(x) and q(x| y) on compact support, we can \ufb01nd a minimum of the density-ratio and set\nb to that minimum. Moreover, taking a positive q(x| y) on a compact support yields minorization\ncondition for the q(x| y).\nIf the support of target distribution is not compact, we may resort to the approximation of the target\ndistribution on some smaller compact support that contains say 99.9% of the whole mass of target\ndistribution. In practice, many problems of generative modeling are de\ufb01ned on compact support, e.g.\nthe distribution of images lies in \ufb01nite support since we represent an image by pixels values.\n\n6\n\n\fTable 1: Different losses for a density-ratio estimation.\n\nProposal\n\nName\n\nLoss\n\nUpper bound (UB)\n\nMarkov\n\nMarkov cross-entropy (MCE)\n\nIndependent Conventional cross-entropy (CCE)\n\n+\n\nd(y, x)\n\nd(y, x)\nd(x, y)\n\nd(x, y)\ufffd\n\n\ufffd dxdy p(x)q(y | x)\ufffd log\n\ufffd dxdy p(x)q(y | x)[\u2212 log d(x, y) \u2212 log(1 \u2212 d(y, x))]\n\ufffd dxdy p(x)q(y)[\u2212 log d(x)(1 \u2212 d(y))]\n\n3.5 Relation to the cross-entropy\n\nIt is possible to upper bound the loss (18) by the binary cross-entropy. For a Markov proposal, it is\n\nd(y, x)\n\nd(y, x)\nd(x, y)\n\nb\ufffd. (22)\ny \u223c q(y | x)\ufffd log\nE x \u223c p(x)\nIn the case of an independent proposal, we factorize the discriminator as d(x, y) = d(x)(1 \u2212 d(y))\nand obtain the following inequality (see Appendix F).\n\ny \u223c q(y | x)\ufffd\u2212log d(x, y)\u2212log(1\u2212d(y, x))+\n\nd(x, y)\ufffd \u2264 E x \u223c p(x)\n\n+\n\n1\n\ny \u223c q(y | x)\ufffd log\nE x \u223c p(x)\n\nd(y, x)\nd(x, y)\n\n+\n\nd(y, x)\n\nd(x, y)\ufffd \u2264 \u2212Ex\u223cp(x) log d(x) \u2212 Ey\u223cq(y) log(1 \u2212 d(y)) +\n\n1\nb\n\n(23)\n\nThus, learning a discriminator via the binary cross-entropy, we also minimize the distance\n\ufffdt\u221e \u2212 p\ufffdT V . This fact justi\ufb01es Algorithm 2.\n4 Experiments\n\nWe present the empirical evaluation of the proposed algorithm and theory for both independent and\nMarkov proposals. For independent proposals, we validate our theoretical result by demonstrating\nmonotonous improvements of the sampling procedure throughout the learning of the discriminator.\nFurther, the implicit MH algorithm with a Markov proposal compares favorably against Algorithm 2\nproposed by (Neklyudov et al., 2018; Turner et al., 2018). In both cases, sampling via the implicit MH\nalgorithm always improves over the straightforward sampling from the proposal. Code reproducing\nall experiments is available online3.\nTo assess our theoretical result in practice, we demonstrate that the minimization of the derived upper\nbounds (17), (22), (23) results in the minimization of the distance between the target distribution and\nthe distribution of the chain. Since one can evaluate the total variation distance only when explicit\ndensities are given, we show its monotonous fall only for synthetic examples (Appendix G). Also, we\nprovide an analysis of the algorithm with the growth of dimensionality (Appendix G).\nFor complex empirical distributions, we consider the problem of sampling from the space of images\n(CIFAR-10, CelebA, ImageNet datasets) and resort to the conventional metrics for the performance\nevaluation: the Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID)\n(Heusel et al., 2017). Note that these metrics rely heavily on the implementation of Inception network\n(Barratt & Sharma, 2018); therefore, for all experiments, we use PyTorch version of the Inception V3\nnetwork (Paszke et al., 2017).\n\n4.1\n\nIndependent proposals\n\nSince we propose to use the implicit MH algorithm for any generative model learned from the\nempirical distribution, we consider \ufb01ve models that are learned with completely different objec-\ntives: Deep Convolutional GAN (DCGAN) (Radford et al., 2015), Variational Auto-Encoder (VAE)\n(Kingma & Welling, 2014), Wasserstein GAN with gradient penalty (WPGAN) (Gulrajani et al.,\n\n3https://github.com/necludov/implicit-MH\n\n7\n\n\f2017), MMD-GAN (Li et al., 2017), BigGAN (Brock et al., 2018). We take the generative part\nfrom each already learned model and treat it as an independent proposal distribution in Algorithm\n3. For GANs, we take the generator, for VAE, we take the decoder and the prior. Then we learn\nthe discriminator from scratch for all models (except BigGAN; there we \ufb01netune the head of the\ndiscriminator) and monitor the performance of the Algorithm 3 with iterations.\nOur theoretical result says that the total variation distance between the stationary distribution and\nthe target can be upper bounded by different losses (see Table 1). Note, that we also can learn a\ndiscriminator by UB and MCE for independent proposals; however, in practice, we found that CCE\nperforms slightly better. In Figure 8, we demonstrate that the minimization of CCE leads to better\nIS and FID throughout the learning of a discriminator (see plots for all models in Appendix H).\nHowever, for a \ufb01nite empirical distribution, expressive enough discriminator could over\ufb01t to the target\ndataset. In such a case, the implicit MH algorithm would become infeasible since it would accept\nonly samples that match points of the dataset. This can be averted by monitoring the acceptance rate\nand early stopping to prevent over\ufb01tting.\n\nFigure 1: Monotonous improvements in terms of FID and IS for the learning of discriminator by\nCCE. During iterations, we evaluate metrics several times (scatter) and then average them (solid\nlines). For a single metric evaluation, we use 10k samples. Higher values of IS and lower values of\nFID are better. Performance for the original models corresponds to the 0th iteration on the plots.\n\n4.2 Markov proposals\n\nTo simulate Markov proposals we take the same WPGAN as in the independent case and traverse its\nlatent space by a Markov chain. Taking the latent vector zy for the previous image y, we sample the\nnext vector zx via HMC and obtain the next image x = g(zx) by the generator g(\u00b7), thus simulating\na Markov proposal q(x| y). Sampling via HMC from the Gaussian is equivalent to the interpolation\nbetween the previous accepted point zy and the random vector v:\n\nzx = cos(t)zy + sin(t)v,\n\nv \u223c N (0, I).\n\n(24)\n\nIn our experiments, we take t = \u03c0/3. For loss estimation, we condition samples from the proposal on\nsamples from the dataset x \u223c q(x| y), y \u223c p(y). However, to sample an image x \u223c q(x| y) we need\nto know the latent vector zy for an image y from the dataset. We \ufb01nd such vectors by optimization in\nthe latent space, minimizing the l2 reconstruction error (reconstructions are in Fig. 2).\nTo \ufb01lter a Markov proposal, we need to learn a pairwise discriminator, as suggested in Section 3. For\nthis purpose, we take the same architecture of the discriminator as in the independent case and put\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fthedifferenceofitslogitsnet(\u00b7)intothesigmoid.d(x,y)=11+exp(net(y)\u2212net(x))(25)ThenwelearnthisdiscriminatorbyminimizationofUBandMCE(seeTable1).InFigure3,wedemonstratethatourMarkovproposalcomparesfavorablynotonlyagainsttheoriginalgeneratorofWPGAN,butalsoagainstthechainobtainedbytheindependentsampler(Algorithm2).Toprovidethecomparison,weevaluateboththeperformance(IS,FID)andcomputationalefforts(rejectionrate),showingthatforthesamerejectionrate,ourmethodresultsinbettermetrics.Figure2:SamplesfromCIFAR-10(topline)andtheirreconstructions(bottomline)\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdFigure3:ComparisonbetweendifferentdiscriminatorsforthesamegeneratorofWPGANintermsofperformance(IS,FID)andcomputationalefforts(rejectionrate).HighervaluesofISandlowervaluesofFIDarebetter.Forasinglemetricevaluation,weuse10ksamples.Foreverysnapshotofadiscriminator,weevaluatemetrics5times(scatter)andthenaveragethem(solidlines).5ConclusionInthispaper,weproposetheimplicitMetropolis-Hastingsalgorithmforsamplingfromtheempiricaltargetdistribution,assumingthattheproposalonlyabletogeneratesamples(withoutanaccesstothedensity).Inthetheoreticalpartofthepaper,weupperboundthedistancebetweenthetargetdistributionandthestationarydistributionofthechain.Thecontributionofthederivedupperboundistwo-fold.Wejustifytheheuristicalgorithmproposedby(Neklyudovetal.,2018;Turneretal.,2018)andderivethelossfunctionsforthecaseofMarkovproposal.Moreover,thepost-processingwiththeimplicitMetropolis-Hastingsalgorithmcanbeseenasthetheoreticaljusti\ufb01cationofanygenerativemodellearnedfromtheempiricaltargetdistribution.Intheexperimentalpartofthepaper,weempiricallyvalidatetheproposedalgorithmonthereal-worlddatasets(CIFAR-10,CelebA,ImageNet)usingdifferentgenerativemodelsasproposals.Forallmodelsanddatasets\ufb01lteringviatheproposedalgorithmalleviatesthegapbetweentargetandproposaldistributions.6AcknowledgementsThisresearchisinpartbasedontheworksupportedbySamsungResearch,SamsungElectronics.DmitryVetrovandKirillNeklyudovweresupportedbytheRussianScienceFoundationgrantno.19-71-30020.ReferencesArjovsky,M.,Chintala,S.,andBottou,L.Wassersteingan.arXivpreprintarXiv:1701.07875,2017.Arora,S.andZhang,Y.Dogansactuallylearnthedistribution?anempiricalstudy.arXivpreprintarXiv:1706.08224,2017.9\fAzadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena, A. Discriminator rejection sampling.\n\narXiv preprint arXiv:1810.06758, 2018.\n\nBarratt, S. and Sharma, R. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.\n\nBrock, A., Donahue, J., and Simonyan, K. Large scale gan training for high \ufb01delity natural image\n\nsynthesis. arXiv preprint arXiv:1809.11096, 2018.\n\nDonahue, J., Kr\u00e4henb\u00fchl, P., and Darrell, T. Adversarial feature learning.\n\narXiv:1605.09782, 2016.\n\narXiv preprint\n\nGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp.\n2672\u20132680, 2014.\n\nGulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of\nwasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767\u20135777, 2017.\n\nHeusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pp. 6626\u20136637, 2017.\n\nKingma, D. P. and Welling, M. Auto-encoding variational bayes. ICLR, 2014.\n\nLedig, C., Theis, L., Husz\u00e1r, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A.,\nTotz, J., Wang, Z., et al. Photo-realistic single image super-resolution using a generative adversarial\nnetwork. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\n4681\u20134690, 2017.\n\nLi, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and P\u00f3czos, B. Mmd gan: Towards deeper understanding\nof moment matching network. In Advances in Neural Information Processing Systems, pp. 2203\u2013\n2213, 2017.\n\nMeyn, S. P. and Tweedie, R. L. Markov chains and stochastic stability. Springer Science & Business\n\nMedia, 2012.\n\nNeklyudov, K., Shvechikov, P., and Vetrov, D. Metropolis-hastings view on variational inference and\n\nadversarial training. arXiv preprint arXiv:1810.07151, 2018.\n\nNowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational\ndivergence minimization. In Advances in Neural Information Processing Systems, pp. 271\u2013279,\n2016.\n\nOrey, S. Lecture notes on limit theorems for markov chain transition probabilities. 1971.\n\nPaszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga,\n\nL., and Lerer, A. Automatic differentiation in pytorch. 2017.\n\nPollard, D. Asymptopia. Manuscript, Yale University, Dept. of Statist., New Haven, Connecticut,\n\n2000.\n\nRadford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\nRoberts, G. O., Rosenthal, J. S., et al. Optimal scaling for various metropolis-hastings algorithms.\n\nStatistical science, 16(4):351\u2013367, 2001.\n\nRoberts, G. O., Rosenthal, J. S., et al. General state space markov chains and mcmc algorithms.\n\nProbability surveys, 1:20\u201371, 2004.\n\nSalimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X.\n\nImproved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234\u2013\n2242, 2016.\n\n10\n\n\fTurner, R., Hung, J., Saatci, Y., and Yosinski, J. Metropolis-hastings generative adversarial networks.\n\narXiv preprint arXiv:1811.11357, 2018.\n\nYu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S. Generative image inpainting with\ncontextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pp. 5505\u20135514, 2018.\n\n11\n\n\f", "award": [], "sourceid": 7786, "authors": [{"given_name": "Kirill", "family_name": "Neklyudov", "institution": "Samsung AI Center, Moscow"}, {"given_name": "Evgenii", "family_name": "Egorov", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Samsung AI Center, Moscow"}]}