{"title": "f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 279, "abstract": "Generative neural networks are probabilistic models that implement sampling using feedforward neural networks: they take a random input vector and produce a sample from a probability distribution defined by the network weights. These models are expressive and allow efficient computation of samples and derivatives, but cannot be used for computing likelihoods or for marginalization. The generative-adversarial training method allows to train such models through the use of an auxiliary discriminative neural network. We show that the generative-adversarial approach is a special case of an existing more general variational divergence estimation approach. We show that any $f$-divergence can be used for training generative neural networks. We discuss the benefits of various choices of divergence functions on training complexity and the quality of the obtained generative models.", "full_text": "f-GAN: Training Generative Neural Samplers using\n\nVariational Divergence Minimization\n\nSebastian Nowozin, Botond Cseke, Ryota Tomioka\n\nMachine Intelligence and Perception Group\n\n{Sebastian.Nowozin, Botond.Cseke, ryoto}@microsoft.com\n\nMicrosoft Research\n\nAbstract\n\nGenerative neural samplers are probabilistic models that implement sampling using\nfeedforward neural networks: they take a random input vector and produce a sample\nfrom a probability distribution de\ufb01ned by the network weights. These models\nare expressive and allow ef\ufb01cient computation of samples and derivatives, but\ncannot be used for computing likelihoods or for marginalization. The generative-\nadversarial training method allows to train such models through the use of an\nauxiliary discriminative neural network. We show that the generative-adversarial\napproach is a special case of an existing more general variational divergence\nestimation approach. We show that any f-divergence can be used for training\ngenerative neural samplers. We discuss the bene\ufb01ts of various choices of divergence\nfunctions on training complexity and the quality of the obtained generative models.\n\nIntroduction\n\n1\nProbabilistic generative models describe a probability distribution over a given domain X , for example\na distribution over natural language sentences, natural images, or recorded waveforms.\nGiven a generative model Q from a class Q of possible models we are generally interested in\nperforming one or multiple of the following operations:\n\n\u2022 Sampling. Produce a sample from Q. By inspecting samples or calculating a function on\na set of samples we can obtain important insight into the distribution or solve decision\nproblems.\n\u2022 Estimation. Given a set of iid samples {x1, x2, . . . , xn} from an unknown true distribution\nP , \ufb01nd Q \u2208 Q that best describes the true distribution.\n\u2022 Point-wise likelihood evaluation. Given a sample x, evaluate the likelihood Q(x).\n\nGenerative-adversarial networks (GAN) in the form proposed by [10] are an expressive class of\ngenerative models that allow exact sampling and approximate estimation. The model used in GAN is\nsimply a feedforward neural network which receives as input a vector of random numbers, sampled,\nfor example, from a uniform distribution. This random input is passed through each layer in the\nnetwork and the \ufb01nal layer produces the desired output, for example, an image. Clearly, sampling\nfrom a GAN model is ef\ufb01cient because only one forward pass through the network is needed to\nproduce one exact sample.\nSuch probabilistic feedforward neural network models were \ufb01rst considered in [22] and [3], here we\ncall these models generative neural samplers. GAN is also of this type, as is the decoder model of\na variational autoencoder [18].\nIn the original GAN paper the authors show that it is possible to estimate neural samplers by\napproximate minimization of the symmetric Jensen-Shannon divergence,\n\nDJS(P(cid:107)Q) = 1\n\n2 DKL(P(cid:107) 1\n\n(1)\nwhere DKL denotes the Kullback-Leibler divergence. The key technique used in the GAN training\nis that of introducing a second \u201cdiscriminator\u201d neural networks which is optimized simultaneously.\n\n2 (P + Q)) + 1\n\n2 (P + Q)),\n\n2 DKL(Q(cid:107) 1\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fBecause DJS(P(cid:107)Q) is a proper divergence measure between distributions this implies that the true\ndistribution P can be approximated well in case there are suf\ufb01cient training samples and the model\nclass Q is rich enough to represent P .\nIn this work we show that the principle of GANs is more general and we can extend the variational\ndivergence estimation framework proposed by Nguyen et al. [25] to recover the GAN training\nobjective and generalize it to arbitrary f-divergences.\nMore concretely, we make the following contributions over the state-of-the-art:\n\n\u2022 We derive the GAN training objectives for all f-divergences and provide as example\nadditional divergence functions, including the Kullback-Leibler and Pearson divergences.\n\u2022 We simplify the saddle-point optimization procedure of Goodfellow et al. [10] and provide\n\u2022 We provide experimental insight into which divergence function is suitable for estimating\n\na theoretical justi\ufb01cation.\n\ngenerative neural samplers for natural images.\n\n2 Method\n\nWe \ufb01rst review the divergence estimation framework of Nguyen et al. [25] which is based on\nf-divergences. We then extend this framework from divergence estimation to model estimation.\n\n2.1 The f-divergence Family\n\nStatistical divergences such as the well-known Kullback-Leibler divergence measure the difference\nbetween two given probability distributions. A large class of different divergences are the so called\nf-divergences [5, 21], also known as the Ali-Silvey distances [1]. Given two distributions P and Q\nthat possess, respectively, an absolutely continuous density function p and q with respect to a base\nmeasure dx de\ufb01ned on the domain X , we de\ufb01ne the f-divergence,\n\nDf (P(cid:107)Q) =\n\n(2)\nwhere the generator function f : R+ \u2192 R is a convex, lower-semicontinuous function satisfying\nf (1) = 0. Different choices of f recover popular divergences as special cases in (2). We illustrate\ncommon choices in Table 1. See supplementary material for more divergences and plots.\n\nq(x)f\n\nq(x)\n\ndx,\n\nX\n\n(cid:90)\n\n(cid:18) p(x)\n\n(cid:19)\n\n2.2 Variational Estimation of f-divergences\n\nNguyen et al. [25] derive a general variational method to estimate f-divergences given only samples\nfrom P and Q. An equivalent result has also been derived by Reid and Williamson [28]. We will\nextend these results from merely estimating a divergence for a \ufb01xed model to estimating model\nparameters. We call this new method variational divergence minimization (VDM) and show that\ngenerative-adversarial training is a special case of our VDM framework.\nFor completeness, we \ufb01rst provide a self-contained derivation of Nguyen et al\u2019s divergence estimation\nprocedure. Every convex, lower-semicontinuous function f has a convex conjugate function f\u2217, also\nknown as Fenchel conjugate [15]. This function is de\ufb01ned as\n\nf\u2217(t) = sup\nu\u2208domf\n\n(3)\nThe function f\u2217 is again convex and lower-semicontinuous and the pair (f, f\u2217) is dual to another\nin the sense that f\u2217\u2217 = f. Therefore, we can also represent f as f (u) = supt\u2208domf\u2217 {tu \u2212 f\u2217(t)}.\nNguyen et al. leverage the above variational representation of f in the de\ufb01nition of the f-divergence\nto obtain a lower bound on the divergence,\n\n{ut \u2212 f (u)} .\n\nDf (P(cid:107)Q) =(cid:82)\n\n(cid:110)\n(cid:111)\nX p(x) T (x) dx \u2212(cid:82)\n(cid:16)(cid:82)\nX q(x) f\u2217(T (x)) dx\n\nq(x) \u2212 f\u2217(t)\nt p(x)\n\nsup\nt\u2208domf\u2217\n\ndx\n\n(cid:17)\n\nX q(x)\n\u2265 supT\u2208T\n\n(Ex\u223cP [T (x)] \u2212 Ex\u223cQ [f\u2217(T (x))]) ,\n\n= sup\nT\u2208T\n\n(4)\n\n2\n\n\fName\n\nKullback-Leibler\nReverse KL\nPearson \u03c72\n\nSquared Hellinger\nJensen-Shannon\n\nDf (P(cid:107)Q)\n\nq(x) dx\np(x) dx\ndx\n\n(cid:82) p(x) log p(x)\n(cid:82) q(x) log q(x)\n(cid:82) (q(x)\u2212p(x))2\n(cid:82)(cid:16)(cid:112)p(x) \u2212(cid:112)q(x)\n(cid:82) p(x) log\n(cid:82) p(x) log\n\n2p(x)\n\n2p(x)\n\np(x)\n\n1\n2\n\n(cid:17)2\n\ndx\n\nGenerator f (u)\n\nu log u\n\u2212 log u\n(u \u2212 1)2\n\u221a\nu \u2212 1)2\n(\n\u2212(u + 1) log 1+u\n\nT \u2217(x)\n1 + log p(x)\nq(x)\n\u2212 q(x)\np(x)\n2( p(x)\n\n(cid:113) p(x)\nq(x) \u2212 1) \u00b7(cid:113) q(x)\nq(x) \u2212 1)\n\np(x)\n\n(\n\nlog\n\n2p(x)\n\np(x)+q(x)\n\np(x)+q(x) + q(x) log\n\n2q(x)\n\np(x)+q(x) dx\np(x)+q(x) dx \u2212 log(4)\n\n2q(x)\n\n2 + u log u\nu log u \u2212 (u + 1) log(u + 1)\n\np(x)+q(x) + q(x) log\n\nGAN\nTable 1: List of f-divergences Df (P(cid:107)Q) together with generator functions. Part of the list of divergences and\ntheir generators is based on [26]. For all divergences we have f : domf \u2192 R \u222a {+\u221e}, where f is convex and\nlower-semicontinuous. Also we have f (1) = 0 which ensures that Df (P(cid:107)P ) = 0 for any distribution P . As\nshown by [10] GAN is related to the Jensen-Shannon divergence through DGAN = 2DJS \u2212 log(4).\nwhere T is an arbitrary class of functions T : X \u2192 R. The above derivation yields a lower bound\nbecause the class of functions T may contain only a subset of all possible functions. By taking the\nvariation of the lower bound in (4) w.r.t. T , we \ufb01nd that under mild conditions on f [25], the bound\nis tight for\n\np(x)+q(x)\n\nlog\n\np(x)\n\n,\n\n(5)\nwhere f(cid:48) denotes the \ufb01rst order derivative of f. This condition can serve as a guiding principle for\nchoosing f and designing the class of functions T . For example, the popular reverse Kullback-Leibler\ndivergence corresponds to f (u) = \u2212 log(u) resulting in T \u2217(x) = \u2212q(x)/p(x), see Table 1.\nWe list common f-divergences in Table 1 and provide their Fenchel conjugates f\u2217 and the do-\nmains domf\u2217 in Table 2. We provide plots of the generator functions and their conjugates in the\nsupplementary materials.\n\nq(x)\n\nT \u2217(x) = f(cid:48)(cid:18) p(x)\n\n(cid:19)\n\n2.3 Variational Divergence Minimization (VDM)\nWe now use the variational lower bound (4) on the f-divergence Df (P(cid:107)Q) in order to estimate a\ngenerative model Q given a true distribution P .\nTo this end, we follow the generative-adversarial approach [10] and use two neural networks, Q and\nT . Q is our generative model, taking as input a random vector and outputting a sample of interest.\nWe parametrize Q through a vector \u03b8 and write Q\u03b8. T is our variational function, taking as input a\nsample and returning a scalar. We parametrize T using a vector \u03c9 and write T\u03c9.\nWe can train a generative model Q\u03b8 by \ufb01nding a saddle-point of the following f-GAN objective\nfunction, where we minimize with respect to \u03b8 and maximize with respect to \u03c9,\n\nF (\u03b8, \u03c9) = Ex\u223cP [T\u03c9(x)] \u2212 Ex\u223cQ\u03b8 [f\u2217(T\u03c9(x))] .\n\n(6)\nTo optimize (6) on a given \ufb01nite training data set, we approximate the expectations using minibatch\nsamples. To approximate Ex\u223cP [\u00b7] we sample B instances without replacement from the training set.\nTo approximate Ex\u223cQ\u03b8 [\u00b7] we sample B instances from the current generative model Q\u03b8.\n\n2.4 Representation for the Variational Function\n\nTo apply the variational objective (6) for different f-divergences, we need to respect the domain\ndomf\u2217 of the conjugate functions f\u2217. To this end, we assume that variational function T\u03c9 is\nrepresented in the form T\u03c9(x) = gf (V\u03c9(x)) and rewrite the saddle objective (6) as follows:\n\nF (\u03b8, \u03c9) = Ex\u223cP [gf (V\u03c9(x))] + Ex\u223cQ\u03b8 [\u2212f\u2217(gf (V\u03c9(x)))] ,\n\n(7)\nwhere V\u03c9 : X \u2192 R without any range constraints on the output, and gf : R \u2192 domf\u2217 is an output\nactivation function speci\ufb01c to the f-divergence used. In Table 2 we propose suitable output activation\nfunctions for the various conjugate functions f\u2217 and their domains.1 Although the choice of gf is\nsomewhat arbitrary, we choose all of them to be monotone increasing functions so that a large output\n1Note that for numerical implementation we recommend directly implementing the scalar function f\u2217(gf (\u00b7))\n\nrobustly instead of evaluating the two functions in sequence; see Figure 1.\n\n3\n\n\fName\nKullback-Leibler (KL)\nReverse KL\nPearson \u03c72\nSquared Hellinger\nJensen-Shannon\nGAN\n\nOutput activation gf\nv\n\u2212 exp(\u2212v)\nv\n1 \u2212 exp(\u2212v)\nlog(2) \u2212 log(1 + exp(\u2212v))\n\u2212 log(1 + exp(\u2212v))\n\ndomf\u2217\nR\nR\u2212\nR\nt < 1\nt < log(2) \u2212 log(2 \u2212 exp(t))\nR\u2212\n\nConjugate f\u2217(t)\nexp(t \u2212 1)\n\u22121 \u2212 log(\u2212t)\n4 t2 + t\nt\n1\u2212t\n\u2212 log(1 \u2212 exp(t)) \u2212 log(2)\n\nf(cid:48)(1)\n1\n\u22121\n0\n0\n0\n\n1\n\nTable 2: Recommended \ufb01nal layer activation functions and critical variational function level de\ufb01ned by f(cid:48)(1).\nThe critical value f(cid:48)(1) can be interpreted as a classi\ufb01cation threshold applied to T (x) to distinguish between\ntrue and generated samples.\n\nFigure 1: The two terms in the saddle objective (7) are plotted as a function of the variational function V\u03c9(x).\n\nV\u03c9(x) corresponds to the belief of the variational function that the sample x comes from the data\ndistribution P as in the GAN case; see Figure 1. It is also instructive to look at the second term\n\u2212f\u2217(gf (v)) in the saddle objective (7). This term is typically (except for the Pearson \u03c72 divergence)\na decreasing function of the output V\u03c9(x) favoring variational functions that output negative numbers\nfor samples from the generator.\nWe can see the GAN objective,\n\nF (\u03b8, \u03c9) = Ex\u223cP [log D\u03c9(x)] + Ex\u223cQ\u03b8 [log(1 \u2212 D\u03c9(x))] ,\n\n(8)\nas a special instance of (7) by identifying each terms in the expectations of (7) and (8). In particular,\nchoosing the last nonlinearity in the discriminator as the sigmoid D\u03c9(x) = 1/(1 + e\u2212V\u03c9(x)),\ncorresponds to output activation function is gf (v) = \u2212 log(1 + e\u2212v); see Table 2.\n\n2.5 Example: Univariate Mixture of Gaussians\n\nTo demonstrate the properties of the different f-divergences and to validate the variational divergence\nestimation framework we perform an experiment similar to the one of [24].\nSetup. We approximate a mixture of Gaussians by learning a Gaussian distribution. We represent our\nmodel Q\u03b8 using a linear function which receives a random z \u223c N (0, 1) and outputs G\u03b8(z) = \u00b5 + \u03c3z,\nwhere \u03b8 = (\u00b5, \u03c3) are the two scalar parameters to be learned. For the variational function T\u03c9 we use\na neural network with two hidden layers having 64 units each and tanh activations. We optimize the\nobjective F (\u03c9, \u03b8) by using the single-step gradient method presented in Section 3. In each step we\nsample batches of size 1024 from p(x) and p(z) and we use a step-size of \u03b7 = 0.01 for updating\nboth \u03c9 and \u03b8. We compare the results to the best \ufb01t provided by the exact optimization of Df (P(cid:107)Q\u03b8)\nw.r.t. \u03b8, which is feasible in this case by solving the required integrals in (2) numerically. We use\n(\u02c6\u03c9, \u02c6\u03b8) (learned) and \u03b8\u2217 (best \ufb01t) to distinguish the parameters sets used in these two approaches.\nResults. The left side of Table 3 shows the optimal divergence and objective values Df (P||Q\u03b8\u2217 )\nand F (\u02c6\u03c9, \u02c6\u03b8) as well as the corresponding (optimal) means and standard deviations. Note that the\nresults are in line with the lower bound property, having Df (P||Q\u03b8\u2217 ) \u2265 F (\u02c6\u03c9, \u02c6\u03b8). There is a good\ncorrespondence between the gap in objectives and the difference between the \ufb01tted means and\nstandard deviations. The right side of Table 3 shows the results of the following experiment: (1) we\ntrain T\u03c9 and Q\u03b8 using a particular divergence, then (2) we estimate the divergence and re-train T\u03c9\nwhile keeping Q\u03b8 \ufb01xed. As expected, Q\u03b8 performs best on the divergence it was trained with. We\npresent further details and plots of the \ufb01tted Gaussians and variational functions in the supplementary\nmaterials.\n\n4\n\n\u22126\u22124\u221220246\u221210\u221250510gf(v)KLReverse KLPearson \u00c22Squared HellingerJensen-ShannonGAN\u22126\u22124\u221220246\u221210\u221250510\u00a1f\u00a4(gf(v))KLReverse KLPearson \u00c22Squared HellingerJensen-ShannonGAN\fDf (P||Q\u03b8\u2217 )\nF ( \u02c6\u03c9, \u02c6\u03b8)\n\u00b5\u2217\n\u02c6\u00b5\n\u03c3\u2217\n\u02c6\u03c3\n\nKL\n\nKL-rev\n\nJS\n\n0.2831\n0.2801\n\n1.0100\n1.0335\n\n1.8308\n1.8236\n\n0.2480\n0.2415\n\n1.5782\n1.5624\n\n1.6319\n1.6403\n\n0.1280\n0.1226\n\n1.3070\n1.2854\n\n1.7542\n1.7659\n\nJeffrey\n\n0.5705\n0.5151\n\n1.3218\n1.2295\n\n1.7034\n1.8087\n\nPearson\n\n0.6457\n0.6379\n\n0.5737\n0.6157\n\n1.9274\n1.9031\n\ntrain \\ test\nKL\nKL-rev\nJS\nJeffrey\nPearson\n\nKL\n\nKL-rev\n\nJS\n\n0.2808\n0.3518\n0.2871\n0.2869\n0.2970\n\n0.3423\n0.2414\n0.2760\n0.2975\n0.5466\n\n0.1314\n0.1228\n0.1210\n0.1247\n0.1665\n\nJeffrey\n\n0.5447\n0.5794\n0.5260\n0.5236\n0.7085\n\nPearson\n\n0.7345\n1.3974\n0.92160\n0.8849\n0.648\n\nTable 3: Gaussian approximation of a mixture of Gaussians. Left: optimal objectives, and the learned mean\nand the standard deviation: \u02c6\u03b8 = (\u02c6\u00b5, \u02c6\u03c3) (learned) and \u03b8\u2217 = (\u00b5\u2217, \u03c3\u2217) (best \ufb01t). Right: objective values to the true\ndistribution for each trained model. For each divergence, the lowest objective function value is achieved by the\nmodel that was trained for this divergence.\n\nIn summary, our results demonstrate that when the generative model is misspeci\ufb01ed, the divergence\nfunction used for estimation has a strong in\ufb02uence on which model is learned.\n\n3 Algorithms for Variational Divergence Minimization (VDM)\n\nWe now discuss numerical methods to \ufb01nd saddle points of the objective (6). To this end, we\ndistinguish two methods; \ufb01rst, the alternating method originally proposed by Goodfellow et al. [10],\nand second, a more direct single-step optimization procedure.\nIn our variational framework, the alternating gradient method can be described as a double-loop\nmethod; the internal loop tightens the lower bound on the divergence, whereas the outer loop improves\nthe generator model. While the motivation for this method is plausible, in practice a popular choice is\ntaking a single step in the inner loop, requiring two backpropagation passes for one outer iteration.\nGoodfellow et al. [10] provide a local convergence guarantee.\n\n3.1 Single-Step Gradient Method\n\nMotivated by the success of the alternating gradient method with a single inner step, we propose an\neven simpler algorithm shown in Algorithm 1. The algorithm differs from the original one in that there\nis no inner loop and the gradients with respect to \u03c9 and \u03b8 are computed in a single back-propagation.\n\nAlgorithm 1 Single-Step Gradient Method\n1: function SINGLESTEPGRADIENTITERATION(P, \u03b8t, \u03c9t, B, \u03b7)\n2:\n3:\n4:\n5: end function\n\nSample XP = {x1, . . . , xB} and XQ = {x(cid:48)\nUpdate: \u03c9t+1 = \u03c9t + \u03b7 \u2207\u03c9F (\u03b8t, \u03c9t).\nUpdate: \u03b8t+1 = \u03b8t \u2212 \u03b7 \u2207\u03b8F (\u03b8t, \u03c9t).\n\n1, . . . , x(cid:48)\n\nB}, from P and Q\u03b8t, respectively.\n\n\u03b8F (\u03b8, \u03c9) (cid:23) \u03b4I, \u22072\n\n\u2207\u03b8F (\u03b8\u2217, \u03c9\u2217) = 0, \u2207\u03c9F (\u03b8\u2217, \u03c9\u2217) = 0, \u22072\n\nAnalysis. Here we show that Algorithm 1 geometrically converges to a saddle point (\u03b8\u2217, \u03c9\u2217) if\nthere is a neighborhood around the saddle point in which F is strongly convex in \u03b8 and strongly\nconcave in \u03c9. These assumptions are similar to those made in [10]. Formally, we assume:\n\u03c9F (\u03b8, \u03c9) (cid:22) \u2212\u03b4I,\n\n(9)\nfor (\u03b8, \u03c9) in the neighborhood of (\u03b8\u2217, \u03c9\u2217). Note that although there could be many saddle points that\narise from the structure of deep networks [6], they would not qualify as the solution of our variational\nframework under these assumptions.\nFor convenience, let\u2019s de\ufb01ne \u03c0t = (\u03b8t, \u03c9t). Now the convergence of Algorithm 1 can be stated as\nfollows (the proof is given in the supplementary material):\nTheorem 1. Suppose that there is a saddle point \u03c0\u2217 = (\u03b8\u2217, \u03c9\u2217) with a neighborhood that satis\ufb01es\n2(cid:107)\u2207F (\u03c0)(cid:107)2\nconditions (9). Moreover, we de\ufb01ne J(\u03c0) = 1\n2 and assume that in the above neighborhood,\nF is suf\ufb01ciently smooth so that there is a constant L > 0 such that (cid:107)\u2207J(\u03c0(cid:48))\u2212\u2207J(\u03c0)(cid:107)2 \u2264 L(cid:107)\u03c0(cid:48)\u2212\u03c0(cid:107)2\n(cid:19)t\n(cid:18)\nfor any \u03c0, \u03c0(cid:48) in the neighborhood of \u03c0\u2217. Then using the step-size \u03b7 = \u03b4/L in Algorithm 1, we have\n\nJ(\u03c0t) \u2264\n\n1 \u2212 \u03b42\nL\n\nJ(\u03c00).\n\n5\n\n\fThat is, the squared norm of the gradient \u2207F (\u03c0) decreases geometrically.\n\n3.2 Practical Considerations\n\nHere we discuss principled extensions of the heuristic proposed in [10] and real/fake statistics\ndiscussed by Larsen and S\u00f8nderby2. Furthermore we discuss practical advice that slightly deviate\nfrom the principled viewpoint.\nGoodfellow et al. [10] noticed that training GAN can be signi\ufb01cantly sped up by maximizing\nEx\u223cQ\u03b8 [log D\u03c9(x)] instead of minimizing Ex\u223cQ\u03b8 [log (1 \u2212 D\u03c9(x))] for updating the generator. In\nthe more general f-GAN Algorithm (1) this means that we replace line 4 with the update\n\n\u03b8t+1 = \u03b8t + \u03b7 \u2207\u03b8Ex\u223cQ\u03b8t [gf (V\u03c9t(x))],\n\n(10)\n\nthereby maximizing the variational function output on the generated samples. We can show that this\ntransformation preserves the stationary point as follows (which is a generalization of the argument in\n[10]): note that the only difference between the original direction (line 4) and (10) is the scalar factor\nf\u2217(cid:48)(T\u03c9(x)), which is the derivative of the conjugate function f\u2217. Since f\u2217(cid:48) is the inverse of f(cid:48) (see\nCor. 1.4.4, Chapter E, [15]), if T = T \u2217, using (5), we can see that this factor would be the density\nratio p(x)/q(x), which would be one at the stationary point. We found this transformation useful\nalso for other divergences. We found Adam [17] and gradient clipping to be useful especially in the\nlarge scale experiment on the LSUN dataset.\nThe original implementation [10] of GANs3 and also Larsen and S\u00f8nderby monitor certain real and\nfake statistics, which are de\ufb01ned as the true positive and true negative rates of the variational function\nviewing it as a binary classi\ufb01er. Since our output activation gf are all monotone, we can derive similar\nstatistics for any f-divergence by only changing the decision threshold. Due to the link between the\ndensity ratio and the variational function (5), the threshold lies at f(cid:48)(1) (see Table 2). That is, we\ncan interpret the output of the variational function as classifying the input x as a true sample if the\nvariational function T\u03c9(x) is larger than f(cid:48)(1), and classifying it as a generator sample otherwise.\n\n4 Experiments\n\nWe now train generative neural samplers based on VDM on the MNIST and LSUN datasets.\n\nMNIST Digits. We use the MNIST training data set (60,000 samples, 28-by-28 pixel images) to\ntrain the generator and variational function model proposed in [10] for various f-divergences. With\nz \u223c Uniform100(\u22121, 1) as input, the generator model has two linear layers each followed by batch\nnormalization and ReLU activation and a \ufb01nal linear layer followed by the sigmoid function. The\nvariational function V\u03c9(x) has three linear layers with exponential linear unit [4] in between. The\n\ufb01nal activation is speci\ufb01c to each divergence and listed in Table 2. As in [27] we use Adam with a\nlearning rate of \u03b1 = 0.0002 and update weight \u03b2 = 0.5. We use a batchsize of 4096, sampled from\nthe training set without replacement, and train each model for one hour. We also compare against\nvariational autoencoders [18] with 20 latent dimensions.\nResults and Discussion. We evaluate the performance using the kernel density estimation (Parzen\nwindow) approach used in [10]. To this end, we sample 16k images from the model and estimate\na Parzen window estimator using an isotropic Gaussian kernel bandwidth using three fold cross\nvalidation. The \ufb01nal density model is used to evaluate the average log-likelihood on the MNIST test\nset (10k samples). We show the results in Table 4, and some samples from our models in Figure 2.\nThe use of the KDE approach to log-likelihood estimation has known de\ufb01ciencies [33]. In particular,\nfor the dimensionality used in MNIST (d = 784) the number of model samples required to obtain\naccurate log-likelihood estimates is infeasibly large. We found a large variability (up to 50 nats)\nbetween multiple repetitions. As such the results are not entirely conclusive. We also trained the\nsame KDE estimator on the MNIST training set, achieving a signi\ufb01cantly higher holdout likelihood.\nHowever, it is reassuring to see that the model trained for the Kullback-Leibler divergence indeed\nachieves a high holdout likelihood compared to the GAN model.\n\n2http://torch.ch/blog/2015/11/13/gan.html\n3Available at https://github.com/goodfeli/adversarial\n\n6\n\n\fTraining divergence\nKullback-Leibler\nReverse Kullback-Leibler\nPearson \u03c72\nNeyman \u03c72\nSquared Hellinger\nJeffrey\nJensen-Shannon\nGAN\nVariational Autoencoder [18]\nKDE MNIST train (60k)\n\nKDE (cid:104)LL(cid:105) (nats) \u00b1 SEM\n5.62\n8.36\n5.53\n8.33\n18.1\n29.9\n8.19\n8.97\n5.36\n5.99\n\n416\n319\n429\n300\n-708\n-2101\n367\n305\n445\n502\n\nTable 4: Kernel Density Estimation evaluation on the MNIST test data set. Each\nFigure 2: MNIST model\nKDE model is build from 16,384 samples from the learned generative model.\nsamples trained using KL,\nWe report the mean log-likelihood on the MNIST test set (n = 10, 000) and the\nreverse KL, Hellinger,\nstandard error of the mean. The KDE MNIST result is using 60,000 MNIST\ntraining images to \ufb01t a single KDE model.\nJensen from top to bottom.\nLSUN Natural Images. Through the DCGAN work [27] the generative-adversarial approach has\nshown real promise in generating natural looking images. Here we use the same architecture as as\nin [27] and replace the GAN objective with our more general f-GAN objective.\nWe use the large scale LSUN database [35] of natural images of different categories. To illustrate\nthe different behaviors of different divergences we train the same model on the classroom category\nof images, containing 168,103 images of classroom environments, rescaled and center-cropped to\n96-by-96 pixels.\nSetup. We use the generator architecture and training settings proposed in DCGAN [27]. The model\nreceives z \u2208 Uniformdrand (\u22121, 1) and feeds it through one linear layer and three deconvolution\nlayers with batch normalization and ReLU activation in between. The variational function is the same\nas the discriminator architecture in [27] and follows the structure of a convolutional neural network\nwith batch normalization, exponential linear units [4] and one \ufb01nal linear layer.\nResults. Figure 3 shows 16 random samples from neural samplers trained using GAN, KL, and\nsquared Hellinger divergences. All three divergences produce equally realistic samples. Note that the\ndifference in the learned distribution Q\u03b8 arise only when the generator model is not rich enough.\n\n(a) GAN\n\n(b) KL\n\n(c) Squared Hellinger\n\nFigure 3: Samples from three different divergences.\n\n5 Related Work\n\nWe now discuss how our approach relates to existing work. Building generative models of real world\ndistributions is a fundamental goal of machine learning and much related work exists. We only\ndiscuss work that applies to neural network models.\n\n7\n\n\fMixture density networks [2] are neural networks which directly regress the parameters of a \ufb01nite\nparametric mixture model. When combined with a recurrent neural network this yields impressive\ngenerative models of handwritten text [12].\nNADE [19] and RNADE [34] perform a factorization of the output using a prede\ufb01ned and somewhat\narbitrary ordering of output dimensions. The resulting model samples one variable at a time condi-\ntioning on the entire history of past variables. These models provide tractable likelihood evaluations\nand compelling results but it is unclear how to select the factorization order in many applications .\nDiffusion probabilistic models [31] de\ufb01ne a target distribution as a result of a learned diffusion\nprocess which starts at a trivial known distribution. The learned model provides exact samples and\napproximate log-likelihood evaluations.\nNoise contrastive estimation (NCE) [14] is a method that estimates the parameters of unnormalized\nprobabilistic models by performing non-linear logistic regression to discriminate the data from\narti\ufb01cially generated noise. NCE can be viewed as a special case of GAN where the discriminator\nis constrained to a speci\ufb01c form that depends on the model (logistic regression classi\ufb01er) and the\ngenerator (kept \ufb01xed) is providing the arti\ufb01cially generated noise (see supplementary material).\nThe generative neural sampler models of [22] and [3] did not provide satisfactory learning methods;\n[22] used importance sampling and [3] expectation maximization. The main difference to GAN and\nto our work really is in the learning objective, which is effective and computationally inexpensive.\nVariational auto-encoders (VAE) [18, 29] are pairs of probabilistic encoder and decoder models\nwhich map a sample to a latent representation and back, trained using a variational Bayesian learning\nobjective. The advantage of VAEs is in the encoder model which allows ef\ufb01cient inference from\nobservation to latent representation and overall they are a compelling alternative to f-GANs and\nrecent work has studied combinations of the two approaches [23]\nAs an alternative to the GAN training objective the work [20] and independently [7] considered the\nuse of the kernel maximum mean discrepancy (MMD) [13, 9] as a training objective for probabilistic\nmodels. This objective is simpler to train compared to GAN models because there is no explicitly\nrepresented variational function. However, it requires the choice of a kernel function and the reported\nresults so far seem slightly inferior compared to GAN. MMD is a particular instance of a larger class of\nprobability metrics [32] which all take the form D(P, Q) = supT\u2208T |Ex\u223cP [T (x)] \u2212 Ex\u223cQ[T (x)]|,\nwhere the function class T is chosen in a manner speci\ufb01c to the divergence. Beyond MMD other\npopular metrics of this form are the total variation metric (also an f-divergence), the Wasserstein\ndistance, and the Kolmogorov distance.\nA previous attempt to enable minimization of the KL-divergence in deep generative models is due to\nGoodfellow et al. [11], where an approximation to the gradient of the KL divergence is derived.\nIn [16] another generalization of the GAN objective is proposed by using an alternative Jensen-\nShannon divergence that interpolates between the KL and the reverse KL divergence and has Jensen-\nShannon as its mid-point. We discuss this work in more detail in the supplementary materials.\n\n6 Discussion\n\nGenerative neural samplers offer a powerful way to represent complex distributions without limiting\nfactorizing assumptions. However, while the purely generative neural samplers as used in this paper\nare interesting their use is limited because after training they cannot be conditioned on observed data\nand thus are unable to provide inferences.\nWe believe that in the future the true bene\ufb01ts of neural samplers for representing uncertainty will be\nfound in discriminative models and our presented methods extend readily to this case by providing\nadditional inputs to both the generator and variational function as in the conditional GAN model [8].\nWe hope that the practical dif\ufb01culties of training with saddle point objectives are not an underlying\nfeature of the model but instead can be overcome with novel optimization algorithms. Further\ninvestigations, such as [30], are needed to investigate and hopefully overcome these dif\ufb01culties.\nAcknowledgements. We thank Ferenc Husz\u00b4ar for discussions on the generative-adversarial approach.\n\n8\n\n\fReferences\n[1] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJRSS (B), pages 131\u2013142, 1966.\n\n[2] C. M. Bishop. Mixture density networks. Technical report, Aston University, 1994.\n[3] C. M. Bishop, M. Svens\u00b4en, and C. K. I. Williams. GTM: The generative topographic mapping. Neural\n\n[4] D. A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential\n\nComputation, 10(1):215\u2013234, 1998.\n\nlinear units (ELUs). arXiv:1511.07289, 2015.\n\n[5] I. Csisz\u00b4ar and P. C. Shields. Information theory and statistics: A tutorial. Foundations and Trends in\n\nCommunications and Information Theory, 1:417\u2013528, 2004.\n\n[6] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the\n\nsaddle point problem in high-dimensional non-convex optimization. In NIPS, pages 2933\u20132941, 2014.\n\n[7] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean\n\ndiscrepancy optimization. In UAI, pages 258\u2013267, 2015.\n\n[8] J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for\nStanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester 2014, 2014.\n[9] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. JASA, 102(477):\n\n359\u2013378, 2007.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[11] I. J. Goodfellow. On distinguishability criteria for estimating generative models. In International Confer-\n\nence on Learning Representations (ICLR2015), 2015. arXiv:1412.6515.\n\n[12] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.\n[13] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. J. Smola. A kernel statistical test of\n\nindependence. In NIPS, pages 585\u2013592, 2007.\n\n[14] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for unnormalized\n\nstatistical models. In AISTATS, pages 297\u2013304, 2010.\n\n[15] J. B. Hiriart-Urruty and C. Lemar\u00b4echal. Fundamentals of convex analysis. Springer, 2012.\n[16] F. Husz\u00b4ar. How (not) to train your generative model: scheduled sampling, likelihood, adversary?\n\narXiv:1511.05101, 2015.\n\n[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n[18] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1402.0030, 2013.\n[19] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.\n[20] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.\n[21] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. Information\n\nTheory, IEEE, 52(10):4394\u20134412, 2006.\n\n[22] D. J. C. MacKay. Bayesian neural networks and density networks. Nucl. Instrum. Meth. A, 354(1):73\u201380,\n\n[23] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. arXiv:1511.05644, 2015.\n[24] T. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.\n[25] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio\n\nby convex risk minimization. Information Theory, IEEE, 56(11):5847\u20135861, 2010.\n\n[26] F. Nielsen and R. Nock. On the chi-square and higher-order chi distances for approximating f-divergences.\n\nSignal Processing Letters, IEEE, 21(1):10\u201313, 2014.\n\n[27] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv:1511.06434, 2015.\n\n[28] M. D. Reid and R. C. Williamson. Information, divergence and risk for binary experiments. Journal of\n\nMachine Learning Research, 12(Mar):731\u2013817, 2011.\n\n[29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining GANs. In NIPS, 2016.\n\n[31] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using\n\nnon-equilibrium thermodynamics. ICML, pages 2256\u20132265, 2015.\n\n[32] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. Lanckriet. Hilbert space embeddings\n\nand metrics on probability measures. JMLR, 11:1517\u20131561, 2010.\n\n[33] L. Theis, A. v.d. Oord, and M. Bethge. A note on the evaluation of generative models. arXiv:1511.01844,\n\n1995.\n\n2015.\n\nIn NIPS, pages 2175\u20132183, 2013.\n\n[34] B. Uria, I. Murray, and H. Larochelle. RNADE: The real-valued neural autoregressive density-estimator.\n\n[35] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. LSUN: Construction of a large-scale image dataset using\n\ndeep learning with humans in the loop. arXiv:1506.03365, 2015.\n\n9\n\n\f", "award": [], "sourceid": 183, "authors": [{"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research"}, {"given_name": "Botond", "family_name": "Cseke", "institution": "Microsoft Research"}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": "MSRC"}]}