{"title": "f-GANs in an Information Geometric Nutshell", "book": "Advances in Neural Information Processing Systems", "page_first": 456, "page_last": 464, "abstract": "Nowozin \\textit{et al} showed last year how to extend the GAN \\textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator's design and compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens --- namely, deformed exponential families, a wide superset of exponential families ---. We show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the $f$-GAN game. This result holds given a sufficient condition on \\textit{activation functions} --- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses' link function in the discriminator.", "full_text": "f-GANs in an Information Geometric Nutshell\n\nAditya Krishna Menon\u2020,\u2021\nRichard Nock\u2020,\u2021,\u00a7\n\u2020Data61, \u2021the Australian National University and \u00a7the University of Sydney\n\nRobert C. Williamson\u2021,\u2020\n\nZac Cranko\u2021,\u2020\n\nLizhen Qu\u2020,\u2021\n\n{firstname.lastname, aditya.menon, bob.williamson}@data61.csiro.au\n\nAbstract\n\nNowozin et al showed last year how to extend the GAN principle to all f-\ndivergences. The approach is elegant but falls short of a full description of the\nsupervised game, and says little about the key player, the generator: for example,\nwhat does the generator actually converge to if solving the GAN game means\nconvergence in some space of parameters? How does that provide hints on the gen-\nerator\u2019s design and compare to the \ufb02ourishing but almost exclusively experimental\nliterature on the subject? In this paper, we unveil a broad class of distributions for\nwhich such convergence happens \u2014 namely, deformed exponential families, a wide\nsuperset of exponential families \u2014. We show that current deep architectures are\nable to factorize a very large number of such densities using an especially compact\ndesign, hence displaying the power of deep architectures and their concinnity in the\nf-GAN game. This result holds given a suf\ufb01cient condition on activation functions\n\u2014 which turns out to be satis\ufb01ed by popular choices. The key to our results is a\nvariational generalization of an old theorem that relates the KL divergence between\nregular exponential families and divergences between their natural parameters. We\ncomplete this picture with additional results and experimental insights on how\nthese results may be used to ground further improvements of GAN architectures,\nvia (i) a principled design of the activation functions in the generator and (ii) an\nexplicit integration of proper composite losses\u2019 link function in the discriminator.\n\n1\n\nIntroduction\n\nIn a recent paper, Nowozin et al. [30] showed that the GAN principle [15] can be extended to the\nvariational formulation of all f-divergences. In the GAN game, there is an unknown distribution P\nwhich we want to approximate using a parameterised distribution Q. Q is learned by a generator\nby \ufb01nding a saddle point of a function which we summarize for now as f-GAN(P, Q), where f is\na convex function (see eq. (7) below for its formal expression). A part of the generator\u2019s training\ninvolves as a subroutine a supervised adversary \u2014 hence, the saddle point formulation \u2013 called\ndiscriminator, which tries to guess whether randomly generated observations come from P or Q.\nIdeally, at the end of this supervised game, we want Q to be close to P, and a good measure of this is\nthe f-divergence If (P(cid:107)Q), also known as Ali-Silvey distance [1, 12]. Initially, one choice of f was\nconsidered [15]. Nowozin et al. signi\ufb01cantly grounded the game and expanded its scope by showing\nthat for any f convex and suitably de\ufb01ned, then [30, Eq. 4]:\nf-GAN(P, Q) \u2264 If (P(cid:107)Q)\n\n(1)\n\n.\n\nThe inequality is an equality if the discriminator is powerful enough. So, solving the f-GAN game\ncan give guarantees on how P and Q are distant to each other in terms of f-divergence. This elegant\ncharacterization of the supervised game unfortunately falls short of justifying or elucidating all\nparameters of the supervised game [30, Section 2.4], and the paper is also silent regarding a key\npart of the game: the link between distributions in the variational formulation and the generator, the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmain player which learns a parametric model of a density. In doing so, the f-GAN approach and\nits members remain within an information theoretic framework that relies on divergences between\ndistributions only [30]. In the GAN world at large, this position contrasts with other prominent\napproaches that explicitly optimize geometric distortions between the parameters or support of\ndistributions [6, 14, 16, 21, 22], and raises the problem of connecting the f-GAN approach to any\nsort of information geometric optimization. One such information-theoretic/information-geometric\nidentity is well known: The Kullback-Leibler (KL) divergence between two distributions of the\nsame (regular) exponential family equals a Bregman divergence D between their natural parameters\n[2, 4, 7, 9, 35], which we can summarize as:\n\nIfKL (P(cid:107)Q) = D(\u03b8(cid:107)\u03d1)\n\n(2)\nHere, \u03b8 and \u03d1 are respectively the natural parameters of P and Q. Hence, distributions are points on\na manifold on the right-hand side, a powerful geometric statement [4]; however, being restricted to\nKL divergence or \"just\" exponential families, it certainly falls short of the power to explain the GAN\ngame. To our knowledge, the only generalizations known fall short of the f-divergence formulation\nand are not amenable to the variational GAN formulation [5, Theorem 9], [13, Theorem 3].\n\n.\n\n,\n\nOur \ufb01rst contribution is such an identity that connects the general If -divergence formulation in eq.\n(1) to the general D (Bregman) divergence formulation in eq. (2). We now brie\ufb02y state it, postponing\nthe details to Section 3:\n\nf-GAN(P, escort(Q)) = D(\u03b8(cid:107)\u03d1) + Penalty(Q)\n\n(3)\nfor P and Q (with respective parameters \u03b8 and \u03d1) which happen to lie in a superset of exponential\nfamilies called deformed exponential families, that have received extensive treatment in statistical\nphysics and differential information geometry over the last decade [3, 25]. The right-hand side of\neq. (3) is the information geometric part [4], in which D is a Bregman divergence. Therefore, the\nf-GAN problem can be equivalent to a geometric optimization problem [4], like for the Wasserstein\nGAN and its variants [6]. Notice also that Q appears in the game in the form of an escort [5]. The\ndifference vanish only for exponential families (escort(Q) = Q, Penalty(Q) = 0 and f = KL).\nOur second contribution drills down into the information-theoretic and information-geometric\nparts of (3). In particular, from the former standpoint, we completely specify the parameters of the\nsupervised game, unveiling a key parameter left arbitrary in [30] (explicitly incorporating the link\nfunction of proper composite losses [32]). From the latter standpoint, we show that the standard deep\ngenerator architecture is powerful at modelling complex escorts of any deformed exponential family,\nfactorising a number of escorts in order of the total inner layers\u2019 dimensions, and this factorization\nhappens for an especially compact design. This hints on a simple suf\ufb01cient condition on the activation\nfunction to guarantee the escort modelling, and it turns out that this condition is satis\ufb01ed, exactly or\nin a limit sense, by most popular activation functions (ELU, ReLU, Softplus, ...). We also provide\nexperiments1 that display the uplift that can be obtained through a principled design of the activation\nfunction (generator), or tuning of the link function (discriminator).\nDue to the lack of space, a supplement (SM) provides the proof of the results in the main \ufb01le and\nadditional experiments. A longer version with a more exhaustive treatment of related results is\navailable [27]. The rest of this paper is as follows. Section \u00a7 2 presents de\ufb01nition, \u00a7 3 formally\npresents eq. (3), \u00a7 4 derives consequences for deep learning, \u00a7 5 completes the supervised game\npicture of [30], Section \u00a7 6 presents experiments and a last Section concludes.\n2 De\ufb01nitions\nThroughout this paper, the domain X of observations is a measurable set. We begin with two\nimportant classes of distortion measures, f-divergences and Bregman divergences.\nDe\ufb01nition 1 For any two distributions P and Q having respective densities P and Q absolutely\ncontinuous with respect to a base measure \u00b5, the f-divergence between P and Q, where f : R+ \u2192 R\nis convex with f (1) = 0, is\n\n(cid:20)\n\n(cid:18) P (X)\n\n(cid:19)(cid:21)\n\nQ(X)\n\n(cid:90)\n\n(cid:18) P (x)\n\n(cid:19)\n\nQ(x)\n\nIf (P(cid:107)Q)\n\n.= EX\u223cQ\n\nf\n\nQ(x) \u00b7 f\n\n=\n\nX\n\nd\u00b5(x) .\n\n(4)\n\n1The code used for our experiments is available through https://github.com/qulizhen/fgan_info_geometric\n\n2\n\n\fFor any convex differentiable \u03d5 : Rd \u2192 R, the (\u03d5-)Bregman divergence between \u03b8 and \u0001 is:\n\nD\u03d5(\u03b8(cid:107)\u0001)\n\n.= \u03d5(\u03b8) \u2212 \u03d5(\u0001) \u2212 (\u03b8 \u2212 \u0001)(cid:62)\u2207\u03d5(\u0001) ,\n\n(5)\n\nwhere \u03d5 is called the generator of the Bregman divergence.\n\nf-divergences are the key distortion measure of information theory, while Bregman divergences are\nthe key distortion measure of information geometry. A distribution P from a (regular) exponential\nfamily with cumulant C : \u0398 \u2192 R and suf\ufb01cient statistics \u03c6 : X \u2192 Rd has density PC(x|\u03b8, \u03c6) .=\nexp(\u03c6(x)(cid:62)\u03b8 \u2212 C(\u03b8)), where \u0398 is a convex open set, C is convex and ensures normalization on the\nsimplex (we leave implicit the associated dominating measure [3]). A fundamental Theorem ties\nBregman divergences and f-divergences: when P and Q belong to the same exponential family, and\ndenoting their respective densities PC(x|\u03b8, \u03c6) and QC(x|\u03d1, \u03c6), it holds that IKL(P(cid:107)Q) = DC(\u03d1(cid:107)\u03b8).\n.= x (cid:55)\u2192 x log x). Remark that the arguments in\nHere, IKL is Kullback-Leibler (KL) f-divergence (f\nthe Bregman divergence are permuted with respect to those in eq. (2) in the introduction. This also\nholds if we consider fKL in eq. (2) to be the Csisz\u00e1r dual of f [8], namely fKL : x (cid:55)\u2192 \u2212 log x, since\nin this case IfKL (P(cid:107)Q) = IKL(Q(cid:107)P) = DC(\u03b8(cid:107)\u03d1). We made this choice in the introduction for the\nsake of readability in presenting eqs. (1 \u2014 3). We now de\ufb01ne generalizations of exponential families,\nfollowing [5, 13]. Let \u03c7 : R+ \u2192 R+ be non-decreasing [25, Chapter 10]. We de\ufb01ne the \u03c7-logarithm,\n0 \u03bb(t)dt, where \u03bb is de\ufb01ned\nby \u03bb(log\u03c7(z)) .= \u03c7(z). In the case where the integrals are improper, we consider the corresponding\nlimit in the argument / integrand.\nDe\ufb01nition 2 [5] A distribution P from a \u03c7-exponential family (or deformed exponential family, \u03c7\nbeing implicit) with convex cumulant C : \u0398 \u2192 R and suf\ufb01cient statistics \u03c6 : X \u2192 Rd has density\ngiven by P\u03c7,C(x|\u03b8, \u03c6) .= exp\u03c7(\u03c6(x)(cid:62)\u03b8 \u2212 C(\u03b8)), with respect to a dominating measure \u00b5. Here, \u0398\nis a convex open set and \u03b8 is called the coordinate of P. The escort density (or \u03c7-escort) of P\u03c7,C is\n\n\u03c7(t) dt. The \u03c7-exponential is exp\u03c7(z) .= 1 +(cid:82) z\n\nlog\u03c7, as log\u03c7(z) .=(cid:82) z\n\n1\n\n1\n\n(cid:90)\n\n\u02dcP\u03c7,C\n\n.=\n\n\u00b7 \u03c7(P\u03c7,C) , Z\n\n.=\n\n1\nZ\n\nX\n\n\u03c7(P\u03c7,C(x|\u03b8, \u03c6))d\u00b5(x) .\n\n(6)\n\nZ is the escort\u2019s normalization constant.\nWe leaving implicit the dominating measure and denote \u02dcP the escort distribution of P whose density\nis given by eq. (6). We shall name \u03c7 the signature of the deformed (or \u03c7-)exponential family, and\nsometimes drop indexes to save readability without ambiguity, noting e.g. \u02dcP for \u02dcP\u03c7,C. Notice that\nnormalization in the escort is ensured by a simple integration [5, Eq. 7]. For the escort to exist, we\nrequire that Z < \u221e and therefore \u03c7(P ) is \ufb01nite almost everywhere. Such a requirement would be\nsatis\ufb01ed in the GAN game. There is another generalization of regular exponential families, known as\ngeneralized exponential families [13, 27]. The starting point of our result is the following Theorem,\nin which the information-theoretic part is not amenable to the variational GAN formulation.\nTheorem 3 [5][36] for any two \u03c7-exponential distributions P and Q with respective densities\nP\u03c7,C, Q\u03c7,C and coordinates \u03b8, \u03d1, DC(\u03b8(cid:107)\u03d1) = E\n\nX\u223c \u02dcQ[log\u03c7(Q\u03c7,C(X)) \u2212 log\u03c7(P\u03c7,C(X))].\n\nWe now brie\ufb02y frame the now popular (f-)GAN adversarial learning [15, 30]. We have a true\nunknown distribution P over a set of objects, e.g. 3D pictures, which we want to learn. In the GAN\nsetting, this is the objective of a generator, who learns a distribution Q\u03b8 parameterized by vector\n\u03b8. Q\u03b8 works by passing (the support of) a simple, uninformed distribution, e.g. standard Gaussian,\nthrough a possibly complex function, e.g. a deep net whose parameters are \u03b8 and maps to the support\nof the objects of interest. Fitting Q. involves an adversary (the discriminator) as subroutine, which\n\ufb01ts classi\ufb01ers, e.g. deep nets, parameterized by \u03c9. The generator\u2019s objective is to come up with\narg min\u03b8 Lf (\u03b8) with Lf (\u03b8) the discriminator\u2019s objective:\n\n{EX\u223cP[T\u03c9(X)] \u2212 EX\u223cQ\u03b8 [f (cid:63)(T\u03c9(X))]} ,\n\n(7)\nwhere (cid:63) is Legendre conjugate [10] and T\u03c9 : X \u2192 R integrates the classi\ufb01er of the discriminator\nand is therefore parameterized by \u03c9. Lf is a variational approximation to a f-divergence [30]; the\ndiscriminator\u2019s objective is to segregate true (P) from fake (Q.) data. The original GAN choice, [15]\n(8)\n\n.= z log z \u2212 (z + 1) log(z + 1) + 2 log 2\n\n.= sup\n\u03c9\n\nfGAN(z)\n\nLf (\u03b8)\n\n(the constant ensures f (1) = 0) can be replaced by any convex f meeting mild assumptions.\n\n3\n\n\f3 A variational information geometric identity for the f-GAN game\n\nWe deliver a series of results that will bring us to formalize eq. (3). First, we de\ufb01ne a new set of\ndistortion measures, that we call KL\u03c7 divergences.\nDe\ufb01nition 4 For any \u03c7-logarithm and distributions P, Q having respective densities P and Q\nabsolutely continuous with respect to base measure \u00b5, the KL\u03c7 divergence between P and Q is\n\nde\ufb01ned as KL\u03c7(P(cid:107)Q) .= EX\u223cP(cid:2)\u2212 log\u03c7 (Q(X)/P (X))(cid:3).\n\n.= z, KL\u03c7 is the KL divergence.\n\nSince \u03c7 is non-decreasing, \u2212 log\u03c7 is convex and so any KL\u03c7 divergence is an f-divergence.\nWhen \u03c7(z)\nIn what follows, base measure \u00b5 and abso-\nlute continuity are implicit, as well as that P (resp. Q) is the density of P (resp. Q). We\nnow relate KL\u03c7 divergences vs f-divergences. Let \u2202f be the subdifferential of convex f and\n.= [inf x P (x)/Q(x), supx P (x)/Q(x)) \u2286 R+ denote the range of density ratios of P over Q.\nIP,Q\nOur \ufb01rst result states that if there is a selection of the subdifferential which is upperbounded on IP,Q,\nthe f-divergence If (P(cid:107)Q) is equal to a KL\u03c7 divergence.\nTheorem 5 Suppose that P, Q are such that there exists a selection \u03be \u2208 \u2202f with sup \u03be(IP,Q) < \u221e.\nThen \u2203\u03c7 : R+ \u2192 R+ non decreasing such that If (P(cid:107)Q) = KL\u03c7(Q(cid:107)P).\n\nTheorem 5 essentially covers most if not all relevant GAN cases, as the assumption has to be satis\ufb01ed\nin the GAN game for its solution not to be vacuous up to a large extent (eq. (7)). We provide a\nmore complete treatment in the extended version [27]. The proof of Theorem 5 (in SM, Section\nI) is constructive: it shows how to pick \u03c7 which satis\ufb01es all requirements. It brings the following\ninteresting corollary: under mild assumptions on f, there exists a \u03c7 that \ufb01ts for all densities P and Q.\nA prominent example of f that \ufb01ts is the original GAN choice for which we can pick\n\n(cid:1) .\n\n1\n\nlog(cid:0)1 + 1\n(cid:105)\n\nz\n\n\u03c7GAN(z)\n\n.=\n\n(9)\n\nWe now de\ufb01ne a slight generalization of KL\u03c7-divergences and allow for \u03c7 to depend on the choice\nof the expectation\u2019s X, granted that for any of these choices, it will meet the constraints to be\nR+ \u2192 R+ and also increasing, and therefore de\ufb01ne a valid signature. For any f : X \u2192 R+, we\ndenote KL\u03c7f (P(cid:107)Q) .= EX\u223cP\np \u00b7 \u03c7(tp).\nWhenever f = 1, we just write KL\u03c7 as we already did in De\ufb01nition 4. We note that for any x \u2208 X,\n\u03c7f (x) is increasing and non negative because of the properties of \u03c7 and f, so \u03c7f (x)(t) de\ufb01nes a\n\u03c7-logarithm. We are ready to state a Theorem that connects KL\u03c7-divergences and Theorem 3.\n\n, where for any p \u2208 R+, \u03c7p(t) .= 1\n\n(cid:104)\u2212 log\u03c7f (X)\n\n(Q(X)/P (X))\n\nTheorem 6 Letting P\nlog\u03c7(P (X))] = KL\u03c7 \u02dcQ\n\n.= P\u03c7,C and Q\n( \u02dcQ(cid:107)P) \u2212 J(Q), with J(Q) .= KL\u03c7 \u02dcQ\n\n.= Q\u03c7,C for short in Theorem 3, we have E\n\n( \u02dcQ(cid:107)Q).\n\nX\u223c \u02dcQ[log\u03c7(Q(X))\u2212\n\n(Proof in SM, Section II) To summarize, we know that under mild assumptions relatively to the GAN\ngame, f-divergences coincide with KL\u03c7 divergences (Theorems 5). We also know from Theorem\n6 that KL\u03c7. divergences quantify the geometric proximity between the coordinates of generalized\nexponential families (Theorem 3). Hence, \ufb01nding a geometric (parameter-based) interpretation of\nthe variational f-GAN game as described in eq. (7) can be done via a variational formulation of the\nKL\u03c7 divergences appearing in Theorem 6. Since penalty J(Q) does not belong to the GAN game (it\ndoes not depend on P), it reduces our focus on KL\u03c7 \u02dcQ\n\n( \u02dcQ(cid:107)P).\n\nTheorem 7 KL\u03c7 \u02dcQ\n\nKL\u03c7 \u02dcQ\n\n( \u02dcQ(cid:107)P) admits the variational formulation\n( \u02dcQ(cid:107)P) =\n\n(cid:110)EX\u223cP[T (X)] \u2212 E\n\nsup\nT\u2208R++\n\nX\n\nX\u223c \u02dcQ[(\u2212 log\u03c7 \u02dcQ\n\n(cid:111)\n\n)(cid:63)(T (X))]\n\n,\n\n(10)\n\n.= R\\R++. Furthermore, letting Z denoting the normalization constant of the \u03c7-escort of\n\nwith R++\nQ, the optimum T \u2217 : X \u2192 R++ to eq. (10) is T \u2217(x) = \u2212(1/Z) \u00b7 (\u03c7(Q(x))/\u03c7(P (x))).\n\n4\n\n\f(Proof in SM, Section III) Hence, the variational f-GAN formulation can be captured in an\ninformation-geometric framework by the following identity using Theorems 3, 5, 7.\n\nCorollary 8 (the variational information-geometric f-GAN identity) Using notations from Theorems\n6, 7 and letting \u03b8 (resp. \u03d1) denote the coordinate of P (resp. Q), we have:\n\nX\u223c \u02dcQ[(\u2212 log\u03c7 \u02dcQ\n\n)(cid:63)(T (X))]\n\n= DC(\u03b8(cid:107)\u03d1) + J(Q)\n\n.\n\n(11)\n\n(cid:110)EX\u223cP[T (X)] \u2212 E\n\nsup\nT\u2208R++\n\nX\n\n(cid:111)\n\nWe shall also name for short vig-f-GAN the identity in eq. (11). We note that we can drill down\nfurther the identity, expressing in particular the Legendre conjugate (\u2212 log\u03c7 \u02dcQ\n)(cid:63) with an equivalent\n\"dual\" (negative) \u03c7-logarithm in the variational problem [27]. The left hand-side of Eq. (11) has the\nexact same overall shape as the variational objective of [30, Eqs 2, 6]. However, it tells the formal\nstory of GANs in signi\ufb01cantly greater details, in particular for what concerns the generator. For\nexample, eq. (11) yields a new characterization of the generators\u2019 convergence: because DC is a\nBregman divergence, it satis\ufb01es the identity of the indiscernibles. So, solving the f-GAN game [30]\ncan guarantees convergence in the parameter space (\u03d1 vs \u03b8). In the realm of GAN applications, it\nmakes sense to consider that P (the true distribution) can be extremely complex. Therefore, even when\ndeformed exponential families are signi\ufb01cantly more expressive than regular exponential families\n[25], extra care should be put before arguing that complex applications comply with such a geometric\nconvergence in the parameter space. One way to circumvent this problem is to build distributions in\nQ that factorize many deformed exponential families. This is one strong point of deep architectures\nthat we shall prove next.\n\n4 Deep architectures in the vig-f-GAN game\n\nIn the GAN game, distribution Q in eq. (11) is built by the generator (call it Qg), by passing the\nsupport of a simple distribution (e.g. uniform, standard Gaussian), Qin, through a series of non-linear\ntransformations. Letting Qin denote the corresponding density, we now compute Qg. Our generator\ng : X \u2192 Rd consists of two parts: a deep part and a last layer. The deep part is, given some L \u2208 N,\nthe computation of a non-linear transformation \u03c6L : X \u2192 RdL as\n\nRdl (cid:51) \u03c6l(x)\n\u03c60(x)\n\n.= v(Wl\u03c6l\u22121(x) + bl) ,\u2200l \u2208 {1, 2, ..., L} ,\n.= x \u2208 X .\n\n(12)\n(13)\nv is a function computed coordinate-wise, such as (leaky) ReLUs, ELUs [11, 17, 23, 24], Wl \u2208\nRdl\u00d7dl\u22121 , bl \u2208 Rdl. The last layer computes the generator\u2019s output from \u03c6L: g(x) .= vOUT(\u0393\u03c6L(x)+\n\u03b2), with \u0393 \u2208 Rd\u00d7dL, \u03b2 \u2208 Rd; in general, vOUT (cid:54)= v and vOUT \ufb01ts the output to the domain at hand,\nranging from linear [6, 20] to non-linear functions like tanh [30]. Our generator captures the\nhigh-level features of some state of the art generative approaches [31, 37].\nTo carry our analysis, we make the assumption that the network is reversible, which is going to reguire\nthat vOUT, \u0393, Wl (l \u2208 {1, 2, ..., L}) are invertible. We note that popular examples can be invertible (e.g.\nDCGAN, if we use \u00b5-ReLU, dimensions match and fractional-strided convolutions are invertible).\nAt this reasonable price, we get in closed form the generator\u2019s density and it shows the following:\nfor any continuous signature \u03c7net, there exists an activation function v such that the deep part in the\nnetwork factors as escorts for the \u03c7net-exponential family. Let 1i denote the ith canonical basis vector.\nTheorem 9 \u2200vOUT, \u0393, Wl invertible (l \u2208 {1, 2, ..., L}), for any continuous signature \u03c7net, there exists\nactivation v and bl \u2208 Rd (\u2200l \u2208 {1, 2, ..., L}) such that for any output z, letting x .= g\u22121(z),\nQg(z) factorizes as Qg(z) = (Qin(x)/ \u02dcQdeep(x)) \u00b7 1/(Hout(x) \u00b7 Znet), with Znet > 0 a constant,\n\nHout(x) .=(cid:81)d\n\ni=1 |v(cid:48)\n\nOUT(\u03b3(cid:62)\n\n.= \u0393(cid:62)1i, and (letting wl,i\n\n.= W(cid:62)\n\nl 1i):\n\n\u02dcP\u03c7net,bl,i (x|wl,i, \u03c6l\u22121) .\n\n(14)\n\ni \u03c6L(x) + \u03b2i)|, \u03b3i\nL(cid:89)\n\n\u02dcQdeep(x)\n\n.=\n\nd(cid:89)\n\nl=1\n\ni=1\n\n5\n\n\f(cid:26)\n\nName\nReLU(\u00a7)\n\nLeaky-ReLU(\u2020)\n(\u03b1, \u03b2)-ELU(\u2665)\n\nprop-\u03c4 (\u2663)\nSoftplus(\u2666)\n\u00b5-ReLU(\u2660)\n\nLSU\n\nk +\n\n(cid:26) z\n\nv(z)\n\nmax{0, z}\n\n\u0001z\n\u03b2z\n\nif\nif\n\u03b1(exp(z) \u2212 1)\n\nz > 0\nz \u2264 0\nif\nif\nk + \u03c4 (cid:63)(z)\n\u03c4 (cid:63)(0)\n\nz > 0\nz \u2264 0\n\n\uf8f1\uf8f2\uf8f3\n\n\u221a\nk + log2(1 + exp(z))\n(1\u2212\u00b5)2+z2\nk + z+\n\n0\n\n(1 + z)2\n\n4z\n\n2\nif\nif\nif\n\nz < \u22121\nz \u2208 [\u22121, 1]\nz > 1\n\n1\n\u0001\n\nz\n\n1\n\n\u03c7(z)\n1z>0\nif\nif\nif\nif\n\nz > \u2212\u03b4\nz \u2264 \u2212\u03b4\nz > \u03b1\nz \u2264 \u03b1\n\u03c4(cid:48)\u22121\u25e6(\u03c4 (cid:63))\u22121(\u03c4 (cid:63)(0)z)\n\n(cid:26) 1\n(cid:26) \u03b2\nlog 2 \u00b7(cid:0)1 \u2212 2\u2212z(cid:1)\n(cid:26) 2\n\n(1\u2212\u00b5)2+4z2\n\u221a\nz\n4\n\nz < 4\nz > 4\n\n\u03c4 (cid:63)(0)\n\nif\nif\n\n4z2\n\nTable 1: Some strongly/weakly admissible couples (v, \u03c7). (\u00a7) : 1. is the indicator function; (\u2020) :\n\u03b4 \u2264 0, 0 < \u0001 \u2264 1 and dom(v) = [\u03b4/\u0001, +\u221e). (\u2665) : \u03b2 \u2265 \u03b1 > 0; (\u2663) : (cid:63) is Legendre conjugate; (\u2660) :\n\u00b5 \u2208 [0, 1). Shaded: prop-\u03c4 activations; k is a constant (e.g. such that v(0) = 0) (see text).\n\n(Proof in SM, Section IV) The relationship between the inner layers of a deep net and deformed\nexponential families (De\ufb01nition 2) follows from the theorem: rows in Wls de\ufb01ne coordinates, \u03c6l\nde\ufb01ne \"deep\" suf\ufb01cient statistics, bl are cumulants and the crucial part, the \u03c7-family, is given by the\nactivation function v. Notice also that the bls are learned, and so the deformed exponential families\u2019\nnormalization is in fact learned and not speci\ufb01ed. We see that \u02dcQdeep factors escorts, and in number.\nWhat is notable is the compactness achieved by the deep representation: the total dimension of all\ndeep suf\ufb01cient statistics in \u02dcQdeep (eq. (14)) is L \u00b7 d. To handle this, a shallow net with a single inner\nlayer would require a matrix W of space \u2126(L2 \u00b7 d2). The deep net g requires only O(L \u00b7 d2) space to\nstore all Wls. The proof of Theorem 9 is constructive: it builds v as a function of \u03c7. In fact, the proof\nalso shows how to build \u03c7 from the activation function v in such a way that \u02dcQdeep factors \u03c7-escorts.\nThe following Lemma essentially says that this is possible for all strongly admissible activations v.\nDe\ufb01nition 10 Activation function v is strongly admissible iff dom(v) \u2229 R+ (cid:54)= \u2205 and v is C 1,\nlowerbounded, strictly increasing and convex. v is weakly admissible iff for any \u0001 > 0, there exists\nv\u0001 strongly admissible such that ||v \u2212 v\u0001||L1 < \u0001, where ||f||L1\n\n.=(cid:82) |f (t)|dt.\n\nLemma 11 The following holds: (i) for any strongly admissible v, there exists signature \u03c7 such that\nTheorem 9 holds; (ii) (\u03b3,\u03b3)-ELU (for any \u03b3 > 0), Softplus are strongly admissible. ReLU is weakly\nadmissible.\n\n(proof in SM, Section V) The proof uses a trick for ReLU which can easily be repeated for (\u03b1, \u03b2)-\nELU, and for leaky-ReLU, with the constraint that the domain has to be lowerbounded. Table 1\nprovides some examples of strongly / weakly admissible activations. It includes a wide class of\nso-called \"prop-\u03c4 activations\", where \u03c4 is negative a concave entropy, de\ufb01ned on [0, 1] and symmetric\naround 1/2 [29]. This concludes our treatment of the information geometric part of the vig-f-GAN\nidentity. We now complete it with a treatment of its information-theoretic part.\n\n5 A complete proper loss picture of the supervised GAN game\n\nIn their generalization of the GAN objective, Nowozin et al. [30] leave untold a key part of the\nsupervised game: they split in eq. (7) the discriminator\u2019s contribution in two, T\u03c9 = gf \u25e6 V\u03c9, where\nV\u03c9 : X \u2192 R is the actual discriminator, and gf is essentially a technical constraint to ensure that\nV\u03c9(.) is in the domain of f (cid:63). They leave the choice of gf \"somewhat arbitrary\" [30, Section 2.4]. We\nnow show that if one wants the supervised loss to have the desirable property to be proper composite\n[32]2, then gf is not arbitrary. We proceed in three steps, \ufb01rst unveiling a broad class of proper\nf-GANs that deal with this property. The initial motivation of eq. (7) was that the inner maximisation\nmay be seen as the f-divergence between P and Q\u03b8 [26], Lf (\u03b8) = If (P(cid:107)Q\u03b8). In fact, this variational\n\n2informally, Bayes rule realizes the optimum and the loss accommodates for any real valued predictor.\n\n6\n\n\f(cid:48)(cid:18) \u03a8\u22121(z)\n\n(cid:19)\n\n(cid:18)\n\n(cid:48)(cid:18) \u03a8\u22121(z)\n\n(cid:19)(cid:19)\n\n(cid:96)\u03a8(\u22121, z) .= f (cid:63)\n\nrepresentation of an f-divergence holds more generally: by [33, Theorem 9], we know that for any\nconvex f, and invertible link function \u03a8 : (0, 1) \u2192 R, we have:\n(X,Y)\u223cD [(cid:96)\u03a8(Y, T (X))] = \u2212 1\nE\n\n(15)\nwhere D is the distribution over (observations \u00d7 {fake, real}) and the loss function (cid:96)\u03a8 is de\ufb01ned by:\n\n\u00b7 If (P(cid:107) Q)\n\nT : X\u2192R\n\ninf\n\n2\n\n(cid:96)\u03a8(+1, z) .= \u2212f\n\n;\n\nf\n\n1 \u2212 \u03a8\u22121(z)\n\n1 \u2212 \u03a8\u22121(z)\nassuming f differentiable. Note now that picking \u03a8(z) = f(cid:48)(z/(1 \u2212 z)) with z\n.= T (x) and\nsimplifying eq. (15) with P[Y = fake] = P[Y = real] = 1/2 in the GAN game yields eq. (7). For\nother link functions, however, we get an equally valid class of losses whose optimisation will yield\na meaningful estimate of the f-divergence. The losses of eq. (16) belong to the class of proper\ncomposite losses with link function \u03a8 [32]. Thus (omitting parameters \u03b8, \u03c9), we rephrase eq. (7) and\n(cid:27)\nrefer to the proper f-GAN formulation as infQ L\u03a8(Q) with ((cid:96) is as per eq. (16)):\nX\u223cQ [\u2212(cid:96)\u03a8(\u22121, T (X))]\n\nX\u223cP [\u2212(cid:96)\u03a8(+1, T (X))] + E\nE\n\nL\u03a8(Q) .= sup\nT : X\u2192R\n\n(cid:26)\n\n(17)\n\n.\n\n,\n\n(16)\n\nit is the composition of f(cid:48) and \u03a8 in eq.\n\nNote also that it is trivial to start from a suitable proper composite loss, and derive the corresponding\ngenerator f for the f-divergence as per eq. (15). Finally, our proper composite loss view of the\nf-GAN game allows us to elicitate gf in [30]:\n(16).\nThe use of proper composite losses as part of the supervised GAN formulation sheds further light\non another aspect the game: the connection between the value of the optimal discriminator, and\nthe density ratio between the generator and discriminator distributions. Instead of the optimal\nT \u2217(x) = f(cid:48)(P (x)/Q(x)) for eq. (7) [30, Eq. 5], we now have with the more general eq. (17) the\nresult T \u2217(x) = \u03a8((1 + Q(x)/P (x))\u22121). We now show that proper f-GANs can easily be adapted\nto eq. (11). We let \u03c7\u2022(t) .= 1/\u03c7\u22121(1/t).\nTheorem 12 For any \u03c7, de\ufb01ne (cid:96)x(\u22121, z) .= \u2212 log(\u03c7\u2022)\nL\u03a8(Q) in eq. (17) equals eq. (11). Its link in eq. (17) is \u03a8x(z) = \u22121/\u03c7 \u02dcQ(x) (z/(1 \u2212 z)).\n\n(\u2212z), and let (cid:96)(+1, z) .= \u2212z. Then\n\n\u02dcQ(x)\n\n1\n\n(Proof in SM, Section VI) Hence, in the proper composite view of the vig-f-GAN identity, the\ngenerator rules over the supervised game:\nit tempers with both the link function and the loss\n\u2014 but only for fake examples. Notice also that when z = \u22121, the fake examples loss satis\ufb01es\n(cid:96)x(\u22121,\u22121) = 0 regardless of x by de\ufb01nition of the \u03c7-logarithm.\n\n6 Experiments\n\nTwo of our theoretical contributions are (A) the fact that on the generator\u2019s side, there exists numerous\nactivation functions v that comply with the design of its density as factoring escorts (Lemma 11),\nand (B) the fact that on the discriminator\u2019s side, the so-called output activation function gf of [30]\naggregates in fact two components of proper composite losses, one of which, the link function \u03a8,\nshould be a \ufb01ne knob to operate (Theorem 12). We have tested these two possibilities with the idea\nthat an experimental validation should provide substantial ground to be competitive with mainstream\napproaches, leaving space for a \ufb01ner tuning in speci\ufb01c applications. Also, in order not to mix their\neffects, we have treated (A) and (B) separately.\nArchitectures and datasets \u2014 We provide in SM (Section VI) the detail of all experiments. To summa-\nrize, we consider two architectures in our experiments: DCGAN [31] and the multilayer feedforward\nnetwork (MLP) used in [30]. Our datasets are MNIST [19] and LSUN tower category [38].\nComparison of varying activations in the generator (A) \u2014 We have compared \u00b5-ReLUs with varying\n\u00b5 in [0, 0.1, ..., 1] (hence, we include ReLU as a baseline for \u00b5 = 1), the Softplus and the Least\nSquare Unit (LSU, Table 1) activation (Figure 1). For each choice of the activation function, all\ninner layers of the generator use the same activation function. We evaluate the activation functions\nby using both DCGAN and the MLP used in [30] as the architectures. As training divergence, we\nadopt both GAN [15] and Wasserstein GAN (WGAN, [6]). Results are shown in Figure 1 (left).\n\n7\n\n\f\u00b5-ReLU\n\nSoftplus / LS / ReLU\n\nDiscriminator: varying link\n\nFigure 1: Summary of our results on MNIST, on experiment A (left+center) and B (right). Left:\ncomparison of different values of \u00b5 for the \u00b5-ReLU activation in the generator (ReLU = 1-ReLU, see\ntext). Thicker horizontal dashed lines present the ReLU average baseline: for each color, points above\nthe baselines represent values of \u00b5 for which ReLU is beaten on average. Center: comparison of\ndifferent activations in the generator, for the same architectures as in the left plot. Right: comparison\nof different link function in the discriminator (see text, best viewed in color).\n\nThree behaviours emerge when varying \u00b5: either it is globally equivalent to ReLU (GAN DCGAN)\nbut with local variations that can be better (\u00b5 = 0.7) or worse (\u00b5 = 0), or it is almost consistently\nbetter than ReLU (WGAN MLP) or worse (GAN MLP). The best results were obtained for GAN\nDCGAN, and we note that the ReLU baseline was essentially beaten for values of \u00b5 yielding smaller\nvariance, and hence yielding smaller uncertainty in the results. The comparison between different\nactivation functions (Figure 1, center) reveals that (\u00b5-)ReLU performs overall the best, yet with some\nvariations among architectures. We note in particular that, in the same way as for the comparisons\nintra \u00b5-ReLU (Figure 1, left), ReLU performs relatively worse than the other criteria for WGAN\nMLP, indicating that there may be different best \ufb01t activations for different architectures, which is\ngood news. Visual results on LSUN (SM, Table A5) also display the quality of results when changing\nthe \u00b5-ReLU activation.\nComparison of varying link functions in the discriminator (B) \u2014 We have compared the replacement\n\u221a\nof the sigmoid function by a link which corresponds to the entropy which is theoretically optimal in\nboosting algorithms, Matsushita entropy [18, 28], for which \u03a8MAT(z) .= (1/2) \u00b7 (1 + z/\n1 + z2).\nFigure 1 (right) displays the comparison Matsushita vs \"standard\" (more speci\ufb01cally, we use sigmoid\nin the case of GAN [30], and none in the case of WGAN to follow current implementations [6]). We\nevaluate with both DCGAN and MLP on MNIST (same hyperparameters as for generators, ReLU\nactivation for all hidden layer activation of generators). Experiments tend to display that tuning the\nlink may indeed bring additional uplift: for GANs, Matsushita is indeed better than the sigmoid link\nfor both DCGAN and MLP, while it remains very competitive with the no-link (or equivalently an\nidentity link) of WGAN, at least for DCGAN.\n\n7 Conclusion\nIt is hard to exaggerate the success of GAN approaches in modelling complex domains, and with\ntheir success comes an increasing need for a rigorous theoretical understanding [34]. In this paper,\nwe complete the supervised understanding of the generalization of GANs introduced in [30], and\nprovide a theoretical background to understand its unsupervised part, showing in particular how deep\narchitectures can be powerful at tackling the generative part of the game. Experiments display that\nthe tools we develop may help to improve further the state of the art.\n\n8 Acknowledgments\nThe authors thank the reviewers, Shun-ichi Amari, Giorgio Patrini and Frank Nielsen for numerous\ncomments.\n\nReferences\n[1] S.-M. Ali and S.-D.-S. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJournal of the Royal Statistical Society B, 28:131\u2013142, 1966.\n\n8\n\n\u00b5SoftplusLSUReLU\f[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS*27, pages 2672\u20132680, 2014.\n\n[16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A.-C. Courville. Improved training of wasserstein\n\nGANs. CoRR, abs/1704.00028, 2017.\n\n[17] R.-H.-R. Hahnloser, R. Sarpeshkar, M.-A. Mahowald, R.-J. Douglas, and H.-S. Seung. Digital selection\n\nand analogue ampli\ufb01cation coexist in a cortex-inspired silicon circuit. Nature, 405:947\u2013951, 2000.\n\n[18] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J.\n\nComp. Syst. Sc., 58:109\u2013128, 1999.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] H. Lee, R. Ge, T. Ma, A. Risteski, and S. Arora. On the ability of neural nets to express distributions.\n\n[21] Y. Li, K. Swersky, and R.-S. Zemel. Generative moment matching networks. In 32nd ICML, pages\n\n[22] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial\n\n[23] A.-L. Maas, A.-Y. Hannun, and A.-Y. Ng. Recti\ufb01er nonlinearities improve neural network acoustic models.\n\nCoRR, abs/1702.07028, 2017.\n\n1718\u20131727, 2015.\n\nlearning. CoRR, abs/1705.08991, 2017.\n\nIn 30th ICML, 2013.\n\n807\u2013814, 2010.\n\n[2] S.-I. Amari. Differential-Geometrical Methods in Statistics. Springer-Verlag, Berlin, 1985.\n[3] S.-I. Amari. Information Geometry and Its Applications. Springer-Verlag, Berlin, 2016.\n[4] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.\n[5] S.-I. Amari, A. Ohara, and H. Matsuzoe. Geometry of deformed exponential families: Invariant, dually-\ufb02at\nand conformal geometries. Physica A: Statistical Mechanics and its Applications, 391:4308\u20134319, 2012.\n\n[6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.\n[7] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential\n\nfamily of distributions. MLJ, 43(3):211\u2013246, 2001.\n\n[8] A. Ben-Tal, A. Ben-Israel, and M. Teboulle. Certainty equivalents and information measures: Duality and\n\nextremal principles. J. of Math. Anal. Appl., pages 211\u2013236, 1991.\n\n[9] J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman voronoi diagrams. DCG, 44(2):281\u2013307, 2010.\n[10] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n[11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential\n\nlinear units (ELUs). In 4th ICLR, 2016.\n\n[12] I. Csisz\u00e1r. Information-type measures of difference of probability distributions and indirect observation.\n\nStudia Scientiarum Mathematicarum Hungarica, 2:299\u2013318, 1967.\n\n[13] R.-M. Frongillo and M.-D. Reid. Convex foundations for generalized maxent models. In 33rd MaxEnt,\n\n[14] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Sinkhorn-autodiff: Tractable Wasserstein learning of generative\n\npages 11\u201316, 2014.\n\nmodels. CoRR, abs/1706.00292, 2017.\n\n[24] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In 27th ICML, pages\n\n[25] J. Naudts. Generalized thermostatistics. Springer, 2011.\n[26] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio\nby convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847\u20135861, Nov 2010.\n[27] R. Nock, Z. Cranko, A.-K. Menon, L. Qu, and R.-C. Williamson. f-GANs in an information geometric\n\nnutshell. CoRR, abs/1707.04385, 2017.\n\n[28] R. Nock and F. Nielsen. On the ef\ufb01cient minimization of classi\ufb01cation-calibrated surrogates. In NIPS*21,\n\n[29] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048\u2013\n\npages 1201\u20131208, 2008.\n\n2059, 2009.\n\n[30] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: training generative neural samplers using variational\n\ndivergence minimization. In NIPS*29, pages 271\u2013279, 2016.\n\n[31] A. Radford, L. Metz, and S. Chintala. unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In 4th ICLR, 2016.\n\n[32] M.-D. Reid and R.-C. Williamson. Composite binary losses. JMLR, 11, 2010.\n[33] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR,\n\n12:731\u2013817, 2011.\n\n[34] T. Salimans, I.-J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. In NIPS*29, pages 2226\u20132234, 2016.\n\n[35] M. Telgarsky and S. Dasgupta. Agglomerative Bregman clustering. In 29 th ICML, 2012.\n[36] R.-F. Vigelis and C.-C. Cavalcante. On \u03d5-families of probability distributions. J. Theor. Probab., 21:1\u201315,\n\n[37] L. Wolf, Y. Taigman, and A. Polyak. Unsupervised creation of parameterized avatars. CoRR,\n\n2011.\n\nabs/1704.05693, 2017.\n\n[38] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Construction of a large-scale image dataset using\n\ndeep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.\n\n9\n\n\f", "award": [], "sourceid": 334, "authors": [{"given_name": "Richard", "family_name": "Nock", "institution": "Data61, The Australian National University & The University of Sydney"}, {"given_name": "Zac", "family_name": "Cranko", "institution": "The Australian National University & Data61"}, {"given_name": "Aditya", "family_name": "Menon", "institution": "Data61/CSIRO"}, {"given_name": "Lizhen", "family_name": "Qu", "institution": "Data61"}, {"given_name": "Robert", "family_name": "Williamson", "institution": "Australian National University & Data61"}]}