{"title": "Practical and Consistent Estimation of f-Divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 4070, "page_last": 4080, "abstract": "The estimation of an f-divergence between two probability distributions based on\nsamples is a fundamental problem in statistics and machine learning. Most works\nstudy this problem under very weak assumptions, in which case it is provably hard.\nWe consider the case of stronger structural assumptions that are commonly satisfied\nin modern machine learning, including representation learning and generative\nmodelling with autoencoder architectures. Under these assumptions we propose and\nstudy an estimator that can be easily implemented, works well in high dimensions,\nand enjoys faster rates of convergence. We verify the behavior of our estimator\nempirically in both synthetic and real-data experiments, and discuss its direct\nimplications for total correlation, entropy, and mutual information estimation.", "full_text": "Practical and Consistent Estimation of f-Divergences\n\nPaul K. Rubenstein\u21e4\n\nMax Planck Institute for Intelligent Systems, T\u00fcbingen\n& Machine Learning Group, University of Cambridge\n\npaul.rubenstein@tuebingen.mpg.de\n\nOlivier Bousquet, Josip Djolonga, Carlos Riquelme, Ilya Tolstikhin\n\nGoogle Research, Brain Team, Z\u00fcrich\n\n{obousquet, josipd, rikel, tolstikhin}@google.com\n\nAbstract\n\nThe estimation of an f-divergence between two probability distributions based on\nsamples is a fundamental problem in statistics and machine learning. Most works\nstudy this problem under very weak assumptions, in which case it is provably hard.\nWe consider the case of stronger structural assumptions that are commonly satis\ufb01ed\nin modern machine learning, including representation learning and generative\nmodelling with autoencoder architectures. Under these assumptions we propose and\nstudy an estimator that can be easily implemented, works well in high dimensions,\nand enjoys faster rates of convergence. We verify the behavior of our estimator\nempirically in both synthetic and real-data experiments, and discuss its direct\nimplications for total correlation, entropy, and mutual information estimation.\n\n1\n\nIntroduction and related literature\n\nThe estimation and minimization of divergences between probability distributions based on sam-\nples are fundamental problems of machine learning. For example, maximum likelihood learning\ncan be viewed as minimizing the Kullback-Leibler divergence KL(PdatakPmodel) with respect to\nthe model parameters. More generally, generative modelling\u2014most famously Variational Autoen-\ncoders and Generative Adversarial Networks [21, 12]\u2014can be viewed as minimizing a divergence\nD(PdatakPmodel) where Pmodel may be intractable. In variational inference, an intractable posterior\np(z|x) is approximated with a tractable distribution q(z) chosen to minimize KLq(z)kp(z|x). The\nmutual information between two variables I(X, Y ), core to information theory and Bayesian machine\nlearning, is equivalent to KL(PX,Y kPXPY ). Independence testing often involves estimating a diver-\ngence D(PX,Y kPXPY ), while two-sample testing (does P = Q?) involves estimating a divergence\nD(PkQ). Additionally, one approach to domain adaptation, in which a classi\ufb01er is learned on a\ndistribution P but tested on a distinct distribution Q, involves learning a feature map such that a\ndivergence D (#Pk#Q) is minimized, where # represents the push-forward operation [3, 11].\nIn this work we consider the well-known family of f-divergences [7, 24] that includes amongst others\nthe KL, Jensen-Shannon (JS), 2, and \u21b5-divergences as well as the Total Variation (TV) and squared\nHellinger (H2) distances, the latter two of which play an important role in the statistics literature [2].\nA signi\ufb01cant body of work exists studying the estimation of the f-divergence Df (QkP ) between\ngeneral probability distributions Q and P . While the majority of this focuses on \u21b5-divergences and\nclosely related R\u00e9nyi-\u21b5 divergences [35, 37, 22], many works address speci\ufb01cally the KL-divergence\n[34, 39] with fewer considering f-divergences in full generality [28, 20, 26, 27]. Although the\nKL-divergence is the most frequently encountered f-divergence in the machine learning literature,\n\n\u21e4Part of this work was done during an internship at Google.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin recent years there has been a growing interest in other f-divergences [30], in particular in the\nvariational inference community where they have been employed to derive alternative evidence lower\nbounds [5, 23, 9].\nThe main challenge in computing Df (QkP ) is that it requires knowledge of either the densities\nof both Q and P , or the density ratio dQ/dP . In studying this problem, assumptions of differing\nstrength can be made about P and Q. In the weakest agnostic setting, we may be given only a \ufb01nite\nnumber of i.i.d samples from the distributions without any further knowledge about their densities.\nAs an example of stronger assumptions, both distributions may be mixtures of Gaussians [17, 10], or\nwe may have access to samples from Q and have full knowledge of P [15, 16] as in e.g. model \ufb01tting.\nMost of the literature on f-divergence estimation considers the weaker agnostic setting. The lack of\nassumptions makes such work widely applicable, but comes at the cost of needing to work around\nestimation of either the densities of P and Q [37, 22] or the density ratio dQ/dP [28, 20] from\nsamples. Both of these estimation problems are provably hard [2, 28] and suffer rates\u2014the speed\nat which the error of an estimator decays as a function of the number of samples N\u2014of order\nN1/d when P and Q are de\ufb01ned over Rd unless their densities are suf\ufb01ciently smooth. This is a\nmanifestation of the curse of dimensionality and rates of this type are often called nonparametric. One\ncould hope to estimate Df (PkQ) without explicitly estimating the densities or their ratio and thus\navoid suffering nonparametric rates, however a lower bound of the same order N1/d was recently\nproved for \u21b5-divergences [22], a sub-family of f-divergences. While some works considering the\nagnostic setting provide rates for the bias and variance of the proposed estimator [28, 22] or even\nexponential tail bounds [37], it is more common to only show that the estimators are asymptotically\nunbiased or consistent without proving speci\ufb01c rates of convergence [39, 35, 20].\nMotivated by recent advances in machine learning, we study a setting in which much stronger\nstructural assumptions are made about the distributions. Let X and Z be two \ufb01nite dimensional\nEuclidean spaces. We estimate the divergence Df (QZkPZ) between two probability distributions\nPZ and QZ, both de\ufb01ned over Z. PZ has known density p(z), while QZ with density q(z) admits the\nfactorization q(z) :=RX\nq(z|x)q(x)dx where access to independent samples from the distribution\nQX with unknown density q(x) and full knowledge of the conditional distribution QZ|X with density\nq(z|x) are assumed. In most cases QZ is intractable due to the integral and so is Df (QZkPZ). As\na concrete example, these assumptions are often satis\ufb01ed in applications of modern unsupervised\ngenerative modeling with deep autoencoder architectures, where X and Z would be data and latent\nspaces, PZ the prior, QX the data distribution, QZ|X the encoder, and QZ the aggregate posterior.\nGiven independent observations X1, . . . , XN from QX, the \ufb01nite mixture \u02c6QN\ni=1 QZ|Xi\ncan be used to approximate the continuous mixture QZ. Our main contribution is to approximate\nthe intractable Df (QZkPZ) with Df ( \u02c6QN\nZ kPZ), a quantity that can be estimated to arbitrary precision\nusing Monte-Carlo sampling since both distributions have known densities, and to theoretically study\nconditions under which this approximation is reasonable. We call Df ( \u02c6QN\nZ kPZ) the Random Mixture\n(RAM) estimator and derive rates at which it converges to Df (QZkPZ) as N grows. We also\nprovide similar guarantees for RAM-MC\u2014a practical Monte-Carlo based version of RAM. By\nside-stepping the need to perform density estimation, we obtain parametric rates of order N,\nwhere is independent of the dimension (see Tables 1 and 2), although the constants may still in\ngeneral show exponential dependence on dimension. This is in contrast to the agnostic setting where\nboth nonparametric rates and constants are exponential in dimension.\nOur results have immediate implications to existing literature. For the particular case of the KL\ndivergence, a similar approach has been heuristically applied independently by several authors for\nestimating the mutual information [36] and total correlation [6]. Our results provide strong theoretical\ngrounding for these existing methods by showing suf\ufb01cient conditions for their consistency.\nA \ufb01nal piece of related work is [4], which proposes to reduce the gap introduced by Jensen\u2019s\ninequality in the derivation of the classical evidence lower bound (ELBO) by using multiple Monte-\nCarlo samples from the approximate posterior QZ|X. This is similar in \ufb02avour to our approach, but\nfundamentally different since we use multiple samples from the data distribution to reduce a different\nJensen gap. To avoid confusion, we note that replacing the \u201cregularizer\u201d term EX[KL(QZ|XkPZ)] of\nthe classical ELBO with expectation of our estimator EXN [KL( \u02c6QN\nZ kPZ)] results in an upper bound\nof the classical ELBO (see Proposition 1) but is itself not in general an evidence lower bound:\nEXhEQZ|X log p(X|Z) KL(QZ|XkPZ)i \uf8ff EXhEQZ|X log p(X|Z)i EXNhKL( \u02c6QN\nZ kPZ)i.\n\nNPN\n\nZ := 1\n\n2\n\n\fThe remainder of the paper is structured as follows. In Section 2 we introduce the RAM and RAM-\nMC estimators and present our main theoretical results, including rates of convergence for the bias\n(Theorems 1 and 2) and tail bounds (Theorems 3 and 4). In Section 3 we validate our results in both\nsynthetic and real-data experiments. In Section 4 we discuss further applications of our results. We\nconclude in Section 5.\n\n2 Random mixture estimator and convergence results\n\nIn this section we introduce our f-divergence estimator, and present theoretical guarantees for it. We\nassume the existence of probability distributions PZ and QZ de\ufb01ned over Z with known density p(z)\nand intractable density q(z) =R q(z|x)q(x)dx respectively, where QZ|X is known. QX de\ufb01ned over\nX is unknown, however we have an i.i.d. sample XN = {X1, . . . , XN} from it. Our ultimate goal is\nto estimate the intractable f-divergence Df (QZkPZ) de\ufb01ned by:\nDe\ufb01nition 1 (f-divergence). Let f be a convex function on (0,1) with f (1) = 0. The f-divergence\nDf between distributions QZ and PZ admitting densities q(z) and p(z) respectively is\np(z)\u25c6 p(z)dz.\n\nDf (QZkPZ) :=Z f\u2713 q(z)\n\nMany commonly used divergences such as Kullback\u2013Leibler and 2 are f-divergences. All the\ndivergences considered in this paper together with their corresponding f can be found in Appendix A.\nOf them, possibly the least well-known in the machine learning literature are f-divergences [32].\nThese symmetric divergences are continuously parameterized by 2 (0,1]. Special cases include\nsquared-Hellinger (H2) for = 1\n2, Jensen-Shannon (JS) for = 1, Total Variation (TV) for = 1.\nIn our setting QZ is intractable and so is Df (QZkPZ). Substituting QZ with a sample-based \ufb01nite\nmixture \u02c6QN\n\ni=1 QZ|Xi leads to our proposed Random Mixture estimator (RAM):\n\ni=1 QZ|XiPZ\u2318.\nNPN\n(1)\nAlthough \u02c6QN\nZ is a function of XN we omit this dependence in notation for brevity. In this section we\nidentify suf\ufb01cient conditions under which Df ( \u02c6QN\nZ kPZ) is a \u201cgood\u201d estimator of Df (QZkPZ). More\nformally, we establish conditions under which the estimator is asymptotically unbiased, concentrates\nto its expected value and can be practically estimated using Monte-Carlo sampling.\n\nZ kPZ := Df\u21e3 1\n\nDf \u02c6QN\n\nZ := 1\n\nNPN\n\n2.1 Convergence rates for the bias of RAM\n\nThe following proposition shows that Df ( \u02c6QN\nany \ufb01nite N, and that the upper bound becomes tighter with increasing N:\nProposition 1. Let M \uf8ff N be integers. Then\nDf (QZkPZ) \uf8ff EXN\u21e5Df ( \u02c6QN\n\nZ kPZ) upper bounds Df (QZkPZ) in expectation for\nZ kPZ)\u21e4 \uf8ff EXM\u21e5Df ( \u02c6QM\n\nProof sketch (full proof in Appendix B.1). The \ufb01rst inequality follows from Jensen\u2019s inequality, using\nthe facts that f is convex and QZ = EXN [ \u02c6QN\nZ ]. The second holds since a sample XM can be drawn\nby sub-sampling (without replacement) M entries of XN, and by applying Jensen again.\n\nZ kPZ)\u21e4.\n\n(2)\n\nAs a function of N, the expectation is a decreasing sequence that is bounded below. By the monotone\nconvergence theorem, the sequence converges. Theorems 1 and 2 in this section give suf\ufb01cient\nconditions under which the expectation of RAM converges to Df (QZkPZ) as N ! 1 for a variety\nof f and provide rates at which this happens, summarized in Table 1. The two theorems are proved\nusing different techniques and assumptions. These assumptions, along with those of existing methods\n(see Table 3) are discussed at the end of this section.\n\nTheorem 1 (Rates of the bias). If EX\u21e0QX\u21e52QZ|X, QZ\u21e4 and KL (QZkPZ) are \ufb01nite then the\nbias EXN\u21e5Df ( \u02c6QN\nZ , QZ)\u21e4. For the KL this is an equality. For Df this holds because for\nthe bias by EXN\u21e5Df ( \u02c6QN\n\nZ kPZ)\u21e4 Df (QZkPZ) decays with rate as given in the \ufb01rst row of Table 1.\n\nProof sketch (full proof in Appendix B.2). There are two key steps to the proof. The \ufb01rst is to bound\n\n3\n\n\fTable 1: Rate of bias EXN Df \u02c6QN\n\nZ kPZ Df (QZkPZ).\n\nDf\n\nTV\n\n2\n\nH2\n\nJS\n\nf-divergence\n\nKL\n\nTheorem 1\nTheorem 2\n\nN1\n\nN 1\n\n3 log N\n\n2\n\nN 1\nN 1\n\n2\n\n-\n\nN 1\nN1 N 1\n\n5\n\n2\n\nN 1\n\n4\n\nN 1\n\n3 log N\n\nDf\u21b5\n\n1\n2 <<1\nN 1\nN 1\n\n4\n\n3\n\n1<<1 1<\u21b5<1\nN 1\nN 1\n\nN \u21b5+1\n\n\u21b5+5\n\n-\n\n4\n\n2\n\nTable 2: Rate (N ) of high probability bounds for Df \u02c6QN\n\nZ kPZ (Theorem 3).\n\nDf\n\nTV\n\n2\n\nH2\n\nJS\n\nf-divergence\n\nKL\n\n1\n2 <<1\nN 1\n\n6\n\n1<<1 1\nN 1\nN\n\n2\n\nDf\u21b5\n3 <\u21b5<1\n13\u21b5\n\u21b5+5\n\n (N )\n\nN 1\n\n6 log N\n\nN 1\n\n2\n\nN 1\n\n2\n\n-\n\nN 1\n\n6 log N\n\nTheorem 2 (Rates of the bias).\n\nZ , QZ)\u21e4 in terms of EXN\u21e52( \u02c6QN\n\n1/2 it is a Hilbertian metric and its square root satis\ufb01es the triangle inequality [14]. The second\nstep is to bound EXN\u21e5Df ( \u02c6QN\nZ , QZ)\u21e4, which is the variance of the\naverage of N i.i.d. random variables and therefore decomposes as EX\u21e0QX\u21e52(QZ|X, QZ)\u21e4/N.\nIf EX\u21e0QX ,Z\u21e0PZ\u21e5q4(Z|X)/p4(Z)\u21e4 is \ufb01nite then the bias\nZ kPZ)\u21e4 Df (QZkPZ) decays with rate as given in the second row of Table 1.\nEXN\u21e5Df ( \u02c6QN\non the inequality f\u02c6qN (z)/p(z) fq(z)/p(z) \uf8ff \u02c6qN (z)q(z)\n\nof f, applied to the bias. The integral of this inequality is bounded by controlling f0, requiring subtle\ntreatment when f0 diverges when the density ratio \u02c6qN (z)/p(z) approaches zero.\n\nf0\u02c6qN (z)/p(z) due to convexity\n\nProof sketch (full proof in Appendix B.4). Denoting by \u02c6qN (z) the density of \u02c6QN\n\nZ , the proof is based\n\np(z)\n\n2.2 Tail bounds for RAM and practical estimation with RAM-MC\n\nTheorems 1 and 2 describe the convergence of the expectation of RAM over XN, which in practice\nmay be intractable. Fortunately, the following shows that RAM rapidly concentrates to its expectation.\n\nTheorem 3 (Tail bounds for RAM). Suppose that 2QZ|xkPZ \uf8ff C < 1 for all x and for some\nconstant C. Then, the RAM estimator Df ( \u02c6QN\nZ kPZ) concentrates to its mean in the following sense.\nFor N > 8 and for any > 0, with probability at least 1 it holds that\n\nDf ( \u02c6QN\n\nZ kPZ) EXN\u21e5Df ( \u02c6QN\n\nZ kPZ)\u21e4 \uf8ff K \u00b7 (N )plog(2/),\n\nwhere K is a constant and (N ) is given in Table 2.\n\nProof sketch (full proof in Appendix B.5). These results follow by applying McDiarmid\u2019s inequality.\nTo apply it we need to show that RAM viewed as a function of XN has bounded differences.\nWe show that when replacing Xi 2 XN with X0i the value of Df ( \u02c6QN\nZ kPZ) changes by at most\nO(N1/2 (N )). Proof of this proceeds similarly to the one of Theorem 2.\n\nIn practice it may not be possible to evaluate Df ( \u02c6QN\nZ kPZ) analytically. We propose to use Monte-\nCarlo (MC) estimation since both densities \u02c6qN (z) and p(z) are assumed to be known. We consider\nimportance sampling with proposal distribution \u21e1(z|XN ), highlighting the fact that \u21e1 can depend\non the sample XN. If \u21e1(z|XN ) = p(z) this reduces to normal MC sampling. We arrive at the\nRAM-MC estimator based on M i.i.d. samples ZM := {Z1, . . . , ZM} from \u21e1(z|XN ):\n\n\u02c6DM\n\nf ( \u02c6QN\n\nZ kPZ) :=\n\n1\nM\n\nMXm=1\n\nf\u2713 \u02c6qN (Zm)\n\np(Zm) \u25c6 p(Zm)\n\n\u21e1 (Zm|XN )\n\n.\n\n(3)\n\n4\n\n\fTable 3: Rate of bias for other estimators of Df (P, Q).\n\nf-divergence\n\nKL\n\nTV\n\n2\n\nKrishnamurthy et al. [22]\n\nNguyen et al. [28]\nMoon and Hero [26]\n\n-\nN 1\nN 1\n\n2\n\n2\n\n-\n-\n-\n\n-\n-\nN 1\n\n2\n\nH2\n\n-\n-\nN 1\n\n2\n\nJS\n\n-\n-\nN 1\n\n2\n\nDf\n\n1\n2 <<1\n\n1<<1\n\n-\n-\nN 1\n\n2\n\n-\n-\nN 1\n\n2\n\nDf\u21b5\n\nN 1\n\n1<\u21b5<1\n3s\n2 +N\n2s+d\n-\nN 1\n\n2\n\nZ kPZ)\u21e4 for any\nZ kPZ)\u21e4 = E\u21e5Df ( \u02c6QN\nTheorem 4 (RAM-MC is unbiased and consistent). E\u21e5 \u02c6DM\nproposal distribution \u21e1. If \u21e1(z|XN ) = p(z) or \u21e1(z|XN ) = \u02c6qN (z) then under mild assumptions? on\nthe moments of q(Z|X)/p(Z) and denoting by (N ) the rate given in Table 2, we have\n\nf ( \u02c6QN\n\nProof sketch (?full statement and proof in Appendix B.6). By the law of total variance,\n\nVarXN ,ZM\u21e5 \u02c6DM\n\nf ( \u02c6QN\n\nZ kPZ)\u21e4 = OM1 + O (N )2 .\nf |XN\u21e4\u21e4 + VarXN\u21e5Df ( \u02c6QN\n0 P(|Y EY | > pt)dt for any random variable Y\n\nZ kPZ)\u21e4.\n\nThe \ufb01rst of these terms is O(M1) by standard results on MC integration, subject to the assumptions\n\nVarXN ,ZM\u21e5 \u02c6DM\n\nf \u21e4 = EXN\u21e5Var\u21e5 \u02c6DM\non the moments. Using the fact that Var[Y ] =R 1\n\nwe bound the second term by integrating the exponential tail bound of Theorem 3.\n\nThrough use of the Efron-Stein inequality\u2014rather than integrating the tail bound provided by\nMcDiarmid\u2019s inequality\u2014it is possible for some choices of f to weaken the assumptions under which\nthe O( (N )2) variance is achieved: from uniform boundedness of 2(QZ|XkPZ) to boundedness in\nexpectation. In general, a variance better than O(M1) is not possible using importance sampling.\nHowever, the constant and hence practical performance may vary signi\ufb01cantly depending on the\nchoice of \u21e1. We note in passing that through Chebyshev\u2019s inequality, it is possible to derive con\ufb01dence\nbounds for RAM-MC of the form similar to Theorem 3, but with an additional dependence on M and\nworse dependence on . For brevity we omit this.\n\n2.3 Discussion: assumptions and summary\nAll the rates in this section are independent of the dimension of the space Z over which the dis-\ntributions are de\ufb01ned. However the constants may exhibit some dependence on the dimension.\nAccordingly, for \ufb01xed N, the bias and variance may generally grow with the dimension.\nAlthough the data distribution QX will generally be unknown, in some practical scenarios such as\ndeep autoencoder models, PZ may be chosen by design and QZ|X learned subject to architectural\nconstraints. In such cases, the assumptions of Theorems 2 and 3 can be satis\ufb01ed by making suitable\nrestrictions (we conjecture also for Theorem 1). For example, suppose that PZ is N (0, Id) and QZ|X\nis N (\u00b5(X), \u2303(X)) with \u2303 diagonal. Then the assumptions hold if there exist constants K, \u270f > 0\nsuch that k\u00b5(X)k < K and \u2303ii(X) 2 [\u270f, 1] for all i (see Appendix B.7). In practice, numerical\nstability often requires the diagonal entries of \u2303 to be lower bounded by a small number (e.g. 106).\nIf X is compact (as for images) then such a K is guaranteed to exist; if not, choosing K very large\nyields an insigni\ufb01cant constraint.\nTable 3 summarizes the rates of bias for some existing methods. In contrast to our proposal, the\nassumptions of these estimators may in practice be dif\ufb01cult to verify. For the estimator of [22], both\ndensities p and q must belong to the H\u00f6lder class of smoothness s, be supported on [0, 1]d and satisfy\n0 <\u2318 1 < p, q <\u2318 2 < 1 on the support for known constants \u23181,\u2318 2. For that of [28], the density\nratio p/q must satisfy 0 <\u2318 1 < p/q < \u23182 < 1 and belong to a function class G whose bracketing\nentropy (a measure of the complexity of a function class) is properly bounded. The condition on the\nbracketing entropy is quite strong and ensures that the density ratio is well behaved. For the estimator\nof [26], both p and q must have the same bounded support and satisfy 0 <\u2318 1 < p, q <\u2318 2 < 1 on\nthe support. p and q must have continuous bounded derivatives of order d (which is stronger than\nassumptions of [22]), and f must have derivatives of order at least d.\n\n5\n\n\fIn summary, the RAM estimator Df ( \u02c6QN\n\nto its expectation EXN\u21e5Df ( \u02c6QN\n\nbecause it can be ef\ufb01ciently estimated with Monte-Carlo sampling via RAM-MC.\n\nZ kPZ) for Df (QZkPZ) is consistent since it concentrates\nZ kPZ)\u21e4, which in turn converges to Df (QZkPZ). It is also practical\n\n3 Empirical evaluation\n\nIn the previous section we showed that our proposed estimator has a number of desirable theoretical\nproperties. Next we demonstrate its practical performance. First, we present a synthetic experiment\ninvestigating the behaviour of RAM-MC in controlled settings where all distributions and divergences\nare known. Second, we investigate the use of RAM-MC in a more realistic setting to estimate a\ndivergence between the aggregate posterior QZ and prior PZ in pretrained autoencoder models. For\nexperimental details not included in the main text, see Appendix C2.\n\nZkPZ).\n\n3.1 Synthetic experiments\nThe data model. Our goal in this subsection is to test the behaviour of the RAM-MC estimator\nfor various d = dim(Z) and f-divergences. We choose a setting in which Q\nZ parametrized by a\nscalar and PZ are both d-variate normal distributions for d 2{ 1, 4, 16}. We use RAM-MC to\nZ, PZ), which can be computed analytically for the KL, 2, and squared Hellinger\nestimate Df (Q\ndivergences in this setting (see Appendix C.1.1). Namely, we take PZ and QX to be standard normal\ndistributions over Z = Rd and X = R20 respectively, and Z \u21e0 Q\nZ|X be a linear transform of X\nplus a \ufb01xed isotropic Gaussian noise, with the linear function parameterized by . By varying we\ncan interpolate between different values for Df (Q\nThe estimators. In Figure 1 we show the behaviour of RAM-MC with N 2{ 1, 500} and M =128\ncompared to the ground truth as is varied. The columns of Figure 1 correspond to different\ndimensions d2{ 1, 4, 16}, and rows to the KL, 2 and H2 divergences, respectively. We also include\ntwo baseline methods. First, a plug-in method based on kernel density estimation [26]. Second, and\nonly for the KL case, the M1 method of [28] based on density ratio estimation.\nThe experiment. To produce each plot, the following was performed 10 times, with the mean\nresult giving the bold lines and standard deviation giving the error bars. First, N points XN were\ndrawn from QX. Then M =128 points ZM were drawn from \u02c6QN\nZ and RAM-MC (3) was evaluated.\nFor the plug-in estimator, the densities \u02c6q(z) and \u02c6p(z) were estimated by kernel density estimation\nwith 500 samples from QZ and PZ respectively using the default settings of the Python library\nscipy.stats.gaussian_kde. The divergence was then estimated via MC-sampling using 128\nsamples from QZ and the surrogate densities. The M1 estimator involves solving a convex linear\nprogram in N variables to maximize a lower bound on the true divergence, see [28] for more details.\nAlthough the M1 estimator can in principle be used for arbitrary f-divergences, its implementation\nrequires hand-crafted derivations that are supplied only for the KL in [28], which are the ones we use.\nDiscussion. The results of this experiment empirically support Proposition 1 and Theorems 1, 2,\nand 4: (i) in expectation, RAM-MC upper bounds the true divergence; (ii) by increasing N from\n1 to 500 we clearly decrease both the bias and the variance of RAM-MC. When the dimension d\nincreases, the bias for \ufb01xed N also increases. This is consistent with the theory in that, although the\nrates are independent of d, the constants are not. We note that by side-stepping the issue of density\nestimation, RAM-MC performs favourably compared to the plug-in and M1 estimators, more so in\nhigher dimensions (d = 16). In particular, the shape of the RAM-MC curve follows that of the truth\nfor each divergence, while that of the plug-in estimator does not for larger dimensions. In some cases\nthe plug-in estimator can even take negative values because of the large variance.\n\n3.2 Real-data experiments\nThe data model. To investigate the behaviour of RAM-MC in a more realistic setting, we consider\nVariational Autoencoders (VAEs) and Wasserstein Autoencoders (WAEs) [21, 38]. Both models\ninvolve learning an encoder Q\u2713\nZ|X with parameter \u2713 mapping from high dimensional data to a\nlower dimensional latent space and decoder mapping in the reverse direction. A prior distribution\n\n2 A python notebook to reproduce all experiments\n\nis available at https://github.com/\n\ngoogle-research/google-research/tree/master/f_divergence_estimation_ram_mc.\n\n6\n\n\fKL\n\n2\n\nH2\n\nFigure 1: (Section 3.1) Estimating DfN (\u00b5, \u2303), N (0, Id) for various f, d, and parameters \u00b5\nand \u2303 indexed by 2 R. Horizontal axis correspond to 2 [2, 2], columns to d 2{ 1, 4, 16}\nand rows to KL, 2, and H2 divergences respectively. Blue are true divergences, black and red are\nRAM-MC estimators (3) for N 2{ 1, 500} respectively, green are M1 estimator of [28] and orange\nare plug-in estimates based on Gaussian kernel density estimation [26]. N = 500 and M = 128 in\nall the plots if not speci\ufb01ed otherwise. Error bars depict one standard deviation over 10 experiments.\n\nPZ is speci\ufb01ed, and the optimization objectives of both models are of the form \u201creconstruction +\ndistribution matching penalty\u201d. The penalty of the VAE was shown by [19] to be equivalent to\nZkPZ) + I(X, Z) where I(X, Z) is the mutual information of a sample and its encoding.\nKL(Q\u2713\nThe WAE penalty is D(Q\u2713\nZkPZ) for any divergence D that can practically be estimated. Following\n[38], we trained models using the Maximum Mean Discrepency (MMD), a kernel-based distance on\ndistributions, and a divergence estimated using a GAN-style classi\ufb01er leading to WAE-MMD and\nWAE-GAN respectively [13, 12]. For more information about VAE and WAE, see Appendix C.2.1.\nThe experiment. We consider models pre-trained on the CelebA dataset [25], and use them to\nevaluate the RAM-MC estimator as follows. We take the test dataset as the ground-truth QX, and\nembed it into the latent space via the trained encoder. As a result, we obtain a \u21e020k-component\nGaussian mixture for QZ, the empirical aggregate posterior. Since QZ is a \ufb01nite\u2014not continuous\u2014\nmixture, the true Df (QZkPZ) can be estimated using a large number of MC samples (we used 104).\nNote that this is very costly and involves evaluating 2\u00b7 104 Gaussian densities for each of the 104 MC\npoints. We repeated this evaluation 10 times and report means and standard deviations. RAM-MC is\nevaluated using N 2{ 20, 21, . . . , 214} and M 2{ 10, 103}. For each combination (N, M ), RAM-\nMC was computed 50 times with the means plotted as bold lines and standard deviations as error\nbars. In Figure 2 we show the result of performing this for the KL divergence on six different models.\nFor each dimension d 2{ 32, 64, 128}, we chose two models from the classes (VAE, WAE-MMD,\nWAE-GAN). See Appendix C.2 for further details and similar plots for the H 2-divergence.\nDiscussion. The results are encouraging. In all cases RAM-MC achieves a reasonable accuracy\nwith N relatively small, even for the bottom right model where the true KL divergence (\u21e1 1910)\nis very big. We see evidence supporting Theorem 4, which says that the variance of RAM-MC is\nmostly determined by the smaller of (N ) and M: when N is small, the variance of RAM-MC does\nnot change signi\ufb01cantly with M, however when N is large, increasing M signi\ufb01cantly reduces the\nvariance. Also we found there to be two general modes of behaviour of RAM-MC across the six\ntrained models we considered. In the bottom row of Figure 2 we see that the decrease in bias with\n\n7\n\n\fZkPZ) for pretrained autoencoder models with RAM-MC\nFigure 2: (Section 3.2) Estimates of KL(Q\u2713\nas a function of N for M =10 (green) and M =1000 (red) compared to an accurate MC estimate of\nthe ground truth (blue). Lines and error bars represent means and standard deviations over 50 trials.\n\nN is very obvious, supporting Proposition 1 and Theorems 1 and 2. In contrast, in the top row it is\nless obvious, because the comparatively larger variance for M =10 dominates reductions in the bias.\nEven in this case, both the bias and variance of RAM-MC with M =1000 become negligible for large\nN. Importantly, the behaviour of RAM-MC does not degrade in higher dimensions.\nThe baseline estimators (plug-in [26] and M1 [28]) perform so poorly that we decided not to include\nthem in the plots (doing so would distort the y-axis scale). In contrast, even with a relatively modest\nN =28 and M =1000 samples, RAM-MC behaves reasonably well in all cases.\n\n4 Applications: total correlation, entropy, and mutual information estimates\n\nIn this section we describe in detail some direct consequences of our new estimator and its guarantees.\nOur theory may also apply to a number of machine learning domains where estimating entropy, total\ncorrelation or mutual information is either the \ufb01nal goal or part of a broader optimization loop.\n\nTotal correlation and entropy estimation. The differential entropy, which is de\ufb01ned as H(QZ) =\nq(z) log q(z)dz, is often a quantity of interest in machine learning. While this is intractable in\n\nRZ\n\ngeneral, straightforward computation shows that for any PZ\nZ ) = EXN KL[ \u02c6QN\n\nH(QZ) EXN H( \u02c6QN\n\nZ kPZ] KL[QZkPZ].\n\nTotal Correlation is considered by [6], T C(QZ) := KL[QZkQdZ\n\nTherefore, our results provide suf\ufb01cient conditions under which H( \u02c6QN\nZ ) converges to H(QZ) and\nconcentrates to its mean. We now examine some consequences for Variational Autoencoders (VAEs).\ni=1 H(QZi)H(QZ)\nwhere QZi is the ith marginal of QZ. This is added to the VAE loss function to encourage QZ to be\nfactorized, resulting in the -TC-VAE algorithm. By the second equality above, estimation of TC can\nbe reduced to estimation of H(QZ) (only slight modi\ufb01cations are needed to treat H(QZi)).\nTwo methods are proposed in [6] for estimating H(QZ), both of which assume a \ufb01nite dataset of\nsize D. One of these, named Minibatch Weighted Sample (MWS), coincides with H( \u02c6QN\nZ ) + log D\nestimated with a particular form of MC sampling. Our results therefore imply inconsistency of the\nMWS method due to the constant log D offset. In the context of [6] this is not actually problematic\n\ni=1 QZi] =PdZ\n\n8\n\n\fsince a constant offset does not affect gradient-based optimization techniques. Interestingly, although\nthe derivations of [6] suppose a data distribution of \ufb01nite support, our results show that minor\nmodi\ufb01cations result in an estimator suitable for both \ufb01nite and in\ufb01nite support data distributions.\n\nMutual information estimation. The mutual information (MI) between variables with joint distri-\n\nbution QZ,X is de\ufb01ned as I(Z, X) := KL [QZ,XkQZQX] = EX KL\u21e5QZ|XkQZ\u21e4. Several recent\n\npapers have estimated or optimized this quantity in the context of autoencoder architectures, coin-\nciding with our setting [8, 19, 1, 31]. In particular, [36] propose the following estimator based on\nreplacing QZ with \u02c6QN\n\nZ , proving it to be a lower bound on the true MI:\n\nI N\n\nT CP C(Z, X) = EXNh 1\nThe gap can be written as I(Z, X) I N\nis any distribution. Therefore, our results also provide suf\ufb01cient conditions under which I N\nconverges and concentrates to the true mutual information.\n\nZ ]i \uf8ff I(Z, X).\nZ kPZ] KL[QZkPZ] where PZ\n\ni=1 KL[QZ|Xik \u02c6QN\nT CP C(Z, X) = EXN KL[ \u02c6QN\n\nNPN\n\nT CP C\n\n5 Conclusion\n\nWe introduced a practical estimator for the f-divergence Df (QZkPZ) where QZ =R QZ|XdQX,\nsamples from QX are available, and PZ and QZ|X have known density. The RAM estimator is based\nNPn QZ|Xn. We\non approximating the true QZ with data samples as a random mixture via \u02c6QN\ndenote by RAM-MC the estimator version where Df ( \u02c6QN\nZ kPZ) is estimated with MC sampling. We\nproved rates of convergence and concentration for both RAM and RAM-MC, in terms of sample size\nN and MC samples M under a variety of choices of f. Synthetic and real-data experiments strongly\nsupport the validity of our proposal in practice, and our theoretical results provide guarantees for\nmethods previously proposed heuristically in existing literature.\nFuture work will investigate the use of our proposals for optimization loops, in contrast to pure\nZkPZ) with\nestimation. When Q\u2713\nrespect to \u2713, RAM-MC provides a practical surrogate loss that can be minimized using stochastic\ngradient methods.\n\nZ|X depends on parameter \u2713 and the goal is to minimize Df (Q\u2713\n\nZ = 1\n\nAcknowledgements\nThanks to Alessandro Ialongo, Niki Kilbertus, Luigi Gresele, Giambattista Parascandolo, Mateo\nRojas-Carulla and the rest of Empirical Inference group at the MPI, and Ben Poole, Sylvain Gelly,\nAlexander Kolesnikov and the rest of the Brain Team in Zurich for stimulating discussions, support\nand advice.\n\nReferences\n[1] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a\n\nbroken ELBO. In ICML, pages 159\u2013168, 2018.\n\n[2] Alexandre B. Tsybakov. Introduction to nonparametric estimation. 2009.\n\n[3] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for\n\ndomain adaptation. In Advances in neural information processing systems, pages 137\u2013144, 2007.\n\n[4] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n[5] Liqun Chen, Chenyang Tao, Ruiyi Zhang, Ricardo Henao, and Lawrence Carin Duke. Variational inference\n\nand model selection with generalized evidence bounds. In ICML, 2018.\n\n[6] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in\n\nvariational autoencoders. arXiv preprint arXiv:1802.04942, 2018.\n\n[7] Imre Csisz\u00e1r, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends R\n\nin Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[8] Adji B Dieng, Yoon Kim, Alexander M Rush, and David M Blei. Avoiding latent variable collapse with\n\ngenerative skip models. arXiv preprint arXiv:1807.04863, 2018.\n\n9\n\n\f[9] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via\n upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732\u20132741,\n2017.\n\n[10] J-L Durrieu, J-Ph Thiran, and Finnian Kelly. Lower and upper bounds for approximation of the kullback-\nleibler divergence between gaussian mixture models. In 2012 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 4833\u20134836. Ieee, 2012.\n\n[11] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of\nMachine Learning Research, 17(1):2096\u20132030, 2016.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[13] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nkernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[14] M. Hein and O. Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability measures. In\n\nAISTATS, 2005.\n\n[15] A. O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha divergence for classi\ufb01cation, indexing and retrieval.\n\nComm. and Sig. Proc. Lab. (CSPL), Dept. EECS, Univ. Michigan, Ann Arbor, Tech. Rep. 328, 2001.\n\n[16] A. O. Hero, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal\n\nProcessing Magazine, 2002.\n\n[17] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian\nmixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-\nICASSP\u201907, volume 4, pages IV\u2013317. IEEE, 2007.\n\n[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural\nInformation Processing Systems, pages 6626\u20136637, 2017.\n\n[19] Matthew D Hoffman and Matthew J Johnson. ELBO surgery: yet another way to carve up the variational\n\nevidence lower bound. 2016.\n\n[20] T. Kanamori, T. Suzuki, and M. Sugiyama. f-divergence estimation and two-sample homogeneity test\n\nunder semiparametric density-ratio models. IEEE Transactions on Information Theory, 58(2), 2012.\n\n[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[22] A. Krishnamurthy, A. Kandasamy, B. P\u00f3czos, and L. Wasserman. Nonparametric estimation of R\u00e9nyi\n\ndivergence and friends. In ICML, 2014.\n\n[23] Yingzhen Li and Richard E Turner. R\u00e9nyi divergence variational inference.\n\nInformation Processing Systems, pages 1073\u20131081, 2016.\n\nIn Advances in Neural\n\n[24] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE\n\nTransactions on Information Theory, 52(10):4394\u20134412, 2006.\n\n[25] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), December 2015.\n\n[26] K. Moon and A. Hero. Ensemble estimation of multivariate f-divergence. In 2014 IEEE International\n\nSymposium on Information Theory, pages 356\u2013360, 2014.\n\n[27] K. Moon and A. Hero. Multivariate f-divergence estimation with con\ufb01dence. In NeurIPS, 2014.\n\n[28] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and\nthe likelihood ratio by convex risk minimization. IEEE Trans. Information Theory, 56(11):5847\u20135861,\n2010.\n\n[29] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating\n\nf-divergences. IEEE Signal Process. Lett., 21(1):10\u201313, 2014.\n\n10\n\n\f[30] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers\nusing variational divergence minimization. In Advances in neural information processing systems, pages\n271\u2013279, 2016.\n\n[31] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive\n\ncoding. arXiv preprint arXiv:1807.03748, 2018.\n\n[32] Ferdinand Osterreicher and Igor Vajda. A new class of metric divergences on probability spaces and its\n\napplicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639\u2013653, 2003.\n\n[33] Leandro Pardo. Statistical inference based on divergence measures. Chapman and Hall/CRC, 2005.\n\n[34] F. Perez-Cruz. Kullback-leibler divergence estimation of continuous distributions. In IEEE International\n\nSymposium on Information Theory, 2008.\n\n[35] B. Poczos and J. Schneider. On the estimation of alpha-divergences. In AISTATS, 2011.\n\n[36] Ben Poole, Sherjil Ozair, A\u00e4ron van den Oord, Alexander A Alemi, and George Tucker. On variational\n\nlower bounds of mutual information. In ICML, 2018.\n\n[37] S. Singh and B. Poczos. Generalized exponential concentration inequality for R\u00e9nyi divergence estimation.\n\nIn ICML, 2014.\n\n[38] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In\n\nICLR, 2018.\n\n[39] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa. Divergence estimation for multidimensional densities via k-nearest-\n\nneighbor distances. IEEE Transactions on Information Theory, 55(5), 2009.\n\n11\n\n\f", "award": [], "sourceid": 2244, "authors": [{"given_name": "Paul", "family_name": "Rubenstein", "institution": "MPI for IS"}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": "Google Brain (Zurich)"}, {"given_name": "Josip", "family_name": "Djolonga", "institution": "Google Research, Brain Team"}, {"given_name": "Carlos", "family_name": "Riquelme", "institution": "Google Brain"}, {"given_name": "Ilya", "family_name": "Tolstikhin", "institution": "MPI for Intelligent Systems"}]}