{"title": "Nonparametric Density Estimation under Adversarial Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 10225, "page_last": 10236, "abstract": "We study minimax convergence rates of nonparametric density estimation under a large class of loss functions called ``adversarial losses'', which, besides classical L^p losses, includes maximum mean discrepancy (MMD), Wasserstein distance, and total variation distance. These losses are closely related to the losses encoded by discriminator networks in generative adversarial networks (GANs). In a general framework, we study how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate. We also discuss implications for training GANs based on deep ReLU networks, and more general connections to learning implicit generative models in a minimax statistical sense.", "full_text": "Nonparametric Density Estimation\n\nwith Adversarial Losses\n\nShashank Singh1,2,\u2217\n\nAnanya Uppal3\n\nBoyue Li4\n\nChun-Liang Li1 Manzil Zaheer1\n\nBarnab\u00e1s P\u00f3czos1\n\n1Machine Learning Department\n3Department of Mathematical Sciences\n\n2Department of Statistics & Data Science\n4Language Technologies Institute\n\nCarnegie Mellon University\n\n\u2217Corresponding Author: sss1@cs.cmu.edu\n\nAbstract\n\nWe study minimax convergence rates of nonparametric density estimation under a\nlarge class of loss functions called \u201cadversarial losses\u201d, which, besides classical Lp\nlosses, includes maximum mean discrepancy (MMD), Wasserstein distance, and\ntotal variation distance. These losses are closely related to the losses encoded by\ndiscriminator networks in generative adversarial networks (GANs). In a general\nframework, we study how the choice of loss and the assumed smoothness of\nthe underlying density together determine the minimax rate. We also discuss\nimplications for training GANs based on deep ReLU networks, and more general\nconnections to learning implicit generative models in a minimax statistical sense.\n\n1\n\nIntroduction\n\nGenerative modeling, that is, modeling the distribution from which data are drawn, is a central task in\nmachine learning and statistics. Often, prior information is insuf\ufb01cient to guess the form of the data\ndistribution. In statistics, generative modeling in these settings is usually studied from the perspective\nof nonparametric density estimation, in which histogram, kernel, orthogonal series, and nearest-\nneighbor methods are popular approaches with well-understood statistical properties [49, 47, 14, 7].\nRecently, machine learning has made signi\ufb01cant empirical progress in generative modeling, using\nsuch tools as generative adversarial networks (GANs) and variational autoencoders (VAEs). Compu-\ntationally, these methods are quite distinct from classical density estimators; they usually rely on deep\nneural networks, \ufb01t by black-box optimization, rather than a mathematically prescribed smoothing\noperator, such as convolution with a kernel or projection onto a \ufb01nite-dimensional subspace.\nIgnoring the implementation of these models, from the perspective of statistical analysis, these\nrecent methods have at least two main differences from classical density estimators. First, they\nare implicit, rather than explicit (or prescriptive) generative models [12, 29]; that is, rather than an\nestimate of the probability of a set or the density at a point, they return novel samples from the\ndata distribution. Second, in many recent models, loss is measured not with Lp distances (as is\nconventional in nonparametric statistics [49, 47]), but rather with weaker losses, such as\n\n[f (X)] \u2212 E\nX\u223cQ\n\ndFD (P, Q) = sup\nf\u2208FD\n\n(1)\nwhere FD is a discriminator class of bounded, Borel-measurable functions, and P and Q lie in a\ngenerator class FG of Borel probability measures on a sample space X . Speci\ufb01cally, GANs often\nuse losses of this form because (1) can be approximated by a discriminator neural network.\nThis paper attempts to help bridge the gap between traditional nonparametric statistics and these\nrecent advances by studying these two differences from a statistical minimax perspective. Speci\ufb01cally,\n\n[f (X)]\n\nX\u223cP\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\funder traditional statistical smoothness assumptions, we identify (i.e., prove matching upper and\nlower bounds on) minimax convergence rates for density estimation under several losses of the\nform (1). We also discuss some consequences this has for particular neural network implementations\nof GANs based on these losses. Finally, we study connections between minimax rates for explicit\nand implicit generative modeling, under a plausible notion of risk for implicit generative models.\n\n1.1 Adversarial Losses\nThe quantity (1) has been extensively studied, in the case that FD is a reproducing kernel Hilbert\nspace (RKHS) under the name maximum mean discrepancy (MMD; [18, 46]), and, in a wider context\nunder the name integral probability metric (IPM; [31, 42, 43, 8]). [5] also called (1) the FD-distance,\nor, when FD is a family of functions that can be implemented by a neural network, the neural network\ndistance. We settled on the name \u201cadversarial loss\u201d because, without assuming any structure on\nFD, this matches the intuition of the expression (1), namely that of an adversary selecting the most\n\ndistinguishing linear projection f \u2208 FD between the true density P and our estimate (cid:98)P (e.g., by the\ndiscriminator network in a GAN).\nOne can check that dFD : FG \u00d7 FG \u2192 [0,\u221e] is a pseudometric (i.e., it is non-negative and satis\ufb01es\nthe triangle inequality, and dFD (P, Q) > 0 \u21d2 P (cid:54)= Q, although dFD (P, Q) = 0 (cid:54)\u21d2 P = Q unless\nFD is suf\ufb01ciently rich). Many popular (pseudo)metrics between probability distributions, including\nLp [49, 47], Sobolev [24, 30], maximum mean discrepancy (MMD; [46])/energy [45, 36], total\nvariation [48], (1-)Wasserstein/Kantorovich-Rubinstein [20, 48], Kolmogorov-Smirnov [22, 40], and\nDudley [13, 1] metrics can be written in this form, for appropriate choices of FD.\nThe main contribution of this paper is a statistical analysis of the problem of estimating a distri-\nbution P from n IID observations using the loss dFD, in a minimax sense over P \u2208 FG, for fairly\ngeneral nonparametric smoothness classes FD and FG. General upper and lower bounds are given in\nterms of decay rates of coef\ufb01cients of functions in terms of an (arbitrary) orthonormal basis of L2\n(including, e.g., Fourier or wavelet bases); note that this does not require FD or FG to have any inner\nproduct structure, only that FD \u2286 L1. We also discuss some consequences for density estimators\nbased on neural networks (such as GANs), and consequences for the closely related problem of\nimplicit generative modeling (i.e., of generating novel samples from a target distribution, rather than\nestimating the distribution itself), in terms of which GANs and VAEs are usually cast.\nPaper Organization: Section 2 provides our formal problem statement and required notation.\nSection 3 discusses related work on nonparametric density estimation, with further discussion of\nthe theory of GANs provided in the Appendix. Sections 4 and 5 contain our main theoretical upper\nand lower bound results, respectively. Section 6 develops our general results from Sections 4 and\n5 into concrete minimax convergence rates for some important special cases. Section 7 uses our\ntheoretical results to upper bound the error of perfectly optimized GANs. Section 8 establishes some\ntheoretical relationships between the convergence of optimal density estimators and optimal implicit\ngenerative models. The Appendix provides proofs of our theoretical results, further applications,\nfurther discussion of related and future work, and experiments on simulated data that support our\ntheoretical results.\n\n2 Problem Statement and Notation\n\nWe now provide a formal statement of the problem studied in this paper in a very general setting, and\nthen de\ufb01ne notation required for our speci\ufb01c results.\nFormal Problem Statement: Let P \u2208 FG be an unknown probability measure on a sample space\nX , from which we observe n IID samples X1:n = X1, ..., Xn\nIID\u223c P . In this paper, we are interested\nin using the samples X1:n to estimate the measure P , with error measured using the adversarial loss\ndFD. Speci\ufb01cally, for various choices of spaces FD and FG, we seek to bound the minimax rate\n\n(cid:104)\n\ndFD\n\n(cid:16)\n\n(cid:17)(cid:105)\nP,(cid:98)P (X1:n)\n\nM (FD,FG) := inf(cid:98)P\n\nsup\nP\u2208FG\n\nE\nX1:n\n\n(cid:98)P (i.e., all (potentially randomized) functions (cid:98)P : X n \u2192 FG). We will discuss both the case when\n\nof estimating distributions assumed to lie in a class FG, where the in\ufb01mum is taken over all estimators\nFG is known a priori and the adaptive case when it is not.\n\n2\n\n\f2.1 Notation\nFor a non-negative integer n, we use [n] := {1, 2, ..., n} to denote the set of positive integers at most\nn. For sequences {an}n\u2208N and {bn}n\u2208N of non-negative reals, an (cid:46) bn and, similarly bn (cid:38) an,\n\u2264 C. an (cid:16) bn indicates\nindicate the existence of a constant C > 0 such that lim supn\u2192\u221e an\nbn\nan (cid:46) bn (cid:46) an. For functions f : Rd \u2192 R, we write\nsup\n\nlim(cid:107)z(cid:107)\u2192\u221e f (z) :=\n\n{zn}n\u2208N:(cid:107)zn(cid:107)\u2192\u221e\n\nn\u2192\u221e f (zn),\nlim\n\nrequire summations of the form(cid:80)\n\nwhere the supremum is taken over all diverging Rd-valued sequences. Note that, by equivalence\nof \ufb01nite-dimensional norms, the exact choice of the norm (cid:107) \u00b7 (cid:107) does not matter here. We will also\nz\u2208Z f (z) in cases where Z is a (potentially in\ufb01nite) countable\nindex set and {f (z)}z\u2208Z is summable but not necessarily absolutely summable. Therefore, to ensure\nthat the summation is well-de\ufb01ned, the order of summation will need to be speci\ufb01ed, depending on\nthe application (as in, e.g., Section 6).\nFix the sample space X = [0, 1]d to be the d-dimensional unit cube, over which \u03bb denotes the usual\nLebesgue measure. Given a measurable function f : X \u2192 R, let, for any Borel measure \u00b5 on X ,\np \u2208 [1,\u221e], and L > 0,\n\n(cid:18)(cid:90)\n\n(cid:19)1/p\n\n(cid:107)f(cid:107)Lp\n\n\u00b5 :=\n\n|f|p d\u00b5\n\nX\n\nand Lp\n\n\u00b5(L) :=\n\nf : X \u2192 R(cid:12)(cid:12)(cid:12) (cid:107)f(cid:107)Lp\n(cid:110)\n\n\u00b5 < L\n\n(cid:111)\n\n(taking the appropriate limit if p = \u221e) denote the Lebesgue norm and ball of radius L, respectively.\n\u03bb indexed by a countable family Z. To allow probability\nFix an orthonormal basis B = {\u03c6z}z\u2208Z of L2\nbounded function, so that (cid:101)Pz := EX\u223cP [\u03c6z(X)] is well-de\ufb01ned. For constants L > 0 and p \u2265 1 and\nmeasures P without densities (i.e., P (cid:54)(cid:28) \u00b5), we assume each basis element \u03c6z : X \u2192 R is a\nreal-valued net {az}z\u2208Z, our results pertain to generalized ellipses of the form\n\nHp,a(L) =\n\n(where (cid:101)fz :=(cid:82)\n\nX f \u03c6z d\u00b5 is the zth coef\ufb01cient of f in the basis B). We sometimes omit dependence on\nL (e.g., Hp,a = Hp,a(L)) when its value does not matter (e.g., when discussing rates of convergence).\nA particular case of interest is the scale of the Sobolev spaces de\ufb01ned for s, L \u2265 0 and p \u2265 1 by\n\n(cid:33)1/p\n\n\u2264 L\n\n(cid:33)1/p\n\n\u2264 L\n\n\uf8fc\uf8fd\uf8fe .\n\uf8fc\uf8fd\uf8fe .\n\nap\n\nz\u2208Z\n\nz|(cid:101)fz|p\n\n(cid:32)(cid:88)\n\n\uf8f1\uf8f2\uf8f3f \u2208 L1(X ) :\n\uf8f1\uf8f2\uf8f3f \u2208 L1(X ) :\n(cid:32)(cid:88)\n|z|sp|(cid:101)fz|p\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)f (s)(cid:13)(cid:13)(cid:13)Lp\n(cid:26)\n\nf \u2208 Lp\n\n\u03bb\n\nz\u2208Z\n\nW s,p(cL) :=\n\n(cid:27)\n\n< L\n\n\u03bb\n\nW s,p(L) =\n\nFor example, when B is the standard Fourier basis and s is an integer, for a constant factor c depending\nonly on s and the dimension d,\n\ncorresponds to the natural standard smoothness class of Lp\nderivatives f (s) in Lp\n\n\u03bb(L) [24]).\n\n\u03bb functions having sth-order (weak)\n\n3 Related Work\n\nOur results apply directly to many of the losses that have been used in GANs, including 1-Wasserstein\ndistance [3, 19], MMD [25], Sobolev distances [30], and the Dudley metric [1]. As discussed in\nthe Appendix, slightly different assumptions are required to obtain results for the Jensen-Shannon\ndivergence (used in the original GAN formulation of [17]) and other f-divergences [33].\nGiven their generality, our results relate to many prior works on distribution estimation, including\nclassical work in nonparametric statistics and empirical process theory, as well as more recent work\n\n3\n\n\f2t+d , matching the rates given by our main results.\n\n\u03bb loss, corresponding the adversarial loss dFD with FD = L2\n\nstudying Wasserstein distances and MMD. Here, we brie\ufb02y survey known results for these problems.\nThere have also been a few other statistical analyses of the GAN framework; due to space constraints,\nwe discuss these works in the Appendix.\nL2\n\u03bb distances: Classical work on nonparametric statistics has typically focused on the problem of\nsmooth density estimation under L2\n\u03bb(LD)\n(the H\u00f6lder dual) of L2 [49, 47]. In this case, when FG = W t,2(LG) is a Sobolev class, then the\nminimax rate is typically M (FD,FG) (cid:16) n\u2212 t\nMaximum Mean Discrepancy (MMD): When FD is a reproducing kernel Hilbert space (RKHS),\nthe adversarial loss dFD has been widely studied under the name maximum mean discrepancy\n(MMD) [18, 46]. When the RKHS kernel is translation-invariant, one can express FD in the form\nH2,a, where a is determined by the spectrum of the kernel, and so our analysis holds for MMD losses\nwith translation-invariant kernels (see Example 6). To the best of our knowledge, minimax rates for\ndensity estimation under MMD loss have not been established in general; our analysis suggests that\ndensity estimation under an MMD loss is essentially equivalent to the problem of estimating kernel\nmean embeddings studied in [46], as both amount to density estimation while ignoring bias, and both\ntypically have a parametric n\u22121/2 minimax rate. Note that the related problems of estimating MMD\nitself, and of using it in statistical tests for homogeneity and dependence, have received extensive\ntheoretical treatment [18, 35].\nWasserstein Distances: When FD = W 1,\u221e(L) is the class of 1-Lipschitz functions, dFD is\nequivalent to the (order-1) Wasserstein (also called earth-mover\u2019s or Kantorovich-Rubinstein) distance.\nIn this case, when FG contains all Borel measurable distributions on X , minimax bounds have been\nestablished under very general conditions (essentially, when the sample space X is an arbitrary totally\nbounded metric space) in terms of covering numbers of X [50, 39, 23]. In the particular case that X\nis a bounded subset of Rd of full dimension (i.e., having non-empty interior, comparable to the case\nd},\nX = [0, 1]d that we study here), these results imply a minimax rate of M (FD,FG) = n\u2212 min{ 1\nmatching our rates. Notably, these upper bounds are derived using the empirical distribution, which\ncannot bene\ufb01t from smoothness of the true distribution (see [50]). At the same time, it is obvious to\ngeneralize smoothing estimators to sample spaces that are not suf\ufb01ciently nice subsets of Rd.\nSobolev IPMs: The closest work to the present is [26], which we believe was the \ufb01rst work to\nanalyze how convergence rates jointly depend on (Sobolev) smoothness restrictions on both FD\nand FG. Speci\ufb01cally, for Sobolev spaces FD = W s,p and FG = W t,q with p, q \u2265 2 (compare our\nExample 4), they showed\n\n2 , 1\n\nn\u2212 s+t\n\n2t+d (cid:46) M (W s,2,W t,2) (cid:46) n\n\n\u2212 s+t\n\n2(s+t)+d .\n\n(2)\nOur main results in Sections 4 and 5 improve on this in two main ways. First, our results generalize to\nand are tight for many spaces besides Sobolev spaces. Examples include when FD is a reproducing\nkernel Hilbert space (RKHS) with translation-invariant kernel, or when FG is the class of all Borel\nprobability measures. Our bounds also allow other (e.g., wavelet) estimators, whereas the bounds of\n[26] are for the (uniformly L\u221e\n\u03bb -bounded) Fourier basis. Second, the lower and upper bounds in (2)\ndiverge by a factor polynomial in n. We tighten the upper bound to match the lower bound, identifying,\nfor the \ufb01rst time, minimax rates for many problems of this form (e.g., M (W s,2,W t,2) (cid:16) n\u2212 s+t\n2t+d in\nthe Sobolev case above). Our analysis has several interesting implications:\n1. When s > d/2, the convergence becomes parametric: M (W s,2,FG) (cid:16) n\u22121/2, for any class\nof distributions FG. This highlights that the loss dFD is quite weak for large s, and matches\nknown minimax results for the Wasserstein case s = 1 [10, 39].\n(unsmoothed) empirical distribution (cid:98)PE to the true distribution, which typically occurs at rate of\n2. Our upper bounds, as in [26], are for smoothing estimators (namely, the orthogonal series\nestimator 3). In contrast, previous analyses of Wasserstein loss focused on convergence of the\n(cid:16) n\u22121/d + n\u22121/2, where d is the intrinsic dimension of the support of P [10, 50, 39]. Moreover,\n(t \u2208 (cid:16)\nif FG includes all Borel probability measures, this rate is minimax optimal [39]. The loose\nupper bound of [26] left open the questions of whether (when s < d/2) a very small amount\nsmoothed estimators are outperformed by (cid:98)PE in this regime. Our results imply that, for s < d/2,\n) of smoothness improves the minimax rate and, more importantly, whether\n\nthe minimax rate strictly improves with smoothness t, and that, as long as the support of P\n\n0, 2s2\nd\u22122s\n\n(cid:105)\n\n4\n\n\fhas full dimension, the smoothed estimator always converges faster than (cid:98)PE. An important\n\nopen problem is to simultaneously leverage when P is smooth and has support of low intrinsic\ndimension; many data (e.g., images) likely enjoy both these properties.\n3. [26] suggested over-smoothing the estimate (the smoothing parameter \u03b6 discussed in Equation (3)\nbelow was set to \u03b6 (cid:16) n\n\u03bb loss, and hence it was not clear\nhow to design estimators that adapt to unknown smoothness under losses dW s,p. We show that\nthe optimal smoothing (\u03b6 (cid:16) n\n\u03bb loss, and we\nuse this to design an adaptive estimator (see Corollary 5).\n\n2t+d ) under dW s,p loss is identical to that under L2\n\n2(s+t)+d ) compared to the case of L2\n\n4. Our bounds imply improved performance bounds for optimized GANs, discussed in Section 7.\n\n1\n\n1\n\nThis section gives upper bounds on the adversarial risk of the following density estimator. For any\n\n4 Upper Bounds for Orthogonal Series Estimators\n\n\ufb01nite set Z \u2286 Z, let (cid:98)PZ be the truncated series estimate\n(cid:98)Pz\u03c6z, where, for any z \u2208 Z,\n\n(cid:98)PZ :=\n\n(cid:88)\n\nn(cid:88)\n\n(cid:98)Pz :=\n\n1\nn\n\nz\u2208Z\n\n(3)\nZ is a tuning parameter that typically corresponds to a smoothing parameter; for example, when B is\n\nthe Fourier basis and Z = {z \u2208 Zd : (cid:107)z(cid:107)\u221e \u2264 \u03b6} for some \u03b6 > 0, (cid:98)PZ is equivalent to a kernel density\nestimator using a sinc product kernel Kh(x) =(cid:81)d\n\n2\u03c0x/h with bandwidth h = 1/\u03b6 [34].\n\n\u03c6z(Xi).\n\nsin(2\u03c0x/h)\n\nj=1\n\ni=1\n\n2\nh\n\nWe now present our main upper bound on the minimax rate of density estimation under adversarial\nlosses. The upper bound is given by the orthogonal series estimator given in Equation (3), but we\nexpect kernel and other standard linear density estimators to converge at the same rate.\nTheorem 1 (Upper Bound). Suppose that \u00b5(X ) < \u221e and there exist constants LD, LG > 0,\nreal-valued nets {az}z\u2208Z, {bz}z\u2208Z such that FD = Hp,a(X , LD) and FG = Hq,b(X , LG), where\np, q \u2265 1. Let p(cid:48) = p\n\np\u22121 denote the H\u00f6lder conjugate of p. Then, for any P \u2208 FG,\nP,(cid:98)P\n\n(cid:17)(cid:105) \u2264 LD\n\n+ LDLG\n\n(cid:41)\n\n(cid:27)\n\n(cid:16)\n\ndFD\n\n(cid:104)\n\nP\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:40)(cid:107)\u03c6z(cid:107)L\u221e\n\naz\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:26) 1\n\nazbz\n\nz\u2208Z\\Z\n\nE\nX1:n\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)p(cid:48)\n\nz\u2208Z\n\n(4)\n\n1\n\n1\u22121/p\u22121/q\n\ncp(cid:48)\u221a\nn\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nThe two terms in the bound (4) demonstrate a bias-variance tradeoff, in which the \ufb01rst term (variance)\nincreases with the truncation set Z and is typically independent of the class FG of distributions, while\nthe second term (bias) decreases with Z at a rate depending on the complexity of FG.\nCorollary 2 (Suf\ufb01cient Conditions for Parametric Rate). Consider the setting of Theorem 1. If\n\n(cid:88)\n\nz\u2208Z\n\n(cid:107)\u03c6z(cid:107)2L\u221e\n\nP\n\na2\nz\n\nA :=\n\n< \u221e and max{az, bz} \u2192 \u221e.\n\nIn particular, letting cz := supx\u2208X |\u03c6z(x)| for each z \u2208 Z, this occurs whenever(cid:80)\nIn many contexts (e.g., if P (cid:28) \u03bb and \u03bb (cid:28) P ), the simpler condition(cid:80)\ni \u2208 N. The assumption max(cid:8)lim(cid:107)z(cid:107)\u2192\u221e az, lim(cid:107)z(cid:107)\u2192\u221e bz\n\nwhenever (cid:107)z(cid:107) \u2192 \u221e, then, the minimax rate is parametric; speci\ufb01cally, M (FD,FG) \u2264 LD\n< \u221e.\nz\u2208Z c2\n< \u221e suf\ufb01ces.\nThe \ufb01rst, and slightly weaker condition in terms of (cid:107)\u03c6z(cid:107)2L\u221e\nis useful when we restrict FG; e.g.,\nif B is the wavelet basis (de\ufb01ned in the Appendix) and FG contains only discrete distributions\nsupported on at most k points, then (cid:107)\u03c6i,j(cid:107)2L\u221e\n= 0 for all but k values of j \u2208 [2i], at each resolution\nRiemann-Lebesgue lemma and the assumption that FD is bounded in L\u221e\nthis condition always holds if B is the Fourier basis.\n\n(cid:9) = \u221e is quite mild; for example, the\n\n\u03bb together imply that\n\n\u03bb \u2286 L1\n\nz\u2208Z c2\n\nz\na2\nz\n\nz\na2\nz\n\nP\n\nP\n\n(cid:112)A/n.\n\n5 Minimax Lower Bound\nIn this section, we lower bound the minimax risk M (FD,FG) of distribution estimation under dFD\nloss over FG, for the case when FD = Hp,a and FG := Hq,b are generalized ellipses. As we show\n\n5\n\n\fin some examples in Section 6, our lower bound rate matches our upper bound rate in Theorem 1\nfor many spaces FD and FG of interest. Our lower bound also suggests that the assumptions in\nCorollary 2 are typically necessary to guarantee the parametric convergence rate n\u22121/2.\nTheorem 3 (Minimax Lower Bound). Fix X = [0, 1]d, and let p0 denote the uniform density (with\nrespect to Lebesgue measure) on X . Suppose {p0} \u222a {\u03c6z}z\u2208Z is an orthonormal basis in L2\n\u00b5, and\n{az}z\u2208Z and {bz}z\u2208Z are two real-valued nets. Let LD, LG \u2265 0 and p, q \u2265 2. For any Z \u2286 Z, let\n\nAZ := |Z|1/2 sup\nz\u2208Z\n\nbz.\nThen, for FD = Hp,a(LD) and FG := Hq,b(LG), for any Z \u2286 Z satisfying\n\u2264 1,\n\n(cid:114) n\n\n(cid:88)\n\nand BZ := |Z|1/2 sup\nz\u2208Z\n\nand\n\naz\n\n2\n\nLG\nBZ\n\nz\u2208Z\n\n(cid:107)\u03c6z(cid:107)L\u221e\n\n\u00b5\n\nlog 2\n\nBZ \u2265 16LG\nwe have M (FD,FG) \u2265 LGLD|Z|\n\n64AZBZ\n\n=\n\nLGLD\n\n64 (supz\u2208Z az) (supz\u2208Z bz)\n\n.\n\n(5)\n\nAs in most minimax lower bounds, our proof relies on constructing a \ufb01nite set \u2126G of \u201cworst-case\u201d\ndensities in FG, lower bounding the distance dFD over \u2126G, and then letting elements of \u2126G shrink\ntowards the uniform distribution p0 at a rate such that the average information (here, Kullback-Leibler)\ndivergence between each p \u2208 \u2126G and p0 does not grow with n. The \ufb01rst condition in (5) ensures that\nthe information divergence between each p \u2208 \u2126G and p0 is suf\ufb01ciently small, and typically results in\ntuning of Z identical (in rate) to its optimal tuning in the upper bound (Theorem 1).\nThe second condition in (5) is needed to ensure that the \u201cworst-case\u201d densities we construct are\neverywhere non-negative. Hence, this condition is not needed for lower bounds in the Gaussian\nsequence model, as in Theorem 2.3 of [26]. However, failure of this condition (asymptotically)\ncorresponds to the breakdown point of the asymptotic equivalence between the Gaussian sequence\nmodel and the density estimation model in the regime of very low smoothness (e.g., in the Sobolev\nsetting, when t < d/2; see [9]), and so \ufb01ner analysis is needed to establish lower bounds here.\n\n6 Examples\n\nIn this section, we apply our bounds from Sections 4 and 5 to compute concrete minimax convergence\nrates for two examples choices of FD and FG, namely Sobolev spaces and reproducing kernel Hilbert\nspaces. Due to space constraints, we consider only the Fourier basis here, but, in the Appendix, we\nalso discuss an estimator in the Sobolev case using the Haar wavelet basis.\nFor the purpose of this section, suppose that X = [0, 2\u03c0]d, Z = Zd, and, for each z \u2208 Z, \u03c6z is the\nzth standard Fourier basis element given by \u03c6z(x) = ei(cid:104)z,x(cid:105) for all x \u2208 X . In this case, we will\nalways choose the truncation set Z to be of the form Z := {z \u2208 Z : (cid:107)z(cid:107)\u221e \u2264 \u03b6}, for some \u03b6 > 0, so\nthat |Z| \u2264 \u03b6 d. Moreover, for every z \u2208 Z, (cid:107)\u03c6z(cid:107)L\u221e\nExample 4 (Sobolev Spaces). Suppose that, for some s, t \u2265 0, az = (cid:107)z(cid:107)s\u221e and bz = (cid:107)z(cid:107)t\u221e. Then,\nsetting \u03b6 = n\n\n2t+d in Theorems 1 and 3 gives that there exist constants C > c > 0 such that\n\n= 1, and hence CZ \u2264 1.\n\n\u00b5\n\n1\n\n2t+d} \u2264 M(cid:0)W s,2,W t,2(cid:1) \u2264 Cn\u2212 min{ 1\n\n2 , s+t\n\n2t+d}.\n2 , s+t\n\ncn\u2212 min{ 1\n\n(6)\nCombining the observation that the s-H\u00f6lder space W s,\u221e \u2286 W s,2 with the lower bound (over\nW s,\u221e) in Theorem 3.1 of [26], we have that (6) also holds when W s,2 is replaced with W s,p for any\np \u2208 [2,\u221e] (e.g., in the case of the Wasserstein metric dW 1,\u221e).\nSo far, we have assumed the smoothness t of the true distribution P is known, and used that to tune\nthe parameter \u03b6 of the estimator. However, in reality, t is not known. In the next result, we leverage\nthe fact that the rate-optimal choice \u03b6 = n\n2t+d above does not rely on the loss parameters s, together\nwith Theorem 1 to construct an adaptively minimax estimator, i.e., one that is minimax and fully-data\ndependent. There is a large literature on adaptive nonparametric density estimation under L2\n\u00b5 loss;\nsee [14] for accessible high-level discussion and [16] for a technical but comprehensive review.\n\n1\n\n6\n\n\fsup\n\nP\u2208W t,2\n\nE\nIID\u223c P\nX1:n\n\ndW s,2\n\n(7)\n\nCorollary 5 (Adaptive Upper Bound for Sobolev Spaces). There exists an adaptive choice(cid:98)\u03b6 : X n \u2192\n\nN of the hyperparameter \u03b6 (independent of s, t), such that, for any s, t \u2265 0, there exists a constant\nC > 0 (independent of n), such that\n\n(cid:104)\n\n(cid:16)\n\nP,(cid:98)PZ(cid:98)\u03b6(X1:n )\n\n(cid:17)(cid:105) \u2264 M(cid:0)W s,2,W t,2(cid:1)\n\nDue to space constraints, we present the actual construction of the adaptive(cid:98)\u03b6 in the Appendix, but, in\nwe show that the dW s,\u221e risk of (cid:98)P\u03b6 can be factored into its L2\n\nbrief, it is a standard construction based on leave-one-out cross-validation under L2\n\u00b5 loss which is\nknown (e.g., see Sections 7.2.1 and 7.5.2 of [28]) to be adaptively minimax under L2\n\u00b5 loss. Using\nthe fact that our upper bound Theorem 1 uses a choice of \u03b6 is independent of the loss parameter s,\n\u00b5 risk and a component (\u03b6\u2212s) that is\nindependent of t. Since L2\n\u00b5 risk can be rate-minimized in independently of t, it follows that the dW s,\u221e\nrisk can be rate-minimized independently of t. Adaptive minimaxity then follows from Theorem 3.\nExample 6 (Reproducing Kernel Hilbert Space/MMD Loss). Suppose Hk is a reproducing kernel\nHilbert space (RKHS) with reproducing kernel k : X \u00d7 X \u2192 R [4, 6]. If k is translation invariant\n(i.e., there exists \u03ba \u2208 L2\n\u00b5 such that, for all x, y \u2208 X , k(x, y) = \u03ba(x \u2212 y)), then Bochner\u2019s theorem\n(see, e.g., Theorem 6.6 of [51]) implies that, up to constant factors,\n\n(cid:41)\n|(cid:101)\u03baz|2|(cid:101)fz|2 < L2\nThus, in the setting of Theorem 1, we have Hk = H2,a, where az = |(cid:101)\u03baz| satis\ufb01es(cid:80)\n\n< \u221e. Corollary 2 then gives M (Hk(LD),FG) \u2264 LD(cid:107)\u03ba(cid:107)L2\n\nz =\n(cid:107)\u03ba(cid:107)2L2\nn\u22121/2 for any class FG. It is\nwell-known known that MMD can always be estimated at the parametric rate n\u22121/2 [18]; however,\nto the best of our knowledge, only recently has it been shown that any probability distribution can be\nestimated at the rate n\u22121/2 under MMD loss[41], emphasizing the fact that MMD is a very weak\nmetric. This has important implications for applications such as two-sample testing [35].\n\nHk(L) := {f \u2208 Hk : (cid:107)f(cid:107)Hk \u2264 L} =\n\nf \u2208 Hk :\n\nz\u2208Z a\u22122\n\n(cid:40)\n\n(cid:88)\n\nz\u2208Z\n\n.\n\n\u00b5\n\n\u00b5\n\n7 Consequences for Generative Adversarial Neural Networks (GANs)\n\n(cid:80)n\n\nThis section discusses implications of our minimax bounds for GANs. Neural networks in this section\nare assumed to be fully-connected, with recti\ufb01ed linear unit (ReLU) activations. [26] used their upper\nbound result (2) to prove a similar theorem, but, since their upper bound was loose, the resulting\ntheorem was also loose. The following results are immediate consequences of our improvement\n(Theorem 1) over the upper bound (2) of [26], and so we refer to that paper for the proof. Key\ningredients are an oracle inequality proven in [26], an upper bound such as Theorem 1, and bounds\nof [52] on the size of a neural network needed to approximate functions in a Sobolev class.\nIn the following, FD denotes the set of functions that can be encoded by the discriminator network\nand FG denotes the set of distributions that can be encoded by the generator network. Pn :=\n\ni=1 1{Xi} denotes the empirical distribution of the observed data X1:n\n\n1\nn\nTheorem 7 (Improvement of Theorem 3.1 in Liang [26]). Let s, t > 0, and \ufb01x a desired approxima-\ntion accuracy \u0001 > 0. Then, there exists a GAN architecture, in which\n1. the discriminator FD has at most O(log(1/\u0001)) layers and O(\u0001\u2212d/s log(1/\u0001)) parameters,\n2. and the generator FG has at most O(log(1/\u0001)) layers and O(\u0001\u2212d/t log(1/\u0001)) parameters,\n, is the optimized GAN estimate of P ,\n\nsuch that, if (cid:98)P\u2217(X1:n) := argmin\n(cid:104)\n(cid:98)P\u2208FG\n\n(cid:16)\n\n(cid:17)\n(cid:16)\nPn,(cid:98)P\n(cid:17)(cid:105) \u2264 C\nP,(cid:98)P\u2217(X1:n)\n\ndW s,2\n\n(cid:16)\n\u0001 + n\u2212 min{ 1\n\n2t+d}(cid:17)\n\n2 , s+t\n\n.\n\ndFD\n\nIID\u223c P .\n\nthen\n\nsup\n\nP\u2208W t,2\n\nE\nX1:n\n\nThe discriminator and generator in the above theorem can be implemented as described in [52]. The\nassumption that the GAN is perfectly optimized may be strong; see [32, 27] for discussion of this.\nThough we do not present this result due to space constraints, we can similarly improve the upper\nbound of [26] (their Theorem 3.2) for very deep neural networks, further improving on the previous\nstate-of-the-art bounds of [2] (which did not leverage smoothness assumptions on P ).\n\n7\n\n\f8 Minimax Comparison of Explicit and Implicit Generative Models\n\nIn this section, we draw formal connections between our work on density estimation (explicit\ngenerative modeling) and the problem of implicit generative modeling under an appropriate measure\nof risk. In the sequel, we \ufb01x a class FG of probability measures on a sample space X and a loss\n\nfunction (cid:96) : FG \u00d7 FG \u2192 [0,\u221e] measuring the distance of an estimate (cid:98)P from the true distribution\n\nP . (cid:96) need not be an adversarial loss dFD, but our discussion does apply to all (cid:96) of this form.\n\n8.1 A Minimax Framework for Implicit Generative Models\n\nThus far, we have analyzed the minimax risk of density estimation, namely\n\nRD(P,(cid:98)P ), where RD(P,(cid:98)P ) =E\n\n(cid:104)\n(cid:96)(P,(cid:98)P (X1:n))\n\n(cid:105)\n\n(8)\n\nMD(FG, (cid:96), n) = inf(cid:98)P\n\ndenotes the density estimation risk of (cid:98)P at P and the in\ufb01mum is taken over all estimators (i.e.,\n(potentially randomized) functions (cid:98)P : X n \u2192 FG). Whereas density estimation is a classical\n\nIID\u223c P\n\nsup\nP\u2208FG\n\nX1:n\n\nThe evaluating the performance of implicit generative models, both in theory and in practice, is\ndif\ufb01cult, with solutions continuing to be proposed [44], some of which have proven controversial.\nSome of this controversy stems from the fact that many of the most straightforward evaluation\n\nstatistical problem to which we have already contributed novel results, our motivations for studying\nthis problem arose from a desire to better understand recent work on implicit generative modeling.\nImplicit generative models, such as GANs [3, 17] and VAEs [21, 37], address the problem of sampling,\nin which we seek to construct a generator that produces novel samples from the distribution P [29].\na source of randomness (a.k.a., latent variable) Z \u223c QZ with known distribution QZ (independent\n\nXZ, where Z is uniformly distributed on [n]). One objective that can avoid this problem is as follows.\nFor simplicity, \ufb01x the distribution QZ of the latent random variable Z \u223c QZ (e.g., QZ = N (0, I)).\nIID\u223c P and latent distribution Z \u223c QZ, we de\ufb01ne the implicit distribution\nFor a \ufb01xed training set X1:n\nover X of the random variable\n\nIn our context, a generator is a function (cid:98)X : X n \u00d7Z \u2192 X that takes in n IID samples X1:n \u223c P and\nof X1:n) on a space Z, and returns a novel sample (cid:98)X(X1:n, Z) \u2208 X .\nobjectives are optimized by a trivial generator that \u2018memorizes\u2019 the training data (e.g., (cid:98)X(X1:n, Z) =\nof a generator (cid:98)X as the conditional distribution P(cid:98)X(X1:n,Z)|X1:n\n(cid:98)X(X1:n, Z) given the training data. Then, for any P \u2208 FG, we de\ufb01ne the implicit risk of (cid:98)X at P by\n(cid:96)(P, P(cid:98)X(X1:n,Z)|X1:n\nWe can then study the minimax risk of sampling, MI (FG, (cid:96), n) := inf (cid:98)X supP\u2208FG RI (P, (cid:98)X). A\nfew remarks about MI (F, (cid:96), n): First, we implicitly assumed (cid:96)(P, P(cid:98)X(X1:n,Z)|X1:n\nwhich is not obvious unless P(cid:98)X(X1:n,Z) \u2208 FG. We discuss this assumption further below. Second,\nsince the risk RI (P, (cid:98)X) depends on the unknown true distribution P , we cannot calculate it in\npractice. Third, for the same reason (because RP (P, (cid:98)X) depends directly on P rather than particular\n\nRI (P, (cid:98)X) := E\n\ndata X1:n), it detect lack-of-diversity issues such as mode collapse. As we discuss in the Appendix,\nthese latter two points are distinctions from the recent work of [5] on generalization in GANs.\n\n) is well-de\ufb01ned,\n\nX1:n\u223cP\n\n(cid:104)\n\n(cid:105)\n\n)\n\n.\n\n8.2 Comparison of Explicit and Implicit Generative Models\n\nAlgorithmically, sampling is a very distinct problem from density estimation; for example, many\ncomputationally ef\ufb01cient Monte Carlo samplers rely on the fact that a function proportional to\nthe density of interest can be computed much more quickly than the exact (normalized) density\nfunction [11]. In this section, we show that, given unlimited computational resources, the problems of\ndensity estimation and sampling are equivalent in a minimax statistical sense. Since exactly minimax\n\nestimators (argmin(cid:98)P supP\u2208FG RD(P,(cid:98)P )) often need not exist, the following weaker notion is useful\nDe\ufb01nition 8 (Nearly Minimax Sequence). A sequence {(cid:98)Pk}k\u2208N of density estimators (resp.,\n{(cid:98)Xk}k\u2208N of generators) is called nearly minimax over FG if limk\u2192\u221e supP\u2208FG RP,D((cid:98)Pk) =\nMD(FG, (cid:96), n) (resp., limk\u2192\u221e supP\u2208FG RP,I ((cid:98)Xk) = MI (FG, (cid:96), n)).\n\nfor stating our results:\n\n8\n\n\fThe following theorem identi\ufb01es suf\ufb01cient conditions under which, in the statistical minimax frame-\nwork described above, density estimation is no harder than sampling. The idea behind the proof is as\n\nfollows: If we have a good sampler (cid:98)X (i.e., with RI ((cid:98)X) small), then we can draw m \u2018fake\u2019 samples\nfrom (cid:98)X. We can use these \u2018fake\u2019 samples to construct a density estimate (cid:98)P of the implicit distribution\nof (cid:98)X such that, under the technical assumptions below, RD((cid:98)P ) \u2212 RI ((cid:98)X) \u2192 0 as m \u2192 \u221e.\n\n(cid:96)(P1, P3) \u2264 C(cid:52) ((cid:96)(P1, P2) + (cid:96)(P2, P3)).\n\nTheorem 9 (Conditions under which Density Estimation is Statistically no harder than Sampling).\nLet FG be a family of probability distributions on a sample space X . Suppose\n(A1) (cid:96) : P \u00d7P \u2192 [0,\u221e] is non-negative, and there exists C(cid:52) > 0 such that, for all P1, P2, P3 \u2208 FG,\n(A2) MD(FG, (cid:96), m) \u2192 0 as m \u2192 \u221e.\n(A3) For all m \u2208 N, we can draw m IID samples Z1, ..., Zm\n\u2208 FG.\n\n(A4) there exists a nearly minimax sequence of samplers (cid:98)Xk : X n \u00d7 Z \u2192 X such that, for each\n\nk \u2208 N, almost surely over X1:n, P(cid:98)Xk(X1:n,Z)|X1:n\n\nIID\u223c QZ of the latent variable Z.\n\nThen, MD(FG, (cid:96), n) \u2264 C(cid:52)MI (FG, (cid:96), n).\nAssumption (A1) is a generalization of the triangle inequality (and reduces to the triangle inequality\nwhen C(cid:52) = 1). This weaker assumption applies, for example, when (cid:96) is the Jensen-Shannon\ndivergence (with C(cid:52) = 2) used in the original GAN formulation of [17], even though this does not\nsatisfy the triangle inequality [15]). Assumption (A2) is equivalent to the existence of a uniformly\n(cid:96)-risk-consistent estimator over FG, a standard property of most distribution classes FG over which\ndensity estimation is studied (e.g., our Theorem 1). Assumption (A3) is a natural design criterion of\nimplicit generative models; usually, QZ is a simple parametric distribution such as a standard normal.\nFinally, Assumption (A4) is the most mysterious, because, currently, little is known about the\nminimax theory of samplers when FG is a large space. On one hand, since MI (FG, (cid:96), n) is an\n\nin\ufb01mum over (cid:98)X, Theorem 9 continues to hold if we restrict the class of samplers (e.g., to those\n(cid:98)X, this assumption may not be too restrictive, because nearly minimax samplers are necessarily close\nto P \u2208 FG. For example, if FG contains only smooth distributions but (cid:98)X is the trivial empirical\nsampler described above, then (cid:96)(P, P(cid:98)X ) should be large and (cid:98)X is unlikely to be minimax optimal.\n\nsatisfying Assumption (A4) or those we can compute). On the other hand, even without restricting\n\nFinally, in practice, we often do not know estimators that are nearly minimax for \ufb01nite samples, but\nmay have estimators that are rate-optimal (e.g., as given by Theorem 1), i.e., that satisfy\n\nsupP\u2208FG RI (P, (cid:98)X)\n\nMI (FG, (cid:96), n)\n\n< \u221e.\n\nC := lim sup\nn\u2192\u221e\n\nUnder this weaker assumption, it is straightforward to modify our proof to conclude that\n\nMD(FG, (cid:96), n)\nMI (FG, (cid:96), n)\n\nlim sup\nn\u2192\u221e\n\n\u2264 C(cid:52)C.\n\nThe converse result (MD(FG, (cid:96), n) \u2265 MI (FG, (cid:96), n)) is simple to prove in many cases, and is related\nto the well-studied problem of Monte Carlo sampling [38]; we discuss this brie\ufb02y in the Appendix.\n\n9 Conclusions\n\nGiven the recent popularity of implicit generative models in many applications, it is important\nto theoretically understand why these models appear to outperform classical methods for similar\nproblems. This paper provided new minimax bounds for density estimation under adversarial losses,\nboth with and without adaptivity to smoothness, and gave several applications, including both\ntraditional statistical settings and perfectly optimized GANs. We also gave simple conditions under\nwhich minimax bounds for density estimation imply bounds for the problem of implicit generative\nmodeling, suggesting that sampling is typically not statistically easier than density estimation. Thus,\nfor example, the strong curse of dimensionality that is known to af\ufb02ict to nonparametric density\nestimation Wasserman [49] should also limit the performance of implicit generative models such as\nGANs. The Appendix describes several speci\ufb01c avenues for further investigation, including whether\nthe curse of dimensionality can be avoided when data lie on a low-dimensional manifold. zzz\n\n9\n\n\fAcknowledgments\n\nThis work was partly supported by NSF grant IIS1563887, the Darpa D3M program, AFRL FA8750-\n17-2-0212, and the NSF Graduate Research Fellowship DGE-1252522.\n\nReferences\n[1] Ehsan Abbasnejad, Javen Shi, and Anton van den Hengel. Deep lipschitz networks and dudley GANs,\n\n2018. URL https://openreview.net/forum?id=rkw-jlb0W.\n\n[2] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge\n\nuniversity press, 2009.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[4] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society,\n\n68(3):337\u2013404, 1950.\n\n[5] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in\ngenerative adversarial nets (GANs). In International Conference on Machine Learning, pages 224\u2013232,\n2017.\n\n[6] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Springer Science & Business Media, 2011.\n\n[7] G\u00e9rard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.\n\n[8] Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit\n\ngenerative modeling. arXiv preprint arXiv:1712.07822, 2017.\n\n[9] Lawrence D Brown, Cun-Hui Zhang, et al. Asymptotic nonequivalence of nonparametric experiments\n\nwhen the smoothness index is 1/2. The Annals of Statistics, 26(1):279\u2013287, 1998.\n\n[10] Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport\n\nmetrics. In Advances in Neural Information Processing Systems, pages 2492\u20132500, 2012.\n\n[11] Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-Hastings algorithm. The american\n\nstatistician, 49(4):327\u2013335, 1995.\n\n[12] Peter J Diggle and Richard J Gratton. Monte Carlo methods of inference for implicit statistical models.\n\nJournal of the Royal Statistical Society. Series B (Methodological), pages 193\u2013227, 1984.\n\n[13] RM Dudley. Speeds of metric probability convergence. Zeitschrift f\u00fcr Wahrscheinlichkeitstheorie und\n\nVerwandte Gebiete, 22(4):323\u2013332, 1972.\n\n[14] Sam Efromovich. Orthogonal series density estimation. Wiley Interdisciplinary Reviews: Computational\n\nStatistics, 2(4):467\u2013476, 2010.\n\n[15] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE\n\nTransactions on Information theory, 49(7):1858\u20131860, 2003.\n\n[16] Alexander Goldenshluger and Oleg Lepski. On adaptive minimax density estimation on Rd. Probability\n\nTheory and Related Fields, 159(3-4):479\u2013543, 2014.\n\n[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[18] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nkernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5769\u20135779,\n2017.\n\n[20] Leonid Vasilevich Kantorovich and Gennady S Rubinstein. On a space of completely additive functions.\n\nVestnik Leningrad. Univ, 13(7):52\u201359, 1958.\n\n10\n\n\f[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[22] Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn.,\n\n4:83\u201391, 1933.\n\n[23] Jing Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded\n\nfunctional spaces. arXiv preprint arXiv:1804.10556, 2018.\n\n[24] Giovanni Leoni. A \ufb01rst course in Sobolev spaces. American Mathematical Soc., 2017.\n\n[25] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. MMD GAN: Towards\ndeeper understanding of moment matching network. In Advances in Neural Information Processing\nSystems, pages 2200\u20132210, 2017.\n\n[26] Tengyuan Liang. How well can generative adversarial networks (GAN) learn densities: A nonparametric\n\nview. arXiv preprint arXiv:1712.08244, 2017.\n\n[27] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of\n\ngenerative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.\n\n[28] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007.\n\n[29] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[30] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint\n\narXiv:1711.04894, 2017.\n\n[31] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in Applied\n\nProbability, 29(2):429\u2013443, 1997.\n\n[32] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances\n\nin Neural Information Processing Systems, pages 5591\u20135600, 2017.\n\n[33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers\nusing variational divergence minimization. In Advances in Neural Information Processing Systems, pages\n271\u2013279, 2016.\n\n[34] Mark Owen. Practical signal processing. Cambridge university press, 2007.\n\n[35] Aaditya Ramdas, Sashank Jakkam Reddi, Barnab\u00e1s P\u00f3czos, Aarti Singh, and Larry A Wasserman. On\nthe decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In\nAAAI, pages 3571\u20133577, 2015.\n\n[36] Aaditya Ramdas, Nicol\u00e1s Garc\u00eda Trillos, and Marco Cuturi. On wasserstein two-sample testing and related\n\nfamilies of nonparametric tests. Entropy, 19(2):47, 2017.\n\n[37] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In ICML, 2014.\n\n[38] Christian P Robert. Monte Carlo methods. Wiley Online Library, 2004.\n\n[39] Shashank Singh and Barnab\u00e1s P\u00f3czos. Minimax distribution estimation in Wasserstein distance. arXiv\n\npreprint arXiv:1802.08855, 2018.\n\n[40] Nickolay Smirnov. Table for estimating the goodness of \ufb01t of empirical distributions. The annals of\n\nmathematical statistics, 19(2):279\u2013281, 1948.\n\n[41] Bharath Sriperumbudur et al. On the optimal estimation of probability measures in weak and strong\n\ntopologies. Bernoulli, 22(3):1839\u20131893, 2016.\n\n[42] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG Lanckriet.\nNon-parametric estimation of integral probability metrics. In Information Theory Proceedings (ISIT), 2010\nIEEE International Symposium on, pages 1428\u20131432. IEEE, 2010.\n\n[43] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, Gert RG Lanckriet, et al.\nOn the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u20131599,\n2012.\n\n11\n\n\f[44] D. Sutherland, H-Y Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative\nmodels and model criticism via optimized maximum mean discrepancy. In ICLR, 2017. URL https:\n//arxiv.org/abs/1611.04488.\n\n[45] G\u00e1bor J Sz\u00e9kely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation\n\nof distances. The annals of statistics, 35(6):2769\u20132794, 2007.\n\n[46] Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean\n\nembeddings. The Journal of Machine Learning Research, 18(1):3002\u20133048, 2017.\n\n[47] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer,\n\nNew York, 2009.\n\n[48] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n[49] Larry Wasserman. All of Nonparametric Statistics. Springer Science & Business Media, 2006.\n\n[50] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in Wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[51] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.\n\n[52] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:\n\n103\u2013114, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6570, "authors": [{"given_name": "Shashank", "family_name": "Singh", "institution": "Carnegie Mellon University"}, {"given_name": "Ananya", "family_name": "Uppal", "institution": "Carnegie Mellon University"}, {"given_name": "Boyue", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Chun-Liang", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Manzil", "family_name": "Zaheer", "institution": "Google"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}