Review for NeurIPS paper: A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

NeurIPS 2020

A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

Review 1

Summary and Contributions: This paper examines the ability to approximate a distribution constructed by push-forward with a neural network. As generative models, flow models, etc. become more popular, it is very common to construct new distributions by extruding distributions with generators. This paper proves that when a generator is constructed with a neural network, a constant distribution can be approximated with a sufficiently large network and under arbitrary errors. Errors on the distribution were measured in Wasserstein, MMD, and KSD, and the rate of width and breadth required for the error sought was also revealed.

Strengths: This is an important result that neatly formulates a widely treated problem. The results are also good and general.

Weaknesses: I see this paper as a positive, but I have the following unclear points. Is it possible to describe the number of weights needed for the network for the approximation? Some important approximation capability papers investigates a relation btw a number of their weights and the approximation power. How does this affect them? Is it possible to give a similar rate if there is no density in the distribution? GANs, for example, generates a distribution in high-dimensional space by push-forward a low-dimensional noise distribution. Therefore, the generated distribution does not possess a natural density function. What would be the result in that case? Regarding citations, I felt that some relevant papers were missing. In the context of the flow model, there are similar studies that should discuss the differences. The last paper (Lin+ 2018) is the universality of ResNet, but since it is for an R^d to R^d mapping, it can be discussed as a generator. Kong, Z., & Chaudhuri, K. (2020). The Expressive Power of a Class of Normalizing Flow Models. arXiv preprint arXiv:2006.00392. Bailey, B., & Telgarsky, M. J. (2018). Size-noise tradeoffs in generative networks. In Advances in Neural Information Processing Systems (pp. 6489-6499). Lin, H., & Jegelka, S. (2018). Resnet with one-neuron hidden layers is a universal approximator. In Advances in neural information processing systems (pp. 6169-6178).

Correctness: Good enough, as far as I've checked.

Clarity: It's written clearly enough.

Relation to Prior Work: A few papers are missing references, but they are not fatal.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper proves a universal approximation theorem for probability distributions, instead of continuous functions. The authors build an intuitive proof based on first using empirical measures. The authors also show that these empirical measures converge to the correct target measures as more data is collected. These convergence rates are calculated for the 1-Wasserstein distance, MMD, and KSD.

Strengths: The math is very precise and well presented. The authors did an excellent job with presenting the complexities of their theory in a very clear way. Line 173 - The high level comparison between the authors’ work and the proof of the universal approximation theorem is very clear and intuitive. These types of theorems are usually hard to digest and this paragraph is very helpful for the reader to read over before jumping into the details. The structure of the paper is very easy to follow and clear. The authors present their results excellently.

Weaknesses: Line 62 - The authors’ third contribution doesn’t sound like a contribution, honestly. Constructing an explicit neural network is novel. Of course, it helps that the authors are not talking about arbitrary functions, which are hard to relate to neural networks. The authors present several results from other papers in their work. This is fine, but it seems like too much of this work is just referencing other papers. For example, the authors make it sound like Proposition 1 and 2 are trivial or known results. Moreover, they make it seem like Lines 248-274 are a review from other papers. This is not inherently bad because it helps the reader understand more about the theorems that the authors prove, but it reduces the novel contributions that the authors are able to present in an 8-page paper.

Correctness: Line 130 - I think that the authors want a strict inequality on Assumption K1 because they mention that the measures are non-zero.

Clarity: The paper is extremely clear and well written. Bravo. Below are a few suggestions on parts that I was confused on. There are a few small typos. Line 72 - network network should probably be neural network. Line 185 - rational should be rationale. Weird space on Line 218. Equation 4.2 - gamma(dxdy) should be gamma(dx,dy). Line 286 - linear should be affine. Line 113 - The equation for Wasserstein distance, MMD, and KSD all use inputs p and pi but Equation 2.2 uses p and q. Mathematically this is correct but is not as consistent as it could be and causes confusion at first glance because the notation switched. Line 133 - Is the maximum taken over all m, n > 0 so that m + n <= 1? If so, then I don’t understand why k needs to be twice differentiable, instead of just once differentiable. Because the two Equations in (2.5) only need k to be once differentiable. Theorem 4.1 - The authors did not define \mathcal P_2(R^d) yet. At first glance, this may seem like the space of degree 2 polynomials defined on R^d, so I looked in the appendix to find out they mean that mu has finite second moments. The authors should mention this somewhere in the main text for the reader. Moreover, by “Lebesgue density” I think that the authors mean that mu is an absolutely continuous measure, but it is not clear.

Relation to Prior Work: Prior work is thoroughly discussed and well presented.

Reproducibility: Yes

Additional Feedback: Overall, I really enjoyed this paper. The authors did an excellent job.

Review 3

Summary and Contributions: This paper proves that the gradients of deep ReLU networks can map a given probability to a sequence of probability density distributions converging to a target distribution measured by Wasserstein distance, MMD or KSD of a certain kernels. This is the first work obtaining a sufficiently general result on the approximation power of neural networks for probability distributions.

Strengths: The statement is clear and solid in theory. It extends the approximation theory of neural networks to the regime of probability distributions, and adds our knowledge on generative models.

Weaknesses: 1. This connection between the optimal transportation theory and generative models have been shown by previous literatures, for example Ref 30 given in the paper. The proofs in this work are simple applications of the well-known optimal transportation theory and the results are quite incremental and not surprising. 2. People will care more about the approximation capability of the models proven to be useful in practice. This paper only makes a statement on the gradient networks of deep ReLU networks. It is not clear how to train such networks in practice or how they are related to popular probability modeling methods like GAN, VAE or normalizing flow.

Correctness: I think the claims are correct, though I do not carefully check all the details of the proof.

Clarity: The paper is easy to read and statements are clear.

Relation to Prior Work: It describes its difference from previous theoretical work. But adding more words on the implication of this result for algorithms designed from the same optimal transport perspective will be better.

Reproducibility: Yes

Additional Feedback: