__ Summary and Contributions__: ***** UPDATE *****
I realize I might have been harsh in my evaluation.
I believe the paper would have been more suited for a more theory oriented statistics conference / journal, but this is a recurrent problem in NeurIPS and I shouldn't have taken it out on the authors.
While their theoretical result is really interesting, I also didn't appreciate that the authors barely mentioned previous work on statistical learning bounds with optimal transport. There have been recent efforts on the topic by several teams, and they should at least acknowledge them.
However, if other reviewers took the time to thoroughly review the proof of the main result, I'm willing to increase my score.
******
The authors provide a through statistical study of the Wasserstein distance of an empirical measure to a given hypothesis class (i.e. a given set of functions).
Thanks to a dual formulation of the problem, for which strong duality holds, they derive convergence rates for that distance. In particular, in the compact case they get a central limit theorem, which allows them to do hypothesis testing.
They eventually test their method on a toy two-sample problem.

__ Strengths__: The main strength of this paper is its thorough analysis of the distance it considers from a statistical standpoints. The distributional limit they obtain could be very useful for two sample testing with the Wasserstein distance.

__ Weaknesses__: While the paper is really thorough on the theoretical aspects of the paper, I believe the main motivation of the paper (two sample testing) is not clearly explained. In section 4, where the authors use their theoretical results to perform two-sample tests, I did not understand how they proceed and how they choose to accept or reject the hypothesis.

__ Correctness__: I haven't checked the theoretical claims in details. However, the methodology seems correct.

__ Clarity__: In my opinion the paper lacks clarity. While the paper exhibits impressive theoretical results, it doesn't make a good job at explaining how they can be used by the community. Specifically, I found paragraph 3 in the introduction very vague, and should focus on a simple exemple to make their point. Mostly, section 4 should be rewritten so that the reader can understand how to apply the theoretical results for hypothesis testing (see questions below for specific issues).

__ Relation to Prior Work__: I believe there should be a discussion on how this result relates to standard convergence results for the Wasserstein distance, which suffers from a curse of dimension. They should also mention other works which break the curse of dimension thanks to the cardinality of the space (Sommerfeld et al) or the regularization (Genevay et al).

__ Reproducibility__: No

__ Additional Feedback__: Questions:
- In practice, do you use theorem 2 or theorem 4 for the two-sample problem? How do you choose which one you use?
- How do you derive the rejection threshold from your theorems in section 4?
- How do you compute the 95% quantile for Rn?
I encourage the authors to rewrite the application / empirical section, to clarify how to apply their theoretical results for hypothesis testing.

__ Summary and Contributions__: The paper studies computing the Wasserstein distance between a probability measure and a set of of probability measures, where two distributions belong to the same set (are equivalent) if they have the same expectation over a set of specified transformations. The authors establish weak duality of such distance with a tractable expression involving a function transform, and strong duality under some constraints (compactness of the domain, and two other non-compact examples). There are many examples throughout the text particularizing the results to specific examples, and some simulations illustrating the convergence rate results.

__ Strengths__: I find the contribution is strong as it addresses an important problem, which is the difficulty of approximating empirical distributions to continuous measures by limiting the classes of interest. The authors hypothesize and later show that limiting the projection to a set of probability measures improves the number of required samples to learn approximate distributions. This seems very relevant to ML applications and makes intuitive sense too.
Furthermore, the authors connect their analysis with existing results, provide multiple illustrative examples, and some numerical validation results.

__ Weaknesses__: I did not find clear weaknesses on this submission, the contribution seems reasonable and complete.
The main weakness may be to further extend the practical impact of the analysis, but this is something that the authors address in the broader impact section, and something they are interested in pursuing.
To the discretion of the authors, on a revised version of the manuscript I would suggest them to briefly expand the discussion on rates of convergence compared to normal convergences, but this is minor comment and the provided rates remain clear nonetheless.

__ Correctness__: The claims and method seem correct, although I did not verify the proofs.
The empirical methodology also seems correct and supports results and intuition.

__ Clarity__: Yes, the paper is clear and well written.

__ Relation to Prior Work__: Yes, relevant references are discussed, and comparisons with known results are present as well.

__ Reproducibility__: Yes

__ Additional Feedback__: A few minor typos:
- page 4 line 152: wavet --> wavelet
- page 7 line 245: othe --> of the
- page 7 line 246: "(...) technique for constructing improving dual functions (...)"
AFTER REBUTTAL:
I thank the authors for their responses. I will maintain my original review of strong accept.

__ Summary and Contributions__: This paper considers estimating a constrained Wasserstein distance by constraining the test functions to a given class. Strong duality results are established for both compact and non-compact domains. Asymptotic properties of the constrained Wasserstein distance when the test function class involving linear projections are developed. Finally, preliminary numerical experiments are conducted for simulated mixtures of Gaussians.
Edit after rebuttal:
I am satisfied with authors' responses and will maintain my score of strong accept.

__ Strengths__: (1) Both the strong duality and statistical convergence results are novel and generalize the Kantorovich duality in optimal transport and duality for Wasserstein distributionally robust optimization. The theoretical development are highly non-trivial and demonstrates the authors' master on techniques from infinite-dimensional optimization, statistical theory, and optimal transport theory. The derived results have an important impact and strengthen our understanding of both areas of distributionally robust optimization and optimal transport.
(2) The authors pay close attention to the mathematical rigor of their derivations.
(3) Numerical experiments on synthetic datasets confirm the theoretical findings.

__ Weaknesses__: (1) The statistical convergence is established only for function classes involving a one-dimensional linear projection. The result is good but sort of expected, and hence makes the result not very strong. Can you establish results for general function classes in terms of its complexity?
(2) The problem formulation (1) and its reformulation (3) remind me of the definition of the Integral Probability Metric (IPM), which is defined as the maximum difference of expectations of the test functions under the two compared distributions. It would be helpful to relate to and compare with IPM in terms of their respective pros and cons in terms of computational tractability, application scope, etc?
(3) Can you conduct more experiments on real datasets? The current numerical experiments only consider simulated data from a mixture of Gaussians. It would be useful to see how the theoretical results are applied to real-world data such as image datasets with a more complicated structure.

__ Correctness__: Yes.

__ Clarity__: Yes. The paper is well organized and the results are clearly stated.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper discovers the property of the Wasserstein distance as capturing the differece of distributions on a function class of interest. This duality result generalizes the Kantorovich-Rubinstein duality. This formulation also enjoys parametric convergence rate, beating the curse of dimensionality.

__ Strengths__: As far as I checked, the theoretical claims in this paper are solid. I think the results in this paper are new and provides important insight on understanding how and why Wasserstein distance works.

__ Weaknesses__: I didn't in particular find any major limitation of this work. Maybe few minor thing is that, I don't know if NeurIPS review process is suitable for checking this theoretically heavy paper (I didn't check all the proofs in time), and in appendix some math equations are off the margin.

__ Correctness__: As far as I checked, the claims are correct. The experiments section also demonstrates the theoretical results.

__ Clarity__: I think the paper is clearly written, although if there is an extra space I think it would be a good idea to make a conclusion section that discuss the results and concludes the work.

__ Relation to Prior Work__: I think this work doesn't really have previous work to be directly compared, and related work is well discussed in Section 1.

__ Reproducibility__: Yes

__ Additional Feedback__: Minor typos:
p.2, line 49: "an hypothesis" -> "a hypothesis"
p.3, line 102: "an hypothesis" -> "a hypothesis"
p.7, line 245: "the proof othe results" -> "the proof of the results"
p.7, line 254: "[8, Theoerem 1]" -> "[8, Theorem 1]"
p.8, line 278: "each iterations" -> "each iteration"
p.8, line 281: "Guassians" -> "Gaussians"