NeurIPS 2020

Robust Density Estimation under Besov IPM Losses

Review 1

Summary and Contributions: The author(s) show the robustness of nonparametric density estimations under Besov integral probability metrics for mixture distributions with an unknown outlier distribution. Therefore, they prove the minimax convergence of a wavelet estimator.

Strengths: The authors provide a theoretical sound framework showing the convergence of wavelet estimator for (robust) density estimation for different contaminations (structured and unstructured). As this is relevant for many applications and its relatedness to GANs is shown, I consider it interesting. Further, in most cases, it is well written and references to prior work used for different conclusions are also present.

Weaknesses: Some (minor) spelling mistakes such as “theorem 5” (e.g., line 183 and 297) or “of the form 7” (295, 306).Similarly, more detailed references to the appendix would be preferable. More details regarding the implications for real-world data would be beneficial. A test on synthetic data for different contaminations would be ideal. However, as the article is theoretical, this is not necessarily required.

Correctness: Yes, the paper seems to be technically sound.

Clarity: Yes, the paper is well written.

Relation to Prior Work: It is clearly discussed how their findings relate to recent work such as Uppal et al. [2019], Chen et al. [2016], and Liu and Gao [2017]. They also include a discussion on how their work relates to the earlier work of Kim and Scott [2012] and Vandermeulen and Scott [2013].

Reproducibility: Yes

Additional Feedback: Figure 1 requires some more explanation in the caption. I would prefer consistent writing of form and equation. Also, many equations do not have numbers. It is not clear where the claim in that a GAN with “ReLu activations can learn the distribution of the ...” is shown. This statement needs clarification. I would prefer citations of the final work over arxiv, e.g., Uppal et al. [2019] vs. the publication at NIPS. Line 305 : “estimator of 4.2”, what is 4.2 referring too? In contrast to other sections, section 5 is harder to understand in contrast to the other sections.

Review 2

Summary and Contributions: The paper discusses minimax risks for non-parametric density estimation under a large and common family of losses (integral probability metrics) with contaminated data. The authors prove minimax rates using wavelet thresholding estimators shown to be optimal.

Strengths: - The paper is clearly relevant to the Neurips community, studying robustness of estimation under unknown contamination of the available data. - In comparison with related works, the paper brings novel contributions in two directions: (i) by finding optimal density estimators (with or without contamination) which only depend on the data and don’t require any knowledge on the underlying distribution. (ii) extending the computation of minimax rates under contamination to wider classes of losses and densities of contaminations. - Furthermore, implications of the presented results for ideal GANs are discussed.

Weaknesses: - Under my relative level of expertise, I do not find serious weaknesses of the paper.

Correctness: - Given my relative level of expertise and time allowed, I did not check the proofs of the paper.

Clarity: -The paper is clear, including intuitions guiding the reader where possible.

Relation to Prior Work: - The relationship to prior work is clearly discussed in a dedicated section.

Reproducibility: Yes

Additional Feedback: Typo / form: - Maybe it would be preferable to have numbering of all equations. - line 102, could the authors check the definition of the Haar wavelets scaled from the mother wavelet? There is no \lambda in the definition. - line 193 “the the” ******* I have read other reviews and authors feedback, it confirms my positive assessment of the paper.

Review 3

Summary and Contributions: This paper studies convergence of minimax estimators for a large class of losses commonly used in machine learning (Besov integral probability metrics) for a model of polluted data. They describe a data-dependent estimator (i.e., no information about the smoothness of the space is required) which they prove yields minimax optimal convergence rates. Finally, they describe the implications for GAN convergence.

Strengths: This paper has very strong theoretical results, several of which are genuinely surprising. First, they derive a minimax rate for the "unstructured" Huber contamination model in which there is no smoothness restriction on the contaminating distribution (though it must be compactly supported). This rate, interestingly, is better than the rate achieved for density estimation at a point, a very interesting result that certainly merits more discussion in the paper---it would be nice to have a sense of why this is the case. The rates reveal some interesting conclusions, as well. In particular, the observation that asymptotically, it is primarily the boundedness of the contaminating distribution and that further smoothness assumptions do not improve the rate. This is a really nice result.

Weaknesses: Added after feedback: I do think that some numerical experiments on synthetic data could be illuminating, as indicated by the other reviewers. I maintain my rating of this very strong paper, but encourage the authors to consider designing experiments to probe the convergence rates. The authors could significantly improve the main-text discussions of the proofs. There is essentially no information about proof technique, whether or not the steps are standard, etc. Incorporating some description of this type would certainly make the paper more readable. Another issue I encountered was that I felt the description of the wavelet estimator (in particular how it differed from that of Donoho) was vague and not really addressed in the main text.

Correctness: Yes, the claims seem correct and the assumptions reasonable.

Clarity: This is a very well-presented paper. Clear, economical language and well-structured.

Relation to Prior Work: Yes, it's very well-contextualized within a somewhat narrow literature.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper considers a wide range of smoothness conditions lying in a Besov space in the family of loss function called integral probability metric. The paper then details the convergence rates for a purely data-dependent density estimators and non-parametric convergences guarantees for data which have been contaminated by random outliers.

Strengths: The paper reads very well. The authors first introduce a formal problem statement of the contamination density for both the unstructured and structured setting. Then definitions of Besov space and linear estimator is given. The authors provide a discussion of the related work well to show the contributions of this work. The related work section discusses the limitations and assumptions that have been made in current approaches found in literature. The authors provide theorems to show the minimax rate for both unstructured and structured contamination.

Weaknesses: I like the idea that the authors have given examples and practical implications on applying this approach to GANs. It would be great if the authors give more examples or an empirical evaluation to demonstrate the capability. This will greatly improve the understanding of the theoretical contributions outlined in this paper.

Correctness: I do not have the technical expertise to comment in detail for this section. I have read the main submission and majority of the content appears to be correct to me.

Clarity: Overall the paper is well structured and written. The structure of the paper is a bit unconventional by first providing the technical backgrounds before highlighting the related work. But, I quite like this because it will allow the authors to use mathematical notations to explain the related work section. The part which I have found the most difficult to follow is Section 2.1, where I am unfamiliar with a lot of the mathematical notations. For the short timeframe I have to review this paper, I was unable to understand the meaning behind each of the mathematical notation. However, that being said paper is out-of-area for me. So, I am unable to appropriately assess whether section is well written or not.

Relation to Prior Work: The authors have provided a detail discussion of the prior work in both the introduction and Section 3 Related Work. This work is different from the previous contributions by providing convergence rates for a purely data-dependent density estimation for Besov IPMs.

Reproducibility: Yes

Additional Feedback: Overall, the paper addresses a very important issue into density estimation when contaminated by random outliers which is encountered in many machine learning problems. The theoretical guarantees for the linear and non-linear convergences rate and minmax bound is extremely useful for designing robust machine learning models. Unfortunately, I do not have the technical expertise to comment on the correctness of this approach. But for an out-of-area reviewer, I can see that this paper is well motivated and is written and structured well. ------------------------------------------- Post-Author Feedback Comments: ------------------------------------------- Thank you for your response. As mentioned in my original review, experimental results with synthetic data could strengthen the paper and improve understanding of the paper, but it is not critical. The authors have explained using examples for potential applications for the theoretical results in Section 4.3 which seems good enough for me. Also, looking back at my original rating of 6 does seem a bit harsh for the minor criticism I have given. I have now increased it to 7.