Paper ID: | 7112 |
---|---|

Title: | Minimum Stein Discrepancy Estimators |

In this paper the authors present a kernel based frame work using stein estimators to fit normalized densities. This framework is quite general and covers several existing estimators. They also prove nice theorems on the finite/asymptotic behavior of the estimator as well as how it approximate other estimators. Finally they present evidence that this framework estimates better than the standard score matching on some nastier toy datasets, heavy tailed and nonsmooth. Overall the contribution of the paper seems very much worth publishing although this topic is very much outside my area of expertise, so I cannot be sure how it fits into existing literature, or how novel it is. My main issue with the paper is that it is very technically dense, perhaps unnecessarily so. As someone outside the field I felt like the mathematical exposition could be significantly decompressed for the introduction of their framework (Section 2.0) with some additional intuition and leading provided for the reader. To compensate one could omit, shorten, make more vague, some of the content in Sections 2.1-3.2. For example Theorem 1 would definitely benefit from hiding more of the details under the hood, i.e. in the supplement. It would be better if the paper was more readable and made it clear what additional details can be found in the supplemental section for the very interested reader. If this were improved I could see giving the paper a 7 or 8.

Originality: The papers build on existing literature to provide a unified framework for stein discrepancies based on a generalized diffusion operator. Although the proposed generalization is expected, it is clearly beneficial to have such complete and rigorous treatment of stein discrepancies in the context of learning. Significance: The paper exhibits and characterizes cases when score matching would have undesirable behaviors compared to KSD. Such contribution is very valuable to the community. Also, the new generalization provides a whole new class of discrepancies that are more suited for heavy-tailed and non-smooth distributions. Quality: the proofs are sound, the experiments are simple but convincing which makes it a complete and insightful paper. Clarity: The structure is very clear and the paper is pleasant to read. I have two small questions: - In section 4.2, it seems that m was picked to match somehow the expression of the parametric model. What guides such choice? Can one think of a general guideline to pick such function to improve the convergence properties of the loss (convexify the loss)? - The theory predicts that DSM is more robust than SM, could this be illustrated in a simple example? ================================ After reading the other reviews and the authors' response, I still think this is a good submission and should be accepted. It would be great to incorporate the explanations provided in the response to the final version of paper.

Update: Thanks for your feedback and additional experimental results. I still suggest accept. ============ As far as I know, it seems original that the work focuses on leveraging a flexible metric (e.g. information geometry) into probability discrepancies. I did not go through the proof so I do not know the validity and the difficulty in adapting existing results (e.g. [21]), but the results seem reasonable. The provided results also cover a wide range of considerations on the proposed discrepancies. Issues: * On generalizing the metric in discrepancies, the authors are recommended to discuss the relation between DKSD and the Riemannian kernel Stein discrepancy proposed in Liu & Zhu (2018; arXiv:1711.11216). * In Eq. (2), what does it mean that m is a diffusion matrix? Does it have to be symmetric positive definite? Referring to the operator in Theorem 2 of [21], if m is taken as the sum of the covariance and stream coefficients, it could be any matrix. Does the considered operator have this generality? * For DKSD and DSM to be discrepancies constructed by Stein's method, the corresponding operators need to satisfy Stein's identity. I noted that the identity for DKSD is provided in Line 120, but did not find the one for DSM. * Why the matrix-valued kernel considered in DKSD has to be in the two forms? Is it possible to extend the results to a general kernel? The corresponding vector-valued RKHS theory is known, e.g., [C. Micchelli & M. Pontil, 2003, On Learning Vector-Valued Functions]. * For the experiments, is it possible to make the experiments more related to the theoretical analysis? For example, analyse the failure of SM in estimating symmetric Bessel distributions, or compare DSM with SM to demonstrate the benefit. The presentation of the paper is basically clear, and I appreciate the notation system. Here are some additional concerns. * In Line 76, "Stein's identity" is referred before its definition in Line 80. * Unexplained abbreviation "SDE" in Line 90. * The authors could use another symbol for the dimensionality of \theta instead of m. * Possible typos: Line 177: "probality" -> "probability" Line 214: no subject in the sentence Line 283: "his" -> "This"