Review for NeurIPS paper: Stochastic Stein Discrepancies

NeurIPS 2020

Stochastic Stein Discrepancies

Review 1

Summary and Contributions: Update: I have read the rebuttal. I think the algorithmic contribution of this paper is limited, but I like the theoretical contribution more. Thus I incline to keep my score. ========= Stein discrepancy has been used in machine learning to measure distance between two distributions with a number of downstream applications such as in Bayesian inference. In practice, a stochastic version of the Stein discrepancy is typically used to derive algorithms for large datasets, without a formal theoretical justification. This paper fills this gap by proving theoretical justifications for the use of stochastic Stein discrepancy. The theory justify the reasonability of using SSDs for measuring convergence of two distributions and also for Bayesian inference. The main contribution of this paper lies in the theoretical side, as stochastic algorithms with Stein discrepancy have been widely used in practice.

Strengths: The theory is sound and seems correct. It proves the first guarantee for a practical use of SD.

Weaknesses: The results are not surprising, and have limited impacts in practice. The paper could be stronger if it can find out some distintive properties of SSD compared to SD.

Correctness: The claims, method and empirical results seem correct.

Clarity: The paper is written fairly well.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I like the theoretical contribution of this paper, which fills the gap between theory and practical use of SD in previous literature. A downside of the theoretical results is that it is hard to see the advantage of SSD over SD, partly because the current theory is in terms of asymptotic convergence. I would recommend the authors the look at the non-asymptotic behavior of SSD with finite number of particles. I acknowledge that this could be challenging, as even the convergence of SD is hard to guarantee, except for a modified version by Zhang et al., 2020. There are some detailed comments: 1. It is better to define what "coercive" means in Proposition 5. 2. Theorem 6 seems to be limited to a special polynomial kernel. In practice, a Gaussian kernel is typically used. Does Theorem 6 applies to this case? 3. I don't understand Theorem 4. It is claimed before Theorem 4 that the theorem proves non-convergence property of SD in terms of if Q \not\to P, then S(...) \not\to 0. But Theorem 4 seems to prove S(...) \to 0. Am I missing something? Also, what is the point of introducing Theorem 4? It seems the theorem is about SD but not SSD. Zhang et al., stochastic particle-optimization sampling and the non-asymptotic convergence theory, 2020.

Review 2

Summary and Contributions: Post-rebuttal update: thank you for the response. I've updated my score given the updated proof. I still share the concern of R3 on the discrete case, and I agree with R1 that the results are to some extent unsurprising. ==== This work investigates a variant of Stein discrepancies where the target log density, log p, is replaced with its stochastic approximation. Such stochastic Stein discrepancies are shown to be able to detect commonly-used convergence and non-convergence properties. It also establishes convergence results for both SVGD and its stochastic variants under more realistic conditions, allowing the log density of the true posterior to have quadratic growth beyond a compact domain.

Strengths: The results are relevant since stochastic approximations of Stein discrepancies are widely used, yet previously the impact of stochastic approximation was not discussed. Moreover, improved analysis of SVGD might be of independent interest to the VI community.

Weaknesses: * In the context of posterior inference (i.e. p corresponds to a posterior), the stochastic Stein discrepancy investigated here is technically different from the common practice: here separate mini-batches of data points are drawn for each sample point, while in practice we usually use a fixed mini-batch for all sample points. The version discussed here is more difficult to implement. * I have concerns about the correctness of the proof; see below.

Correctness: I skimmed through Appendix A-D and didn't find any mistakes. Appendix E: where are the other two cases (A.2, A.3) discussed? Appendix F: The proof of Eq.(8) doesn't seem correct, specifically the last inequality below L512 doesn't follow from union bound.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Clarification of App.E-F would be most appreciated.

Review 3

Summary and Contributions: The Stein discrepancy (SD) serves as a measure between distributions for approximate inference while the exact computation of SD can be prohibitive. To mitigate the computation cost, the authors propose the stochastic Stein discrepancy that is easier to compute and serves as an estimate of the SD by subsampling the Stein operator. For a decomposable Stein operator, the coordinate computation for SSD involves only a subset of the Stein operator instead of the whole Stein operator as SD and therefore results in less computation cost. The authors also provide theoretical analysis showing that SSD has the same nice convergence properties as SD, that is, SSD can detect the convergence/non-convergence almost surely. Experiments show that SSD has similar performance to SD in various inference tasks.

Strengths: The SSD can be potentially useful for scaling up a bunch of SD-based inference methods since the decomposable Stein operators are widely used. It provides a tradeoff between efficiency and accuracy in estimating SD. Also, the SSD comes with convergence guarantees under some assumptions showing that the convergence properties of SSD are as nice as the corresponding SD.

Weaknesses: I found the word subsampling confusing because it usually refers to subsampling from the points. While the major computation bottleneck of SD is to compute the gradients of all $n$ points and this is especially true when the number of points is large, SSD still requires such computations over all $n$ points. What SSD reduces is the complexity term in the computation of each gradient, from $L$ the total number of base operators to $m$ the subset size of the base operators. I would suggest adding some comparison between subsampling points and subsampling the base operators to avoid confusion.

Correctness: Although theoretical analysis make sense to me, some missing definitions for nations make the paper less readable: line 111: p_{sigma_i} is not defined. It seems that this should be defined as a summation with index running over set sigma_i in similar way to the operator T_sigma. Algorithm 1: R is not defined. line 207: Q_n^m is not defined.

Clarity: The paper is generally well written. Still, there are some minor issues besides the missing definitions mentioned above. line 59: "discrete" is mentioned. However all the analysis on the paper are based on that the distribution being continuous. line 60: "will an employ" -> "will employ"

Relation to Prior Work: Previous work [1,2] are closely related where SD with coordinate-wise kernels is proposed to approximate SD. They subsampled base operators for each coordinate while SSD does so for each point. [1] Wang, D., Zeng, Z., and Liu, Q. Stein variational message passing for continuous graphical models.InInternational Conference on Machine Learning, pp. 5206–5214, 2018. [2] Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing Stein variational gradient descent. InInternational Conference on Machine Learning, pp. 6013–6022, 2018.11

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors study the theoretical convergence properties of stochastic Stein discrepancies based on subsampling proposed in other papers, and show that SSDs achieves the same convergence as standard SDs with probability 1. Moreover, they analyze the convergence properties for SVGD and its stochastic version by extending the results in [30] for SVGD from bounded domains to unbounded domains.

Strengths: The strengths of the work are: 1. rigorous analysis of the detect convergence and detect non-convergence of SSDs. 2. extension of the convergence analysis for SVGD in [30] to more general cases that hold for the commonly used kernels. 3. interesting and convincing demonstration by empirical examples.

Weaknesses: I do not see evident weaknesses of this work. The proof in the continuous case does not seem to be directly applicable to the discrete case.

Correctness: The convergence results, methods, and empirical methodology seem to be correct, though I did not check the proofs.

Clarity: The paper is well written and clear to follow.

Relation to Prior Work: The discussion of the difference between this work and prior contributions is clear.

Reproducibility: Yes

Additional Feedback: