NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8670
Title:Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions

Reviewer 1


		
Convergence of existing Stein variational methods is known to suffer in high dimensions due to the locality of the kernel. The authors address this problem by exploiting the structure of the posterior distribution. Concretely, they propose to perform Stein gradient steps in a low-dimensional projection subspace. The basis of the projection space is derived from the expected Hessian of the log-likelihood, where the expectation is adaptively approximated by an empirical estimate. The introduced projection scheme and the corresponding Stein gradient steps are well motivated and presented. A theoretical analysis is presented to bound the bias introduced by the projection. Empirical experiments are performed to validate the effectiveness of the method for a linear and a non-linear inference problem. Some remarks: - The paper mentions a few important details without further discussing/studying them. For example, the authors assume that the update from the prior to the posterior in the subspace complementary to the projection subspace is negligible. However, there is no discussion or experiments whether/when this assumption holds, and what's the impact on the introduced bias (i.e., Theorem 1). What if the approximation of the avg. Hessian is poor, e.g., because a (too) uninformative prior? Similarly, Algorithm 2 (Adaptive pSVN) is a bootstrapping version of pSVN which is presented with a motivating idea only; no further analysis or experiments. - The presentation and theoretical analysis is restricted to Gaussian likelihoods, whereas the proposed method should be applicable to a broader class of densities (essentially all inference tasks with differentiable log-densities). The method could be presented for this more general case (keeping the analysis for the Gaussian likelihood). - The notation and text does not clearly distinguish mappings and arguments; e.g., the Fréchet derivate and the preconditioner in Eq. 7 are mappings from R^d into R^d and R^(dxd), respectively; gradient g (line 81) and projection x^r (Eq. 13) are mappings, too. Similarly, mappings and operators are mixed; cf. Eq 5 and 6. It should also be noted that H_{mn} in Eq. 11 denotes a dxd dimensional matrix rather than a single entry of matrix H. While for most parts of the text, the meaning is clear from the context, I would prefer a more stringent notation. - The experiments focus on a simple linear Gaussian inference problem, and a non-linear inverse problem. It would be interesting to evaluate the method for other models such as hierarchical and mixture models. The method may also work for non-Gaussian likelihoods.

Reviewer 2


		
The author(s) propose a variational method, pSVN. Compared to its parent algorithms, SVGD and SVN, pSVN is faster to converge, has higher accuracy, and offers a complexity independent of parameter and sample dimensions. The authors do a very nice job of covering the basics of Stein variational methods before diving into the derivation and exposition of their algorithm. Overall, the theory is well presented and seemingly sound. I believe there is a small typo (extra 'is') in lines 98-99. The experiments are well explained. Their results support the claimed advantages of accuracy, speed, and scalability that pSVN has over SVGD and SVN. I was impressed that the author(s) also experimentally characterized the speed gains of the parallelization available to their algorithm. My only complaint with this paper, is the examples, particularly the nonlinear one, seem contrived and perhaps overly suited to their proposed methdod. The first example being linear makes sense since the authors point out that Hessian-based subspace is optimal for linear f (line 173). Although, from their conclusion it seems these kinds of broader applications are being actively worked on. Overall, I am quite impressed with the paper, but am left wondering about broader applicability given the limited experimental constructions. Edit: I've read the response and other reviews and look forward to this being accepted :)

Reviewer 3


		
The author proposed a projected Stein variational Newton (pSVN) method for high-dimensional Bayesian inference. The author employed Hessian of the log posterior to explore the low dimension geometric structure of posterior distribution to address the curse of dimensionality. Experiment results on both linear and nonlinear synthetic data are presented. Overall, the paper is well organized and the presentation is clear. Detail comments and suggestions are as following. - Using Hessian to find the low dimensional projection direction is under Gaussian assumption. Will this Gaussian assumption limit the ability of proposed method? - In the nonlinear inference problem section, the dimension of the Hessian-based subspace r does not change with the increase of sample dimension. Another possibility is the data generation model is too simple. One experiment using more complex generative model will be able to rule out this possibility? - For these two example, it may not be a bad idea to have MCMC methods as one of the baseline (or ground truth) Minor comment: I really like the right figure in figure 3. This show the potential of using proposed method in real world large scale problem. I have read author's rebuttal and I will keep my original scores.