NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4792
Title:Parameter elimination in particle Gibbs sampling

Reviewer 1

The marginalisation of variables within some steps of an MCMC algorithm is delicate. The main proposal here appears well justified, but it would have been nice to see the argument made a little more explicitly. The type of marginalisation described here seems to be more or less what would be described as a (partially) collapsed Gibbs sampler in the sense of [David A Van Dyk and Taeyoung Park. “Partially collapsed Gibbs samplers: Theory and methods”. In: Journal of the American Statistical Association 103.482 (2008), pp. 790–796] and I would have liked to see that connection made -- particularly as this is one of the arguments that can be used to justified the basic particle Gibbs sampler. It was less clear to me exactly how the "blocking" strategy detailed in Section 4.1 would be justified from a formal perspective, and I do think that this needs clarifying. I.e. the collection of variables to be sampled is divided into three parts -- x', x~ and theta and the decomposition of the kernel seems to involve sampling: x~ from a kernel invariant to its distribution conditional on both x' and theta (starting from the previous x~) x' from a kernel invariant with respect to its distribution conditional only upon x~ (starting from the previous x') \theta from its full conditional distribution and it's not completely transparent how one knows that this is invariant with respect to the correct joint distribution. In the numerical study I would have liked to see some sort of indication of how algorithmic parameters (especially N) were specified, and some illustration of the dependence on N of the results obtained. I would have also liked to see some kind of explicit statement about computational cost, even if only in terms of wallclock time, as there seem to be additional computations to do in the marginalized case: is the cost of running mPG the same as that as running PG for a given number of particles? I was disappointed not to see PGAS featuring in the comparison because this is also known to dramatically reduce the autocorrelation in many settings; in particular it would have been useful to know (a) how does the improvement arising from using mPG rather than PG relate to that obtained from using PGAS rather than PG and especially if one is already using PGAS then does one observe significant improvement by using mPGAS. Details: line 47: are particle Gibbs samplers really pseudomarginal algorithms? They don't seem obviously to fit the framework of [1], but certainly work as simple Gibbs samplers on a `demarginalisation' of the original target and indeed, wouldn't be expected to outperform the ideal algorithm. line 45-56: I find this discussion a bit misleading. It's possible to outperform what you call the "ideal" Gibbs sampler by implementing a sampler which draws iid samples from the posterior (sampling the state sequence from its marginal posterior and then the parameters from the full conditional distribution amounts to using one particular decomposition by which one could sample from the posterior distribution, at least in principle). It's this idealised algorithm which you seek to approximate with the methods described here, and presumably you would not expect to outperform that algorithm... Section 3.1. It might help the reader if you explain what is meant by a "general SSM". In most of the literature the term is used to refer to Markovian models but that seems not to be what is intended here. line 126: "se" line 130: a conjugate prior is stated for computational convenience; it would be nice if any comment on the interpretation of this prior or its desirability for mode Figure 4: it's not clear to me why a kernel density estimate is shown for one algorithm but a histogram is given for the other. This just seems to prevent direct comparison. Note on 7, below: I understand that code will be provided should the manuscript be accepted. -- I thank the authors for their response and, particularly, for the clarification of the justification of the blocking strategy. Together with the other referee reports this leaves me more strongly in favor of accepting the manuscript.

Reviewer 2

This is a nice paper, potentially giving very important improvements for certain models. Can the ideas also be implemented in a backwards-sampling implementation of the SMC algorithm, rather than ancestor sampling? Is there any understanding of how the degeneracy of the sufficient statistics (as reported in various previous attempts to marginalize out static parameters) affects the algorithm? Is it that this is essentially mitigated by the ancestor sampling? Having read the authors' response and the other reviews, I remain of the opinion that this is a nice contribution. There are always unanswered questions from good papers, especially when they are necessarily short, so it is not surprising that there are some here as well.

Reviewer 3

The paper introduces several novel variants of existing approaches that seem to be of high relevance to me. The description of the novel approaches is generally very clear. The paper is of high quality and I highly appreciate that the authors looked into short-comings of their approach. I would be nice to extend the evaluation with more large models and on more datasets, which should be rather easy as the implementation is done in a PPL. Further, I would appreciate if the authors would also compare the running times. However, the paper sometimes lacks a good explanation of the math used. In particular, (1) Before equation 3 the authors state that the unnormalized target density can be factorized. However, the index k never appears in the equation. I suppose the is a typo. (2) Equation 5 lacks sufficient details to understand it well. (At least in my case). What is h, theta, s, A and r? It would be good if the authors would improve this paragraph. (2) In Equation 8 the weight tilde(w) is not introduced. Further, the paper states that several times that a certain property is illustrated in Figure 1. I was missing a proper explanation of what I should be able to see in Figure 1. I would suggest improving those parts. And last but not least, the authors write that PGAS is not affected by path degeneracy (line 81). However, according to F. Lindsten, P. Bunch, S.S. Singh and T.B. Schön (2015) Particle ancestor sampling for near-degenerate or intractable state transition models this doesn't seem to be 100% correct. ---- I want to thank the authors for their response letter.