NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2732 Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees

### Reviewer 1

Summary: This paper introduces Poisson auxiliary variables to facilitate minibatch sampling. The key insight is with the appropriate Poisson parameterization, the joint distribution (Eq. (1)) only depends on factors if they are in the minibatch. The authors apply this insight to discrete-state Gibbs sampling (Algorithm 2), Metropolis Hastings (Supplement), and continuous-state Gibbs sampling (Alg 3. and 5). The authors also develop spectral gap lower bounds for all proposed Gibbs sampling methods, which provides a rough guideline for choosing a tuning parameter $\lambda$ and comparing the (asymptotic) per iteration runtime of the methods (Table 1). Finally the authors evaluate the Gibbs methods on synthetic data, showing that their proposed method performs similarly to Gibbs while outperforming alternatives. Quality: The submission appears to be technically sound with only a few minor typos in the supplement. The computation speed-up claims are supported by the theoretical results (Theorems 1-5). For Theorem 5, I believe $\delta_k$ should be proportional to $exp(L + \delta_m)$ instead of $exp(L)$ as it requires an upperbound on $\tilde{U}$ rather than $U$. The computation costs in Table 1 seem to mix worst-case running time with average-case runtime (used to evaluate Poisson Gibbs); isn't the average-case per iteration runtime of plain Gibbs sampling O(D * mean degree) rather than O(D * max degree)? This is a minor complaint: in many applications (including the experiments of the paper), the mean degree = max degree = number of states. Originality: To the best of my knowledge, the idea to using Poisson-minibatching to reduce calculating factors is novel. The paper adequately cites previous work and distinguishes it contributions compared to previous related work of De Sa et al. (2018) in Table 1. Clarity: The paper is well written and clearly organized. The result on Poisson Minibatching for Metropolis-Hastings (although very interesting) seems a bit out of place, as it isn't introduced or addressed anywhere else in the main paper. In addition, there are a few areas that I suggest additional clarification: -Line 78: Algorithm 1 should be noted to only apply for discrete state spaces -As stated above, Table 1 and line 102 should be careful about whether computation cost per iteration is for the worst-case or average-case. -Line 112: "this joint distribution achieves the minibatch effect automatically", what is the minibatch effect? -Line 129: "At each iteration we will first re-sample all the $s_\phi$", but Algorithm 2 only samples $s_\phi$ for $\phi$ in $A[i]$. -Lines on Figure 1c are difficult to read printed. Significance: The Poisson Minibatching trick seems to be a useful contribution to the literature on scaling MCMC by subsampling factors. As demonstrated in the paper, there are many applications and extensions by combining this trick with existing sampling methods. The spectral gap convergences bounds are also nice and allow users to compare the trade-off between the different sampling schemes. Figure 1 of the experiments plots the performance of the proposed methods vs the number of Gibbs iterations. The computation speed-up claims would be better supported with the addition of identical plots with the x-axis changed to measured runtime. Typos: -Line 294: "the average needed steps to be accept is" -The last line in the equation immediately following line 373 should have $\exp(U_\nu(x))$ instead of $\exp(U(x))$ as the sum is over factors in $A[i]$ not all factors. -As a result, the next equation (immediately following line 374) appears to be missing a factor of $\exp(U(x)-U_\nu(x))$, but the result should still hold as this factor does not depend on $x[i]$. -In line 426, I believe it should read $m = \Theta(L)$ instead of $m = \Theta(L^2)$.

### Reviewer 2

* Originality: The work builds off of previous work in a simple, but non-trivial way. The simplicity of the idea is appealing. Related work is adequately cited. * Quality: The proposed methods appear to be technically sound, and the bounds on the rates of convergence are useful. However, the experiments do little to shed light properties that the authors claim to be important. How does the compute/wall-clock time of the Poisson--Gibbs method compare to vanilla Gibbs and other mini-batched methods? How many potential functions are being evaluated at each iteration? Are there problems on which vanilla Gibbs would be prohibitively expensive for which Poisson--Gibbs would be useful (e.g., very large N for a highly connected graph)? Are there problems on which Poisson--Gibbs might fail (e.g., poor initialization and/or parameter settings; or strong dependence between variables in the graph)? These types of experiments are more useful than the current experiments. * Clarity: Overall, the main text of the paper is clearly written: the development is easy to follow, and the text proposes the main ideas in a simple manner that is consistent with the simplicity of those ideas. The text in the Sections 3 and 4 could use some cleaning/tightening. The proofs of the convergence rate theorems could be presented in a more reader friendly way, and should be checked for errors (for example, there appears to be a typo going from the equation after line 374 to that after line 378). * Significance: The ideas in the paper appear to be of moderate significance: the models to which the methods are applicable are somewhat limited, and no new technical or theoretical methods were introduced. * A few detailed comments on typos/technical points: - I did not review all of the proofs of the convergence rate theorems in detail, though on line 379 of the Supplemental Material (proof of Thm. 1) the authors suppose that the $s_{\phi}$ are iid Poisson random variables. Can the authors please explain why this supposition is ok/justified for the purposes of establishing the necessary bound? - The maximum local energy is a maximum taken over the entire set of variables, so 'local' seems a misnomer. I think the relevant property (as opposed to the total maximum energy is that it's not aggregated. - There's a stray ')' on line 82. Thanks to the authors for their thoughtful feedback to the initial reviews. In particular, the experiments showing wall clock and factor evaluation comparisons enhance the original submission. I have updated my review and score accordingly.

### Reviewer 3

Update: I have gone over the comments from the other reviewers and the authors' response to the same. The wall-clock experiments provided by authors only reinforce my opinion that this paper should be clearly accepted. Originality: The paper seems to be a clever idea for speeding up minibatching methods, although at some level it seems natural to consider a Poisson auxilliary variable instead of the usual Bernoulli variables. Quality: The results in the paper are impressive considering usually methods based on minibatching has asymptotic biased. The authors here present clear results on the rates of convergence of the proposed methods, and provide reasonable simulations. Clarity: The article is fairly well written, and the concepts are clearly described. Significance: If minibatching is possible and useful in any problem, I would think that this particular article would be very pertinent to practitioners. However, it is unclear if this article (and any minibatching techniques) are universally useful for all problems.