Reviews: Thermostat-assisted continuously-tempered Hamiltonian Monte Carlo for Bayesian learning

This paper presents a sampling method that combines Hamiltonian Monte Carlo (HMC), mini-batches, tempering, and thermostats, to more efficiently explore multimodal target distributions. It is demonstrated on a number of substantial neural network problems using real data sets. This is an interesting method, and the empirical results are quite substantial. Figure 2 does a nice job of demonstrating how the omission of any of the ingredients (e.g. the tempering, or the thermostat) is detrimental to the overall result, which is a nice illustration of how the combination works together well. This is followed by some substantial image classification examples. I have one major, and one minor issue with the work, which perhaps the authors can address in their feedback: 1. It is not clear to me that the work contributes any new ideas beyond those already introduced in the following paper (cited as [5]): Ding et. al. (2014). Bayesian Sampling Using Stochastic Gradient Thermostats. Unless I am missing something, the above work introduces a sampler that combines the same components as listed above, and it, too, offers substantial empirical results. The authors might like to clarify what they see as the contribution of their work. 2. Given that these are Bayesian methods, I would have liked to have seen comparison of the posterior distributions or empirical convergence metrics, and not merely predictive performance, as given in Table 1. Presumably there is an "honest" predictive performance that is obtained by the exact posterior distribution for the model and data under study, and it could be obtained (hypothetically, at least) with any of the sampling methods (not the optimizations methods obviously) with a sufficient number of iterations. If a method has not converged sufficiently close to the posterior distribution, the predictive performance may be higher or lower than the honest value. It would not necessarily be lower. So these results are not really definitive as to which method is working best at obtaining the posterior distribution. To show that, it would be useful to see some conventional convergence diagnostics such as autocorrelation plots, effective sample size, etc. Or, if the argument is that the Bayesian approach offers a regularization against overfitting, rather than us caring about having a good estimate of the posterior distribution, some uncertainty range could be reported in Table 1 (posterior variance of accuracy, or 95% credibility over posterior samples, etc), to calibrate the significance of the increase in predictive performance for the method that is presented. (I presume the predictive performance reported is the mean accuracy over the posterior samples, given the comment at the bottom of page 7, "For the sampler baselines, the final accuracy is calculated from the Monte Carlo integration on all sampled models...") AFTER AUTHOR FEEDBACK I thank the authors for their detailed response. The authors have addressed my concerns regarding novelty. Having had a better look at Ding et al. now I am satisfied that the addition of continuous tempering is indeed novel, and have increased my score accordingly. The authors have suggested that they will look at some additional metrics, which I would also very much like to see.

Paper ID:	6787
Title:	Thermostat-assisted continuously-tempered Hamiltonian Monte Carlo for Bayesian learning

Reviewer 1

Reviewer 2

Reviewer 3