NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2832
Title:Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model

Reviewer 1

Originality: The main and novel contribution of this paper is showing how to exploit/embrace the non-convergence of MCMC to learn interesting models. Related work is well cited, and it is clear how it differs from previous approach like contrastive divergence and moment matching GANs. There are also interesting connections to energy based dynamics to models used in neuroscience. Quality: The authors have come up with an interesting novel strategy of embracing what was seen as undesired property (non-convergent etc MCMC), and built a complete and interesting work exploring the results of said strategy . The paper appears technically sound and it should be possible to recreate their model and results from the descriptions in the paper (ignoring the fact that the code is already provided). The paper also does a good job of exploring the connection to the related work. Minor correction: Line 136 - "a deterministic" -> "made deterministic". Clarity: The paper was clearly written and understandable.. A minor improvement for Section 4 would be to start by clearly stating that the FRAME model would will be examined first as stronger statements can be made in the restricted regime. Significance: The significance of the paper is potentially quite high as it may unlock deeper connections between EBM, MCMC and more general CNN approaches. It is also a quite simple algorithm, but still exhibits competitive results. It might also yield insights into why existing methods like CD work despite not having convergence in their MCMC chains. See also the contributions section.

Reviewer 2

The highlighted phenomenon (the convergence of a short-run MCMC while training EBMs) seems to be novel and very interesting. The conventional wisdom is that a simple MCMC algorithm like Langevin dynamics would take a long time to converge close to the stationary distribution of the EBM when initialized far from it. The paper argues that in fact if the EBM is trained by generating negative samples from a short-run MCMC, then the short-run MCMC chain would in fact converge close to the data distribution (the authors argue that the "closeness" is related to moment matching). The theoretical argument for explaining this phenomenon seems suggestive, but ultimately didn't convince the reviewer (even convergence of the algorithm seems to be not explained, and section 4.2 seems particularly weak - it's not clear what the "generalized moment matching objective" is trying to achieve). However the empirical evidence for the convergence of short-run MCMC in EBMs seems very compelling - the training procedure for the model is significantly simpler than other procedures used to train EBMs, yet produces highly competitive results on several image datasets. There is some evidence for coverage of the distribution, which is a concern for models not trained with the MLE objective (the evidence is the reconstructions of held-out data points). It would be great to see actual log-likelihood estimates under the short-run MCMC model (not the EBM, the directed model that is K steps of Langevin dynamics) by training a recognition network and computing a lower bound on the likelihood similar to "Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing, Burda et al, 2014". The submission is written very clearly, and the code is provided and easy to read. The reviewer believes the work is significant, as it highlights a new phenomenon in training energy-based models not yet explored in the relevant literature. While the theoretical explanation didn't convince the reviewer, I believe that future work will attempt to explain the phenomenon of short-run MCMC convergence more rigorously.

Reviewer 3

the paper provides an interesting statistical justification of the short-run MCMC. it views the short-run MCMC as a generative model, with initial image as the latent variable, uniform noise as prior, and the Langevin dynamics as network. this is an interesting formulation. I have three major comments. one is on why short-run MCMC can reconstruct the observed images and interpolating different images. is it due to the "short-run" property that allows short-run MCMC for good reconstruction? it is due to the choice of a fixed K? how does the theoretical argument in the paper help justify this? while the theoretical argument says that the short-run MCMC is preferrable because the short-run MCMC does not need to have the EBM as the stationary distribution. however the empirical results in tables 2 and 3 still favor large K. does it mean short-run MCMC still favors stationary distribution? the short-run MCMC also appears closely related to Monte Carlo EM, where it is common to run MCMC for a fixed number of steps in the E-step without reaching a stationary distribution. does the theoretical results of Monte Carlo EM directly apply to short-run MCMC here?