Paper ID: | 5875 |
---|---|

Title: | Exponential Family Estimation via Adversarial Dynamics Embedding |

Update: I thank the authors for their responses and think they did a good job to address the concerns raised by the reviewers. Therefore, I'll hold my original score and believe the paper makes a strong contribution for estimating exponential family models and should be accepted by NeurIPS. ----------------------------------------------------------------------------------------------------------- Overall: This paper proposes a novel Adversarial Dynamics Embedding (ADE) that directly approximates the MLE of general exponential family distributions while achieving computational and statistical efficiency. It not only provides theoretical justification for the proposed dynamics embeddings, but also show how other estimators can be framed as special case in the proposed framework. Quality: The overall approach proposed by the paper seems to be technically sound. The paper did a good job in comparing the proposed method to other estimators and show how the proposed method overcome the limitations of other estimators. The results on both synthetic and real-data (MNIST and CIFAR-10) show the flexibility of the proposed dynamics embeddings without introducing extra parameters. Clarity: The paper is well-written. Originality: The proposed Adversarial Dynamics Embedding is entirely novel to my knowledge. Significance: This paper makes a good contribution to algorithms/methods for estimating general exponential family distributions. It overcomes the shortcomings of existing estimators, and can directly approximate the MLE while improving computational and statistical efficiency.

# Originality This is an original approach to MLE of exponential families based on a representation of the intractable log partition function using Fenchel duality. Fenchel duality introduces a dual sampler that needs to be optimized, resulting in an "adversarial" max-min problem. The authors propose to estimate an augmented model inspired by Hamiltonian Monte Carlo, which allows them to design dual samplers generated by Hamiltonian dynamics in the original space or an embedding space. # Quality The quality of the paper is quite good but could be improved. For example, references need to be fixed in the main paper and the appendix. Tests reporting MLEs for some common exponential families would be useful. # Clarity Most parts of the paper are clearly written. Personally, I think that the fragmented style splitting the text into remarks perturbs the flow of the text. # Significance I think this is a very nice and flexible framework for MLE of exponential families that should result in powerful estimators and provide a starting point for further algorithmic developments. What about exponential families defined on discrete spaces?

This paper starts from introducing max-min formulation of MLE by using Fenchel dual of log partition function. Authors focus on providing a better solution of min part (i.e. dual sampler). They first construct augmented model by introducing an auxiliary momentum variable. The essence is to incorporate the model with an independent Gaussian variable, see (6). By this construction, the min problem is equivalent to (8). They further conduct T HMC-based steps, (13), to approximately solve it and show how the output approximates the density value and helps with max problem. They compare the proposed scheme with SM, CD etc to show superiority. The paper is readable and has good quality overall. The experimental results are significant. But I have some brief concerns: 1, the significance of the paper need to be clarified more clearly. In particular, the max-min problem is from plugging the Fenchel dual of log partition, which is standard. For the augmented MLE, since later authors use HMC to represent dual sampler and HMC has natural augmented interpretation, (6) is more likely proposed due to the specific HMC scheme adopted, instead of an original work. 2, the finer analysis of Algorithm 1 is lacking for study. Specifically, is it possible that SGD for max problem would blur the precision of HMC such that the inner iteration number T need to adapt with outer iteration suitably. The theorem only shows HMC is able to well approximate exponential family, which I think is standard, but the combination with SGD for max problem need to discuss as well. 3, for the experiment, instead of using fixed number of steps, using CV to select for all methods is preferred. I don't see why T = 5 for ADE corresponds to T = 15 for CD. Also the setup of leapfrog stepsize is not mentioned in the main paper. 4, minor thing: should it be log(Z(f)) in (7)?