Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
The paper presents an implicit variational inference method for likelihood-free inference. This approach builds on previous work and particularly on Hierarchical Variational Inference and Implicit Variational Bayes.
The key trick used in the paper is the subtraction of the log empirical distribution log q(xn) and the transformation of the ELBO in the form given by eq. 4, which suggests the use of log density ratio estimation as a tool for likelihood-free variational inference. The rest methodological details of the papers are based on standard tools, such as log density
ratio estimation, reparametrization and hierarchical variational distributions.
While I found the trick to deal with likelihood intractability very interesting, it requires log density ratio estimation in high-dimensional spaces (in the joint space of data x_n and latent variable z_n). This is very challenging since log density ratio estimation in high dimensions is an extremely difficult problem and there is no clear evidence that the authors provide a stable algorithm to deal that. For instance, the fact that the authors have not applied their method to a standard GAN (for generating high dimensional data such as images) but instead they have constructed this rather weird Bayesian GAN for classification (see page 7) indicates that the current algorithm is very unstable. In fact it is hard to see how to stabilize the proposed
algorithm since initially the “variational joint” will be very different from the "real joint" and it is precisely this situation that makes log density ratio estimation completely unreliable, leading to very biased gradients in the early crucial iterations of the optimization.
The paper defines a class of probability models -- hierarchical
implicit models -- consisting of observations with associated 'local'
latent variables that are conditionally independent given a set of
'global' latent variables, and in which the observation likelihood is
not assumed to be tractable. It describes an approach for KL-based
variational inference in such 'likelihood-free' models, using a
GAN-style discriminator to estimate the log ratio between a
'variational joint' q(x, z), constructed using the empirical
distribution on observations, and the true model joint density. This
approach has the side benefit of supporting implicit variational
models ('variational programs'). Proof-of-concept applications are
demonstrated to ecological simulation, a Bayesian GAN, and sequence
modeling with a stochastic RNN.
The exposition is very clear, well cited, and the technical machinery
is carefully explained. Although the the application of density ratio
estimation to variational inference seems to be an idea 'in the air'
and building blocks of this paper have appeared elsewhere (for example
the Adversarial VB paper), I found this synthesis to be cleaner,
easier to follow, and more general (supporting implicit models) than
any of the similar papers I've read so far.
The definition of hierarchical implicit models is a useful point in
theoretical space, and serves to introduce the setup for inference
in section 3. However the factorization (1), which assumes iid
observations, is quite restrictive -- I don't believe it technically
even includes the Lotka-Volterra or stochastic RNN models explored in
the paper itself! (since both have temporal dependence). It seems
worth acknowledging that the inference approach in this paper is more
general, and perhaps discussing how it could be adapted to problems
and models with more structured (time series, text, graph)
observations and/or latents.
Experiments are probably the weakest point of this paper. The 'Bayesian GAN'
is a toy and the classification setup is artificial; supervised learning is
not why people care about GANs. The symbol generation RNN is not
evaluated against any other methods and it's not clear it works
particularly well. The Lotka-Volterra simulation is the most
compelling; although the model has few parameters and no latent
variables, it nicely motivates the notion of implicit models and shows
clear improvement on the (ABC) state of the art.
Overall there are no groundbreaking results, and much of this
machinery could be quite tricky to get working in practice (as with
vanilla GANS). I wish the experiments were more compelling.
But the approach seems general and powerful, with the
potential to open up entire new classes of models to effective
Bayesian inference, and the formulation in this paper will likely be
useful to many reasearchers as they begin to flesh it out. For that
reason I think this paper is a valuable contribution.
Misc comments and questions:
Lotka-Volterra model: I'm not sure the given eqns (ln 103) are
correct. Shouldn't the Beta_3 be added, not subtracted, to model the
predator birth rate? As written, dx_2/dt is always negative in
expectation which seems wrong. Also Beta_2 is serving double duty as
the predator *and* prey death rate, is this intentional? Most sources
(including the cited Papamakarios and Murray paper) seem to use four
line 118: "We described two classes of implicit models" but I only see
line 146: "log empirical log q(x_n)" is redundant
Suppose we have an implicit model, but want to use an explicit
variational approximation (for example the mean-field Gaussian in the
Lotka-Volterra experiment). Is there any natural way to exploit the
explicit variational density for faster inference?
Subtracting the constant log q(x) from the ELBO means the ratio
objective (4) no longer yields a lower bound to the true model
evidence; this should probably be noted somewhere. Is there an
principled interpretation of the quantity (4)? It is a lower bound on
log p(x)/q(x), which (waving hands) looks like an estimate of the
negative KL divergence between the model and empirical distribution --
maybe this is useful for model criticism?
Thank you for an interesting read.
This paper proposed a hierarchical probabilistic model using implicit distributions. To perform posterior inference the authors also proposed a variational method based on GAN-related density ratio estimation techniques. The proposed method is evaluated with a number of different tasks including ABC, supervised learning and generative modeling.
I like the idea in general but I think there are a few points that need to be made clearer.
1. How is your method related to AVB  and ALI ? I can see these connections, but not all the readers you're targeting could see it easily.
2. In the AVB paper they mentioned a crucial trick (adaptive contrast) to improve the density ratio estimations in high dimensions. You only did a toy case (2D linear regression) to demonstrate the stability of your method, and your findings are essentially the same as in the toy example in the AVB paper (naive density ratio estimation works well in low dimensional case). It would be better if you could provide an analysis in high dimensional case, e.g. your BNN example.
3. Hinge loss: why the optimal r is the log ratio?
4. Generating discrete data: yes using r(x, w) instead of r(x) could provide gradients, however this means you need to input w to the discriminator network as well. Usually you need quite a deep (and wide) neural network to generate realistic data so I presume w could be of very high dimensions. How scalable is your approach here?
5. I'm a bit worried about no quantitative results for the sequence generation part. I think it's not a good practice for just including generative samples and letting the readers judge the fidelity.
In summary I think this paper is borderline. I would be happy to see clarifications if the authors think I've missed some important points.