NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 322 Importance Weighted Hierarchical Variational Inference

Reviewer 1

After feedback: I stand by my original review and think it's a worthy submisson. The authors propose a novel lower bound on the ELBO for variational approximations defined by auxiliary random variables. The quality and clarity of the paper is high and the method is interesting. I however have a few comments: - Lines 56-59, the discussion regarding maximizing the ELBO wrt both sets of parameters was a bit unclear. Maximizing wrt to the variational parameters does minimize the KL, but since the KL depends also on the model parameters it is not necessarily the case that you are maximizing the log-marginal likelihood. Perhaps this was what the authors meant? - Lines 62-65. The result on the connection between self-normalized importance sampling and the ELBO for a more flexible variational approximation was first proved in Naesseth et al. (2018): Naesseth et al., Variational Sequential Monte Carlo, AISTATS, 2018. - The legend in some of the plots is obscuring some information (Fig. 1) - Line 26, "on" -> "in"?

Reviewer 2

There are some minor issues that can be improved. (1) The key idea of this work is not so clear. If my understanding is correct, the authors propose to use an importance weighted auxiliary variational distribution (See eq 1 & 4). I ignore the data x for notation simplicity in the following expressions. In Yin & Zhou 2018, a forward variational distribution is used such as q(z|\psi_0) \prod_{i=0} q(\psi_i) (see eq 2). In this work, the authors would like to use a backward variational distribution such as q(z) \prod_{i=0} q(\psi_i|z) = q(z|\psi_0)q(\psi_0)\prod_{i=1}q(\psi_i|z) . However, computing q(\psi_i|z) is intractable, the authors propose to use \tau(\psi_i|z) to approximate q(\psi_i|z). (see eq 4) I think the backward approximation (in other words \tau(\psi_i|z) should depend on z) gives the improvement over SIVI . The key building block for using the importance weight is Thm 1 at Domke & Sheldon 2018, where Domke & Sheldon consider an auxiliary distribution for the model p while in this work the authors use a similar idea of Thm 1 in the hierarchical variational model q such as Yin & Zhou 2018, and Molchanov et al 2018. (2) I think the proof of the Thm (C.1 in the appendix) in Section 3 is inspired by Molchanov et al 2018. If so, the authors should cite Molchanov et al 2018. (3) The proof of Lemma C.2 in the appendix can be simplified. I think it is easy to directly show that \omega (\psi_0:K |z) is normalized. The proof sketch is given below. Let C= \sum_{k=0}^K q(\psi_k|z) / \tau(\psi_k|z). We have \int \omega(\psi_0:K|z) d\psi_0:K = (K+1) \int q(\psi_0|z)/\tau(\psi_0|z) \tau(\psi_0:K|z) / C d\psi_0:K = \sum_{k=0}^{K} \int q(\psi_k|z)/tau(\psi_k|z) \tau(\psi_0:K|z) / C d\psi_0:K (due to the symmetry of these K+1 intergrals) = \int \sum_{k=0}^{K} q(\psi_k|z)/tau(\psi_k|z) \tau(\psi_0:K|z) / C d\psi_0:K = \int C \tau(\psi_0:K|z) / C d\psi_0:K = \int \tau(\psi_0:K|z) d\psi_0:K = 1

Reviewer 3

I have read the author response and have decided to raise my original score to a 6. Please do correct the erroneous statements regarding upper bounding the log marginal likelihood of the data and also regarding claim that maximizing the lower bound maximizes the log marginal likelihood of the data. Originality: The proposed method builds marginally upon SIVI [1] Clarity: Clarity is lacking...the abstract for example is obscure...what is the "marginal log-density" here? I believe it is the logarithm of q_{\phi}(z | x). Please make this clear at least once because the term "marginal log-density" is used throughout the paper. Quality: Quality is also lacking. For example the related work section does not mention some related work (see [2] and [3] below). Furthermore, there is a very misleading statement on line 205 about the core contribution of the paper: "the core contribution of the paper is a novel upper bound on marginal log-likelihood". This is a false statement. The paper proposes an upper bound of the log density of the variational distribution which leads to a lower bound (NOT an upper bound) on the log marginal likelihood of the data. Finally, the experiments don't include comparison to UIVI [4] which also improves upon SIVI and there is no report of computational time which would put the test log-likelihood improvement into perspective. Significance: I am not convinced the proposed method is significant. Questions and Minor Comments: 1--line 38 and elsewhere: it is "log marginal likelihood" not "marginal log-likelihood" 2--lines 57-59 are not clear...what do you mean here? 3--what is the intuition behind increasing K during training? How was the schedule for K chosen? [1] Semi-Implicit Variational Inference. Yin and Zhou, 2018. [2] Nonparametric Variational Inference. Gershman et al., 2012. [3] Hierarchical Implicit Models and Likelihood-Free Variational Inference. Tran et al., 2017. [4] Unbiased Implicit Variational Inference. Ruiz and Titsias, 2019.