NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5922
Title:Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting

Reviewer 1

Originality and Significance: In light of submission with ID 6290, the method proposed in this paper is not novel. However the applications are different: submission 6290 focuses on learning generative models that are not influenced by the bias in the data whereas this submission focuses on reducing the mismatch between the trained generative model and the data distribution for Monte Carlo estimation at test time. Although the idea of importance weighting and estimating density ratios with a classifier is not new I believe the applications in the two papers are potentially useful. Quality and Clarity: The paper is well written. The related work section is good but there is a missing line of related work that was not mentioned, that of using importance weighting in the context of variational inference. See for example [1] and many follow-up papers. The experiments section is nicely done. Questions and Minor Comments: 1--how do you ensure that the generative distribution p_{\theta} has bigger support than p? 2--Is there a way to measure the bias introduced by p_{\theta}? What is this bias? 3--How did you deal with high variance in importance reweighting? What usually happens in importance weighting is that all the components of the importance weights collapse to 0 except for one component which eats all the mass...This happens even with your proposed trick on line 135 (self-normalization) and I can imagine this would also happen for the solution on line 137 (flattening). Clipping (as proposed on line 141) might be promising in practice however you do lose the asymptotic unbiasedness of importance weighting in this case. Please clarify. 4--how did you choose the clipping threshold \beta? [1] Importance-Weighted Autoencoders. Burda et al., 2015.

Reviewer 2

=Summary= The paper propose a method for correcting the bias in the outcomes of pretrained deep generative models. Given data from a generator distribution and the real distribution, the paper uses importance reweighting to up/down-weigh the generated samples. The importance weights are computed using a probabilistic binary classifier that predicts the identity of the data distribution. Experiments are shown on several tasks to show that the importance reweighting improves the task performance. ---- =Originality= Medium. The importance weighting using binary classification is a well-known technique. However, its usage in pretrained generative models is interesting. =Quality= Medium. The experimental section is well-written and the paper clearly points out potential drawbacks. =Clarity= Low-Medium. The paper is in general easy to read. However, some design choices can be explained better. Please see detailed comments below. =Significance= Medium. The proposed scheme can be a good addition to a deep generative model practitioner's toolkit. The framework has some stark limitations (e.g., the requirement regarding joint support of the real and generated data), as also pointed out in the paper. But it is still a useful addition to the growing literature on deep generative modeling research. ---- =Detailed comments and suggestions= - Lines 135-143: It is not clear what the corrective measures proposed here do on an intuitive and theoretical level. For examples, given very high and very low importance weights in a minibatch (corresponding to real and generated data), how does normalization help in terms of obtaining the "true" importance weights? - Similarly, are there any guidelines on when to apply each of the above corrective measures and how should the hyperparameters be selected? - Line 161: It is a bit surprising to see that the binary classifiers are calibrated by default, given that deep models are known to be very prone to miscalibration ( How precisely is miscalibration measured (e.g., via expected calibration error as described in the reference above)? - Line 168: For computing the reference scores in Table 1, how precisely is the real data split? Is it a 50-50 split? - Line 129, 304: The reviewer appreciates the fact that the paper is quite open about the potential limitations in the proposed methodology. - One limitation that is worth mentioning is that the proposed method cannot be used to generate new unbiased samples. - A question inspired by the closely related work of [45]: Ignoring the mode-dropping phenomenon of GANs (that is, assuming that the real and generated distributions have joint support), would the proposed importance weighing mechanism be obsolete if one were to train the GAN for a very long time? ------- = Update after the rebuttal = Most of my questions were addresses in the rebuttal. It is good to see the plot showing that the models are already well-calibrated. It would perhaps be helpful to add the plot (or at least the reference by Danescu-Mizil and Caruana) in the final version so that the readers are aware of 1) the potential for miscalibration and 2) the fact that these models are in fact well-calibrated.

Reviewer 3

The paper proposes an importance weights based scheme to correct the bias in the distribution learned by a generative model. Specifically, the authors employ a classifier to distinguish between a true data sample and a sample from the generative model. This binary classifier is used to estimate the likelihood ratio when the true distribution is unknown and the model distribution is intractable. Authors also discuss practical techniques to address the issue of imperfect classifier and reduce the variance of the Monte Carlo estimates. There include normalizing the weights across a batch of data samples, smoothing as well as clipping. Standard metrics to evaluate sample quality of the generative models such as FID and Inception score show improvement when the Monte Carlo estimates using the learned importance weights are used. This also appears to help on downstream domain adaptation task on Omniglot task. Questions/Concerns: It is not clear how the learned classifier is calibrated. The calibration seems quite crucial and this is often difficult, especially for deep neural network classifier trained on high dimensional data. The post-hoc normalization schemes are not well motivated. How do these interact with calibration? p, p_data, p_theta needs to defined early on. It is nor clear if p_mix, eq (2) and related discussion add much to the discussion. Why does D_g + LFIW not show much improvement over D_g. In the toy experiment, how do the results shown in Figure 1 change as the two modes in the true distribution get closer and closer? Domain adaptation on standard benchmark datasets and comparison with baselines will strengthen the empirical results. Typo: Policy evaluation numbers are missing in the introduction. Update: Thanks for answering most of my questions and of other reviewers. I have raised my score.