Paper ID: | 7283 |
---|---|

Title: | The continuous Bernoulli: fixing a pervasive error in variational autoencoders |

I have read the author response and am strengthening my confidence (since after the response and seeing other reviews, I believe I have understood everything correctly). I loved the analysis of the "concentration of mass at the extrema" between the CB and beta that the authors provided in their response. It is exactly this kind of careful study and how it relates to what you saw in your experiments with MNIST (and why it matters specifically for the particular characteristics of the dataset) that make me love a paper like this. I hope that the authors add that analysis to the supplementary material at least. It almost sounds like your supplement could even be a mini paper on such a study that's interesting in its own right (though please don't write another paper on it). *** The paper is very classy and very well written. I find the paper very inspiring. I will fight for the paper to be accepted if needed. I do believe in the significance/impact of the paper as a demonstration of how to carefully study your modeling choices, and possible negative repercussions of some choices, from a technical perspective. I loved things like studying the shape of the normalising constant (as a function of the parameter). I would like researchers to more often mimic such studies in their own work. However, I don't think the paper is necessarily "seminal" as I can't see many people opting to use this new continuous Bernoulli distribution instead of rather nicer distributions on the unit interval, namely, the beta distribution. With that said, why did Kingma and Welling not just use the beta distribution as the likelihood in the first place (in the same way you defined in your experiments section)? It looks like empirical performance wise they have comparable/mixed performance, and qualitatively (the samples of the digits) of the CB is more satisfying. Is that correct? I see that this is reviewed but I wish it were emphasised more in other parts of the paper, like in the introduction. I don't think it will weaken the paper. I think the CB distribution could be interesting in its own right. But again it's rather ugly compared to the beta distribution. I think the way to go would be a study of the qualities/differences between the beta and CB, for example, comparing the shapes of their normalising constants. Or the metrics of wanting to directly model a the single parameter of the CB (and its interpretations). But I understand this is beyond the scope/needs of the paper. Can't say I found anything lacking in the paper.

updates after rebuttal ---------------------------- I read the rebuttal and still think more evidences are needed to prove that continuous bernoulli is a right choice. I agree with the other reviewers that this is an interesting and valid question to look into. However, I do find in the current form this paper does not meet the requirements to be published at NeurIPS. - The simplicity of the approach is not necessarily a problem, but in this case we would like to see more empirical analysis (ideally on real-world architectures used every day and common benchmarks). This is clearly missing from the current paper. - There is no clear evidence that the proposed continuous Bernoulli distribution outperforms other easy or widely used choices (e.g, Beta, Gaussian, 256-way softmax, discretized logistic). In the newly added experiements on CIFAR10 the proposed CB distribution is outperformed by Beta, which obviously leads to the concern about the motivation. Several comments: 1. There isn’t much information in section 4.1, 4.3. 2. In section 4.5, it is mentioned that applying the transformation of the mean of continuous Bernoulli can help fix some problem of the VAE trained with improper bernoulli likelihoods. It is pointed out by the authors that the common reasoning is false but interestly, there is still improvements if we do this. Since this is more an analysis paper, it would be more interesting to see the reason. 3. The results on MNIST is good and clearly show improvements. But given the idea is rather simple, I’m expecting more experimental evidence on SOTA architectures. The existing results on CIFAR is not satisfying. 4. The authors show in some experiments that continuous Bernoulli has better performance than Beta distribution, which is a natural choice for continuous distribution over [0, 1]. But there is no explanation for this. Why should we prefer continuous Bernoulli over Beta?

## The continuous Bernoulli: fixing a pervasive error in variational autoencoders. ### Summary The authors propose a new single-parameter distribution with support on the open interval (0,1). This distribution is the result of properly normalizing a Bernoulli distribution when we allow the random variable to take values in the continuous interval (0,1) rather than just in the discrete set {0,1}. This so called "Continuous Bernoulli" distribution belongs to the exponential family and have several nice properties like reducing to the classic Bernoulli distribution in the limit. In the experiments, the authors apply it as the likelihood of a VAE to model image datasets where the intensity of the pixels are normalized between 0 and 1. Previous work in the literature have model this incorrectly by using a Bernoulli likelihood, and so the authors show that using a proper distribution (the propose continuous version of the Bernoulli distribution) improves the performance. ### Details The motivation of the paper is simple and the exposition is clear. In addition, it is really well-written which raises the quality of the paper. Some sentence however should be slightly tone down (just a minor). The main idea of the paper is to propose a new distribution which is simply a continuous version of the Bernoulli distribution. This is done by assuming the same functional form of the Bernoulli distribution and then computing the right normalization constant in closed from. Surprisingly, and to the extend of my knowledge, this distribution has not been previously proposed in the literature. The authors provide a through analysis of this new distribution computing the pdf as well as the first moment, showing that it belongs to the exponential family, that it is amenable to the re-parameterization trick and that in the limit it converges to the classical Bernoulli distribution (among other properties). One derivation that it is missing and I thing it is pretty important in better understanding the properties of a distribution is the moment generating function that has closed form (see at the bottom the derivation). In the experiments the authors, the authors apply it as the likelihood in the VAE and use it to model an image dataset where the support is bounded. This was previously done by incorrectly using the Bernoulli distribution ignoring the fact that the observations were not binary but rather continuous values between 0 and 1 after renormalizing the pixel intensities to fall in this interval. They show that the propose continuous Bernoulli distribution outperforms the VAE with Bernoulli, Gaussian and Beta likelihoods. Maybe, the weakest point of the paper is that the fail to further analyses why the Beta shows a worse performance. The Beta distribution is a more flexible object that the proposed distribution in that it has two parameters and that can model multi-modal distributions. Given that, I would expect to see a trade off between the two distributions (not one outperforming the other in all situations). The Beta distribution have the advantage of being more flexible, but at the same time it would be more prone to over fitting due to a higher number of parameter. Additionally, the beta distribution is tricky to optimize due to the gamma function in the normalization constants, and maybe the proposed distribution has a better behavior. This analysis is missing in the paper. Despite this, the paper is well-motivated, they propose a simple approach to a prevalent problem in the previous literature and has a significant novelty by introducing a distribution that despite its simplicity, and to the extend of my knowledge, has not been proposed before. ## Moment generating function Just applying integration by parts you can get that the MGF = \frac{2 \tanh^{-1}(1-2\lambda)}{2 \tanh^{-1}(1-2\lambda) - t} \frac{\lambda \exp(t) + \lambda -1}{2\lambda-1} (for \lambda \neq 0.5).