NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8312
Title:Discrete Flows: Invertible Generative Models of Discrete Data

Reviewer 1

------------------------------------------------------------------------------------------------------------------------------------------------------------------ POST REBUTTAL ------------------------------------------------------------------------------------------------------------------------------------------------------------------ The rebuttal has cleared most of my concerns and I am happy to maintain my score. ------------------------------------------------------------------------------------------------------------------------------------------------------------------ This paper ranks high in novelty as it is the first paper to consider discrete flows and also proposes the first discrete flow transformation layers (XOR and Additive). The experimental results are strong, especially on Text modelling. Moreover, the proposed method significantly computationally more efficient compared to competing approaches. The paper is very well written and easy to understand. However, the paper suffers from the following (few) shortcomings, 1. The capability of proposed XOR and Additive flow layers are unclear (even in 2D). E.g. in Figure 2 shows results only on a discretised Mixture of Gaussians. The fit to even this distribution is not perfect. The authors should consider a wider array of distributions to more convincingly demonstrate the capabilities of the proposed flow layers. E.g. discretised versions of the distributions analysed in [1]. 2. Some important details are unclear. E.g. what is the base distribution for sampling? Is it the factorised marginal distribution? If it is how is it estimated in high dimensions (given that the number of data samples needed for an accurate estimate would grow exponentially)? [1] Invertible Residual Networks, Behrmann et. al.

Reviewer 2

Originality: This paper is the first demonstration of flow-based models to discrete data. As such, the work is fairly novel. The flow-based modeling community has been wondering how to model discrete data for some time, and this paper provides an answer to this question. That being said, the main technical contribution amounts to using a modulo operator (Eq. 5) and handling backpropagation through an argmax operator (Eq. 6) on top of the existing techniques of MAF and Real NVP. I view this simplicity as a benefit of the approach, but some may view this a simple extension of existing techniques. Quality: The technical and experimental aspects of the paper are well-executed. The authors provide multiple experiments to demonstrate autoregressive and bipartite flows. Within these experiments, various hyper-parameter settings are reported to gain a better intuition for the performance of the models. Generation time is reported, helping to demonstrate the benefits of the models. For the most part, the technical ideas are fully developed and explored. Clarity: The presentation of the approach is incredibly clear. Examples are given during the presentation, which help the reader gain intuition about when the approach is useful. The diagram in Figure 1 is helpful for unfamiliar readers. For the most part, the experiments section is also clear. Some details of the models and training set-up are unclear, particularly in the toy examples from sections 4.1 - 4.3. Additional details in the supplementary material would help to clear up confusion. Significance: Although the introduction of discrete flows is a significant contribution, the paper currently feels like more of a proof-of-concept, rather than a competitive new approach. Demonstrations of new techniques are helpful, as other researchers will undoubtedly extend this technique to new settings. But additional experiments would help to complete this paper and broaden its impact. Many flow-based models have been applied to images, and it seems like discrete image datasets, e.g. binarized MNIST or Caltech-101 silhouettes, would be a natural testbed. In fact, RGB images are already naturally discrete. Likewise, with recent interest in discrete latent variable models, e.g. VQ-VAE, applying inverse autoregressive flows for variational inference would be another natural choice. --- Updates: Assuming the authors provide additional details on experiments in the supplementary, then I will be happy with this aspect. I'm perplexed as to why the authors seem resistant to running experiments on a simple binary image dataset, e.g. binarized MNIST or Caltech-101 Silhouettes. With binary data, there wouldn't be any issues with the ordinality of the pixels. And these datasets are small enough that getting results should take a matter of hours or less. This just seems like an obvious experiment to try to see how discrete flows compare with other families of generative models. It would also help to broaden the appeal of the paper to a wider audience.

Reviewer 3

The paper highlights that, despite not looking very obvious at first, normalising flows are in principle available for discrete data as well. The key is to design invertible transformation between discrete spaces. We can think of such a bijection, as a relabelling of the discrete sample space, and there's no need for computation of determinant Jacobians. They also show that it's not difficult to design parametric invertible transformations. They show an example with XOR and a generalisation thereof based on mod K. The real difficulty--and the paper could be more explicit here--is how to estimate the parameters of such discrete transformations. Their parameters are themselves discrete, thus if you have a NN predict them (for maximum flexibility), this network would require a nondifferentiable output activation. The approach taken at this point is to ignore the problem, and employ the straight-through estimator (which the authors argue work well for problems where K is not too large). The authors demonstrate the technique is effective in controlled artificial tasks as well as in char-level density estimation for text (PennTreebank and text8) showing both improved likelihoods and fast generation. The paper is clearly original, and I imagine it will be of great significance (it brings two interests together, namely, flexible flow-based density estimation, and modelling discrete data). The paper is mostly quite clear, I only have a few remarks in the next box.