Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
(Originality): This is the first paper to discuss flow-based models for discrete data, concurrent with [Tran et al. 2019] (see https://arxiv.org/abs/1905.10347). However, there is little overlap between the papers as the other paper considers flows for nominal data and proposes a different set of flows. The paper also, to the best of my knowledge, is the first to use flow-based methods for lossless compression, concurrent with [Ho et al. 2019] (see https://arxiv.org/abs/1905.08500). (Quality): The paper appears technically sound. As the flows map from integers to integers, the straight-through estimator (STE) is used to allow gradient-based optimization. The authors are clear about the gradient bias introduced by the STE. (Clarity): I found the paper to be very readable and with a good structure. Background material is introduced, the method is clearly explained and supported by several illustrative figures. (Significance): This paper is among the first papers on two emerging lines of research: - Flows for discrete data. - Flows for lossless compression.
The paper proposes a discrete flow model (IDF) based on the ideas from NICE/Real NVP. It differs from another discrete flow model of Tran et al., though the differences in layers construction are not too big. I don’t know what the policy regarding similar ideas is, but this work has strong experiments achieving state-of-the-art results, it seems to offer a better design for the factor-out layers and the latent distribution, and the paper is well organised and clearly written. I think that (discrete) flow model is an important direction to work on and this paper will be used by others. I think the paper could be much better if it gave intuition on why some choices were made. For example, there is a conditional dependence between the parts of z. How much worse would the model be with independent latents? Another question is how much bigger does the model need to be if only translations are used instead of modulo scaling and translation? Have you tried using a softmax instead of a discretised logistic distribution? The results in Table 3 are surprising. It’s related to the questions above. Do you know why a simpler model with only translations performs better than Real NVP with both scaling and translation? Would the difference be due to the dependencies in z or a different latent distribution? Are the models in Table 3 comparable in terms of the number of parameters? Section 2.2 says that rANS entropy encoder was used in the experiments, so I would assume that compression results in Table 1 are computed after you applied rANS. Is that correct? So have you implemented an encoder-decoder pair that one can readily use as a replacement to lossless JPEG2000 or PNG? How would they compare with respect to computational costs? Line 244 typo: ‘they are are focused’. ========== UPDATE ============ The rebuttal was sufficient and I'm happily increasing my score.
The core contribution of this paper is the discrete coupling layer for discrete variables. This layer can be seen as a variant of the continuous coupling layer in Real NVP or Glow. That is, they are designed in a similar way that splitting the input into two parts, and using the first part to generate parameters for the second part. The main difference is that the discrete coupling layer uses the rounding operation to round the bias. The paper designs a discrete flow model based on the proposed discrete coupling layer. The model uses DLogistic as the distribution for latent code. The paper uses the proposed discrete normalizing flow to do lossless compression, and outperforms the current methods. Pros: 1. The paper is well written and easy to follow. 2. The proposed discrete coupling layer is useful for discrete variables. 3. The application is new, since the previous flow models are all applied to generate synthetic images. 4. The experiments are good and prove the ability of the model to do lossless compression. Cons: 1. I don’t quite understand why we should split the input to 75-25 parts. It seems that it is just arbitrarily set, so I think it needs to be discussed more, or some empirical results to prove it. 2. I think a better way to compare the proposed model with other flow models, e.g. Glow, is to compare the generated samples, and use larger images, e.g., 64x64, 96x96 or 128x128. The images in Figure 7 are so small that cannot prove the model’s ability of image generation. With only the NLL on small datasets, it is hard to say that the discrete flow performs as well as the state-of-the-art continuous flows, e.g., Flow++ and Glow. 3. Maybe I missed something. To generalize the discrete flow to continuous flow, I think we need to compute the determinant of the Jacobian matrix, but I did not see the authors mention it. ================================= I have read the authors' responses and the other reviewers' comments. I did not change my score. The reasons are as below. 1. The authors did not answer my question about why we need to split the variable to 75-25. 2. The authors did not provide an example of 64x64 generated image in the author responses. So I am not quite sure how good the generated images can be. 3. Overall, it is a good paper, and I tend to accept it.