NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: This model is primarily composed of two main modifications to the original Glow model. The first modification is what makes it "masked". Instead of modeling the entire image as in glow, MaCow creates a semi-autoregressive model by allowing dependencies solely within a specific area using masked convolutions. This allows more efficient inference (effectively O(h) or O(w), rather than O(hw)) than a completely autoregressive model and higher quality modeling than a completely non-autoregressive model. Additionally, they tweak Glow to have more output latent channels in every scale that is being modeled, as is depicted in Figure 2c. Besides these two modifications, the model and setup greatly resembles that of the original Glow paper for image synthesis. Clarity: The paper is clear. However, the fine-grained architecture and the dequantization could be explained significantly more clearly. Significance: This work describes an intermediate between fully autoregressive and non-autoregressive flow models. Autoregressive models tend to be better at density estimation, so somewhat naturally this model achieves better results on density estimation.
Reviewer 2
UPDATE: Many thanks to the authors for the rebuttal, which clearly answers many of our questions. I have increased my score to 6 in response. I'm glad to see measurements of generation speed, and I think these will improve the paper. They experimentally confirm that generation speed scales linearly with the height (or width) of the image. My apologies for thinking that s() and b() are linear. Of course they can contain masked convolutions as well as nonlinearities. The following statement in the paper "The two autoregressive neural networks, s() and b(), are implemented with one-layer masked convolutional networks" gave me the impression that s() and b() are a single masked convolution. This misunderstanding demonstrates that the explanation of masked convolutions can be improved. That said, I hope that the authors agree that masked convolutions are a specific way of implementing autoregressive convolutions (which exist in various forms already since e.g. PixelCNN), rather than a new conceptual development, which I believe justifies my judgement of low originality. In general, I'd like to re-iterate that the paper could be improved if it focused on its core contribution (which is the masked convolution) and explained it more clearly, Currently, the paper goes through a lot of material (such as variational dequantization) which is already known and orthogonal to the contribution of the paper. I sincerely hope the authors will take that into account when revising the paper. Summary: The paper describes MaCow, a flow-based generative model for images. Like Glow, MaCow includes affine coupling layers, 1x1 convolutions, and actnorm layers. In addition to these, MaCow includes masked convolutional layers and a fine-grained variant of the multi-scale architecture of Glow. The model achieves state-of-the-art results in image generation for flow-based models, as measured by bits/dim on a test set. Originality: The two original elements of this paper are the masked convolutional layers and the fine-grained multi-scale architecture. However, I would consider both of them to be rather incremental contributions, as they are variants of already existing architectures and not significantly novel. In particular, the masked convolution is essentially an autoregressive layer, whose scale and shift functions are linear, share parameters, and have a restricted receptive field, whereas the fine-grained architecture is an almost trivial variation of the multi-scale architecture of Real NVP. There are other versions of invertible convolutions not discussed in the paper, for example: Hoogeboom et al, Emerging Convolutions for Generative Normalizing Flows, arXiv:1901.11137, January 2019. I think the paper would benefit from a discussion of how the masked convolutional layers differ from invertible convolutions such as the above. Quality: The model is tested in four image datasets, two of which are high-resolution, and achieves state-of-the-art results in terms of bits/dim in some cases. The experimental evaluation is done well; I'm impressed by the ablation experiments and the level of detail provided in the supplementary material. The ablation experiments clearly evaluate the separate improvements due to the masked convolutional layers, the fine-grained multi-scale architecture and the variational dequantization; in particular, we see that the masked convolutional layers and the fine-grained architecture yield small improvements, whereas the variational dequantization yields a larger improvement. It would have been good to include error bars though, so that we could confirm that the improvements are statistically significant (they probably are). My concern is whether the masked convolutional layers slow down image generation. The masked convolutional layers are essentially autoregressive layers with restricted receptive fields; as a result, instead of HxW passes they require H (or W) passes. For a 256x256 image, does that mean that the masked convolutional layers are 256 times more expensive to invert than to evaluate? If that's the case, then I don't see the usefulness of the masked convolutions; they provide little benefit for such a reduction in generation speed. In line 128, it is stated that "the training procedure is unstable when modeling an extended range of contexts and stacking multiple layers". This is surprising claim, and I'm not sure I believe it. My understanding is that the instability is entirely due to stacking multiple layers, and has nothing to do with the size of the context. I think either more supporting evidence is required, or the claim should be amended. Clarity: The paper is generally well written and easy to understand for someone familiar with the field. However, the writing is sloppy at times, and the paper could benefit from a revision with more attention to detail. Moreover, the explanation of masked convolutional layers in section 3.1 is rather compressed, and could be made clearer. Since masked convolutional layers are the main contribution of the paper, I think the paper would benefit significantly if section 3.1 were expanded and made clearer. Significance: Even though the paper achieves state-of-the-art results, it does so by incrementally varying already existing architectures. For this reason, I don't consider the contribution of this paper to be particularly significant.
Reviewer 3
Given the Glow and flow++ papers, it seems the biggest contribution of this work is in the masked convolution coupling layers, which on its own, improves upon Glow but falls behind flow++ for both uniform and variational dequantization. Given the paper's abstract, I would've expected performance to exceed flow++ but this seems to not be the case for either uniform or variational dequantization. I would've liked to see more comparable experiments with flow++ (by simply augmenting the flow++ architecture with masked convolutions since their code is publicly available) especially since MaCow has the exact same motivations (judging from the abstract) and also uses variational dequantization. I found the current comparisons a bit messy and difficult to understand. Given that Glow had to use 1 batch per GPU and gradient checkpointing in order to train on CelebA-HQ, can the authors comment on how MaCow compares? The paper also dedicates a page to discussing dequantization; however, it isn't clear to me how this is different from flow++'s ELBO. It seems the main equations are simply re-deriving this lower bound on the log density. I did not attribute this to be a contribution of the paper, but it could be that I simply did not understand the message here. Note that the temperature trick for sampling used in Glow is only applicable when the change in log density is constant wrt. the sample, so they only applied it for additive coupling layers. Can the authors verify whether their experiments satisfy this property? On another note, please be careful with link submissions and author identity. Many of the items on the reproducibility checklist seems wrong, e.g. how hyperparameters were chosen was not specified, the number of evaluations runs is not specified, and I did not see a single error bar or standard deviation in the paper? --- I thank the authors for providing wallclock time for sampling. I have one comment being that the authors should clarify in the paper how their section on variational dequantization is different from flow++'s exposition, or I would recommend moving it to a background section and instead expanding the sections on the properties of masked convolutions for a clearer narrative.