NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
* Quality Overall the paper is of high quality. My main remarks are: 1. A priori, the hybrid modeling task seems out of place for this paper. I have rarely seen generative models being evaluated on such a task, where the classifier is trained simultaneously as the generative model. Generally there’s some semi-supervised setup where the representations are learned, frozen and used to test their ability to model p(y | x). I can see that one could argue that the inductive bias of residual networks might be more suitable for discriminative tasks as well. It would be good to state the intention of the experiment in the context of the paper more clearly (currently there’s only L251-L252 which is quite vague). 2. The samples in Figure 3 don’t look as competitive to e.g. the samples from Glow (in terms of sample quality). Since I find no mention of temperature scaling, my assumption would be that this was not used. As also mentioned in the Glow paper, temperature scaling is important to get more visually appealing samples (Glow e.g. uses T=0.7). Since this is a generative modeling paper on images, I would suggest to explore temperature scaling and choose a temperature that works best (and report it as such). It would also be important to then show the effect of temperature scaling on the samples in the appendix (Glow has this in the main paper). If this results in much more competitive results, consider moving some of the generalization to the induced mixed norms or the hybrid modeling task to the appendix and putting more emphasis (potentially with a figure on the first page) on the samples. Currently the paper is relatively technical and it would be great to balance that out more (in the context of an improved generative model of images). 3. For CelebA (is it CelebA-HQ?) there are no quantitative results anywhere (just samples).They should ideally also be reported in Table 2. * Originality The related work section seems thorough and the positioning of the paper is very clear. It is clear how the paper overcomes certain limitations in previously introduced methods of interest to the field. * Clarity The paper is well-written, well-structured and reads fluently. * Significance There is a good argument for why invertible residual networks are interesting to people in the field of generative modeling / flow-based modeling, as also explained in the introduction of the paper: it uses a common supervised network architecture (ResNet) and transforms it to one that is tractable for flow-based modeling without (big) changes to the architecture. This seems to be a promising approach compared to previous methods which rely on specialized architectures (coupling blocks, etc.). This paper makes this approach more tractable and also shows quantitative improvements, even compared to the state of the art on some of the tasks. I consider this paper of significance to the field of generative modeling. The samples are less impressive but I also address as to why this could be above. --------------------------- I thank the authors for their response. I will leave my already high score as is; the addressed comments will improve the paper but not significantly.
Reviewer 2
I appreciate the authors' response about generalization beyond normalizing flows. I would encourage the authors to add these generalizations in the conclusion or discussion section to help readers see how these results could be more broadly useful. ---- Original review ---- The novelty primarily comes from deriving an unbiased log density and log determinant estimator rather than using a fixed truncation of an infinite sum, which can be significantly biased. The tradeoff (which the paper mentions clearly) is variable memory and computation during training. Overall, I think this is a focused but solid contribution and is explained clearly. The induced mixed norm section could be shortened to a paragraph and the details left in the appendix. I appreciate including the somewhat negative result but I think a paragraph of explanation and details in the appendix would do fine. Overall, I didn't find any major weaknesses in the paper. One small weakness may be that this primarily builds off of a previous method and improves one particular part of the estimation process. This is definitely useful but the novelty doesn't open up entirely novel models or algorithms and it is not clear that this can be generalized to other related situations (i.e. can these ideas be used for other models/methods than invertible residual flows). Could you mention what distribution you used for p(N)? I might have just missed it but wanted to double check. Also, since any distribution with support on the positive indices allows for unbiasedness, why not choose one that almost always selects a very small number (i.e. Poisson distribution with lambda close to 0)? Could you give some intuitions on what are reasonable values for p(N)? Figure 1 should probably be moved closer to the first mention of it. I didn't understand reading till near the end of the introduction.
Reviewer 3
While the paper is structured as a grab bag of improvements to the i-ResNet model, the methods are original and well explained via text and ablation studies. The work will be helpful for increasing the maturity of these promising residual models. Overall the quality and clarity of the paper is good, although I think there are a couple points in the paper that could use elaboration/clarification. Regarding the inequality in the appendix following equation 16, I presume the reason why these are not equal is because the Jacobian matrix and it's derivative might not commute? Maybe it would be worth mentioning this. If these are indeed not equal, have you observed a difference in the variance of the estimator going from one form to the other (irrespective of the difference in memory costs)? For the backward in forward, it's not clear to me where the memory savings are coming from. It is mentioned that precomputing the log determinant derivative reduces memory (used in logdet computation) from O(m) to O(1) where m is the number of residual blocks. It seems like if these are stored during the forwards pass, then the memory usage for doing so will still be O(m). I think section 3.3 could use a little elaboration on why the LipSwish activation function better satisfies (2) than the softplus function. For the minibatching, presumably the same N is chosen for all elements in the minibatch but I didn't see this mentioned in the paper. The appendix mentions that the number of iterations used in power iteration is adapted so that relative change in estimated norm is small, but doesn't state the exact value. This detail would be important to have.