Reviews: BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling

Originality: the techniques used in the methods seem to be incremental but it is exciting to see that a very deep latent variable model can solve some problems that the shallow ones do not. Contribution&significance: the method is evaluated on a wide range of tasks and the results are excellent both quantitatively and qualitatively, compared with a large family of deep generative models. However, I worry about the reproducibility since most of the results are run by only once. Clarity: the organization and writing of the paper are good and the related work is clearly discussed.

Reviewer 2

This is a good paper in general. It is well written and easy to follow. The motivation and insight is clear, the proposed method is reasonable and novel, and the experimental results are convince. I have several small questions: For the equation between line 135 and 136( why does it not have a equation number?): It says that z^{TD} is conditioned on z^{BU}_{>i}. However, it seems z^{TD} should only depends on the BU variables lower than it. Is it a typo? The experiments stops on L=20. What is the performance if we keep increasing L to 30, 40, or 100, 200, 1000? While the results shown in table 5 provides some insight, it would be good to understand more about the learned hierarchical representation. For example, what do z^1, z^2, .... z^L represent, respectively?

Reviewer 3

Paper ID:	3529
Title:	BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling

The authors propose variational autoencoder where inference is performed bidirectionally (top -> bottom and bottom -> top) with the intention of enhance the flow of information and avoid inactive units. This is achieved via multi-layered stochastic variables and a deterministic backbone network. The proposed inference model is akin to ladder VAE but with stochastic layers. The proposed model does not contain autorregressive elements. The authors present extensive results on image datasets and also consider semisupervised classification and outlier detection tasks. It is not clear why p_\theta(x|z) is conditioned on the entire set of latent variables rather than z_1 as in Figure 1a, unless off course this is the case when accounting for the (deterministic network) skip connections in Figure 1d. In either case, it needs to be clarified. Similarly, why is q(z_i^BU|x,z_i^BU,z_>i^TD) dependent on x considering Figure 1c? From the description in Section 3.2 it seems clear that the model does not explicitly use skip connections as in Figure 1d but the deterministic path implicitly acts as skip connection (and conditioning on x) serving as backbone for all the layers. That being said, Figure 1 does not seem to help explain the concept behind the model. I enjoyed reading the paper (minus the graphical model), it is well written, well motivated and the experiments are well thought, extensive and convincing. Post-rebuttal: The changes to the model description and graphical model, as well as the error bars on the results are welcome additions to the revision.

Reviewer 1

Reviewer 2

Reviewer 3