__ Summary and Contributions__: Summary:
Variational auto-encoders (VAEs) are a powerful approach to probabilistic modeling. However they are not directly amenable to heterogenous data. This paper proposes to adapt VAEs to these types of data. The idea is to use a two-stage procedure. First fit a VAE to each dimension of the data. Then capture the dependencies by fitting a new VAE to the individual latents. Interestingly, this two-step procedure optimizes a lower bound to the log marginal likelihood of the data. The proposed method is tested on image generation on five different datasets. The method is then extended to handle missing data and this extension is tested on conditional data generation and sequential information acquisition problems. The results show the proposed method avoids the problems of VAEs in handling heterogenous data.
States Contributions:
1) propose a new family of VAEs for heterogenous data.
2) extends the family proposed in 1) to handle missing data imputation with application to conditional data generation and sequential information acquisition.
-----
After rebuttal: I really enjoyed reading the paper. I am keeping my decision to accept this paper. Good work.

__ Strengths__: Very well motivated paper. Very well written. Nice applications for the empirical study.

__ Weaknesses__: Needing to fit D different VAEs might seem suboptimal, especially when the data is very high-dimensional. But this should not be a big problem since the different VAEs are fit to one-dimensional variables.

__ Correctness__: I didn't see any incorrect statements or flaws in the methodology.

__ Clarity__: The paper is very well written, very well motivated, and very easy to follow and understand. It was a really great read.

__ Relation to Prior Work__: Very nice related work section.

__ Reproducibility__: Yes

__ Additional Feedback__: Questions & Remarks:
1--What happens when D is very large (high-dimensional data)? You are going to train D different "marginal VAEs" or a VAE for each modality? That seems like a lot. Have you tried fitting simpler models to each dimension instead and using the dependency network on the resulting latents?
2--It looks to me as though the way you model the latent space relates to copulas. With copula modeling you model the marginals separately and then capture all the correlation in a covariance matrix. Is it possible to make this connection if there is indeed one?
3--The reward function R_I defined right after line 161 is the mutual information between x_i and the target, conditional on the observations. So maximizing this reward is equivalent to selecting the next point that maximizes its mutual information with the target. That's the intuition I believe.
4--Suggestion: maybe you can call your method mixVAE? I am not sure whether that name is taken already but it is catchier than VAEM.

__ Summary and Contributions__: - This paper proposed a new VAE-based approach (dubbed “VAEM”) to handle datasets that contain variables of different types (Categorical & Continuous).
- The authors show that why training VAEs naively (without any adjustment to the likelihood model) can results in bad data generation.
- The authors then perform a set of experiments to show that VAEM can successfully address this issue to some extent.

__ Strengths__: I find the problem stated in this paper to be a novel that has not been paid attention to much. Most of the work on VAEs and other deep generative models are on images where the community fights to improve the generative model by making the architecture or the variational distributions more complex. However, as the authors point out, many real-world datasets contain variables with different types, so I think it is valuable to design models to handle these cases.
The extensions to the VAEM model such as the new reward estimator and training another network with eq.8 seems to involve a significant amount of work.

__ Weaknesses__: Could the authors clarify something? According to Figure 2.b VAEM trains a VAE for EVERY dimension of x. While the authors are correct that in most real-world datasets, data contains variables of different types, I would argue that it is also the case that most real-world datasets are high dimensional. The dataset chosen for the experiments contain no more than 20 variables. Training that many VAEs is certainly not feasible. Is it the case these models are trained in parallel? Or is it the case that you group all variables of the same type together and train a single VAE on those (for the 1st stage)?

__ Correctness__: Yes the objective makes sense.

__ Clarity__: This paper reads well overall.
Maybe you can just add a normal legend for Figure 4? It makes the finger hard to read when the legend is broken to be at the top of all 6 plots.
Typo: “Optimization objective” on page 4

__ Relation to Prior Work__: I’m not very familiar with this line work. However two set of models that seems relevant to me are:
1) VAE with categorical variables [1,2,3,4]: these models also deal with continuous and categorical variables so this is done in latent space which is different than the proposed model
2) Multimodal representation learning [4,5]: The data in these settings also come in deterrent forms e.g. (image, caption) or (video, audio). Similar to this paper, these models also have a shared network and a network for different modalities.
Could the authors comment on these?
[1] Dupont, Emilien. "Learning disentangled joint continuous and discrete representations." Advances in Neural Information Processing Systems. 2018.
[2] Esmaeili, Babak, et al. "Structured disentangled representations." The 22nd International Conference on Artificial Intelligence and Statistics. 2019.
[3] Kingma, Durk P., et al. "Semi-supervised learning with deep generative models." Advances in neural information processing systems. 2014.
[4] Siddharth, Narayanaswamy, et al. "Learning disentangled representations with semi-supervised deep generative models." Advances in Neural Information Processing Systems. 2017
[5] Shi, Yuge, et al. "Variational mixture-of-experts autoencoders for multi-modal deep generative models." Advances in Neural Information Processing Systems. 2019.
[6] Wu, Mike, and Noah Goodman. "Multimodal generative models for scalable weakly-supervised learning." Advances in Neural Information Processing Systems. 2018.

__ Reproducibility__: Yes

__ Additional Feedback__: ========
Update
========
I thank the authors for their response. My concern regarding dimensionality has been addressed. I'm still happy to recommend this paper for acceptance.

__ Summary and Contributions__: This paper proposes VAEM, a two-level hierarchical VAE designed to handle heterogeneous data (e.g. mix of categorical and continuous features). Specifically, the first layer of VAEM consists of D univariate VAEs (which the paper calls "marginal VAEs") each modeling a single dimension of the D-dimensional input independently from the rest. The second layer (called the "dependency network") is another VAE that further models the distribution of the D latent variables from the first level.
The paper also discusses the application of VAEM on sequential active information acquisition (SAIA), the task of determining which variable should be observed given a partial observation to maximize information gain measured by a predefined "information reward" function. Experimental results on density estimation and SAIA on heterogeneous data are provided.
---------
Post-rebuttal: I have reviewed the author feedback, and many of my concerns were addressed. I'm increasing the score to 5.

__ Strengths__: Motivation and Significance
- The paper is well-motivated, as it tackles the problem of training a VAE on heterogeneous mix of variables, which many real-world applications involve. Much of the existing work on VAEs (and more broadly generative modeling) focuses on homogeneous data, and there is certainly a need for methods that can model heterogeneous data.

__ Weaknesses__: Formulation & Novelty
- Architecture: I am a bit concerned with the novelty of the proposed model. The only difference between a vanilla hierarchical VAE and VAEM is in the first layer, which contains individual "marginal VAEs" for each of dimensions. This approach of having a separate network for each dimension is also similar to the architecture proposed in HI-VAE [1], as acknowledged in Section 4.
- Training procedure: Since the training of the two layers happens in distinct stages, it brings the question of whether this training procedure would lead to suboptimal decoder and encoder upon completion compared to joint training. Appendix A.2 has a brief discussion on this and argues that jointly training both layers may lose the "uniformity" of each latent variable z_{nd}, but I'm not sure what "uniform properties" the authors are referring to.
Empirical evaluation
- Datasets: As the highlight of the paper is VAEM's ability to model heterogeneous data, it'd be very helpful to mention what kind of heterogeneity is present in the datasets used for empirical evaluation.
- Likelihood scaling: In the introduction, the paper claims that naively training a VAE on heterogeneous data can be difficult because "the contribution that each likelihood makes to the training objective can be very different, leading to challenging optimization problems in which some data dimensions may be poorly-modeled in favor of others". This naturally brings up the question of whether careful tuning of the scaling coefficient for the likelihood function of each dimension could ease the aforementioned optimization difficulties. The "VAE-adaptive" baseline seems to be a data-dependent attempt at this, but I'm not convinced that a single minibatch is sufficient for computing the coefficients for each data type (as described in Appendix C.1.2). In particular, it'd be interesting to see if VAEM would outperform a (possibly hierarchical) VAE with more carefully tuned scaling factors for each dimension to rule out the possibility that the poor performance of vanilla VAE baselines is simply due to hyperparameter tuning.
- Ablation study: To empirically verify that the proposed approach indeed improves performance of other VAE models, it would have been nice to see an ablation test on a few different architectures. For example, a more powerful VAE model (e.g. Ladder VAE) could be used to model the data with or without the first layer containing marginal VAEs.
- Choice of evaluation task: It's unclear why SAIA was chosen as an evaluation metric when the focus is on handling the heterogeneity in data. Could authors elaborate further on the motivation behind why SAIA was used?
[1] Nazabal, Alfredo, et al. "Handling incomplete heterogeneous data using vaes." Pattern Recognition (2020): 107501.

__ Correctness__: Experiment setup
- Baseline architecture: The comparison made between VAEM and vanilla VAE does not seem fair, given that VAEM has two layers of hierarchical latent variables whereas both VAE and VAE-extended have a single layer of latent variable. It would be nice to see how VAEM compares to a two-layer hierarchical VAE with matching latent dimensions.

__ Clarity__: The overall writing quality of the paper is good, although there are a few typos throughout the paper. These typos very minor, however, and do not obfuscate the clarity of the exposition.

__ Relation to Prior Work__: The paper properly discusses how it differs from existing work.

__ Reproducibility__: Yes

__ Additional Feedback__: - In Table 1, why are some negative log-likelihood values negative? For discrete variables (i.e. categorical and ordinal), the corresponding marginal probability mass should be at most 1. So the only way for the NLL to be negative would be if the model assigned a very high density on some values for a continuous variable (which is definitely possible when the model overfits, but also possible even when it doesn't). This may not be an issue, just wanted to double check that these numbers made sense.
- For the VampPrior used in the models. how were the pseudo-inputs u_k chosen? Specifically, the last sentence of Section 2.2 (line 102) states that they are a subset of data points. But how were those data points chosen?

__ Summary and Contributions__: The authors propose a deep generative model to handle heterogeneous data (as in different likelihood models) that uses a two-stage procedure: first train a series of standard VAEs for each univariate marginals in parallel, then combine them by training a globall VAE on each. In such a way the first stages just perform Gaussianization for marginals, delivering an homogeneous space for the global, second VAE to perform (approximate) density estimation.

__ Strengths__: The major strength of the paper is providing evidence that the two-stage approach in VAEs can help alleviate the misproportioned contributions of the different likelihood models.

__ Weaknesses__: One weak point is the presentation: the first stage idea is not really dependent on VAEs and can be realized in several ways, for instance the one in [1] which provides a more principled way to deal with heterogeneous likelihood models. In fact, the current first/local stage is just a mean for Gaussianization but it is agnostic of the statistical data type of the input distribution and assumes inputs are only Reals with infinite support or Categoricals.
[1] - Valera, Isabel, and Zoubin Ghahramani. "Automatic discovery of the statistical types of variables in a dataset." International Conference on Machine Learning. 2017.

__ Correctness__: The claims and derivations I checked are correct. More discussion about related approaches should be carried out.
On the motivational side, it is not clear why one should adopt a VAE, very limited in performing inference (MAP, marginals, etc) w.r.t. other tractable alternatives. Indeed, the proposed 2 staged VAE resuses the same heuristics of the Partial VAE for handling missing values, which provide no guarantees on inference. I would say the advantage of using a VAE could lie in re-using the latent embeddings but this does not seem the case for the paper.
Concerning experiments, see more comments below.

__ Clarity__: The paper is well-written, clear and easy to follow.

__ Relation to Prior Work__: The paper correctly points to Gaussianization for the role of the first stage VAEs. I believe the overall presentation would benefit from a deeper discussion on the topic, showing how one does not necessarily need a VAE for trasforming single variables into Gaussians. A Flow could do the job, or even the transformations in [1].
Furthermore, if one used deterministic transformations, only density estimation in the joint latent space would be required [2, 3] (or [4] if one applies only the two stage procedure fashion)
[2] Kumar, Abhishek, Ben Poole, and Kevin Murphy. "Regularized autoencoders via relaxed injective probability flow." arXiv preprint arXiv:2002.08927 (2020).
[3] Brehmer, Johann, and Kyle Cranmer. "Flows for simultaneous manifold learning and density estimation." arXiv preprint arXiv:2003.13913 (2020).
[4] Ghosh, Partha, et al. "From variational to deterministic autoencoders." arXiv preprint arXiv:1903.12436 (2019).

__ Reproducibility__: Yes

__ Additional Feedback__: The assumption of categorical data for discrete is a limitation. What happens when the discrete feature comes from a distribution with larger (even infinite, e.g. Poisson) support from the one observed during training?
Are integer/count data in the UCI datasets considered to be categorical or real?
Why not considering a variant where the two stages are trained end-to-end? I can imagine that this can be highly instable, but would be a useful baseline in the experiments.
Likelihoods can be highly misleading w.r.t. sample quality when dealing with continuous data, even more in the heterogeneous setting. What about checking the quality of generated samples in terms of a statistical test? Or even missing values imputation as in [1]. This is partially done in the active learning scenario, but the effect of performing MAP inference correctly is harder to disentangle from the effect of the online learning.
--------- UPDATE
I thank the authors for their answers, I believe the paper is worth acceptance and the additional work in presentation can be done in the camera-ready