__ Summary and Contributions__: This paper proposes a novel VAE-based stochastic video generation model that decomposes the latent space into object-specific latent variables that encode appearance, pose, and missingness (i.e. is the object present in the frame).
The authors compare their approach with a variety of other methods on the moving MNIST dataset and a grayscale/binarized version on the MOTS dataset. They demonstrate their method accurately infers the binary missingness variable and effectively predicts future frame under different settings.

__ Strengths__: This paper is well written and well motivated, for the most part. The model is a sensible extension of previous approaches and performs adequately on simple datasets.

__ Weaknesses__: This paper would significantly benefit from greater experimental evaluation and analysis. There are some standard video prediction datasets missing from comparison, and there are many ablation studies / analysis of the different model components that I'd encourage the authors to consider. (More details on my recs below.)

__ Correctness__: Yes.

__ Clarity__: Easy to follow, well written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: - The authors refer to DrNet as a SOTA stochastic video generation method. However, DrNet is not a stochastic video prediction method, making it a somewhat inappropriate comparison for the stochastic dataset.
- The authors say "we enforce separate representations for each object." -- could you specify how this is done? Is the number of objects specified for the model a priori?
- Given that the key novelty in this method arises from the missingness variable, I would suggest exploring some dataset (natural or artificial) that test the limits / effectiveness of this new component better. For example, a dataset that occludes an object behind another object would strengthen the paper.
- It would be good to see an ablation study where the exact same model architecture *without* the missingness variable is compared. More generally, this model has several different components and additional analysis of the strengths / importance of each of these components would strengthen the paper.
- The ability of this model to track multiple objects is a strength, however this currently feels under explored. How many objects can the model handle, and at what point does it break down? It would also be good to evaluate the method on a well used video prediction dataset with multiple objects (but not necessarily occlusion) to see if it performs as well as (or better) that current SOTA approaches. For example, the BAIR pushing dataset, while having few moments of occlusion, does have multiple moving objects and is also a well used dataset and comparison on that could strengthen the results.

__ Summary and Contributions__: This paper presents a deep generative model of video sequences
by decomposing the the latent representations of videos into factors while accounting
for the ability to reason about objects that are missing in videos or occluded.
The key idea of the DIVE model proposed herein is to:
1) factorize the representation (of each frame) into appearance, pose and missingness
2) impute data when missing
3) use the model for video prediction by modeling the static and dynamic objects separately
An bidirectional LSTM is used to encode each frame into a representation. A separate representation
is inferred for each of the different objects.
Then, a univariate Normal distribution is used to decide whether or not the object is missing or
present. A sample from this distribution is passed through a step function to obtain a binary
representation of whether or not the object is missing.
The hidden state of the object if it is missing is set based either on the current representation
or the previous representation of the object based on a hyperparameter.
This "missingness corrected" representation is passed through an LSTM to obtain the pose representation.
A time-varying dynamic and static representation for the appearance are also inferred.
Using these disentangled representations of each object in each frame, two different LSTMs are used
to parameterize the reconstruction and future prediction prediction of frames. The decoding processes uses a spatial transformer and the inferred pose variable to rescale and place the object into the frame.

__ Strengths__: I think the idea of explicitly modeling the missingness process is an important one
which this work makes use of to good effect. The neural architecture here is designed to make use of fine-grained knowledge of video semantics and consequently, the model compares well against several baselines with good experimental results (particularly, those in Figure 3/4/5).

__ Weaknesses__:
I have a lot of unanswered questions on how this model was trained as well as the kinds of hyperparameters the learning algorithm was sensitive to.

__ Correctness__:
This is a little difficult to ascertain. There is a gap in this manuscript regarding details on how the model is trained.
Line 153 says that the model is maximizing a variational lower bound
and that details on the same were in the supplement but I could not find it. This is the sort of detail
that should be in the main paper. Is there a prior distribution? If so, what is it?
What kind of distribution is the one in Equation (5)? Does it represent a variational distribution? I am assuming the use of "q" is indicative of a variational approximation but this is not evident from what is written in the paper. This detail is relevant to understanding how the KL divergence (or entropy) term that is typically present in the variational bound is evaluated.

__ Clarity__:
* DIVE is presented as a deep generative model. However, I found the translation of the writing in Section 3.1-3.3 into a generative process quite difficult.
* It is not clear how to backpropagate through FC(h_{iy}) in equation (4) since the heaviside step function is not differentiable. Did this pose and issue and if so, how was it handled?
* How is p set in practice (equation (2))?
* Line 140 talks about covariate shift but this is never elaborated on, where does the covariate shift come from?
* In the synthetic examples and datasets, it is possible to know "N" (the number of objects) in a video. What happens when you do not know this number apriori?

__ Relation to Prior Work__: Yes, there is a discussion of prior work.

__ Reproducibility__: No

__ Additional Feedback__: ** Post Discussion: Please incorporate all the writing changes and the additional experiments to the manuscript.

__ Summary and Contributions__: The paper introduces a novel method for model to learn video representation that disentangles pose, missingness, and appearance.
The novelty lies in this missingness latent variable that is used to potentially impute the pose and appearance variable. Dynamic appearance is also introduce

__ Strengths__: The method is mathematicaly sound and the experimental results seems to show that the proposed approach outperforms previous works.

__ Weaknesses__: The main weakness lies in the evaluation.
There is no ablation study.
The model is more complicated than DDPAE and, without an ablation study, it's hard to tell if the method is better because of the better disentanglement or just because the architecture is bigger.
I would recommend at least the following experiments:
-measure the impact of dynamic appearance vs static: with LSTM, with a short time window, using a constant appearance
-the overall same model but without imputing as described in 3.1
-what is the impact of the mixture in eq 2? what happens with different values of gamma
In fig 4 and 5, if we compare the results of DDPAE with the results reported in the DDPAE's paper, the images are much worse. Why?

__ Correctness__: Everything seems correct.

__ Clarity__: Overall, the paper is clear.
The figures 3,4,5 are not very clean. We see black rectangles around the text.

__ Relation to Prior Work__: The presentation of related works is quite clear.

__ Reproducibility__: Yes

__ Additional Feedback__: The heavyside step function is not diferentiable. I don't see how backpropagation is performed.
*****Post rebuttal******
I thank the authors for their very clear answers. They answered all my points. I raise my rating.

__ Summary and Contributions__: The paper deals with video prediction (on 2-digit Moving MNIST and MOTS data) in scenarios with "missing" data - 1) occluded pixels, 2) missing digits, 3) missed frames and deformed objects. It builds on top of the model presented in DDPAE [14] by including additional latent variables to account for and compensate for the missing of data. It shows the effectiveness of its model compared to previous variants in the above scenarios of missing data.

__ Strengths__: The paper presents an effective way of tackling missing information, by referencing previous works that have done the same but perhaps not in the context of video prediction. The overall model is soundly designed to handle missing information, by both reconstructing the missing information in the input video as well as predicting for future frames. It is justified that the handling of missing data happens in the latent space rather than pixel space, and for the case of video most of the proposed ideas (distributions of latent variables, their connections, etc.) seem appropriate. The explanations of the method are quite clear and easy enough to understand.

__ Weaknesses__: The model proposed was built on top of a model (DDPAE [14]) that was designed for and tested on simplistic datasets of Moving MNIST and Bouncing Balls. Hence, it is very effective on the simpler case of well-defined individual components in a dark background. It is encouraging that the model was able to achieve good results on this setting, it is to be seen how well it can perform in more complex datasets, such as those with natural images. The paper presents some motivating results on the MOTS dataset to address this very concern, however the method has the potential of working on more complex scenarios.
In 3.2 Missingness Inference, the use of a normal distribution for sampling before the use of the heaviside step function is not quite justified.

__ Correctness__: The claims of the paper are quite justified so far as the experiments designed are concerned. Three scenarios are proposed to check the missing value imputation the paper has introduced, and in all three the paper's method seems to work better than those methods that have not actively compensated for missing information.

__ Clarity__: The paper is very well written, the explanation of the method is quite clear, the figures are very helpful.

__ Relation to Prior Work__: Yes, it is clearly discussed how this paper is different from previous works, and it is mentioned that this paper builds on top of DDPAE [14].

__ Reproducibility__: Yes

__ Additional Feedback__: ------------------------
Here are my impressions from the author feedback:
1) I agree with R1 that DrNet should not have been mentioned as a stochastic method. Perhaps it is best to mention DrNet as a good deterministic model to compare with (notwithstanding the hierarchical deterministic methods that came after it). The authors have addressed this in their feedback.
2) R1:“we enforce separate representations for each object.” In lines 80-81 it is mentioned that “we assume that each video has a maximum number of N objects”. In the code, the option `n_components` seems to describe this max number of objects, which is used while initializing all the model priors in models/DiveModel.py. I agree with R2 that this is a convenient option though, so an ablation study on the number of objects that can be handled is useful. However, it is encouraging to see that the redundant objects are learned to be empty.
3) I agree with the authors that DDPAE can be considered the “exact model without missingness”. They provided the required experiment in the feedback.
4) The authors have clarified the issue with differentiating through the Heavyside function, as well as the nature of the prior distributions. Since they followed DDPAE, I assumed that they followed the same prior distributions as in DDPAE.
5) It is important to note the values of `p`, which the authors have clarified in their feedback. It should be included in the final draft.
6) I believe the authors have sufficiently answered all of R3’s clarification points in their feedback.
7) I agree with the other reviewers that the method has the potential of being effectively used in more complex datasets, such as at least KTH. However, this does not take away the introduced novelty or effectiveness of the method. I believe it has been sufficiently captured for the purpose of this submission.
For now, I will keep my (top) rating as is.