NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6171 Variational Temporal Abstraction

### Reviewer 1

Weakness: My doubt mainly lies in the experiments section. 1. There is not enough quantitative evaluation of the model. As the authors claim, the proposed framework should be able to capture long-term temporal dependencies. This ability would result in higher generative likelihood. However, there is not enough quantitative evaluation and comparison to back up this statement. 2. The latent space, especially the temporal abstraction level is not investigated enough. Since the proposed framework should be able to learn high-hierarchical temporal structures, it would be interesting to traverse the temporal abstraction latent variable and visualize what happens. Does it encode different information with the observation abstraction or are they somehow entangled together? Such investigation would provide more insights into the hierarchical latent space learned. 3. Although it is not a big issue, the use of binary indicator with Gumbel-softmax relaxation has been utilized in a lot of previous works. But since it works, and it only serves as part of the contribution, I do not see it as a big issue.

### Reviewer 2

1) Main main point of criticism is the experimental validation of the proposed model. 1.1) Sec 5.1: Bouncing balls (BBs) dataset 1.1.1) I think it is indeed good practice to test the algorithm on a simple dataset, but this version of BBs seems quite tailored to the algorithm, as the balls change color on collision. Does the segmentation still yield interpretable results without color change? 1.1.2) There is no quantitative comparison to an reasonable baseline model (eg matched for same network size or similar). This would be required to convince the reader that the inference and learning algorithms are able to identify the model. Also it would be good to see a sample from the baseline. 1.2) Sec 5.2: 3D Maze 1.2.1) My main quibble with this experiment is that the true segmentation is basically almost explicitly given based on the available actions, eg if TURN-LEFT is executed, then a new segment is allocated. This essentially points to the basic dilemma of hierarchical reinforcement learning: If I know good high-level options (here: always follow corridors to the next intersection) then learning the right, high-level state abstraction is easy; and vice-versa. Learning both at the same time is hard. I would be more convinced by these experiments if the authors ran an experiment eg with a model that's not conditioned on actions and see if segmentations still coincide with intersections. 1.2.2) How is the baseline RSSM defined here? How much do training curves vary across runs (let alone hyperparameters)? 2) Smaller comments: 2.1) Sec 2.3 l102-l103: This prior is quite weird as the last segment is different than the other ones. I don't really see the reason for this design choice, as the posterior inference does not make use of the maximum number of segments. 2.2) l135-l138: The assumption of independence of the $m_t$ under the posterior seems quite weak. Imagine in the BBs data set (no color change) it could be quite hard to determine where exactly the change point is (collision), but we can be very certain that there is only one. This situation could not be represented well with an independent posterior. 2.3) l40-l41: Clearly there have be earlier "stochastic sequence model(s) that discover(s) the temporal abstraction structure", eg take any semi-Markov, of Markov-jump-process. However, I agree that this particular version with NN-function approximators / amortized inference is novel and a worthwhile contribution. 2.4) The notation in eqn (1) and (2) looks a bit broken, eg there seems to be $s^i$ missing on the lhs. 2.5) below l81: This process is not exactly the same as the one from eqn (1) and (2) as here the length of the sub-sequence depends on the state as in $p(m_t\vert s_t)$ and not just on the $z_t$.

### Reviewer 3

Reviewer knowledge: I am very familiar with variational inference and variational autoencoders. I followed the derivation (in the main paper and appendix) and I believe they are correct. However, I am not very familiar with specific application to temporal data and the further usage in reinforcement learning. Someone familiar with that area should perhaps further comment on connections to prior work (I skimmed the RSSM and the other cited papers for this review). Review summary: interesting and novel work, proposing a new temporal variational model, learning to subsequence data. The experimental work shows the learned subsequences, the models’ ability to predict future frames, and RL performance. My concerns are regarding the forcing of the UPDATE function during testing, as well as the rather limited evaluation and comparisons with other methods (details below). There is also more analysis that can be done, including showing the beneficial effect of modelling uncertainty. Significance: A temporal variational model, which learns sequence structure though latent variables and learned sequence segmentation. The proposed model is principled (can be derived as a bound from the data log likelihood). The idea of learning representations for sequences is very powerful, and can be used for video generation, reinforcement learning, etc. In addition, the paper proposes learning to segment the sequential data into independent chunks, which can be useful for action recognition in videos, text segmentation, or reinforcement learning. The proposed method uses the Gumbel Softmax trick to learn the binary segmentation variables. Note that since this is a biased gradient estimator, the model cannot converge to the true posterior. Originality: I am not an expert on hierarchical sequence modelling, but from what I can tell the proposed method does introduce a novel way to learn hierarchical structure in a probabilistic fashion. Other works either learn the structure in a deterministic fashion (Chung et all, 2016), or they hardcode the structure, or avoid learning the hierarchy all together (Krishnan at all, 2017, Buesing et all 2018a, Chung et all, 2015). The authors focus their empirical comparison with RMSS. Unlike RMSS, the proposed work introduces learning hierarchical structures (through the binary latent variables m which are subsequence delimiters and the latent variables z which encode relevant information regarding a subsequence). The learned variables m and z are missing in the RMSS formulation. However, in RMSS, the authors specifically train the model for long term prediction, via latent overshooting. This is not the case in this work, where the presented work prohibits the use of the COPY operator (lines 113-114), but rather forcing the model to produce a new subsequence at each time step. I am concerned that this is a less principled way to do jumpy imaginatiation, since the model is forced to operate in a situation which it has not seen during training. What if the model would have definitely not started a new sequence at the given time point? In that case, it is forced to generalize outside of its training distribution. There is however an advantage of the way the jumpy navigation is implemented here, and that is efficiency. The proposed method can do jumpy imagination faster than prior work, by forcing the UDPATE operation. Another similar model is VHRED, but this model also does not learn the hierarchical structure (but rather uses a given one). Figure 1 is greatly beneficial for understanding the proposed work. It would also be beneficial to have similar figures for other methods (perhaps in the appendix), as done in Figure 2 in RSSM [Hafner et all]. Experimental results: The experimental work shown exhibits how the model can learn to segment sequences (Figure 3). Since the model learns uncertainty over the learned segments, q(m_t|X), it would have been nice to also see the uncertainty at which the model operates. To show the effect of sequence modelling on navigation, the next experiment shows the ability to model a maze and compares against RSSM. They show the effect of jumpy navigation in Figure 5. The last set of experiments show how the learned model can be used for search in model based RL. The results are obtained only on a simple environment (goal search navigation). Figure 8 shows that the model performs better than RSSM baseline. Here, it would have been nice to see other baselines as well as more complex environments. Note that the model is not trained jointly with the agent (based on the algorithms provided in the appendix), but rather from experience gathered from a random agent. I am concerned that this approach will not scale to hard exploration problems, where a random agent will not be aware of large parts of the environment. Experiments I would have wanted to see: * Something outside the image domain (this has been done in other related works. Example: sequence modelling). See Chung et all, 2016 for examples. * Experiments which exhibit the importance of stochasticity (through a direct comparison with HMRNN). * RL experiments which are directly comparable with prior work. The Goal oriented navigation task is not present in the RSSM paper, so there is no direct baseline to compare to in order to assess the baseline was tuned correctly. * Further RL experiments, for standard, well known, tasks. Nice extra experiments to have: * Showing the generalization effect of the model when forcing the update operation during jumpy evaluation on more than one domain (beyond Figure 5). * [Qualitative and quantitative analysis] What is the effect of the gumbel softmax annealing on the learned subsequence delimiters? Readability: The paper is readable and clear, but I believe certain parts of the appendix need to be moved in the main paper. I am not referring to the derivations themselves, but the overall structure of how the KL term looks like. In that derivation in the appendix, certain notations are being introduced such as (q(s_t)). I suggest to the authors to keep the original notation, such as q(s_t| z_t, s_{t-1}, m_t). While more verbose, it is clearer. The equations can be split on multiple lines. Figure 6 would benefit from further description. Reproducibility: The paper does not provide code but provides the required details in the appendix. I believe the paper is clear enough that the model could be reproduced. However, there are not many details about the provided baseline. Details of the hyperparameter tuning procedure are lacking.