Review for NeurIPS paper: Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

NeurIPS 2020

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Review 1

Summary and Contributions: The paper presents a method for estimation of the underlying POMDP of an RL problem for subsequent solution it with a parametrised policy. The model of the POMDP is used to obtain a state estimator/filter which can then be used to train both a critic and an actor with the samples from the filter as inputs, resulting in a model-free algorithm. The method is heavily based on the family of sequential latent variable models trained with variational inference.

Strengths: The paper is solid. The writing is done well and the method theoretically sound. The results are convincing. The ablation studies are tremendously useful for practicioners and have helped me drive design decisions.

Weaknesses: - The paper's narrative is based around POMDPs, but the experimental evaluation does not really stress the capability of the method in that respect. Evaluation is done on pixel-based control, which is PO of course, but we have know that a lagged observation of a few time-steps can make the state fully observable quickly. (See the appendix of [1]). Hence, we do not know how the method fares in environments where the state uncertainty has to be actively reduced by the agent. Therefore I think the paper overstates the results. It is easy to get out of this, however, since one can just drop the POMDP claim. - The justification of the overall approach could have been improved. For me personally (and the optimal control community) it is obvious that we want some kind of state estimation when we use control, as most–if not all–practical problems are PO. But the paper could have done a much better job at its justification. E.g. a very noisy sensor that requires a few time steps waiting to correctly estimate a quantity makes such approaches necessary. The authors suffer from the fact that the RL community is somewhat focused on Mujoco-like benchmarks, which are representative of only a very small fraction of practical optimal control problems. But the authors could have chosen to use a different suite of environments, such as EscapeRoomba or MountainHike, which would illustrate this. If the authors had chosen to conduct experiments that tackle much more relevant POMDP problems, I'd have given an increased score. - I would have enjoyed an ablation whether AISOC/MaxEnt is necessary. [1] **CURL: Contrastive Unsupervised Representations for Reinforcement Learning** Michael Laskin*, Aravind Srinivas*, Pieter Abbeel. Thirty-seventh International Conference Machine Learning (ICML), 2020.

Correctness: The paper claims SOTA at various points. I know this was the case during the first submissions of this paper and at the time of writing, but right now I think one cannot ignore [1, 2, 3]. I feel sorry for the authors because this is just because of publications in the mean-time, but as of now the claims in the paper are wrong. The manuscript has to correct this, as this clearly stands in the way of publication. (I am willing to increase my score radically if this point is adressed–I don't think SOTA results are very relevant for the publication, I think factual correctness is.) [1] **CURL: Contrastive Unsupervised Representations for Reinforcement Learning** Michael Laskin*, Aravind Srinivas*, Pieter Abbeel. Thirty-seventh International Conference Machine Learning (ICML), 2020. [2] **Reinforcement Learning with Augmented Data** Michael Laskin*, Kimin Lee*, Adam Stooke, Lerrel Pinto, Pieter Abbeel, Aravind Srinivas [3] Kostrikov et al. [Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels](https://arxiv.org/abs/2004.13649). arXiv 2020.

Clarity: Yes, very well.

Relation to Prior Work: [4] is missing from the related work. It was one of (if not the) first works using a amortized variational sequence model in a control context. Karl, Maximilian, et al. "Unsupervised real-time control through variational empowerment." arXiv preprint arXiv:1710.05101 (2017).

Reproducibility: Yes

Additional Feedback: It feels as if the work was merely resubmitted. I think it could be improved in the light of recent findings, and would still be relevant. E.g. an experimental evaluation on environments that [1, 2, 3] clearly cannot solve, would greatly increase the relevancy of SLAC.

Review 2

Summary and Contributions: The authors propose learning a (generative) latent variable model alongside the model-free q based approach. The authors show that this lead to improved performance on standard robotic bechmarks

Strengths: - The writing is very good including the theoretical work and description of experiments - the experiments and especially the ablation studies make sense given the research question

Weaknesses: - A comparison to E2C would have been nice - See my general comments on additional feedbacks about separating the policy learning from the embedding / state transition learning. - The results of the proposed method appear a bit unstable e.g for Walker or Hopper. - By the results from cheetah or Ant v2 I get a bit the impression that the results can be divergent and "early stopping" may be necessary. Why this may be?

Correctness: yes

Clarity: yes

Relation to Prior Work: The authors did a great job covering related work

Reproducibility: Yes

Additional Feedback: I have some doubts over the idea to separate the RL task from the 'learning an embedding and model' part. Isn't one of the main advantages of model-free RL over model-based RL the -- for a lack of better word -- task efficency or attention mechnanism in the sense that only those parts of the system behavior is learned that is useful in saving the task? (compared to model-based RL that tries to learn the whole dynamics). As it is right now the latent variable model (the embedding + state transition model) does not receive any feedback about how its learning benefits the general task. Isn't the "correct" latent representation task dependent?

Review 3

Summary and Contributions: This paper proposes an approach for solving POMDP with rich / high dimensional observations, e.g., images. This approach frames the policy optimization problem as an inference problem whose goal is to maximize the joint likelihood of optimal policy and observations, over an explicit latent state model and a policy model. This directly draws inspiration from structure variational inference and RL as probabilistic inference. The approach is instantiated and experimented on continuous control tasks. And the results show that the new method outperforms some popular model-free and model-based RL algorithms.

Strengths: The new method achieves convincingly good results. The experimental results are comprehensive. Besides showing that the new method achieves superior or similar performance compared to some popular RL algorithms, the authors conducted ablation analysis, which is very helpful to understand the effect of the design choices.The paper contains some useful lessons learned, which provides interesting insights for making design choices in the future, e.g., Line 228, Line 320.

Weaknesses: Algorithmic contribution: The contribution in terms of the algorithm is limited. The two main components of the algorithm can be drawn from existing methods in a straightforward way: the Actor critic part (Line 201) seems to be the same as Levine [40], except that state becomes the latent state and state dependent policy becomes observation dependent policy. And the latent variable model part (Line 190) has been proposed before. Overly optimistic / risk seeking: As mentioned in Levine [40], treating both policy and transition model as variational parameters can lead to risk-seeking behavior. The authors argue that having policies that are dependent on the history of observations instead of the latent state can mitigate this issue. I am not fully convinced that this is the case, because a latent dynamics model and a latent representation is learned anyway and furthermore the goal is to maximizing p(x, O | a). In addition, the experimental results (Figure (b)) seem not to support the authors' claim.

Correctness: I think empirical methodology is correct. However I have some questions with respect to the method. 1. I am a bit confused by Algorithm 1. What is the difference between t and \tau? What do \theta_1 and \theta_2 represent? 2. In Equation (9), On the left, V_\theta(z_{\tau+1}) is a function of z_{\tau+1}. However the right hand side also depends on x_{1:{\tau+1}}. 3. The likelihood in Line 178 is for a single \tau. Are the likelihood for different \tau summed up for optimization?

Clarity: Yes, overall. However, the layout seems to be overly tightened. More loose layout can make reading easier. Furthermore, I think some potentially useful information is missing, e.g., In Eq.(5) is p(z_{t+1} | z_t, a_t) the true unknown transition probability or a model? During pre-training, can the true state be accessed? is that the "supervision signal"?

Relation to Prior Work: Yes, except that the difference between the new method and Hafner et al. [26] in Line 87 is not well addressed.

Reproducibility: Yes

Additional Feedback: ================ == After Rebuttal == ================ I appreciate the authors' thoughtful response and additional experimental results. However, I am still not fully convinced that the proposed method will not suffer from risk seeking since in each iteration the model is updated (Eq. 7) and is used to forecast to update policy (Eq. 10), if I am not mistaken. Furthermore, as pointed by R1 and acknowledged by the authors, the POMDP narrative lacks support. But I think this paper does have interesting metrics. Good luck with the paper.

Review 4

Summary and Contributions: The authors propose a new actor-critic method where the critic operates on the learned latent space but the policy still operates on the state space by taking in past history (though the policy is trained using this "latent space critic"). This latent representation is learned by training a fully stochastic sequential latent model with VAE loss for maximizing observation likelihoods. The method shows very good performance on standard benchmark pixel-based Mujoco environments.

Strengths: Experimental results shown in this paper (Figure 4 Halfcheetah, Walker2d, Hopper, Ant) are quite strong, compared to other recent methods on this domain such as PlaNet. This paper also includes a neat derivation on the proposed SLAC objective using control as inference framework for POMDP settings.

Weaknesses: Since I don't have expertise in this area and particular task domain, I cannot comment on the novelty of the method.

Correctness: Yes. The empirical methodology is correct. I appreciate the ablations studies on key design choices (on latent vs history conditioning for actor and critic) from Figure 9 in Appendix. It shows that using history (instead of latent vector) also performs competitively as latent actor with the added benefit of fast test time policy deployment.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: ----Post rebuttal--- I've read the rebuttals. Thank you for the added clarifications and the position of this paper. (e.g. not claiming SOTA as the main contribution, true state of the environment is never available.) The empirical results shown in the paper are still quite strong