NeurIPS 2020

MOReL: Model-Based Offline Reinforcement Learning

Review 1

Summary and Contributions: This paper presents an interesting approach to offline MBRL using pessimistic MDP. The authors provide an approximate lower bound of performance in the environment by the policies learned in the P-MDP. Some insights about the claims and approaches for learning the P-MDP are provided through numerical experiments. ------- Based on the author's response and overall evaluation, I recommend that the paper be accepted.

Strengths: I think the main strength of the paper is the theoretical results provided by the authors in section 4.1. I think the result of lower bounding the performance in the environment with the P-MDP learned by the authors is novel and interesting. I also like that the authors were able to illustrate the transfer from P-MDP to the environment.

Weaknesses: The idea of offline model learning could be related to system id approaches in control theory. However, this still remains difficult for high dimensional system. The authors don't provide a novelty in fixing this issue. However, they use an ensemble of parameterize models to quantify uncertainty and then, use, I would say a rather, ad-hoc way to define their USAD using a hyperparameter. This again makes the algorithm susceptible to instability based on the choice of this hyper parameter. It would have been nice if the authors would have explored the effect of model learning on the performance of the controller. Would it be possible to demonstrate the affect of this hyperparameter on the policy? While the authors provide an intuition that the policy for data logging does affect their final policy, is there an intuition about collecting good data from a system? Consider that now we interact with a new system-- how can we log data to implement this method on the system? When using a pure-random policy, the performance of the controller seems to drop quite a lot in the proposed method. It is not clear to me how much the policy performance depends on the choice of the method used to compute the policy? Why was the method NPG chosen for policy computation? Did the authors explore any other method for policy computation?

Correctness: I like the theoretical results that the authors have provided in section 4.1, and they show that for the examples demonstrated in section 5 that the P-MDP provides a lower bound to the performance in the environment. I think this might be the strongest part of the paper.

Clarity: The paper is very well written. The theoretical results are well described as well as the results are well explained.

Relation to Prior Work: yes, the authors have pointed out relevant differences from related work in this field.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes a model-based offline RL algorithm based on tracking the uncertainty in the learned dynamics model and making uncertain states transition to a negative reward absorbing state. It shows some theoretical analysis of performance and good results on mujoco-based offline RL benchmarks.

Strengths: The paper is well written with a good description of related work and a good empirical comparison with positive results.

Weaknesses: My enthusiasm for the paper is limited by the fact that these kinds of ideas have already been well explored in the historical model-based RL literature. The real contribution of the paper is mostly to carry the ideas forward to modern offline RL setups and benchmarks. On the plus side, the empirical results are good. Theorem 1 is ok but doesn't provide much novel insight. Basically if a policy doesn't drive toward uncertain states very quickly, its value in the original and the pessimistic MDP is similar, which is as expected. Analyzing this via hitting times is interesting.

Correctness: The empirical results show standard deviations over 5 random seeds which is a nice improvement over previous work that did not report theirs. However, the appropriate thing to report would be standard errors. In particular, if your intent is to indicate results where you do not have the best average performance but are within the uncertainty of the evaluation (as you do) then standard error should be used.

Clarity: In proposition 4, why is gamma constrained [0.95,1)? I'm assuming 0.95 is not important, you just need some specific value for it not too close to zero?

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I might have missed it, but I believe you didn't define D_TV before using it in equation 2.

Review 3

Summary and Contributions: This paper presents a model-based offline reinforcement learning algorithm, MOReL. MOReL constructs a pessimistic MDP (P-MDP) using an unknown state-action detector, which largely penalizes the 'unknown' state-action regions. Then, the policy is optimized via the constructed P-MDP. Theoretically, it is proven that MOReL is nearly minimax optimal. Empirically, MOReL is competitive or outperforms the state-of-the-art offline RL algorithms.

Strengths: Offline reinforcement learning is an important problem, which can enable RL to be applied to many real-world problems where the data-collection cost or safety is crucial. This work provides the first (deep) model-based offline RL algorithm that has a theoretical guarantee as well as strong empirical performance, which are novel contributions of the work.

Weaknesses: I didn't find a significant weakness of the paper.

Correctness: I didn't read the full proof, but the statements look sound and correct. The proposed algorithm is sound and well-backed by the theorem.

Clarity: The paper is well-written and easy fo follow.

Relation to Prior Work: Overall, the contributions of the work are clearly discussed, compared to other offline RL works.

Reproducibility: Yes

Additional Feedback: Most of recent offline RL algorithms rely on policy regularization where the optimizing policy is prevented from deviating too much from the data-logging policy. Differently, MOReL does not directly rely on the data-logging policy but exploits pessimism to a model-based approach, providing another good direction for offline RL. Overall, I lean toward the acceptance of the paper. - MOReL constructs a pessimistic MDP, where all unknown (state,action) pairs are penalized equally. However, it would be more natural to penalize more to more uncertain states. For example, one classical model-based RL algorithm (MBIE-EB) constructs an optimistic MDP that rewarding the uncertain regions by the bonus proportional to the 1/sqrt(N(s,a)) where N(s,a) is the visitation count. In contrast, but similarly to MBIE-EB, we may consider a pessimistic MDP that penalizes the uncertain regions by the penalty proportional to the 1/sqrt(N(s,a)). - It seems that theorem 1 always encourages to use the smallest alpha, i.e. alpha=0. However, when alpha=0, USAD of Eq. (2) will always output TRUE, where MOReL would not work properly. How is it justified to use alpha greater than zero for USAD? - It would be great to see how sensitive the performance of the algorithm with respect to kappa in the reward penalty and threshold in USAD. - To claim SOTA performance, MOReL may have to be compared to more recent works on offline RL such as [1] and [2]. [1] Siegel et al., Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning, 2020 [2] Lee et al. Batch Reinforcement Learning with Hyperparameter Gradients, 2020 == post rebuttal I have read the author response, and I still remain on the acceptance of the paper.