Review for NeurIPS paper: A Unifying View of Optimism in Episodic Reinforcement Learning

NeurIPS 2020

A Unifying View of Optimism in Episodic Reinforcement Learning

Review 1

Summary and Contributions: This theoretical paper uses duality to show that model optimistic algorithms have a much more practical value optimistic analogue. While calculating the value optimistic exploration bonuses turns out to be difficult as well, a much easier upper bound can be used to derive an approximation whose regret bound is within a constant of the original algorithm.

Strengths: Model optimistic and value optimistic algorithms are considered to be fundamentally different. By showing that one can approximately reduce one to the other, this paper helps unify these approaches. It is possible that these results could help us design algorithms that are of practical use and retain some theoretical guarantees, or at least allow us to advance the field without further complicating analysis.

Weaknesses: The main paper and appendix total 34 pages. It's not clear that this makes sense for a conference paper. It would have been better to focus on the tabular case in this paper and defer results for factored linear MDPs to a followup. That way the paper could have been restricted to a more reasonable length, and essential results could have been included in the main body of the paper. The contribution of this paper is purely theoretical. Neither the original model optimistic algorithms, nor the easier to implement value optimistic duals are mature enough to be used in a real application. While a step in the right direction, I expect the audience of this paper to be limited to a handful of authors who may build upon these results. The motivation provided in the paper is that model optimistic algorithms are easier to analyze, while value optimistic algorithms are easier to implement. This may be more a reflection of the authors' background, rather than a universal truth. Some authors find thinking in terms of value functions more intuitive. Nevertheless the paper should prove useful to authors from both backgrounds.

Correctness: While the claims in the main body of the paper seem reasonable, I did not carefully go through the proofs in the 23 page appendix, so I cannot confidently confirm correctness.

Clarity: The paper is well written and organized.

Relation to Prior Work: The paper does a good job relating its results to the state of the art, as well as rederive some recent results using the proposed framework.

Reproducibility: Yes

Additional Feedback: Reference 42 is missing venue and year. Post-rebuttal: Thank you for your responses. After some discussion I do understand R2's concern about the applicability of the duality result, though I find it to be of interest in and of itself. It would be great if you could provide some suggestions on how the duality result could be used to improve the theory or practice of exploration in RL.

Review 2

Summary and Contributions: This paper uses Lagrangian duality to show that every existing model-optimistic algorithm has an equivalent representation as a value-optimistic algorithm, which is computationally efficient. Based on this connection, this paper derives a class of algorithms that are modifications of existing model-optimistic algorithms but can be implemented by DP. This paper further studies the linear function approximation setting and obtains matching bounds as in previous works.

Strengths: 1. The duality between model-optimistic algorithms and value-optimistic algorithms is an interesting finding. 2. This paper is very well-written.

Weaknesses: While I like the duality result, I find this paper is not substantial enough that merits acceptance. This paper shows a class of model-optimistic algorithms can be implemented efficiently (with minor modifications). However, none of state-of-the-art algorithms is model-optimistic algorithms. This is somehow inherent with this class of algorithms because the transition model scales with $S^2$ but the optimal bounds scale linearly in $S$ via value-optimistic algorithms. Value-optimisic algorithms are not only more computationally efficient but also more statistically efficient. So making model-optimistic algorithms more efficient is not a very significant result. I would be happy to recommend acceptance if this paper contains any of the following results either 1) A substantially simpler analysis of existing SOTA algorithms, like UCB-VI-Bernstein, EULER (Zanette et al. 2019), ORLC (Dann et al. 2019), UCB-Q-Bernstein the framework in this paper. 2) A new model-optimistic-inspired algorithm which achieves SOTA sample complexity ($H\sqrt{SAT}$). Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette, Emma Brunskill Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

Correctness: Yes.

Clarity: Yes. This paper explains this its idea very well.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I've read the rebuttal and decided to keep my score. The response does not address my concern about lack of substantial application of the duality result. --------------------------------------------------------------- Minor comments: 1. line 86-86 missing summations. 2. UCB-VI only matches the lower bound in the regime $SA \ge H$ and $T$ is sufficiently large. EULER and ORLC match the lower bound and do not require $SA \ge H$, though they still require $T$ sufficiently large. 3. There are some typos in references, e.g., [42] only has a page number.

Review 3

Summary and Contributions: The paper proves that any model-optimistic algorithm can be transformed to an equivalent value-optimistic algorithm. This enables cleaner theoretical analysis of an optimistic algorithm through its model-optimistic formulation while allowing easier implementation through its value-optimistic formulation.

Strengths: While I could not check proofs of all claims, they seems to be valid. The result that any model-optimistic algorithm can be expresses as both value- and model-optimistic algorithms is, I think, novel. Also this result is very significant because it enables cleaner theoretical analysis of an optimistic algorithm while allowing its easier implementation by dynamic programming.

Weaknesses: I do not find any particular weakness of the paper.

Correctness: I did not check proofs in the appendix, but Proposition 1, which is the main result showing an equivalence of model-optimistic algorithms to value-optimistic algorithms, looks correct (as far as I can tell from its proof sketch in the main paper).

Clarity: Yes, it is clearly written and easy to follow.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I like the paper very much. I have two question to the authors. Regarding Table 1, is it possible to find an optimal D such that the regret bound is tightest? If no, what makes find an optimal D difficult? If yes, what does it look like? Is there a hope that model-optimistic algorithms can archive a minimax regret bound with such an optimal D? Regarding Theorem 4, is it possible to get something similar for PAC, or uniform-PAC bounds (https://arxiv.org/pdf/1703.07710.pdf)? It would make the paper stronger, I think. ----- After Reading Author Feedback ----- I read the author feedback and other reviews. The author feedback answered my questions very well. I agree with other reviewers that the paper would be stronger if it has either a substantially simpler analysis of existing SOTA algorithms or a model-optimistic algorithm with a SOTA sample complexity, as an application of the duality result. However, I think the duality result itself is interesting and has potential to be used in other papers to improve algorithms. Therefore, I keep my score as is.

Review 4

Summary and Contributions: The paper proposes a new framework for designing and analyzing optimistic RL algorithms in episodic tabular MDPs. The paper also proposes a few other technical proofs that might be of independent interest. --- post rebuttal --- I am not an expert in this area and I do not feel adequate to judge this work. I have deferred judgement to the remaining reviewers.

Strengths: The paper is very clearly written and organized. I am not an expert in this area and therefore not able to judge the significant and novelty of the contribution.

Weaknesses: The paper is, if I am not mistaken, entirely theoretically. It would be insightful to have experiments, even on toy environments where the optimal policies can be computed, to understand to what extent the theoretical analysis transfer to experiments.

Correctness: I did not get the chance to go over the proof in the appendix.

Clarity: Yes. The paper is very well written and organized.

Relation to Prior Work: I would have liked to see more discussions on the connections between the paper, its theoretical results and some of the more experimental papers, especially RL method that uses optimism for exploration.

Reproducibility: Yes

Additional Feedback: