NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1035
Title:Better Exploration with Optimistic Actor Critic

Reviewer 1

This work tackles the problem of sample efficiency through improved exploration. Off-policy Actor-Critic methods use the estimate of the critic to improve the actor. As they ignore the uncertainty in the estimation of the Q values, this may lead to sub-optimal behavior. This work suggests to focus exploration on actions with higher uncertainty (i.e., those which the critic is less confident about). The authors suggest of Optimistic Actor Critic approach, which is an extension to the Soft Actor Critic and evaluate their approach across various MuJoCo domains. This approach seems like a promising direction, the paper is well written and the motivation is clear, however I have several issues/questions: 1) Can this approach be applied to Actor-Critic schemes in discrete action spaces? The exploration scheme wouldn't end up as a shifted Gaussian, as in the continuous case, but rather a probabilistic distribution over the simplex which is bounded by distance \delta from the original policy. 2) Extension to non-convex action-value functions. This approach focuses on Gaussian policies, however, as the action-value function may be non-convex [1] considering Gaussian policies may result in convergence to local sub-optimal extremum. Can this approach be extended to more general policy distributions, such as a mixture of Gaussians (exists in the SAC source code and in a previous revision of their work)? 3) This work builds on-top of SAC, which is a maximum entropy approach. Since this is a general approach that can be applied to any off-policy Actor-Critic scheme, you should also show experiments with OAC applied to DDPG/TD3. 4) Evaluation: The results are nice, but aren't groundbreaking. The main improvement lies within the Hopper domain and the Humanoid 4-step version (in the rest it is unclear due to the variance in OAC and SAC). 5) Results for TD3 seem very low. Based on my experience (with TD3 in the v2 domains), it attains ~10k average on HalfCheetah, 4k on Ant and 2.5k on Hopper - with the official implementation (average over 10 seeds). [1] Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces Small issues: 1) I wouldn't exactly call what TD3 do a lower bound. As they show, the basic approach results in over-estimation due to the self-bootstrap and their minimum approach simply overcomes this issue (similar to Double Q learning). This is different than an LCB which takes the estimated mean minus an uncertainty estimate. 2) line 162 - you refer to Eq. (12) which is defined in the appendix. Should refer to (5) instead. 3) At each step, computing the exploration policy requires calculating the gradient. It's OK that an approach is more computationally demanding, but it should be stated clearly. Note to the authors: I think this work identifies a real problem and the approach looks promising. However, I believe it there is additional work to do in order for this to become a full conference paper and this paper should definitely be accepted in the future. ------------- POST FEEDBACK --------------- The authors response satisfied most of my "complaints". Thank you for taking the time to provide full answers and additional experiments. I have updated my score accordingly. I urge the authors to continue working on this direction, expanding to discrete action spaces and other areas of interest which may benefit greatly from this approach and those similar to it.

Reviewer 2

Originality: off-course, exploration and actor-critic architectures are not new in reinforcement learning, but here the authors suggest to start from an original point of view (which is the combination of two main drawbacks or RL approaches (i) pessimistic under-exploration and (ii) directionally uninformedness). Quality: intuitions and ideas are both well developed and illustrated. The proposed algorithm is interesting as well, and definitely reaches good performances. However, experimental results show that, despite improvements regarding other methods, millions of environmental interactions are still required. So, even if the proposed approach is able to reach state of the art performances, the main problem, which is stated by the author in the first paragraph of the introduction, still remains (I quote) : "millions of environment interactions are needed to obtain a reasonably performant policy for control problems with moderate complexity". Of course, this remains one of the biggest challenges of (deep) RL. The paper is well positioned regarding the literature. Clarity: The paper is very well written and organized. Intuitions and ideas are well presented and illustrated. Significance: the topic is indeed interesting, and I think other researchers in the field will definitely be interested in the results of the paper. *** Post feedback *** Thanks to the authors for the feedback and additional explanations. I have updated the confidence score accordingly.

Reviewer 3

The paper proposes an original idea, which is brining the upper bound confidence estimate, typically used in bandits algorithms, to improve exploration in Actor-Critic methods. The claim that 33 "If the lower bound has a spurious maximum, the covariance of the policy will decrease" is not sustained or explained. The authors mentioned the problem with directional uninformedness and how SAC and TD3 sample actions in opposite directions from the mean with equal probability, but their proposal (5) is still a gaussian distribution, which just a different mean and variance, so the samples would be still have opposite directions from a different mean. Please clarify. The impact of the level of optimism is not analyzed in the experiments. In proposition 7, equation (7) it is not clear why the covariance matrices are the same. In Section 4.1, why not use the max{Q^1, Q^2} as the upper confidence bound of Q_{UB} instead of the mean + std? The claim that OAC improves sample complexity doesn't seem to follow from the experiments or figure 3, can you do a figure comparing steps to reach certain performance, as a way to show that? Minor comments: - In line 73 do you mean "Actor-Critic" instead of "Policy gradient"? - 80 \hat{Q}^1_{LB} -> \hat{Q}_{LB} - 162 There is not formula (12) - In Figure 3, add Humanoid-v2 1 training steps graph to the 4 training steps to make the comparison easier. - The paper uses a weird format for references that makes it harder to navigate them. ------------- POST FEEDBACK --------------- Thanks for to the authors for the explanations and for adding extra ablation studies and measuring the effect of the directionality. I have updated my overall score to reflect it.