Reviews: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

The proposal to restrict the learned policy to lie within the support of the data distribution instead of mimicking the data distribution is very good. Proposing this, and especially making it work is in my understanding novel (I will have to defer to the other reviewers who are hopefully deeper experts on Q-learning to comment on past work). The theoretical analysis that supports the idea is also well made: there is an appropriate balance between complexity and usefulness in the analysis. The empirical results are convincing and support the theory, and are in line with what one would expect. Overall, the paper seems solid and well written, the method relevant, and the results are good. Edit: Thank you for your response, and for adding the humanoid experiments, they add value to the paper. Although I am happy with the response and believe the paper got even stronger, I would say that my score pre- and post-update would round to the same fairly favorable number.

Reviewer 2

Summary: This paper proposes a new algorithm that help stabilize off-policy Q-learning. The idea is to introduce approximate Bellman updates that are based on constraint actions sampled only from the support of the training data distribution. The paper shows the main source of instability is the boostrapping error. The boostrapping process might use actions that do not lie in the training data distribution. This work shows a way to mitigate this issue. It provides both theoretical results and a practical algorithm. The experiment show some aspects of the idea. In overall, the paper studies an interesting problem in RL. The paper comes with many theoretical results, the practical algorihtm seems to work with promising experiment results. However the writing is quite hard to follow. The notations and explanations are not good enough to make it easily readable. Both the idea of using constraint actions for approximate Bellman updates and the theoretical results make sense. The paper looks like a premature work and needs substantial improvement to become a solid and stronger paper. Some of my major comments are: * Background section: - inconsistent notation: P(s'|s,a), T(s'|s,a) , p(s'|s,a); P^{\pi}, p_{\pi} - This section should at least include a description of SAC and TD3, before discussing about it in Section 4, especially in the example. * Section 4 & 5: - Example 1: the description is quite unclear. It's hard to understand the task and results. Especially the results here use SAC as a baseline, but all previous discussion use the standard Q-learning algorithm. - These results in Section 4 still lack intuition and especially connections to the proposed practical algorithm, BEAR. The algorithm BEAR uses many approximate techniques, e.g. Dirac policies, and the conservative estimate of Q. All these technical steps do not come with good explanation, especially the policy improvement step, e.g why estimate's variance is used, \Pi_\epsilon, definition of \Delta, {\cal D}(s). - About the arguments from lines 127-138 The support is different from the behaviour distribution/policy? it does not clarify this difference. How to understand the set \Pi_\epsilon, does it include all policies that have probability of choosing a smaller than epsilon? Though the meaning of the MMD constraint in Eq.1 is consistent with some discussion in the paper that the policy should be selecting actions lying in the support of \beta distribution, but is this consistent with the definition of \Pi_\epsilon? In addition, this MMD constraint in Eq. 1 looks similar to the constraint used by BCQ, so the discussion in line 127-138 is quite misleading? I think this also wants to constraint the policy to be learned that somehow must be close to the behaviour policy? - No definitions for k, \var_k {Q_k} , and \var_k {E[Q_k]} in line 210,211 - The choice of conservative estimate for policy: what is the intuition behind this? it is not upper or lower confidence bound, which is a way of exploration bonus. - Algorithm 1: the sampling is a mini-batch, however the description of Q-update and Policy-update do not tell there is updates using mini-batch. The update of the ensembles on 4-6 is also not explained: why the next Q value y(s,a) is computed as in step 5, especially regarding to \lambda? * Section 6: I wonder how this should be compared to [1]? and should related work consist of discussions to methods like [1]? - The results in Fig. 3: It seems BC can also be very competitive, e.g. HalfCheetah and Ant. I wonder why the performance of BEAR becomes degenerate in Hopper? - Figure 4 shows BEAR might be competitive in average for both settings. That means BEAR is not favorable in any cases. Given optimal data, nay method can be competitive, while with random data, either Naive RL or BCQ is very competitive. [1] Todd Hester et. al. Deep Q-learning from Demonstrations ----------------- The rebuttal has addressed some of my concerns. The only concern is about clarity and novelty.

Reviewer 3

This work aims to resolve the off-policy problem. Authors discussed a variant of value iteration operator which is induced by a set of constrained policies. They further identified two kinds of errors and showed that a trade-off between them can improve the performance. Finally, they leveraged the intuition to develop a practical algorithm. Their observation is very interesting and the solution makes sense. Clarity: the bootstrapping error is a central notion in this work. However, authors didn't give a definition of bootstrapping error before discussing how to deal with the bootstrapping error in line 127. Though I understood their results through their theorem, I didn't figure out what is bootstrapping error exactly. Empirical evaluation: the proposed algorithm seems to work well in practice. But two baselines is not very sufficient, which hurts this work somehow.

Paper ID:	6282
Title:	Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Reviewer 1

Reviewer 2

Reviewer 3