NeurIPS 2020

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

Review 1

Summary and Contributions: This paper propose a new RL algorithm that addressed the setting of low inherent Bellman error, which significantly improves the computational complexity of the earlier work in this setting [Zenette et al. 2020b]. This setting is also more general than the earlier line of work on linear MDP or low-rank MDP. This paper also present a reward-free version of the result where reward is not required in the phase of exploration

Strengths: The problem is important in the setting of linear function approximation. This result is solid, the contribution is significant, and the paper is well-written and easy to follow.

Weaknesses: This paper assumes the explorability (Definition 2). It seems to the lower bound requires the learning agent with no knowledge of reward vector, which is only justified in the reward-free setting? Can author comment on if this is necessary for the non-reward-free setting, and what is the analog of such condition in earlier result of linear MDP, such as LSVI-UCB, and if they require that?

Correctness: I have not fully checked the details. The high-level argument looks correct to me.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper introduces a new provably efficient RL algorithm with PAC guarantees for linear MDPs in the low inherent Bellman error regime. The paper achieves this performance guarantee by a very simple variant of LSVI algorithm which basically relies on some approximation of G-optimal design to uniformly reduce the maximum uncertainty over all the features. This is simply done by keeping track of the online covariance matrix of the features and using that to generate reward from the corresponding Gaussian distribution. I believe if this result is proven to be true it is a novel and significant contribution to the literature of RL with function approximation and can be definitely of the great interest for NeurIPS community.

Strengths: The main strength of paper is that it tackles one of the major open problem in RL namely online RL with function approximation by producing a simple and computationally tractable algorithm that enjoys PAC guarantees in the agnostic reward setting. Beyond that the paper also considers the effect of approximation error, quantified by inherent Bellman error on the quality of results. I think the sheer fact that the paper has attempted on addressing such a difficult problem, which has eluded many top researchers in the past is admirable. Also although some of the assumptions in particular the explorability assumption looks very strong, the paper provides complementary results which shows these assumption in general are unavoidable. Overall I think a lot of work has gone into this paper and I believe this work definitely contributes to the literature of RL theory, even if not every result in the paper can be verified.

Weaknesses: The main difficulty is evaluating and verifying the results of the paper. In my belief The paper suffers from lack of proper and in-depth motivation for its theoretical and algorithmic arguments. This is the case both in description of Algorithm in Sect. 3 in which the paper fails to motivate the reasoning behind the proposed algorithmic steps. More importantly it is also the case in the sketch proof of section 5 which could significantly benefit from providing some high-level structure of proofs. In fact I found the current sketch proof very brief and unintuitive, which makes the job of evaluating the paper very difficult. Although by browsing through the sketch proof I couldn't find any error I was not able to reach to any overall understanding of the analysis of this paper based on the current version of sketch proof.

Correctness: I checked the sketch proof and the lower-bound on the exploitability and I couldn't find any error, I did not have time go through the full proof in the Appendix.

Clarity: As I mentions the lack of clarity is the main drawback of this work. I would suggest that authors t provide better motivations for their proposed approach and the analysis by describing the high level ideas in a more clear way.

Relation to Prior Work: The prior work in the literature of RL theory have been properly and extensively covered.

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: This paper provides a reward-agnostic pure exploration RL algorithm for systems with low-inherent Bellman error. The paper provides strong PAC guarantees on learning a near-optimal value function provided that the linear space is sufficiently “explorable”. The paper also shows a lower bound showing that the "explorable" condition is somewhat necessary. === After rebuttal: I do think the paper has merits in providing a reward free algorithm for the low-bellman error setting. But I still feel the reachability assumption is a bit too strong.

Strengths: (1) Low-inherent Bellman error is stronger than linear MDP or low-rank MDP. (2) Matching upper and lower bound. (3) Computational efficient algorithm.

Weaknesses: (1) The requirement for the "explorable" does not seem convincing for the following reasons: a. Theorem 4.1 requires that eps < O(\nu_min/sqrt{d}) b. The lower bound requires very small eps as well c. The lower bound appears to be built on a finite states MDP, in which case [Jin et al] provides an upper bound that has no dependence on \nu_min. Any explanation on this? d. Intuitively the \nu_min is not necessary in the sense that if a direction is hard to reach, then it is not important for any reward function. e. The algorithm is based on a layer-by-layer learning structure, which seems intrinsically requires a "explorable" condition, which is also observed in [Du et al. 2019]. f. When reduce to the tabular setting, it seems that this result does not recover [Jin et al] due to the requirement of \nu_min. (2) The layer-by-layer learning structure does not seem very natural. Such an algorithm may be loose on the H dependence. (3) There is a lack of motivation of discussing the low-Bellman error model and reward-free/agnositc exploration in this setting. (4) There is lack of experiment, although it seems fine for a theory-oriented paper.

Correctness: I believe so.

Clarity: It is okay written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors present new theoretical results for RL with linear function approximation, a setting for which there is still a lack of good results despite its seeming simplicity. The paper builds on results for batch learning RL algorithms for the low-inherent Bellman case as well as on recent work in G-optimal design. The main result is for reward-agnostic exploration where a linear reward function is revealed after data has been collected and a value function will be estimated in batch mode from the gathered experience. A conceptually simple algorithm is introduced for which they prove PAC bounds for how many steps are needed before one can estimate a near optimal policy for any linear reward function.

Strengths: The paper presents a new form of result that is quite interesting. The authors does a good job of explaining the context of other relevant literature.

Weaknesses: While it is not uncommon for this kind of work to not have even illustrative forms of experiments, it would still have been nice to see.

Correctness: I have not properly checked the proofs, but I am not aware of anything incorrect.

Clarity: The paper is well written.

Relation to Prior Work: The authors does a good job of discussing relevant literature.

Reproducibility: Yes

Additional Feedback: