NeurIPS 2020

### Review 1

Summary and Contributions: This paper studies the value based RL and makes progress beyond the tabular and linear setting. The assumptions about the value function class are: (i) it satisfies closedness for all possible value function, and (ii) bounded covering number. The regret is poly in Eluder dimension, log covering number, and has square root in dependence in T. This paper mostly extends [44] (about Eluder dimension) from the bandit case to the RL setting. Although the idea of the algorithm is simply to do regression, the authors construct a very delicate bonus term. The main contribution of this work is to introduce the sensitive sampling [32,20,21] to do universal approximation of the bonus term and significantly reduce the sample size. On the downside, its computation doesn’t seem to be too generally efficient, which might limit the potential impact. Some part of the writing is not clear and can be improved.

Strengths: The understanding beyond the tabular setting and linear value function approximation is limited in theoretical RL field. It's great to see this paper makes progress in this direction and provides a good regret bound. The paper also introduces the sensitive sampling to do universal approximation of the bonus term and at the same time significantly reduce the size of the dataset. Such technique might be useful for the RL community.

Weaknesses: The proposed algorithm doesn’t seem to be computationally efficient in general. It might limit its potential impact.

Correctness: I went into the detail of the proof a little bit. It looks good in general and I have some minor questions (see below).

Clarity: The paper is generally well written and the main text is easy to follow. I think some writing and discussion can be improved to make the paper clearer. The authors didn't mention the assumption 1 (kind of closedness assumption of the function class) in the abstract or in the introduction. I feel it's a very important assumption and it's better to emphasize that at the beginning. It's unclear to me whether the algorithm is really computationally efficient or not. See details below.

Relation to Prior Work: Most of the existing theoretical results (PAC or regret bound) are about the tabular or linear setting. A few algorithms were proposed for more general setting but suffer from the computationally inefficiency. This paper mostly extends [44] (about Eluder dimension) from the bandit case to the RL setting. The idea of the algorithm is simple but the construction of bonus term is quite complicated and technical. Some part of the proof (Lemma 9 and 10) are adapted from the Proposition 3 and Lemma 2 prior work [44]. I think the connection need to be discussed more. It’s unclear to me how the sensitive sampling part differs from the prior work [32,20,21] and I feel more discussions is needed.

Reproducibility: Yes

### Review 2

Summary and Contributions: The authors studied an LSVI algorithm in which a general function class is used. They gave performance guarantees which depend on the log covering number and the eluder dimension and has \sqrt{K} dependence. ** I read the author's response and decided to keep my score. Thanks for the clarifications.

Strengths: Generalizes previous algorithms which study LSVI with linear function approximation to a general class of (Q) function approximations. Furthermore, the authors suggest and analyze algorithm 2, which allows them to reduce the computational complexity of the bonus term.

Weaknesses: There are two main weaknesses in my opinion. 1) Assumption 1. The function class needs to be closed w.r.t. every possible value is a harsh assumption. For example, a weaker assumption is closeness w.r.t. to the bellman operator applied on the function class. 2) To calculate D_k one needs to calculate V^k(s_\tau) for all \tau\in[K], i.e., the computational complexity of calculating line 8 is O(K) (and cannot be calculated adaptively using Sherman Morrison formula).

Correctness: I haven't found a problem in the proofs.

Clarity: The text is clearly written.

Relation to Prior Work: Relation to previous works are clearly discussed.

Reproducibility: Yes

Additional Feedback: *) Computational complexity. The algorithm needs to calculate the current value function V^k on the entire historical data. This makes the algorithm run in O(T). Think it is something that is not stressed enough. *) Proof of Lemma 9. In the proof you bound the bonus using properties of the function class F (i.e, its eluder dimension). How the bonus is related to the properties of the function class? I think an explanation would make this proof much more readable. *) Is there a reason not to use the doubling trick to reduce the computational complexity? Is there something that prevents it? I think the authors should comment about it in the paper.

### Review 3

Summary and Contributions: The paper extends the low-rank idea recently introduced by Jin et al '20 to more general, i.e., non-linear predictors, showing 1) connection to the Eluder dimension and 2) reducing the complexity of the bonus to reduce the size of the cover by selecting the most relevant data. ======================= After discussion with the other reviewers and area chair, I raised my score with the understanding that the authors will better clarify the relations with the literature, the assumptions, and other concerns other reviewers may have

Strengths: Point 1 and 2 described in the summary above are (indeed very good) strengths of this submission.

Weaknesses: I report these under weaknesses but it's more like observations - Assumption: This is an extension of the low-rank setting in that it maps any value function to a prescribed functional space (which is the key property used in the linear MDP setting of Jin '20, regardless of the decomposition into a low-rank MDP). This should be better emphasized by the author(s): the reason why they are able to proceed in the analysis using traditional methods is because they can add bonuses and still be able to represent the value function as the Bellman operator automatically maps everything back to the prescribed functional space. The assumption is certainly strong (as is the linear MDP assumption in the linear case); while some assumptions have to be made, it should be motivated better - The paper operates in a very general setting in terms of predictors, where we don't normally have confidence intervals available, so the authors just assume that. Owning to the strong assumption (see bullet points above), the authors create a bonus equal to the confidence intervals. This effectively solves most of the exploration problem by assumption (the estimation part is often difficult, and the construction of the bonus might be very hard). Not much can be said in such a general setting, but all this bypasses most of the difficulties.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: The works makes two claims which I think sends the wrong message - please either elaborate on these by including a proof in the manuscript or don't include them unless absolutely certain. 1) around line 272. "By a more refined analysis specialized to the tabular setting, the regret bound of our algorithm can be improved using techniques in [6]" I'm not sure this is the case. The work [6] uses an exploration bonus adapted to the variance of the system dynamics. Works prior to [6] have been unable to obtain the same regret bound as [6] without this. This suggest that your algorithm for tabular setting might need to be modified, and the bound does not improve merely by improving the analysis as you claim - Line 284. "Finally, we would like to stress that our algorithm and analysis tackle the case that F is a general function class which contains the linear case studied in [27, 57, 62] as a special case." Reference [62] does not make the assumption that any value function comes out in the prescribed functional space (linear in their case), so I think the two are incomparable works in terms of assumptions.

Reproducibility: Yes

### Review 4

Summary and Contributions: The paper studies exploration-exploitation dilemma in episodic reinforcement learning with general function approximation. The proposed algorithm, F-LSVI, is an optimistic modification of least square value iteration with general function class F. Under some conditions on F, the algorithm enjoys a sublunar regret that depends on the Eluder dimension and the log-coving number of F.

Strengths: It is a new step towards sample-efficient model-free reinforcement learning with general function approximation. It certainly provides some new intuition about how to achieve sample efficiency in challenging setting beyond tabular representation or linear MDP. The proposed algorithm enjoys nice theoretical guarantees. The sensitivity sampling looks interesting approach and it is a key element of the algorithm to control its regret.

Weaknesses: Assumption 1 seems to induce implicit condition on the model of the environment. The choice the function class F would be intimately related to the structure of the environment at hand, for example in linear MDP, the function class would be the linear function with the same features that defines the low rank structure of the transition model. So I am not sure about the author ’s claim that they don’t make any assumption on the model of the environment. I have some concerns about the computational efficiency. Beyond the linear case (which is already addressed by prior works), the bonus function which is defined by the width function doesn’t look something that we can’t estimate easily and naive approach would scale with the covering number of F. The same holds for determining whether z is independent of a given set with respect to F. The definition of beta in the confidence region looks very pessimistic. I think it would lead to over-exploration in practice. For example, the covering number of the class of Lipschitz functions is doubly exponential in the dimension which lead to value of beta exponential in the dimension. It would be interesting to have some empirical results on simple environment to highlight advantages and disadvantages of some components of the algorithm (the importance of sensitivity sampling for example)

Correctness: I went through the poof but not in detail so I can’t really judge the correctness of every steps of the analysis but it looks overall sounded. I encourage the author to improve the clarity of the appendix, there are many typos (eg: \bar{Z} in line 578 is not defined before, the inequality after line 532 should be the inverse..). In addition, in the proof of lemma 1 as well as in the algorithm 4 in appendix, It is not clear at all how Z^\alpha_j are constructed, Z^\alpha_j are set to {} but it is not mentioned how they are updating. Could you please clarify this?

Clarity: The paper is well written.

Relation to Prior Work: Discussion with the recent closely related work (Ayoub et al 2020) is missing. In fact, beyond model-free and model-based differences, (Ayoub et al 2020) study the setting of general function approximation and I think their assumption is very related to the one made in this paper and prove similar regret bound in their theorem 1. Model-Based Reinforcement Learning with Value-Targeted Regression, 2020 Alex Ayoub , Zeyu Jia , Csaba Szepesvari , Mengdi Wang and Lin F. Yang

Reproducibility: Yes