NeurIPS 2020

### Review 1

Summary and Contributions: This paper studies the exploration problem in episodic reinforcement learning with kernel and neural network function approximations. The proposed algorithm is an optimistic version of least-squares value iteration, where the solution to the standard LSVI is further added by a bonus function for exploration. Under assumptions on the underlying RKHS or NTK function classes, the proposed algorithms are shown to achieve a H^2 \sqrt{T} \delta_F regret, where \delta_F depends on the effective dimension of the RKHS or NTK. —————— Updated after author response and reviewer discussion: After reading the authors’ response and discussing it with another reviewer, I would suggest the authors revise the paper in the following aspects regarding presentation and discussion. First, state clearly in the introduction (maybe also abstract) that this paper makes the assumption that the transition model is characterized by the RKHS class — I think you already did but it doesn’t hurt to emphasize it. Also, revise the sentence “propose the first provable efficient RL algorithm [...] without any additional assumptions on the sampling model” (lines 44-46), e.g., by changing the term “sampling model” to be “generative model” or “simulator”, as such a term is ambiguous. Second, discuss clearly the assumption that the transition model is characterized by the RKHS class, especially how it covers linear MDPs and Lipschitz MDPs as special cases by choosing different kernels. The other reviewer suggests adding the T regret under model misspecification, but I don’t think it is necessary — it is up for you to decide. Also, give concrete examples of such kernels, as promised in the authors’ response. In summary, the other reviewer and I don’t find any technical issues. The main debate is about the presentation and discussion, which I believe is easily fixable. Given the significance and depth of the results (especially the theory of DRL), I would keep my score and advocate for acceptance (as the rest two reviewers).

Strengths: 1. The theoretical results seem solid. 2. RL nonlinear function approximation is known to be challenging in both statistical and computation aspects. Statistically, in practice, DRL methods can be sample inefficient. Computationally, these methods may fail to converge. More importantly, RL with nonlinear function approximation has divergent counterexamples. This paper addresses the statistical aspect of this fundamental problem by proposing novel LSVI-type algorithms for both kernel and nonlinear neural network settings with provably sample efficiency. Moreover, these algorithms are also computationally efficient because (1) the kernel version involves kernel ridge regression and (2) the neural setting utilizes the NTK optimization theory for overparameterized NN. Thus, the proposed algorithm might be applicable to real-world deep RL problems where currently heuristic exploration schemes are widely used.

Weaknesses: The limitation of this work is that it only considers RKHS with exponential eigenvalue decay. Although the theoretical framework is general and the extension to polynomial decaying kernels seems involving just direct computations, it would be nice to have a general result that holds for all kernels.

Correctness: I didn't check the proof line by line but the theory seems reasonable and the proof seems correct.

Clarity: The paper is overall very well-written. However, as I mentioned above, the presentation can be further improved by adding: (1) a general result that holds for general RKHS that does not depend on the eigenvalue decay. This would strengthen the argument of this paper. (2) consider other eigenvalue decay conditions such as finite rank kernels and polynomial decay kernels. (3) Give a concrete example for both the RKHS and NTK kernels that satisfy the assumption.

Relation to Prior Work: Yes. This paper discusses and distinguishes from previous work thoroughly.

Reproducibility: Yes

Additional Feedback: My main concerns are stated in the Clarity section. I will elaborate on them as follows. [1] It would be nice to have a general result which depends on the intrinsic complexity of the RKSH, rather than only focusing on the exponentially decaying kernels, which is only a small class of kernels. [2] In the assumption, you also require the magnitude of eigenfunctions do not grow exponentially? Does this hold for the Gaussian RBF kernel, which is the example you gave in the paper for $\gamma = 1/d$. Moreover, are there any NTK that satisfy this assumption? It would be nice to provide an example of such an NTK. For example, does ReLU NTK satisfy the assumption? [3] When the kernel is finite rank, say having only $D$ nonzero eigenvalues, does the algorithm reduces to the LSVI algorithm in Jin et al? If so, how does the regret bound compare? [4] It would be nice to give a concrete example, say the Gaussian RBF kernel. In this case, it seems that the regret is H^2 \sqrt{T} (\log T)^{2+1.5d}. Why is this regret still “sample efficient” when it is exponential in d? It seems that the authors assume that $d$ is fixed. Then it would be nice to clearly define what “sample efficient algorithm” means.

### Review 2

Summary and Contributions: This paper considers the problem of low-regret policy optimization in H-stage episodic MDPs with large state and action spaces. The main assumption is the availability of a kernel that exactly captures the dynamics of the MDP. Under this assumption, the paper shows that "kernelized" versions of existing LSVI-based algorithms using optimistic exploration yield algorithms with O(H²sqrt(T)) regret, also depending on various characteristics of the kernel space. An extension of this idea shows similar results for sufficiently large neural networks, using the "neural tangent kernel".

Strengths: This paper shows that the essential ideas of fitted Q-learning with linear function approximation continue to work in the kernel setting, and that the UCB-style reward bonus for exploration can also be expressed using kernel functions. Moreover, it identifies the characteristics of the kernel space (i.e. a particular l∞ covering number and an eigenvalue decay constant) that replace the dimensionality of a linear feature representation in the regret bounds. In the case that the kernel is actually constructed from a finite-dimensional linear function representation, the regret bounds match known results.

Correctness: I did not have time to thoroughly check all the proofs, but skimmed through some of them and did not find any errors. It is quite possible that I might have missed them. The claims made in the introduction about computational complexity are not really borne out later in the paper, as noted above. To set more realistic expectations for the reader, the nature of the assumptions (perfect realizability, small covering number, optimization oracle for potentially large neural networks) should be mentioned along with or instead of those claims.

Clarity: On the whole, the paper is fairly clear, apart from the misleading claims I have already described above.

Relation to Prior Work: The prior work section in the appendix is brief but serviceable. However, the comparison of the strengths and weaknesses of this work compared to existing literature is unfortunately one-sided. In particular, the claims of prior algorithms being computationally challenging (l. 496) apply to this work as well, and the following claim about not requiring additional assumptions on the transition model are wrong; assumption 4.1 is most certainly a (strong) additional assumption.

Reproducibility: Yes

### Review 3

Summary and Contributions: The authors propose a least-square value iteration algorithm with exploration bonus and derive regret bounds under both kernel and neural approximation schemes.

Strengths: The results are mainly theoretical. Such results provide insights for further understanding of the conditions needed for RL algorithms to converge to near-optimal solutions.

Weaknesses: It is not clear how much impact the proposed algorithm has to RL in practice.

Correctness: I did not check the proofs but the results seem plausible.

Clarity: The paper is easy to read, assuming that the reader is already familiar with most of the ideas in the literature this work is building on.

Relation to Prior Work: The authors did a decent job connecting the present work to prior works.

Reproducibility: Yes

Additional Feedback: It would be great if the authors could provide more insights in terms of how the proposed algorithm (such as the use of proper exploration bonus) can be applicable in practice. For example, if the regret bound for NOVI is always worse than that of KOVI, why would one use NOVI? ===== Post-Rebuttal ===== I have read the authors' feedback, I maintain my score. Thank you.

### Review 4

Summary and Contributions: The authors proposed two provably efficient reinforcement learning algorithms called Kernel Optimistic least-squares Value Iteration (KOVI) and Neural Optimistic least-squares Value Iteration (NOVI) in episodic Markov Decision Processes with the horizon H, where both achieve O(H^2\sqrt(T))-regret bounds. Compared to the previous works with either tabular learning or linear function approximations, the submission has its novelty on the use of the powerful approximators such as Reproducing Kernel Hilbert Spaces (RKHSs) and overparameterized neural networks.

Strengths: The contribution of the submission is enough in the sense that it is the first provably efficient reinforcement learning methods with kernel and neural network approximations.

Weaknesses: Although the algorithm is shown to be computationally and statistically efficient, it seems difficult to be used in practice for the long-horizon setting (due to H^2 order). However, the submission is highly meaningful since this can give theoretical motivations for later works.

Correctness: Derivations and claims seem to be correct. This is a theoretical work without empirical studies.

Clarity: The submission is clearly written with clear mathematical definitions and notations. Sufficient explanations are given for all assumptions and theorems with the intuitive meaning of each term.

Relation to Prior Work: The relation to prior works is clearly specified in Appendix.

Reproducibility: Yes

Additional Feedback: The submission is extremely well-written, so there are not many comments I could make. The reason I score “accept” is due to the low confidence in my evaluation. One simple question is on Assumption 4.2 and Assumption 4.5. (which deal with assumptions on eigenvalues) and how it affects the proof. --- After I read other reviewers' comments and author response, I decided to keep my score.