NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 8184 Worst-Case Regret Bounds for Exploration via Randomized Value Functions

### Reviewer 1

Post author response: I thank the author(s) for their response and commenting on my discussion points. As those would need additional work, I for now keep my original score: this is a solid paper. ---------------------- Clarity: The paper is very well written and generally easy to understand, given the technical nature of the contribution. While the proof for Lemma 4 & 5 is described very well in the main text, it would be helpful to have a short explanation how this is used to achieve Lemma 6. If necessary, I suggest to drop the proof of Lemma 3 from the main text as this result is standard. Quality: I have verified the proof in the main text and individual lemmas in the appendix. All results that I checked appear to be correct. The authors do discuss the main limitations of their analysis but it would be nice to discuss the following points at least briefly: - The authors chose the setting with time-dependent dynamics which is a little less common than the default setting where dynamics and rewards are identically distributed across time steps within the episode. Would the current analysis be able to go through even if samples are shared across time steps? I suspect it would be difficult to achieve a tighter bound in H (e.g. due to Lemma 8) but I wonder whether the same bound can be proved. - How does this setting for \beta affect the empirical performance of the algorithm. To what degree does it make the algorithm conservative compared to empirically tuned parameters? - This is an expected regret bound (given any fixed true MDP) but is there any indication that a similar analysis would not yield a high-probability regret bound or PAC-style bound (as in [11] after adjusting k --> n_k(h,s,a) in beta)? Originality: This is the first worst-case analysis for RLSVI and even though it might not yield novel insights beyond the analysis itself, is an important technical contribution. The proof relies to large parts on existing strategies (which is not surprising) but contains novel technical insights such as lemma 5. The paper does discuss the relevant links to existing literature but it would have been helpful to give a more direct comparison of the result here to the lower bound and upper bounds of other algorithms (e.g. OFU based methods) in this setting. Significance: I think this is a solid technical contribution to the field and even though it is not surprising or particularly tight, it is absolutely non-trivial and still an important result. As mentioned in the introduction of the paper itself, worst-case analyses of randomized algorithms in the RL setting are far from trivial and so this relatively simple analysis has a good chance of being used as a template and improved upon in future work. Minor: - Line 361: hat M_k \in Mcal_k should be \notin - Line 264 missing closing )

### Reviewer 2

Originality: while the principle is certainly common in the literature, this is the first paper to demonstrate frequentist regret guarantees for perturbation induced exploration. Proof techniques do not appear to be terribly different than the prior arm, with the key differences appearing in Lemmas 4-5. Quality: The work appears generally correct, and the proofs are readable and succinct. Clarity: the proofs appear generally correct however, I believe that the third inequality beneath line 360 in the appendix could be explained more thoroughly, since justifying this statement crucially relies on conditioning on H_{k-1}, which is not indicated in the notation. Moreover, I think in the description of the algorithm, it should be stressed that the data augmentation is solely for planning, and not for estimation of transition probabilities. Significance: I think the significance of the paper lies in the argument that optimism is not a realistic framework for more complicated RL settings, and that randomized data augmentation gives a more plausible approach which is still amenable to theoretical analysis. However, I do not know if the fact that randomized value functions do in fact ensure frequentist regret guarantees is terribly surprising. In sum, my inclination to accept the paper is not so much for its technical novelty, as for bringing to light the fact that there are other algorithm design paradigms for RL that may generalize better than optimism in more complex settings, but which are nevertheless theoretically sound.

### Reviewer 3

RLSVI belongs to a family of algorithms that can be used with function approximation to encourage exploration. While the method enjoys some theoretical guarantees in the tabular setting, previous analyses provide Bayesian regret bound, and this paper addresses the frequentist (worst-case) regret. The writing is clear and the text provides interesting discussions and insights around the problem. In general I like the paper, and I have several clarification questions for the authors: 1. Regarding worst-case regret bounds for Thompson-sampling-like methods, it reminds me of the work of Agrawal and Jia [3]. If you look at their algorithm, it's actually two-phased: they use optimism when the sample size is small, and only switch to legit Thompson sampling in the large sample regime. What's interesting to me about the current paper is that there are no such two-phased structure, which looks a little bit too good and I will explain why: RLSVI is similar and closely related to bootstrapping, and we know that bootstrapping only works when the data somewhat resemble the true distribution. This requirement can be difficult to satisfy in MDPs: unlike bandits where you can just sample each action a small number of times to achieve full coverage, in MDPs we often find ourselves not having any data at all from certain states in the middle of exploration. Now I guess this is circumvented by choosing \beta_k to be large enough in the beginning---in fact I was expecting \beta_k to start with O(H) value, the magnitude of the reward function, only to find that it starts with O(H^3)... but this does not fully answer my question: high noise would simply induce uniformly random exploration, which is for sure inefficient (e.g., in combination locks). Can you give some insights on this matter? 2. Algorithm 1 is very nice in the sense that one can almost directly use it in the function approximation setting (e.g., there is no mention of the visitation counters), except for Line 6 that samples the \tilde{Q} from a tabular prior. Some discussions about what's the counterpart of this step in the function approximation setting can be interesting. ----------------Update after rebuttal---------------- Thanks for the response, and it makes a lot of sense. I keep my original recommendation.