Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper gives a frequentist regret bound for the RLSVI algorithm. While the bound is not minimax optimal (and potentially can be improved), this is the first frequentist guarantee for this algorithm and the proof contains some new technical insights, which may be useful in future work. Further the result demonstrates that other algorithmic strategies/paradigms (besides say optimism) may yield provably sample-efficient RL methods. Thanks for notifying us about a bug that you found in the proof! I discussed this with the reviewers and we all decided it was not a deal breaker for us. Anyway the regret bound here is suboptimal, so the fact that the actual result is slightly worse is fairly insignificant.