NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3007 Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model

### Reviewer 2

Originality: It is a new method compared with previous bound where the paper use a non-uniform sampling scheme. Quality: The paper provides detailed theoretical analysis on their methods. However, no empirical experiment is provided. It would be better even if a toy example is provided and can show significant improvement of sample complexity than previous methods. Clarity: I'm a little bit struggling on the writing. The paper is notation heavy and hard to check all the proofs. I would suggest to emphasize on the notation of $n_{sa}^k$ which I think is your key differences than previous methods, should spend more paragraph on this notation. I didn't understand the intuition why problem dependent structure can remove the dependence of horizon for suboptimal action. Another thing is the terminology for 'horizon' where we usually use to refer to the maximum time steps on the MDP. In this paper it refers to the factor $(1-\gamma)^{-1}$ which is not fully explained. Significance: To be honest I don't familiar with the field of PAC-RL, but I believe it is an important result and can inspire policy optimization algorithm based on your algorithm. Overall: I believe the paper can be written much better by emphasizing its idea and explain better on its intuition. But I still think this paper is above borderline because of its theoretical quality and the idea of non-uniform sample would be inspiring for other policy optimization algorithm.

### Reviewer 3

The authors provide a new sample complexity analysis for identifying a near optimal policy of a tabular MDP assuming access to a generative model. They provide a problem dependent bound compared to the worst case bounds present in the literature. The analysis technique is novel in the sense that their bounds depend on the gaps between the optimal action value function and value function of suboptimal actions. Similar bounds are well known in the multi armed bandits literature and shown to be derived from the proposed bound. I am not much familiar with the relevant RL literature and hence can not make detailed comments. The bound improves over the previous results and a high level proof sketch is present in the paper. However, the paper refers to several lemmas present in the supplementary material which makes it a difficult read. Given the short amount of time I was not able to go through the supplementary material. But I assume that the proofs are correct. One novel idea behind the proof is to make use of the variance of the optimal value function under the next state distribution, which effectively helps to improve over the previous analysis. I believe this new technique might be helpful to the community. --------------- Post rebuttal update: The authors mentioned some ideas about numerically evaluating their methods. I confirm my original view and vote for acceptance.