NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal

Reviewer 1

In this paper the authors present PAC-RL results for finite-spaces discounted MDP with a generative sampling model (which can generate in O(1) time a state transition for each given state-action pair). They introduce a value-iteration type algorithm for finding an $\epsilon$-optimal policy, and prove that its running time and sample complexity are nearly the best possible and match, up to logarithmic factors, the lower bounds established previously in [AMK13]. These results are significant: They improve over a recent result in [SWWY18] by a factor of $(1 - \gamma)^{-1}$ ($\gamma$ is the discount factor), and they also fill a known gap in the earlier results of [AMK13] regarding the sample complexity of finding a near-optimal policy. Both the algorithm and its analysis are very interesting. They build upon the prior work [AMK13] and [SWWY18], and combine the algorithmic ideas and proof techniques from both papers in a highly non-trivial and original way. In addition, as an extension to these results, the authors also present a similar near-optimality result for finite-horizon MDP (the extension is given in an appendix). This is an exciting paper. The results are very strong, and the paper is very well-written. While the analysis is intricate and subtle, the authors have structured the paper well to give the reader the big picture, and they have succeeded in explaining the key ideas and proof steps very clearly. Detailed comments, mostly about minor presentation issues: 1. line 157-176, the last paragraph on p. 4: This discussion on the "Bernstein Technique" is a little confusing. The authors talk about the variance of the total return of a policy $\pi$ and the variance of $V^\pi(S_1)$, but the authors' algorithm is a value iteration algorithm and it does not evaluate any policy. The name "Bernstein Technique" seems to suggest that the technique is based on Bernstein's inequality. But the key inequality in this technique is the inequality in the authors' Lemma C.2 (Lemma 8 in [AMK13]), applied to the optimal policy. This inequality is a property of MDP, not a statistical property derived from the Bernstein inequality. I suggest the authors revise the last paragraph on p. 4 to explain more precisely the third technique in their work. 2. line 214-215, p.5: It would be better if the authors can say something about the nature of these constants c_1, c_2, c_3, especially c_2. 3. p. 7: In the for-loop of Algorithm 2, the subroutine HALFERR (Algorithm 1) also needs the values of $\delta$ and $u$ as inputs. The authors need to specify these values in Algorithm 2. 4. p. 480, p. 15: The two equations in this line are strange; the proof of the second inequality of Lemma C.3 doesn't seem valid. Of course, this error doesn't affect other results, since Lemma C.3 is the authors' alternative proof of Lemma C.2, which was already proved in [AMK13, Lemma 8]. On the other hand, the last step of the proof of [AMK13, Lemma 8] given in that reference skips some details and isn't obvious, so it will be good if the authors can fix Lemma C.3 and give the proof details of [AMK13, Lemma 8] (the authors' Lemma C.2). 5. p. 17: It seems in this proof of Lemma 4.4, the authors also need the argument in the proof of Lemma E.3. Also, in the statement of Lemma 4.4 on p. 6, should $1 - 2 \delta$ be $1 - R \delta$ instead? Other minor comments related to typos etc.: 1. Lemma 4.4, p. 216-219, p. 6: Should the definitions of $\pi^*$ and $P^\pi Q$ be given outside the lemma? Also, no need to repeat the definition of $Q^*$. 2. Table 2, p. 11: The first two rows are the same. 3. Before line 480, p. 15: It seems this part of the proof for the first inequality of Lemma C.3 can be simplified by applying Jensen's inequality to $(1 - \gamma)(I - \gamma P)^{-1} \sqrt{v}$.) 4. Lemma D.1, p. 15: Should the conclusion of the lemma include the words "with high probability"? 5. p. 19: The proof of Proposition E.3 is misplaced near the top of this page, before the proposition is stated. The proof of Theorem 4.6 is also misplaced and should be moved to Section E.2. 6. Some typos: -- $\gamma$ is missing in (3.1)-(3.3), p. 3-4. -- What is $\beta$ in line 224, p. 6? -- In the displayed equation just above line 507, p. 15, $n$ should be $m$. -- p. 17: Should $\epsilon$ in the proof be $u$ instead?