Paper ID: 258
Title: Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Summary:
The paper considers the complexity of policy iteration (PI) for finite MDPs
with n states and m actions per state in a number of settings.
The following results are derived:
For Howard's PI, a previous bound is improved by removing a log(n) factor.
For Simplex PI, a previous bound is improved by halving its multiplicative constant.
When the MDP is deterministic, a previous bound for Simplex PI
is improved by decreasing the power of
dependence on n by one.
This result is generalized to stochastic MDPs under additional restrictions on the MDP.

Originality and significance:
The complexity of PI is of major interest to the RL community. Even though the finite
setting is not practical, it is important to have a good understanding of this
simple case before moving to more complicated settings. This paper essentially picks
up the previous line of research that started with a paper by Yinyu Ye in 2011 and
continued by Post and Ye, as well as by Hansen, Miltersen and Zwick.
Some of the improvements are marginal and the core ideas of the proofs go back to the
2011 work of Ye. However, the proofs are much more streamlined and are way more
accessible to RL researchers than those of Ye (who prefers to use the language
of linear programming). The extension of strong polynomial results to stochastic MDPs
under structural assumptions are the most important novel parts of the paper.

Clarity:
The paper is relatively well-written, though the authors should check it carefully
as it contains a few typos (lines 285, 292). Also, at times fluency could be improved
(e.g., "However, on the other hand" on line 303 is awkward).


Soundness:
I believe that the results are sound (I did not have time to check all the details).

Minor issues:
- line 367: specify the lemma used
Q2: Please summarize your review in 1-2 sentences

The mildly improves previous results on the time complexity of PI,
while making the proofs more accessible to RL researchers.
Perhaps more importantly, the paper shows that the class of MDPs
that simplex PI is strongly
polynomial for is much larger than what was known beforehand.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is a nice paper that tightens previous bounds on policy iteration, a problem on which recent progress has been made after decades of stagnation. The results are nice, though not shockingly impressive in themselves. Importantly, however, the paper provides short proofs of some key results and good, short summaries with real insight in cases where the proofs aren't short. The paper is extremely well written.

Minor comments:

Line 97.5: at least on state -> at least one state

line 106: all iteration k -> all iterations k

line 114.5: convergen often occur -> convergence often occurs

line 203: on path -> on paths

Q2: Please summarize your review in 1-2 sentences
Modest improvement on previous policy iteration results. Very nicely written paper, a pleasure to read.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides a theoretical analysis of two variants of policy iteration.
The analysis improves the known worst-case iteration complexity by O(log(n)) in
one case, and by the constant factor 2 in the other. They generalize their
analysis for one variant run on deterministic MDPs, improving its iteration
complexity by O(n).

I cannot recommend this paper be accepted in its current form. The exposition
is extremely rushed, and the improved bounds in themselves are not impressive.
It is possible that the proof techniques provide some scientific value (as the
authors claim), but this is incredibly difficult to distill from the paper.

The proof techniques that improve the first two bounds follow prior work
closely. The proofs in the deterministic MDP setting appear more novel, but
(a) key terms, like transient and recurrent, are not defined and (b) the
authors state that the assumptions required for their proofs are overly strong.
That is, their correctness and applicability cannot be evaluated.

The authors need to better motivate this work and more clearly
describe/highlight their contributions.

Minor comments:
- please spell check
- line 85, \pi' is not used inside the max
- 115, convergen
- 125, observtation
- 134, mentionned
- footnote 6, not immediate put in appendix
- 177, Sherman-Morrisson
- 180, this statement is not immediate, though I believe it
- 182, explaination for conjectures?
- 233, transient and recurrent need to be defined
- 277, appdx
- 292, mentionned and developped

After viewing author feedback and discussions with other reviewers, it is apparent the initial quality score was overly harsh. I suggest drawing more attention to contribution contained in the proofs themselves in the next revision to aid less astute readers like myself.
Q2: Please summarize your review in 1-2 sentences
I recommend this paper be rejected due as it does not appear to provide significantly better bounds, or any obviously insightful proof techniques.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents bounds on the complexity of policy iteration for two versions of the algorithm: one which changes the policy at all states where there is an action better than the current policy (which the author call Howard PI), and one which only changes policy at states with a maximal advantage (which corresponds to running Simplex in the linear programming version of the algorithm). The paper re-derives some existing bounds in a simpler way and also improves some existing bounds.

The paper presents an interesting contribution and the proofs seem correct. The novelty is not very high, as the paper builds on existing results, but nonetheless, there are improvements on these results. The writing in the paper is clear.

Small comments:
- line o43: When at state -> At state
- The definition of advantage (line 085) is missing a \prime, otherwise it's trivially - line 097: on state -> one state
- line 112: and that -> that
- line 195: the the
- line 217: Material and
- line 367: "this lemma" - which lemma do you mean?
- In the proof of Lemma 3, line 72, you have a \max_{s,\bar{\pi}}. Here, the order in which you take the max matters, so you should write \max_s \max_{\bar{\pi}} to clarify (I am assuming this is what is done).
- It would be very useful to have some examples of types of MDPs in which Corollary 2 holds.
Q2: Please summarize your review in 1-2 sentences
The paper presents improvements on the current bounds of policy iteration, making progress on existing bounds. It is a nice contribution, though it does not depart in a major way from existing proof techniques.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thanks for the reviews.


A few comments about the last review (Reviewer_6), that is the only one to be negative.

<< (a) key terms, like transient and recurrent, are not defined >>
We believe transience and recurrence are standard for Markov chains, that underlie MDPs.

<< (b) the authors state that the assumptions required for their proofs are overly strong. >>
This is not true. The only assumption that we say is restrictive is the one used for the last theorem. All other assumptions and definitions are, to our opinion, pretty natural. In particular, please reread the end of the paper.

<< It is possible that the proof techniques provide some scientific value (as the
authors claim), but this is incredibly difficult to distill from the paper. >>
Note that the proof of the bound for Howard's PI (the version of PI that most RL researchers think of) with gamma<1 (Appendix A+Appendix C) is sufficiently small/simple to be included in any introductory course about MDPs. This is one of the motivations for including it in extenso in paper, while putting it in a much bigger perspective (recent known results, improvements, and remaining important problems).