NeurIPS 2020

Efficient Planning in Large MDPs with Weak Linear Function Approximation

Review 1

Summary and Contributions: This paper considers the planning problem in discounted MDPs with linear function approximation. Under uniform approximation error (epsilon_approx) for only optimal value function (i.e. weak LFA), a (size m) set of core states, and generative oracles, the proposed algorithm needs poly(1/epsilon, m, A, 1/(1-gamma) queries from the generative oracle to achieve O(epsilon_approx/(1-gamma)^2 + epsilon) value loss. This work extends the previous work [26] by turning the error bound of relaxed ALP to an efficient planning algorithm. Specifically, this work follows a standard approach by proposing a stochastic saddle-point problem and then applies stochastic mirror-prox to solve it. The authors also improve the statistical and computational result in [26].

Strengths: Under the extra assumption (core states), this paper solves the efficient planning problem under weak LFA. This paper also improves value loss by 1/(1-gamma) the computational efficiency in [26].

Weaknesses: The algorithm requires additional core states assumption and this paper only studies the planning problem (i.e. requires generative oracle). The core states idea, the optimization problem, and the error bound are analyzed or hinted in [26]. This work extends on that and proposes a stochastic saddle-point problem and uses stochastic mirror-prox to solve it. This method seems to be a standard approach.

Correctness: I check the proof in high level due to the time limit. The results and proof seem to make sense to me. I can take a closer look if needed.

Clarity: This paper is clearly written and easy to follow. The pros and cons of the proposed approach is discussed in detail.

Relation to Prior Work: This paper has a thorough and clear discussion about the previous work with the contributions well-placed in the literature.

Reproducibility: Yes

Additional Feedback: After rebuttal: I read the author response and other reviews. Thanks the authors for addressing my concerns! It would be great to see more discussions in the next version. I think this paper has enough contribution and I will keep my original rating for acceptance. -------------------------------------------------------------------------------------------------------------------------------- People have been studying the weaker notion of LFA from different angles (see Table 1 in [13] for reference). This paper studies this problem under additional core states assumption. This paper is well written, and easy to follow. The main text is clean and the proof is deferred to the appendix. The authors also have an extensive discussion about the related work, which is quite useful. I have some detailed comments as below: - This paper mostly builds on [26]. The main idea and the error bound are discussed in [26] and this paper further proposes an efficient planning algorithm to solve the problem. In addition, this work improves the value loss by 1/(1-gamma) compared with [26]. I think the proposed approach is somewhat standard in optimization and is not very novel. - The efficient algorithm under weak LFA is an interesting and unsolved question. This paper proceeds in this direction. The proposed algorithm is computationally efficient and is suitable for stochastic environment without gap assumption. On the downside, the algorithm requires core states assumption and generative oracle. It seems that the generative oracle is required for the LP type of algorithm and the learning problem is beyond the scope of this paper. It would be interesting to see future work that further removes the core state assumption and solves the learning problem. - Currently the approximation error is defined in a uniform way. Do you think there are other meaningful ways to define the approximation error? I'm also wondering whether the approximation error can be defined in other norms. - As discussed around line 247, achieving O(epsilon_approx) error requires exp(H) queries while O(sqrt{d} H^2 epsilon_approx) optimal is possible in polynomial queries. It seems that relaxing to O(sqrt{d} epsilon_approx) is enough? Moreover, in Corollary 4, it is shown that the algorithm only needs poly(1/epsilon, m, A, 1/(1-gamma)) queries to get O(epsilon_approx/(1-gamma)^2 + epsilon). Does it mean that sqrt{d} blowup in the approximation error can be avoided (may be the price here is dependence on m, A and worse 1/(1-gamma)? I think such sqrt{d} has shown up in many related works that have DP flavor. Maybe it is avoided because of using LP? - Is it possible to extend the analysis in this paper to handle the case that the core states only approximately represent all the features? In other words, some features can lie outside the convex hull expanded by the core states. It is definitely possible to include those states in the core states, but I'm imagining the case that doing so may significantly increase the size of the core states and people prefer to use a small core states set.

Review 2

Summary and Contributions: This paper on the theory of MDP planning positively answers the open question of whether there exists a polynomial-time planner using a generative model of the MDP. The solution depends on two key assumptions: (1) having a known feature set, and mapping from states to that feature set, and (2) existence (and knowledge) of a small number of “core states”, such that the feature vector of every other state can be written as a positive linear combination of core features. Beyond the theoretical analysis, the paper provides an algorithm that achieves the result, and a thorough discussion of related work. No experimental results are provided.

Strengths: The paper identifies a very precise question, explains it clearly and concisely, and situates it well within the relevant literature. The core question is relevant to the MDP planning literature for large MDPs, it is based on a recent conjecture. The theoretical results and algorithmic techniques can be useful to solve further problems.

Weaknesses: Taken together, assumptions 1 and 2 seem quite strong, and hard to satisfy in practice. Another weakness of the paper is the lack of any empirical results, even simple demonstrations. Most of the claims and contributions of a theoretical nature, so it is not a fatal flaw. But it would have been nice in particular to see an empirical verification of what happens when the choice of feature space, or choice of core states deviates from the assumptions. I expect that satisfying both assumptions can be difficult in practice, so having some sense of robustness to misspecification would be interesting, even on simple domains.

Correctness: The results appear correct, though I did not carefully check the proofs.

Clarity: The paper is clear and well organized. The core technical section (on CoreStoMP) may be dense for the casual reader, but the language is precise, terms are defined and the authors take care in explaining several design choices and trade-offs in the algorithm and analysis.

Relation to Prior Work: The section on Related Work is particularly nice, not only does it cover relevant literature, but it carefully explains the salient differences from the current contributions. Several very recent works are discussed, but older references are also included for a more complete picture.

Reproducibility: Yes

Additional Feedback: - It would be interesting to comment further on what techniques might be used to discover or select the set of core states, and how to jointly discover the features space and the core states. - The style of the references is not consistent and should be cleaned up. Some use conference name abbreviation, others not, etc. Some ([8], [31]) don’t say where the work was published.

Review 3

Summary and Contributions: In this paper, the authors study the problem of designing an efficient RL algorithm when weak features are available. The authors propose the CoreStoMP algorithm, which guarantees near-optimal performance under two more assumptions (core states, simulator) with polynomial running time and the running time doesn't depend on the number of states. The CoreStoMP algorithm chooses an action by using Stochastic Mirror-Prox to solve the constructed CoreLP problem, and leveraging the fact that if at every state, the expected reward of the policy is near optimal, then the policy is near optimal.

Strengths: This paper answers an open problem by Du et al. in part. Although the problem is not completely solved, this proposed algorithm is definitely a step towards solving it.

Weaknesses: The assumption of Core States seems to be strong, which might prevent further application of the algorithm. For example, to my understanding, the algorithm can potentially work with continuous state space, but in that case, the number of core states might be exponential in d or even infinite.

Correctness: I don't check the math very carefully, but the theorem looks reasonable.

Clarity: The writing quality is good. Notations are well defined and the organization of the paper makes it easy to follow.

Relation to Prior Work: The authors discuss related work in Section 5 in details, which is quite useful to see how the work differs from prior work.

Reproducibility: Yes

Additional Feedback: Post Rebuttal: I've read the rebuttal. I didn't have many questions and would like to keep my score.