Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

NeurIPS 2020

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Meta Review

This is a borderline paper. The paper is technically sound and addressing OPE in average-reward setting is an important problem. Despite that the work is an extension of Duan and Wang (for discounted setting) to the average-reward setting, the algorithm is somewhat different, as Duan and Wang uses FQE whereas the current paper performs stationary-distribution estimation. That said, there are a few weaknesses that the paper should try to address or at least discuss: 1. The entropy maximization is a novel algorithmic element which does not appear in previous approaches in the discounted setting. Naturally multiple reviewers questioned the necessity of the approach. In the rebuttal, the authors clarified that this helps with empirical performance and is justified with additional assumptions, but does not seem necessary for the core theory. Since this paper is positioned mainly as a theory paper, this should be carefully clarified so that theory readers can have the correct takeaway messages from the paper without being confused, and possibly de-emphasize the role of max-ent in the paper (or even in the title). 2. Strength of result: While the paper (L33) mentions that the bound scales similarly to that of Duan and Wang, the reviewers have noted in the discussion that the results are comparatively weaker and require stronger assumptions: the bound is not adaptive to how close the target and behavior policies are, and requires the additional assumption that target policy also induces an exploratory distribution (L118). Is it a weakness of the analysis, or is it fundamentally difficult to prove such a result in the average-reward setting? 3. The estimation procedure in Eq.11 and 12 is highly similar to LSTD, which is generalized by many recent methods that approach OPE from LP and duality. So no wonder the authors find similarity with DualDICE in L196, as many of these recent methods collapse to some form of LSTD in the linear case [1,2]. Moreover, the authors should also discuss existing finite-sample analysis of LSTD family; see e.g., [3,4]. [1] Uehara et al'20. Minimax Weight and Q-Function Learning for Off-Policy Evaluation. [2] Yang et al'20. Off-Policy Evaluation via the Regularized Lagrangian. [3] Lazaric et al'10. Finite-sample analysis of LSTD. [4] Liu et al'15. Finite-Sample Analysis of Proximal Gradient TD Algorithms.