NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1361
Title:DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Reviewer 1

Originality: I find this work to be original and the proposed algorithm to be novel. The authors clearly state what they contributions are and how their work differs itself from the prior works. Clarity/Quality: The paper is clearly written and is easy to follow, the authors do a great job stating the problem they consider, explaining existing solutions and their drawbacks, and then thoroughly building up the intuition behind their approach. Each theoretical step makes sense and is intuitive. I also appreciate the authors taking time to deriving their method using a simple convex function and then demonstrating that it is possible to extend the method to more general set of functions. At the end, the authors arrive at a quite simple objective function that produces unbiased estimates and is straightforward to optimize. In addition, the proposed objective is well-behaved and allows to establish some theoretical guarantees on the convergence. Significance: The proposed method definitely has a potential to be an important step towards learning robust policies from offline data. This problem has wide practical applications in areas of robotics, natural language, medicine and more, where a lot of offline data has been collected over time. Being able to learn from such off-policy data and then maybe fine-tune the model on-policy would be a significant improvement. Comments: Line 122: It might be very hard to make sure these assumptions hold in practise, especially the coverage part, e.g. d^{\pi}(s, a) > 0 implies that d^D(s, a) > 0. How is this handled practically?

Reviewer 2

The paper proposes an algorithm for off-policy policy estimation based on a new way of estimating discounted stationary distribution correction ratios. The problem of estimating the correction ratios is reformulated as optimization of a convex objective function whose optimum can be used to obtain the correction ratios. A change of variables is used to exchange dependence on the discounted stationary distribution of the target policy for dependence on the initial state distribution. Fenchel duality is applied to overcome the multiple sample problem and obtain a linear-concave saddle-point optimization problem whose solution directly yields the correction ratios. A bound on the mean squared error when using the learned estimates for off-policy policy estimation is presented, as are some experiments. I strongly recommend accepting this paper for publication; it introduces a new method for off-policy estimation that is likely to be significant, explains the method well, and tests it on several problems with promising results. Comments on originality: The derivation of the proposed algorithm is similar to an existing method (ratio matching), but the paper takes two additional steps: exchanging the objective function's dependence on the target policy's discounted stationary distribution for dependence on the starting state distribution, and the application of Fenchel duality to overcome the double sampling problem. The latter technique has been used in other RL papers before (SBEED), but to my knowledge has not been applied to this specific problem. Existing work that I'm aware of has been adequately cited. Comments on quality: The paper's quality is good. The series of steps in the derivation are explained and motivated well, and the paper's claims are supported with reasonable experiments and a proof bounding the error of the proposed method. Comments on clarity: The paper is very clear, with enough details given in the main body to help the reader understand, and exact details needed for reproduction included in the appendix. Comments on significance: The paper seems very significant; it provides a method for doing off-policy policy estimation that can be used by practitioners and can be built upon by researchers. Questions: -What are the x-axes in figure 2 and figure 4? -Where did the convex problem from section 3.2 "A Technical Observation" come from? -Why report the median? Are there outliers (such as catastrophic failure) in the distributions of performance for some of the methods? Are the distributions of performance skewed for some other reason? Similarly, why report 25% and 75% percentiles? -The proposed objective function is related to the Bellman Error objective function (with zero rewards and an extra term that depends on the initial state distribution). Sutton and Barto (2018) show in section 11.6 (specifically example 11.4) that the Bellman Error is not learnable with function approximation: two MDPs that generate the same data distribution have different optimal Bellman Errors with different optimal parameters. How do you justify using a Bellman Error-based objective function with function approximation? Notes: -It would be good to mention that correcting updates by importance-weighting the entire trajectory does not require the Markov assumption, whereas importance-weighting stationary distributions does.

Reviewer 3

The paper is well-written, with an accurate notation. The contents are well-organized and the proofs look sound from a high level inspection. I think that the paper provides nice contributions by extending the applicability of stationary state distribution ratios to very general off-policy policy evaluation settings, in which the off-policy samples are potentially drawn from heterogenous and unknown sources (such as multiple unknown behavior policies). Adapting the same idea to off-policy policy optimization, which is proposed as future work, would make this contribution even more significant. The paper does not display any major issue. I report below a couple of detailed comments. - The figures are overall too small, the tick labels are not really easy to read. In Figure 2 the x axis label is missing. The authors might improve the figure display in a final version of the paper. - Before turning to the appendix, it was not completely clear to me what were the actual TD and IS algorithm that the authors compared to DualDICE in the different experiments. I suggest to add a few lines in the main text to clear this point. --- Thank you for the detailed answers. I am looking forward to future developments of this work.