NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7406 Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes

### Reviewer 1

# Originality Although some initial work [1, 2, 3] on combining causal inference and RL has been explored, this paper has another attempt to combine both from the perspective of DTR to learn a policy for an unknown DTR by leveraging the confounded observational data. It is a really novel and promising direction worth exploring further in the future. # Quality The claims presented in the paper are supported both in theory and in practice (i.e., simulated data and real-world data). But I have a question about the finite states. In Algorithm I, the defined event counts are required to be calculated over the states and treatments. What if the state/action dimension is very high? In this case, it would be difficult to obtain accurate estimates of the two terms in Step 3. Any idea? # Clarity The paper is well organised and is easy to follow. # Significance I believe this work would be of great interest to both RL and causal inference communities, and would potentially have a great number of applications in real world, in particular in healthcare/medicine, education, sociology, etc. As I mentioned previously, however, some shortcomings of the proposed algorithm (e.g., the state/action dimension) might limit its applicability in many real-world applications. [1] Causal Reasoning from Meta-reinforcement Learning, Dasgupta et al. 2019 [2] Deconfounding Reinforcement Learning in Observational Settings, Lu et al. 2018 [3] Learning Causal State Representations of Partially Observable Environments, Zhang et al. 2019

### Reviewer 2

(Originality) Although the use of partial identification bounds for policy learning is not new (Kallus and Zhou 2018; the authors should cite this), yet its use in online decision making is innovative. In particular, the proposed approach to combine partial identification bounds and upper confidence bounds is elegant. (Quality) The proposed method is well developed and the theoretical analysis is thorough. For the empirical evaluation, I think the experiment description is quite terse (especially for cancer treatment) and it may be hard to reproduce the results according to the descriptions alone. The appendix adds some details but making it more complete and clear can be very helpful (e.g., introduction of the dataset, definition of the treatments, how to introduce confounding, etc.). Moreover, in the experiment, I suggest the authors also evaluate the Causal UC-DTR algorithm with $M_t^c$ replaced by confidence bounds learnt from the observational data (as opposed to the bounds from the observation data that incorporate both partial identification and confidence bound). This would show the benefit of taking the confoundedness of the observational data into consideration.