Review for NeurIPS paper: Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

NeurIPS 2020

Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

Review 1

Summary and Contributions: The authors propose a method for estimating the partial identification set of the value of a policy in an infinite-time-horizon MDP when the observed policy depends on unobserved confounders, in the special case where the distribution of unobserved confounders does not depend on the previous state. In this setting, the authors propose a bounded sensitivity model similar to bounded odds ratio of treatment models in causal inference. They then characterize the partial identification set of importance weights that would yield policy value estimates that are consistent with the observed data in terms of a linear program. To map this to a partial identification set for the scalar value of the policy, the authors discuss an impractical direct optimization approach, and a more practical non-convex optimization approach. The authors also establish that the estimating equations that they derive are consistent. Finally, the authors demonstrate their method on two simple examples.

Strengths: Partial identification techniques are much-needed in policy evaluation and causal inference contexts to provide more honest uncertainty quantification in settings where assumptions sufficient for identification are implausible. There is a real need in the machine learning community to outline problems where the observed data provide no single “right” answer. This paper is a great contribution to address this need in a setting that could have real practical implications. The derivations in the paper are thorough, and the set of non-trivial steps necessary to make the problem tractable are clearly described.

Weaknesses: Although the paper is generally clearly written, a number of the steps are difficult to parse. A bit more prose description of the mathematical objects, for example the marginal weights g_k(a | j) would go a long way to build intuition in the formulation of the method.

Correctness: They appear to be so, although I did not have time to go through all of the derivations in detail.

Clarity: Generally, yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I really enjoyed this paper, so my comments mostly have to do with making the derivations a bit more readable. The main steps that I got hung up on in reading where the marginalization step, moving from weights beta to weights g, and the step where the matrix A(g) is defined. In both cases, I think some prose description of exactly what the transformation is would be helpful. For the weights g, I think the direct interpretation (the last expression in the line defining g_k(a | j) is more intuitive than the definition in terms of beta. It is not obvious how one moves from one to the other (especially with the inverse migrating out of the summation). It might make sense to define g_k in terms as the weights of the marginalized policy direction first, and then to show the correspondence to beta in a small proposition. I think a slightly extended discussion of how the weights g_k compare to pi_b^{-1} would provide some more insight into how to interpret these weights. It seems clear to me that when transitions do not depend on u, all of the g_k end up being equal to pi_b for each k, so the confounding introduces an interesting non-equivalence here. Understanding how to interpret the dependence on k under confounding would provide some useful insight, I think. One thing that made this section difficult to parse was the use of generic indices for current and next states that differed from other parts of the paper. I think the authors should standardize on (s, s’) or (j, k) (personally, I prefer s, s’). For the Non-convex method, I think it might be useful to describe the set of constraints represented by the matrix A(g), rather than describing its construction so explicitly. Specifically, retating equation (5) in terms of g, describing the set of individual constraints this imposes, and then stating that these can be arranged in a matrix might make this more readable. The actual matrix construction could be left for the appendix. Some nitpicks on the experiment plots. For the consistency plot, having a horizontal line at zero would be useful. For the plots of bounds, it might make sense to focus on one particular set of bounds, and plot the others with a lower alpha value (that is, less opacity) so that it is easier to parse the focal example, while the general trend is visible in the background. Clarifying that the left-hand size of the plot corresponds to the estimate assuming no confounding could also be useful. Finally, a question about identifiability: would it be possible to construct an example in the spirit of, say, the Tennenholtz et al paper where the value of the policy is in identifiable, and to show that the estimated bounds collapse? How do the assumptions from that setting map onto the one that you propose? Would they contradict the sensitivity model directly, or would the consequences be observable after computing the bounds?

Review 2

Summary and Contributions: This paper studies the problem of partial identification of causal effects in settings with an infinite horizon. In particular, the authors consider a canonical Markov decision process model with non-time-varying unobserved confounders. The reward function at each stage of intervention is presumed to be known. However, the transition distribution is affected by an unobserved confounder whose values are drawn independently at each timestep. Let S_t, X_t, Y_t denote, respectively, the observed state, action, reward realized at time t. We denote by do(pi) an intervention following a conditional policy pi: S_t -> X_t. This paper focuses on the average causal effect over the infinite horizon T, i.e., R(pi) = lim_{T -> infty} 1/T E[sum_{t=1}^{T} Y_t|do(pi)]. The authors also assume that the Markov process converges and leads to a stationary distribution P(s_t, x_t, s_{t+1}). The goal is to estimate the unknown causal effect R(pi) for an arbitrary policy pi from the stationary distribution P(s_t, x_t, s_{t+1}).

Strengths: In the general settings, the causal effect R(pi) is not uniquely discernible from the observational data, i.e., it is not identifiable. An alternative solution is the partial identification where the goal is to narrow the parameter space of R(pi) from the observational data. This paper follows the partial identification framework from (Kallus and Zhou, 2018) (for short, KZ18), where they assume the access to a sensitive parameter F that characterizes the strength of unobserved confounding to the data-generating policy that determines observed action X_t. The partial identification problem is thus reducible to a series of linear programs. Unfortunately, the size of the resultant linear programs could be super-exponential. The authors then propose an alternative formulation of non-convex optimization program and discuss practical methods to approximate the bounds. As far as I am aware, this paper is the first work that attempts partial identification of causal effects in settings with an infinite horizon. In many decision-making settings, the unobserved confounding is common, and the agent has to make infinitely many decisions. Therefore, the results presented in this paper could find wide applications in learning policies with suitable performance. While the primary bounding strategy follows (KZ18), non-trivial generalization is required in infinite horizon settings.

Weaknesses: While I understand that one has to constrain the time-dependence of unobserved confounders to obtain reasonable bounds, I still have some questions regarding Assumption 2, summarized as follows. 1. Why does the existence of baseline UCs permit Assumption 2? More specifically, could you please show that one could derive the memoryless UC property from the graphical assumptions in Fig1 (b)? 2. While Lemma 1 allows on to derive Assumption 1, its condition is a bit overly restricted. It seems to suggest that the exogenous variables U_t at each timestep are drawn uniformly at random. This condition is somewhat unrealistic in practice. It also appears that Assumption 2 could hold in more general settings, e.g., Fig1(a). Could the authors explain the rationale behind Lemma 1?

Correctness: I haven't checked the details of the proof. However, the main results seem to be reasonable.

Clarity: This paper is clearly-written and well-organized.

Relation to Prior Work: The references and discussion of the related work are sufficient.

Reproducibility: Yes

Additional Feedback: -- POST REBUTTAL -- I have read the authors’ responses and other reviewers’ comments. Overall, I enjoyed reading this paper but was confused with some technical aspects of the results. In particular, it was not clear to me the relationships between Assumption 2 and the graphical models in Fig 1(a,b). After checking proofs in Appendix, I can now see how Assumption 2 is entailed in causal models of Fig 1(a-b). I think the primary source of confusion is due to Lemma 1, which does not seem to be applicable in either Fig 1(a) or (b). To see this, in Fig 1(a), the condition in Lemma 1 does not hold if u’ \neq \tilde{u}’. While the condition of Lemma 1 could hold in Fig 1(b), the deterministic relationship among Us implies that some P(s’,u’|s,u,a) = 0. Such zero value could lead to questions on the validity of Lemma 1 since its proof involves the ratio of a product of P(s’,u’|s,u,a). Nevertheless, the issue of Lemma 1 is a minor one, and should be easily fixable. For example, the authors could rephrase the conditions in Lemma 1 as variables U_i are mutually independent, which explains Fig 1(a). The authors could then have a separate discussion on the validity of Assumption 2 in Fig 1(b): e.g., once U_0 = U_1 = … U_n are fixed, the environment is reducible to a standard MDP; thus, Assumption 2 is entailed. Overall, I believe that the explanation on Assumption 2 and Lemma 1 could be improved, but it does not seem to affect the soundness of the main results in this paper. Having said that, I am willing to increase my score by 1 point, provided that the authors could resolve the issues of Lemma 1 in the future draft.

Review 3

Summary and Contributions: This paper studies the extent of non-identification in off-policy infinite-horizon RL when certain state space variables are unobserved.

Strengths: Understanding the robustness of infinite horizon MDP methods to missing data is an important topic, given the recent popularity of infinite horizon MDP methods in recent years. This paper has significant technical depth, and appears to be addressing a challenging problem.

Weaknesses: The final proposed approach to bound R_e produces results which appear difficult to interpret for a few reasons. One, a common difficulty in partial identification, is choosing the parameter \Gamma to appropriately reflect reality. While this is a fundamental difficulty, and not unique to this paper, the paper lacks any significant discussion about how one may go about choosing the level \Gamma, or what a reasonable range might be. What side information would an analyst need to know to choose an appropriate value? Two, is the assumptions that are made instead of assuming that all state variables are observed. Both Assumption 1 and 2 appear to make untestable assumptions, as they depend on unobserved variables. The paper provides two examples of when Assumption 2 might hold, but does not discuss the implications of Assumption 1 when certain states are unobserved. In particular, Assumption 1 makes assumptions that jointly depend on the unobserved state and the evaluation policy. Can these be checked in any way? Do these put significant constraints on what the evaluation policy can be? Three, the method partially identifies R_e using a non-convex optimization problem that cannot be simultaneously solved efficiently and exactly. Because the paper suggests to solve the problem using approximate projected gradient descent, there is no guarantee that the resulting solution actually includes the entire partially identified region at termination. Therefore, the output may be an upper bound on the lower bound on R_e, or possibly a real (but loose) lower bound on R_e. This significantly weakens the practicality of using this to draw truly robust conclusions.

Correctness: The mathematical results appear to be correct. The empirical results would benefit from verifying both Assumptions 1 and 2 in each setting.

Clarity: The paper is very dense and makes many large mathematical jumps that are only explained in the appendix. It was difficult to follow the main idea due to the multiple reformulations of the optimization problem.

Relation to Prior Work: The related works section suggests that all previous work on unobservables in RL has introduced assumptions that allow identification. It would be appropriate to clarify that Zhang, J. and Bareinboim, E., 2019 (already cited) and Namkoong, H., et al., 2020 (currently missing, title is "Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding") have developed partial identification results in finite-horizon (it seems) settings, as readers may be interested in these settings as well.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper studies off-policy policy evaluation under unobserved confounding where the policy value can not be point-identified. The authors propose to compute an upper bound and a lower bound on the target policy value by computing the support function of the partially identified set that consists in the set of all valid state-action occupancy ratios that agrees with the sensitivity model they consider. They propose as well a computationally efficient approximation to their method based on nonconvex project gradient descent. Finally, they provide some empirical results on gridworld's domain to illustrate the proposed method.

Strengths: I am not familiar with this area of research so I might be not qualified to judge the contribution of this paper. But, overall the proposed methods sounds correct to me. The author first state their main assumption on the unobserved confounding which is satisfied for stationary or baseline confounders. Then, they introduce the sensitivity model. By marginalization, they reparametrize the sensitivity model in term of the marginalized weights g which seems to decreases the number of constraints as they marginalize over the confounders. Then they characterize the "plausible set" of state-action occupancy ratios in term of a linear program that involves an optimization of g. Finally, the proposed method involves a bilevel optimization problem that they approximately solve with nonconvex project gradient descent.

Weaknesses: I think the main limitation of this work is the that it relies on the discrete nature of S (which might be okay) and the proposed method scales necessarily with |S|^2 |A|, a thing that we cannot afford even for moderately large MDP. I think also that it would be great if the author provide state more clearly the computational complexity of their method.

Correctness: The method looks sounded to me.

Clarity: I am really familiar with the causal inference's terminology and some parts are not very clear for non initiated readers like me. However, the paper looks overall in good shape.

Relation to Prior Work: The connection to related works in both OPE and causal inference literature seems well discussed in introduction as well as in related work section.

Reproducibility: Yes

Additional Feedback: