NeurIPS 2020

A Class of Algorithms for General Instrumental Variable Models

Review 1

Summary and Contributions: The authors study the problem of partial identification of causal effect, which provides bounds on the causal effect from a treatment variable on an outcome variable. They consider the case that there exists an observable instrumental variable and the treatment and outcome are continuous variables. In the proposed method, a parametric response function family is considered. Then the causal effect, written as an integral over the distribution of possible responses to the treatment is approximated. Hence, efficient gradient-based optimization techniques can be used to find lower/upper bounds on the causal effect.

Strengths: Majority of the work on instrumental variable framework consider additive noise models to enable identification, and works on partial identification usually do not consider continuous distributions. Hence this work provides an efficient method for the missing case of partial identification of continuous models. This makes the work interesting to the community of causal inference. The approach for optimization seems sound and interesting.

Weaknesses: - Since the proposed optimization is non-convex, there is no guarantee for the correctness of the bounds. - Perhaps the most important missing result in this work is confidence intervals for the bounds. - At some parts, for instance the choice of function family for p_\eta, it seems that the only criteria for the choices in the model is to make the optimization task efficient and no other justification is provided. - Is there any intuitions or guidelines for choosing the response functions? I thought MLP should be a good choice, but the resulting bounds seem to be loose. - In the experiments, only two cases are considered: linear Gaussian case and a second case in which the treatment is again linear and the outcome is generated by 0.3X^2−1.5XC+e. It seems necessary to consider other instances of non-linear cases as well. - The choice of the outcome equations (X-6C+e and 0.3X^2−1.5XC+e) look random. Was there any specific reason for this choice?

Correctness: The method used in this work seems correct to me.

Clarity: Overall the paper is easy to read, yet some details such as details of subsection 3.2 could have been explained more.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: After rebuttal: I thank the authors for their responses. My score remains the same.

Review 2

Summary and Contributions: The paper presents a response function [Balke & Pearl] approach for bounding the causal effect in instrumental variable settings with continuous treatment and response variables. Their approach optimizes over a parametric response function space via an augmented legrangian procedure (to deal with the constrained optimization).

Strengths: Overall I enjoyed this paper - it demonstrates how we can leverage modern ideas such as the reparameterization trick to address the computational challenges associated with bounding the causal effect in continuous settings. Perhaps more importantly, it complements the ever-expanding toolkit of machine learning methods for IV to include a partial identification method: addressing uncertainty from potential lack of identification is an important topic for practical causal inference.

Weaknesses: My biggest concern is the sensitivity of the method to parametric assumptions. This is obviously unavoidable --- as you grow the space of possible models, you also worsen the identification problem --- but I would have liked to see some discussion of the tradeoffs here. The paper points out the limitations of the y = f(x) + e_y approach to achieving identification; but then isn't as explicit about the implied limitations of different parametric assumptions. Section G in the appendix deals with some of this, but I still have a hard time thinking about how an analyst might reason about the tradeoffs of different parameterizations of the models and their associated assumptions about the ways that u can affect f in the structural equation.

Correctness: As far as I can tell - yes. Both the methods and the empirical evaluation appear sound.

Clarity: The paper is very clear.

Relation to Prior Work: Most of the relevant work is cited, but I would include Bennet, Kallus & Schnabel [2019] & Lewis & Syrgkanis [2018] among the deep network approaches.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The authors are interested in deriving upper / lower bounds for causal effects under the assumption of existence of an instrumental variable, by maximizing / minimizing the causal effect estimation over all IV models compatible with the observed data distribution. They propose to build on work from [Balke&Pearl (1994)] - who describe constraints for the compatibility of the model with the observed data distribution in a sharper way than most recent works - and solve the untractable optimisation problem that arises using recent advances in SGD/Monte-Carlo methods.

Strengths: The article is clearly written and the authors are very pedagogic in explaining their contributions, and how they relate to prior work. The authors notably build on the pioneering work from Balke and Pearl to define the constraints that define the “models compatible with the observed data” using marginals, making a clear difference with most recent work on this matter. The proposed method has two main advantages: (1) it deals with continuous treatment and (2) allows for the use of recent stochastic optimisation algorithms - both of which are, as far as I know, novel. This is done by parametrising the causal function space in a simple (yet expressive) way that allows to incorporate constraints for the underlying model. The experiments are very thorough. The authors experiment both with a linear additive case and non-linear non-additive case, each time with varying levels of confounding, and instrument strength. They compare their bounds with the true causal effect, but also report results from 2SLS and the recent KIV method.

Weaknesses: The authors make choices regarding the parametrisation of the various distributions at play, which are consistent with experimentation and implementation choices. Although this is understandable in the case of such a complex problem, some minor comments/questions remain. The response function space is modelled as the space of linear combination of basis functions. While the authors argue that the proposed method works for any differentiable parametrisation, it isn’t clear how the optimisation algorithm would behave if the response function space was not parametrised as such. Such a parametrisation is indeed very expressive (as explained in appendix), and is valid if we have prior knowledge of the response function form (which seems to be what is assumed in the experiments in line 246). However, one may wonder how to choose such basis functions in practice, when one has no prior knowledge on the confounders, which might have any type of complex influence on the other variables at play. In such a case, the proposed parametrisation could lead to overlooking part of the (valid) response function space, possibly invalidating the optimisation result. The authors propose to build a grid for variable Z, enabling simpler matching of p(y|z) notably. As mentioned in the article, such an “approximation can only relax the constraints”, and therefore not invalidate the bounds, although the extent to which the bounds might be loosened isn’t clearly discussed. In Section 3.2, the authors propose to “bake in” one of the constraint on the marginal p(x|z) by directly using the p(x|z) identified from the observed data (line 154), although it isn’t clear how this is done exactly. As far as I understand, in the experiments the authors refer to each point z of the Z grid, and consider corresponding observations of X to estimate p(X|Z=z): wouldn’t cases where there are few values of X for a given value z be problematic ? Wouldn’t identifying such a distribution from the data imply to have a prior model for this distribution (e.g. linear Gaussian) ?

Correctness: Yes

Clarity: Yes, the paper is very clearly written.

Relation to Prior Work: Yes, relation to prior work is nicely discussed and very pedagogic.

Reproducibility: Yes

Additional Feedback: ---EDIT AFTER AUTHOR RESPONSE--- After reading the other reviews and the authors response, my grade remains unchanged, and I think the paper should be accepted.

Review 4

Summary and Contributions: This paper studies bounding causal effects from data collected by randomized experiments contaminated with non-compliance. In particular, the authors assume the instrumental variable (IV) model (Pearl, 2000, Sec 8.2). The authors improve over the existing results by considering a generalized setting where domains of the treatment and outcome are continuous. To address challenges of continuous domains, the authors consider a family of IV models where the functions are parametrized by a linear combination of non-linear basis functions (kernels). The basis functions are presumed to be known while the coefficients are drawn from a multivariate Gaussian distribution. The primary bounding strategy follows the methods of (Balke & Pearl, 1994) (for short, BP94). That is, the authors (1) formulate the causal bounding problem as a series of optimization programs, (2) and obtain the bounds by solving these programs. The authors also discuss some practical considerations for deriving the bounds. For instance, domains are discretized to estimate the observational distribution. A stochastic gradient descent algorithm is employed to solve the formulated computer program.

Strengths: The experiments are really comprehensive. Future work on bounding causal effects in IV models should follow a similar framework. I like the idea of using two coefficients a, b to categorize instances based on the strength of instrument and confounding. The proposed method is verified in each of these categories. I also appreciate the fact that the authors report both positive and negative cases.

Weaknesses: The authors study a critical problem in the causal inference and make some interesting progress. However, I do have some concerns. I am particularly curious about how sensitive the derived bounds with regard to the parametric assumptions of underlying functions. For instance, in the plot of Fig 2, Row 2, Column 1, the actual causal effect seems to lie outside the derived bounds. This suggests that when the instrument is weak and the strength of unobserved confounding is strong, the proposed methods may not lead to valid bounds. Is there any practical method to test the strength of the instrument and confounding from the observational data? Otherwise, this result seems to suggest that the validity of the proposed method relies on untestable parametric assumptions, which makes the significance of this work somewhat limited. On the contrary, the universal partitioning model introduced in (BP94) (i.e., the discretization of the latent space based on the response functions) is robust for any IV models with discrete observed variables. That is, one could always obtain a valid bound in the discrete domains using the linear program formulation of (BP94). This leads to my next question. Is it possible to (1) discrete the observational data into different bins, and (1) obtain a causal bound using (BP94)? It seems that this somewhat naive approach is guaranteed to lead to valid bounds. How does this approach compare to the authors' method? It would be appreciated if the authors could provide some insights.

Correctness: This paper is techinically sound. The empirical methodology is sound and comprehensive.

Clarity: This paper is clearly-written and well-organized.

Relation to Prior Work: The references and discussion of the related work are sufficient.

Reproducibility: Yes

Additional Feedback: -- POST REBUTTAL -- I have read the authors’ responses and other reviewers’ comments. Unfortunately, they did not address my concerns regarding this paper. In particular, the authors claim that the inconsistent bounds in Fig 2 (Row 3, Column 1) is due to issues of finite samples, and “higher sample size will control this error”. This comment is curious since the error in Fig 2(R3, C1) appears to be quite significant. If that is really due to insufficient sample size, the authors may want to develop confidence bounds that control the uncertainties of finite samples. Nevertheless, the insufficient sample size may serve as an alternative explanation. Still, it does not disprove the possibility that the proposed parametric assumptions may be incorrect, which is more likely to be the course of bounding errors in Fig 2. In the end, accepting/rejecting comes down to how general the required parametric assumptions are. While I could see the point of letting practitioners evaluate these assumptions in practice, I am afraid that such a decision may lead to more misuse that it is intended. Due to the nature of causal inference studies, the target causal effect often remains unknown to the investigators. Bounding errors, if they exist, are not likely to be caught during the study. Some investigators may follow up and revisit these assumptions, but there is a chance that their concerns could go unnoticed. Due to these reasons, I would like to maintain my original score. Having said that, I believe this work would be most improved with a discussion on the robustness of the required parametric assumptions. A sensitivity analysis of these assumptions is also encouraged.