NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6430
Title:Debiased Bayesian inference for average treatment effects

Reviewer 1

The paper presents a Bayesian framework for causal inference in observational studies, in the potential outcomes setup. When studying the average treatment effects, results based on Gaussian process priors may be biased and the paper proposes a correction. I have the impression that the paper does not pursue a fully Bayesian approach since \pi does not seem to be estimated and its prior distribution is never defined. In particular, contrary to a standard Bayesian approach, it is separately estimated and its estimate is plugged-in the estimation procedure for m(x,r). A comparison with a fully Bayesian approach is missing. The bias the authors want to correct is said to exist, but neither a reference nor an analytical prove is given. I think one of the advantages of this approach with respect to a fully Bayesian approach is computational efficiency. A comparison of the computational times would be useful to compare the methods in the simulated and semi-simulated examples. A description of the organisation of the paper is missing at the end of the Introduction. Table 1: I am a bit surprised that the coverage of the proposed method is larger than other methods, but the size is similar of the credible intervals is smaller than other methods with low absolute error: the credible interval coverage should be similar to the frequentist level (since the credible intervals are of level 0.95, the coverage should be around 0.95), a larger coverage should mean larger intervals. I think a better explanation of the Table is needed. Line 103: saying that f, \pi, and m are independent under the posterior is imprecise, because they are conditionally independent (for example, they all depend on X). Line 106: since the posterior distribution on \psi depends on the full posterior distribution, which, I believe, is approximated by using full conditionals, I think you still need to define a prior for \pi, as it is also shown by Equation (4).

Reviewer 2

In their manuscript entitled, "Debiased Bayesian inference for average treatment effects", the authors present a new class of Bayesian model (or, as they phrase it, choice of [stochastic process] priors) for estimating average treatment effects (across a population) given only observational data not from an RCT (i.e., the canonical causal inference setting in social science or population health). The insight brought to this problem regards the structuring of the model to introduce a posterior de-biasing correction based on theoretical insights from the Bayesian asymptotics literature. From my point of view---having worked extensively with Gaussian process & Dirichlet process models and having read widely on the asymptotics of these processes---I was very pleasantly surprised to see here: (i) the identification of a connection between the ATE estimation problem in a semi-parametric Bayesian setting and this particular branch of the asymptotics literature, and (ii) that the authors were able to successfully transfer the insights from the asymptotics back to the practical problem to achieve a more effective model. Moreover, the presentation is very clear (modulo my concern that I find sometimes the exchangable use of the terms model and prior to be initially a little confusing). Although I found no errors in the text I felt that perhaps some discussion on how this model might interact with the (seemingly) increasingly common prior step of automatic causal feature selection could be warranted: e.g. presumably this method suffers from a curse of dimensionality if too many non-important variables are simply thrown into the design matrix, but likewise presumably performance will suffer if important covariates are omitted: is there any way within this model class to diagnose either of those problems?

Reviewer 3

** Update after author response** I'd like to thank the authors for their very detailed response to my and other referees' comments. In particular, I welcome the comparison to BCF and find it quite interesting that it performs better on the semi-synthetic data but not the synthetic! On re-reading the paper and supplement, as well as the response, I do think the way you have incorporated the propensity score is quite clever. I'm quite happy to revise my score up to 7. ----- The authors consider the important problem of heterogeneous treatment effect estimation. They specifically propose a non-parametric Bayesian procedure, placing a Gaussian process prior on the potential outcome function m(x,r). They note that the natural approach (i.e. placing a "vanilla" GP prior, e.g., on this function) can result in substantial bias in the estimate. The authors propose re-parametrizing m(x,r) to include an estimate of the propensity score and note that such a correction yields better performance than state-of-the-art methods. While I generally liked the approach of the paper, I cannot help but draw parallels to earlier work by Hahn and colleagues, which the authors have referenced. Indeed, Figure 1 displays a type of "regularization-induced confounding" described by this group in the context of linear models (reference 15 in the present manuscript) and tree-based methods (reference 16 in this paper). Moreover, the modified prior bears some similarity to the parametrization of m(x,r) used in reference 16, in that both decompose m into a term depending on an estimated propensity score and a term that does not include a propensity score estimate. I would note, however, that the proposed modification is arguably somewhat more principled than the one in reference 16 and I think some discussion comparing and contrasting these approaches is necessary. In a similar vein, I think a direct comparison with the Bayesian causal forest (bcf) method from reference 16 is warranted. Adding such a comparison should be straightforward, as the method is currently implemented in the "bcf" package available on CRAN. This comparison is perhaps more fair than simply running BART with the propensity score included as a covariate. Beyond this, I have a few additional concerns: - Estimation of F: I understand that to provide inference for a population average treatment effect, one needs to accurately model the distribution of X. As the authors note in passing, this can be quite difficult, especially when the covariates are high-dimensional and include both continuous and discrete outcomes. In general, posterior inference about the population ATE proceeds by repeatedly (i) drawing F* ~ F | D, a posterior sample of the distribution of X, (ii) X* ~ F*, a new set of covariates, (iii) computing m(X*,1) - m(X*,0). The authors propose using a Dirichlet Process prior on F with base measure equal to the empirical distribution of the observed covariates. The resulting posterior distribution of F places all of its probability mass on distributions that are supported only on the covariates observed in the sample. As a result, all inference about the population ATE is based on evaluating the potential outcome function on covariates already observed in the sample. This seems like a "population effect" only in the narrowest sense: it implicitly assumes that within the sample, one has observed all possible sets of covariates. -- Originality & Quality: To the best of my knowledge, the proposed representation of the function m(x,r) is new and I rather liked the motivation for it (provided in the supplement). It is especially promising that the proposed methods yields performance similar to the state-of-the-art. However, the general phenomenon described and overall approach bears striking similarity to existing work, which must be acknowledged. Clarity: The paper is well-written Significance: I do think the results are important and this paper opens up the possibility for better Bayesian non-parametric estimation of heterogeneous treatment effects.