NeurIPS 2020

### Review 1

Summary and Contributions: Update after rebuttal : I remain a weak accept because I"m not sure the identification with proxies is carefully discussed in the paper; modelling for uncertainty for causal inference requires specific assumptions and associated methods. Further, I think I will need to look at the further evaluation and ablations studies to judge merit of the proposed method. ============================================================== The paper presents an effect estimation method that relies on the Robinson decomposition + KL balancing to improve causal estimation. With the proposed method, both (strong) proxies of the confounder or direct confounders can be used.

Strengths: The strengths of the paper are the experimental improvements on IHDP and ACIC benchmarks. I suspect this is due to ease of KL balancing and due to the special form of the loss functions. Lemma 3.3 is a nice touch in the vein of the Johannson et al. generalization bounds.

Weaknesses: The idea of latent variable identification is very limited and should not be relied on to show causal identification directly. I think Miao et al. is a general setup of causal identification with proxies with an assumption of completeness, and even then they talk about average effects and not conditional effects. Beyond this, identification requires specific conditions and these conditions need to be discussed. I think the experiments could contain more ablation studies that show how bad estimation can get with noise in the the confounder given proxy helps or hurts. For example, increasing noise does seem to help but why? Is it just ease of training or is it something more fundamental about the data generating process? Please comment of these issues.

Correctness: The claims of robustness to unmeasured confounding are imprecise. This can only occur in certain cases like when noise in confounder is averaged out in estimates. Lemma 3.3 seems to have an issue, the supremum M should appear in the numerator. Otherwise the claim seems to be that an unbounded loss means balance does not matter.

Clarity: The paper is not very hard to read. However, the narrative in the paper can be simplified. To me, the narrative is robinson decomposition + KL balance condition improves effect estimation. Then identification relies upon identification with proxies. The latter requires more discussion.

Relation to Prior Work: The prior work is extensive but not presented in a way that is helpful and does not add insight. I think giving examples of where causal identification holds is important.

Reproducibility: Yes

Additional Feedback: The authors should clarify their discussion about identification. They should provide clear examples of where latent variable identification is possible which would then lead to causal identification. Without this, the claims in the paper are at risk of misleading readers. Please clarify the sentence "Robust to unmeasured confounding". This is a bit vague. The following is my reasoning : If your proxies determine the unobserved confounders, you immediately have effect identification. If they don't, there are conditions under which the effect is identified like when the outcome model is additive in T and Z. For a counterexample, imagine Y = T + 1[cos(k \pi Z) > 0] with large scalar k, treatment T and confounder Z, and X = Z + normal noise. This noise in Z | X would smooth out the the discontinuities which means the error in effect estimates increases with noise.

### Review 2

Summary and Contributions: This paper takes a generative modelling approach towards addressing the problem of causal inference. The proposed algorithm uses the Robinson residual decomposition to derive a reformulated variational bound which is designed to explicitly estimate the causal effects rather than individual potential outcomes.

Strengths: + The claims are sound theoretically and empirical evaluations are adequate. + Causal inference is a quite relevant subject to the NeurIPS community.

Weaknesses: I have my concerns regarding the significance and novelty of this work, and I think it is not enough for publication in NeurIPS. Specifically, this work provides an improvement over a previous work, namely CEVAE [46], by adding a penalty term for learning balanced representations -- see Eq. (5). The idea of adding this penalty is not novel either, as many works (originated by [32]) have adopted and incorporated this idea into their algorithms. Moreover, the text promises to accommodate counterfactual validation in lines 62-63; however, I could not find it addressed later in the paper.

Correctness: + I acknowledge that there exists literature on estimating the causal effects directly rather than indirectly from individual potential outcomes; but I’m not familiar with it. However, I’m not sure if the proposed objective function could be optimized given the observed data; as it appears to require \tau values which are never observed. Could the authors please elaborate on how they provide the \tau(x) values for training their model? This comment also holds for m(x) as we never observe both \mu_0 and \mu_1 for the same subject. + Eq. (12): The reason why many works use integral probability metrics is that KL-divergence measure of discrepancy between two probability distributions is rather unstable numerically. Please comment on why you think this won’t be an issue with your method.

Clarity: This is a very well-written paper.

Relation to Prior Work: Yes, the related work section fully discusses how this work differs from previous contributions.

Reproducibility: Yes

Additional Feedback: + Line 147: did you mean “exclude” instead of “preclude”? + There are a couple of mistakes in Eq. (3); the correct versions are: y - m - (t - e) \tau; and y - {t \mu_1 + (1 - t) \mu_0}. Please verify whether these were only typos. + Lines 157-158: My understanding is that the factorization in Eq. (4) is a direct result of the assumed graphical model (Fig. S1.c). The authors however state that plugging the result of Eq. (3) into ELBO yields this factorization. I think what they meant to say was that in their implementation, they substitute the \tau-loss term with its \epsilon equivalent. Please clarify. ===== post-rebuttal ===== The authors have addressed many of my concerns in their rebuttal; however, I still have my concerns regarding novelty and the claim that the existing generative objectives do not account for selection bias. I have updated my score accordingly.

### Review 3

Summary and Contributions: The authors use deep generative models for causal effect estimation from a Rubin causal inference perspective. The attempt to use adjust variational inference to account for balancing. They introduce a validation technique called "counterfactual validation." These two things; (1) adjusting the objective for causal constraints and estimation procedures and (2) validation procedures that incorporate causal semantics, are high impact.

Strengths: I've seen work that tries to use deep generative modeling to get parameteric identifiability under confounding (e.g. using IV) with VAEs. This is the first one I've seen that takes practical estimation concepts from the Rubin causal inference literature and uses ties it to the deep generative objective function. I also

Weaknesses: Works in this vein lack proofs of identifiability. That is a weakness here as well, though the authors do a good job addressing it.

Correctness: I found no issues with the claims.

Clarity: Yes. Extremely so.

Relation to Prior Work: I am satisfied with the discussion of prior work. Especially when it comes to identifiability.

Reproducibility: Yes

Additional Feedback: Regarding claim "And in our follow-up investigations, we have found that variant of the proposed variational framework shows robustness against the algorithmic biases towards the minority populations, a major issue that draws criticism for machine learning applications." That's waaaay to hand-wavy. Back that up in the supplement. Please don't just play lip-service to these issues. Unlike the broad ML community, causal inference research has mathematically concrete things to say about fairness/bias/discrimination.

### Review 4

Summary and Contributions: This manuscript proposes a novel computational approach to ITE estimation, based on Bayesian models, variational inference and domain adaptation. First, it uses a known decomposition of the ITE to formulate the estimation as an appealing Bayesian inference problem, solved with variational inference. Second, it proposes an original penalty for dealing with unbalance, theoretically justified. Then, the authors expose a set of experiments for which their experiments outperform state-of-the-art.

Strengths: The paper is extremely comprehensive. Indeed, it might even get confusing at some points, because many topics are brought up and it is not always clear whether the paper in hand is attempting to solve them or not (more details on the clarity section). Unifying VI and the R-learner is an elegant paradigm that should be relevant to the NeurIPS community. ITE estimation is a specific flavor of counterfactual inference, it would be interesting to contextualize it for other offline counterfactual problems (batch learning from bandit feedback for example). It is an appealing point of the method that the variational formulation of the KL divergence (using the Fenchel dual form) allows for a gain in complexity, compared to Wasserstein and MMD.

Weaknesses: ITE estimation is a well established line of work. Although this paper is comprehensive and has the merit of laying out where this field may progress, as well as some contributions, there are some limitations to the novelty. Regularizing variational inference with KL divergence has a Bayesian motivation that is discussed in this recent manuscript [1]. Proposing a new regularization scheme for ITE estimation (or another counterfactual problem) along with tractable approximation is investigated in many research work. For example, [2] presents the chi-square divergence as the variance of importance sampling (in batch learning for bandit feedback) and proposes to minimize it with a variational formulation (f-divergence). [1] https://arxiv.org/pdf/1806.11500.pdf [2] http://proceedings.mlr.press/v80/wu18g.html I do not know whether this paper was intended to discuss the problem of identifying latent confounding variables (cf. clarity) but the experiment does not underline any of that. Similarly, the paper mentions briefly counterfactual cross-validation in the introduction but this is not explored (normal’’ cross validation is used in the experiments).

Correctness: To the best of my knowledge, all the exposed claims and methods seem reasonable. I have some minor points. Line 230. The estimate of the causal effect makes use of the variational distribution. First, this can be biased. Second, VI is known to underestimate the variance in the posterior distribution. There might be some improvement with using importance weighted variational inference and using a self-normalized importance sampling estimator? Especially in scenarios where the uncertainty in tau is important. Line 248 The square root of a KL divergence usually comes from a Pinsker inequality (and it is the case here). However, the total variation distance is symmetric while KL is not. So you could get a similar bound with KL(q_1 || q_0). Why not use a symmetrized version (sum of reverse and forward KL)? Table 1: CFR has extremely poor performance. This is mentioned somewhere in the Appendix but because the datasets are the same, and that in their paper (and to my knowledge some others) CFR outperforms BART, OLS, etc.. I am intrigued. Is this something the authors kept working after the submission deadline? It would be good to know why it doesn’t work.

Clarity: The paper is compelling and well written. I have some minor comments. + Line 52-54: the way it is written, it seems like this manuscript is attempting to solve points (i), (ii) and (iii). However, (ii) only is addressed, and this is a sensibly studied problem. + The paper does not make it clear whether the latent variable z factors in hidden confounding factors, or whether it just helps in predicting the ITE and its noise, etc. + The sigma introduced in line 174 is not explained. The statement "V-NICE also approximately recovers the R-learner for sigma -> 0" is therefore not clear. Should this be added as a proof somewhere? I could not find it in the SM after a quick glance + After a bit of time, it seems like the Fenchel dual form of the KL is used twice, one for the ELBO and once for the balancing term. However, this is not explained at lines 197-204 and need to be better exposed. The notation theta prime in Alg 1 is not used anywhere else in the manuscript. + I found that the number of baselines was weak. However, I found more results in the supplements. So, there should be more links between the main text and the supplementary files. Typos: Missing a space line 11 Green in line 74 Word missing at 100 RHS not explained at 156 The list of citations must be worked through, some entries are missing and some others are poorly formatted.

Relation to Prior Work: Related work is clearly discussed in a very nice section (except the citations I proposed earlier)

Reproducibility: Yes

Additional Feedback: I would like to thank the authors for their clarification on counterfactual cross-validation and other questions I had. I maintain my score, I think this is a good paper.