Paper ID: | 1443 |
---|---|

Title: | Adapting Neural Networks for the Estimation of Treatment Effects |

The authors present a methodology for a deep learning model for the nuisance parameters for causal estimation (mean outcome and propensity score functions), which is based around learning a shared low-dimensional representation of the confounders regularized for good finite-sample performance, and a learning algorithm for training this model based on their presented concept of "targeted regularization" (which is a regularization scheme inspire by TMLE). Using their novel methodologies they are able to achieve state of the art performance on standard datasets for high-dimensional causal inference. Their methodology combines multiple different ideas in causal inference (multi-headed deep learning models and targeted learning) in a novel way, and their empirical evaluation seems very strong, so I would recommend the paper for acceptance. Some issues with the paper are as follows: - They claim that their methodology is stable because it does not involve any propensity terms in denominators. However as far as I can understand this is false because their prediction is based on their learnt \tilde{Q} function, which involves propensities in denominators in its second term (which is weighted by \hat{epsilon}. - The baselines in their evaluations are not completely clear. In particular it seems like the "baseline (TARNET)" method should be the same as "TARNET (Sha+16)" in Table 1, however they report different numbers in each. It seems like maybe the second is the number reported in past work, and the first is using the author's code possibly with different details in exact model architecture and learning hyperparameters, but this is not made explicit. - The authors claim that part of strength of model is insensitivity to very low/high propensity scores due to lack of propensity scores in denominators. However in their evaluations they exclude data with extreme propensity scores which makes this claim difficult to verify. In addition since different methods will estimate propensity scores differently it is not made clear what data points are removed, and whether different methods are being evaluated on the same data - It seems weird that Equation 2.2 has no hyperparameter for how much the two loss terms are weighted, is there a reason why no such term is included? - They claim that the third head in their model regularizes the model such that finite-sample performance should be improved. However no part of their experiments evaluates this claim. It would be good to see an experiment, even if it's just with synthetic data, that tests this (could be done by using a synthetic data distribution and comparing model performance with 2 vs 3 heads with small n versus very large n)

The draft is basically well written, except that Section 3 seems a bit disorganized. Estimation of treatment effects from observational data is an important topic in causal inference. A line of research has been done in recent years. This work uses neural networks for the estimation of treatment effects from observational data in “no hidden confounding” setting. They proposed two methods in terms of two stages. First, they proposed Dragonnet, a three-headed architecture that provides an end-to-end procedure for predicting propensity score and conditional outcome from covariates and treatment information. If the propensity-score head is removed from Dragonnet, the resulting architecture is the TARNET architecture from Shalit et al. Second, they made a modification to the objective function in training. The main inspiration here is the targeted minimum loss estimation (TMLE). My main concerns are the novelty. Both of the methods seem like we have A and B, so we can try to combine to see how it works.

Summary: In this paper, the authors address the problem of estimating treatment effects from observational data when all covariates are measured (the ‘no-confounding’ assumption). The estimation proceeds in two stages: in the first step a model for the expected outcome, i.e., Q(t, x) = E[Y | t, x], and one for the propensity score, i.e., g(x) = p(T = 1 | x), are fitted; in the second stage, the average treatment effect is derived from the previously computed fits of Q(t, x) and g(x). The authors focus on improving the models estimated in the first stage with the ultimate goal of improving the treatment effect estimation in the second stage. For this purpose, they propose a neural network architecture called Dragonnet, in which the outcome models Q(0, x) and Q(1, x) are tightly coupled with the propensity score g(x). The authors then propose a procedure called targeted regularization for improving the asymptotic properties of the neural network-estimated functions in terms of estimating the average treatment effect, at the expense of predictive performance. Detailed comments: • Originality: The two methodological contributions appear to be original. Relevant related work is adequately cited in a separate section on page 5, but more references could be added. For instance, the paper “Representation Learning for Treatment Effect Estimation from Observational Data” by Yao et al. is probably highly relevant for this work. • Quality: The ideas proposed appear sound, but are validated only through a limited number of experiments. I have doubts that asymptotic properties like double-robustness are achieved so easily for targeted regularization. It seems that coupling the estimators for Q and g will in general lead to loss of consistency and of the double-robustness property. The authors claim that “consistency is plausible – even with the addition of the targeted regularization term” because “the model can choose to set epsilon to zero”. However, the non-parametric estimating equation, which is needed to achieve the good asymptotic properties, is satisfied only for the value of epsilon that minimizes the modified objective (locally), and this value will not be zero in general. I also did not find the empirical study particularly convincing. For instance, the authors fail to explain how they combined targeted regularization with TMLE in for the experiments described in Table 2 (page 7) and Table 4 (page 8). • Clarity: The submission is well-structured and easy to read for the most part. However, most of the figure and table captions are extremely bare and are not self-contained. What’s more, The authors tend to overuse hedge words and rhetorical questions in their argumentation. • Significance: The two methodological contributions, Dragonnet and targeted regularization, improve on state-of-the-art approaches like TARNET and TMLE, respectively, but only incrementally. It is hard to accurately evaluate how significant these contributions are based on the limited number of experiments. A theoretical analysis of the newly-proposed estimators couple with a more comprehensive experimental section would go a long way towards shedding light on the significance of these ideas. • Minor comments: • The legend is missing in Figure 2. The x-axis scale should be removed. • Is equation (3.5) missing a factor of (-2) on the right side? • On page 7, the authors claim that “targeted regularization essentially never hurts”, yet in Table 3 the targeted regularization degrades the performance of the simple baseline estimator (row 1) in half of the cases. Could there be a typo in the table? • Have the authors verified that “in cases where the targeted regularization loss term is large” the model responds “by setting the parameter epsilon to 0 and recovering the baseline”? • The terms ‘semi-parametric’ and ‘non-parametric’ are used throughout the paper as if interchangeable. For example, at the beginning of Section 3 (page 3): “This modified objective is based on non-parametric estimation theory.” and later “We review some necessary results from semi-parametric estimation theory”. • The footnote explanation for calling the architecture “Dragonnet” is curious as dragons typically have just one head (country to hydras).