NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6256
Title:The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data

Reviewer 1

Update: I thank the authors for addressing my comments. I increased my recommendation from 6 to 7 as a result. Originality: Perhaps the biggest problem with the paper. The point it is making is a good one –– one that needs to be heeded by researchers in the field. But it is not a *new* contribution. The paper seems to ignore previous work on the topic in the statistics literature (Cook, Shadish and Wong 2008; Zhao, Keele and Small 2018... just to name a few). There are a number of references in the statistics literature that make a similar point, with more details. Clarity: the paper is well generally clear and well written. Quality: the paper makes good points, but has a somewhat limited scope. In particular, it does not provide any guidance for how to conduct a convincing empirical evaluation. This limits the potential impact of the paper, in my opinion. It would have been great if the authors had dedicated a couple of pages to reviewing some of the methods for empirical evaluation, or case studies of empirical evaluation done right.

Reviewer 2

Although the paper is a good attempt at this space, and the messages should be echoed wide in the community, the paper could benefit from various improvements. Specifically, I am unsure if some of the performed experiments are supportive of the claims made in the paper. Details are as follows: Line 79: Authors discuss evaluating interventional distribution. But if the structure learning part is correct, then the learned distribution will also be correct as long as the parameterization is known or for discrete variables. Am I missing a point here? After reading the rest, I guess authors are concerned about approximately learning the structure, and then depending on whether strong or weak edges are omitted can be determined by such an evaluation. It may help expand this discussion here a bit too. Can you elaborate a bit more on untested influences in line 176. Line 245: The data proposed in the machine learning challenges mentioned here is already used in the cause-effect pair dataset of Mooij et al. Section 4.4: Please explain this experiment of generating synthetic data on the learned network in more detail: How many samples were in the real data, how many samples did you generate synthetically? The mentioned algorithms can perform poorly if the number of samples are small, which is a different problem than using synthetic data. " Structural measures also implicitly assume that DAGs are capable of accurately representing any causal process being modeled, an unlikely assumption" This issue is much more complicated than authors imply. Once we remove the assumption that the underlying graph is acyclic, the modeling changes drastically. So, if an algorithm that is based on a set of well-defined assumptions including the assumption that the underlying graph is acyclic, outputs a cyclic graph it is a clear error and should be avoided. It is a different point to encourage assuming cyclic models and developing algorithms for that, but that is at the modeling phase, much before evaluation. Please elaborate this part as the quoted sentence can be misleading, diminishing the significant difference in modeling cyclic vs. acyclic systems. The TVD vs. structural distance plot is interesting. Is the TVD calculated only on the observational distribution. AFTER REBUTTAL: I would like to thank the authors for their detailed response. Although it clarified many points, I still believe the reason for seeing multiple outcomes from the causal inference algorithm is probably simply using insufficient number of samples, rather than synthetic vs. real data. I hope authors investigate this point better.

Reviewer 3

This clearly written and highly novel paper describes a critical gap in the causal inference literature. While inference methods have advanced, our evaluation techniques have not. As the authors show, this means that our ability to predict which methods will translate successfully to practice is limited. The paper contains a thorough survey of inference methods and evaluations, which i have not seen before. This is a valuable contribution to the literature. While the paper is not perfect (see improvements section), I believe the significant novelty and potential impact on the community outweigh these weaknesses and that it is a significant contribution. Figure 2 is especially striking. Questions: -The authors discuss the significant limitations of synthetic data. However, only simulations using the target structures (e.g. DAG) seem to be considered. What about using domain specific simulation systems? These are totally independent of the methodological assumptions/approaches. Do you believe the results would be closer to interventional measures? I appreciated the responses to the reviews, and maintained my high score as the weaknesses pointed out by the other reviewers will be addressed in revision.