Review for NeurIPS paper: Active Structure Learning of Causal DAGs via Directed Clique Trees

NeurIPS 2020

Active Structure Learning of Causal DAGs via Directed Clique Trees

Review 1

Summary and Contributions: -considers intervention design for orient essential graph into a DAG -active: the design is not fixed at once but sequentially taking into account -proves that the number of interventions necessary is at least the sum of half of the size of the largest cliques in each chain components - presents an algorithm for intervention design -which matches the bound up to a factor -which scales up higher and shows better worst case performance

Strengths: -mix of novel theoretical results and readily usable algorithm -quite clear simulations -new theory for orientation of DAGs from interventions

Weaknesses: -considering DAGs without latent confounders or cycles in the causal context is limiting -also only single node interventions are considered -simulation results do not show much improved average case performance

Correctness: The claims seem correct.

Clarity: The writing of the paper is relatively good, despite the easily fixable issues in additional feedback. Examples on the theoretical graph concepts are given as figures.

Relation to Prior Work: Relation to previous is described well in the paper: -extends central node algorithm of Greenewald NIPS 2019 that is for limited graph types to general DAGs -present all cases theoretical results in contrast to previous worst case results

Reproducibility: Yes

Additional Feedback: Minor writing issues: Terms noiseless and noisy interventions (Sec 1) are used before they are defined (Sec 2). Directed cycle is defined as sequence of edges with at least on directed edge, this is non-standard terminology that should be changes to something else. It is also troublesome since it does not correspond to acyclicity, one of the main assumptions of the paper. Intervention policy definition is unclear: a map rom EG to interventions? Is this now EGs for I-EGs? If it is just a map with EG as input, how can previous interventions affect it? Def 2 Is unclear and informal: 1) it seems to produce * which do not exist in Fig 1, 2) Fig 2 includes bidirected edges. Better to formulate in another way. Intuition on ic-ratio is missing, this should be added. Hard to interpret it as a reader. When is the result optimal? ----- AR acknowledged. This is good work but due to the clear limitations a higher grade is not warranted.

Review 2

Summary and Contributions: The paper studies intervention design for learning causal relationships between observed random variables. In brief, the problem setting is such that we are given a mixed graph G on node set V representing the Markov equivalence class of the observed distribution, and we want to design a sequence of interventions that allow us to infer the (causal) directions of the undirected edges (which are guaranteed to form a DAG); an intervention is an operation on node subset I that, from a purely combinatorial perspective, gives us the directions of the boundary edges of I. The task is then to find a shortest sequence of interventions that allows us to infer all edge directions. The paper focuses on the case of single-node adaptive intervention strategies, i.e. all interventions operate on single node at a time, and the selected interventions can depend on the results of prior interventions. The main results are (a) a lower bound for the number of single-node interventions required to fully orient a graph, and (b) an algorithm for generating a sequence of interventions with guarantees on the upper bound of interventions assuming structural conditions on the underlying true graph. The algorithm is implemented and compared against various prior proposals.

Strengths: Generally, this paper is quite far outside my own area, so I can't fully evaluate its significance in the context of causal inference community. The experimental results indicate that the proposed algorithm is somewhat better than the best prior proposal, and the lower bound results is reasonable.

Weaknesses: The results seem fairly incremental compared to prior work, and it's not clear if the results are interesting outside a small sub-community.

Correctness: The results seem be reasonable likely to be correct, however, due to the dense and non-self-contained presentation, I was not able to fully verify the details of the proofs within the review window. There seem to be various minor mistakes in the proofs, so thorough checking might be warranted. In the experimental section, a more direct comparison between DCT and Coloring would be appropriate, since they have at least on average very similar behaviour in terms of competitive ratio.

Clarity: While this paper is certainly outside my area, I found it to be not well written, and I had to frequently refer to prior work to make sense of the paper. On general level, the paper does not provide a good precise overview of the results, or provide context in terms of what theoretical bounds are obtained in prior work and how they compare to the present work. Some key details about the setting are omitted or mentioned only in passing. For example, the fact that all of the results consider only single-node interventions is mentioned only in a single sentence in the introduction, and is not reflected e.g. in Definition 7, which seems to define m(D) in terms of unbounded interventions. Likewise, the paper does not explain what information about edge directions an intervention I provides. The theoretical results are presented in a rather dense, front-loaded style, which requires the reader to keep track of lots of notation and abbreviations. However, the actual proofs don't seem very deep, and the presentation could probably be significantly streamlined.

Relation to Prior Work: Relevant related work seems to be cited sufficiently, but not sufficiently discussed to provide context for the results; in particular, the paper does not discuss what (if any) upper and lower bounds are provided in the prior work in a way that could be compared with the results in the present paper.

Reproducibility: Yes

Additional Feedback: - Line 42: "MEC" used before defined - Line 63: Definition of directed cycle looks weird, possibly should be *-> instead of *-*? (By this definition, e.g. 234 is a directed cycle in Fig1(a)) - Line 70: Parent set notation is not defined. - Section 5: How is m(D) computed in the experiments? I.e. is it the actual m(D), or the lower bound provided by Theorem 2? - Appendix, lines 591-593: Please elaborate on the clique intervention lower bound, or provide a reference. - Appendix, lines 613: The cited reference does not seem to have Lemma 1. Should be 2014 instead? Please include full details. -- update after rebuttal -- After the rebuttal and discussion, I'm slightly adjusting my score upwards. The lower bound is indeed kind of nice, but I still disagree with the authors on the clarity of presentation. The claim itself can be presented as a simple combinatorial statement, and the proof does not use any advanced techniques. In particular, I would encourage the authors to make sure that the proofs in the main paper can be followed without reference to the appendix or prior work. In the experiments, it looks like COLOURING and DCT both have very similar performance, and DCT does not (substantially) improve over COLOURING in practice (though DCT has theoretical guarantees in the restricted case). This is where I would have wanted to see the "more direct comparison" on instance-by-instance basis; since the average ic-ratio is generally slightly better for COLOURING, but maximum ic-ratio is worse. Is the bad maximum ic-ratio due to single outliers? Does COLOURING sometimes also produce much better ic-ratios than DCT to counterbalance the bad outliers? (I.e., I would want to see the ic-ratios of COLOURING and ICT plotted against each other for all instances in a certain size category.)

Review 3

Summary and Contributions: The authors considered the problem of experiment design to fully identify the causal structure by performing minimum number of interventions. They showed that for full identification of the causal graph, it is required to orient some specific induced subgraphs of the causal graph, which they called "residuals". Based on this observation, they provided a lower bound on the number of interventions necessary for full identification. Moreover, a two-phase intervention policy was proposed where in the first phase, Central Node algorithm (Greenewald et al., 2019) have been utilized to reduce the problem to identify orientations within the residuals. In the second phase, the algorithm orients the edges in each residual.

Strengths: - The authors proposed a universal lower bound on the number of interventions necessary for full identification of the causal graph. In particular, they showed that this bound is equal to the sum of half the size of the largest cliques in each chain component of the essential graph. Comparing with the result in previous work, this bound holds true for any DAG in the MEC which is an interesting result. - Under some assumptions on the causal graph, the two-phase intervention policy uses at most O(\log(|C_max|) where C_max is the highest number of maximal cliques in any chain component.

Weaknesses: ===After rebuttal=== I read the reviews and the rebuttal. I think the "intersection-incomparability" is a restrcitive assumption in analyzing the proposed algorithm. Although the paper has a nice result for universal lower bound on the number of interventions necessary for full identification of the causal graph. In overall, I decided to keep my score unchanged. ========= - I think it is required to discuss in which causal graphs, the "intersection-incomparability of the clique graph" holds true. Is it a restrictive assumption? - About C_max: It seems that the definition of C_max in line 54 is different from the one in Theorem 3. It would be great to clarify this issue. Moreover, the value of C_max can be in the order of graph size. I am not sure whether there exists other related work with the approximation factor of log(n). - In the experiments, the structure of large graphs is close to trees. It would be great if the authors can report running times of the proposed algorithm in structures other than trees.

Correctness: It seems that the claims and the proposed algorithm are correct. However, I did not go in details of the proofs.

Clarity: The paper is generally well written.

Relation to Prior Work: The authors mentioned previous work in Section 6.

Reproducibility: Yes

Additional Feedback: