__ Summary and Contributions__: The authors provide a score-based algorithm for learning causal graphs using interventional data. They first construct a score function, show that it achieves the optimum at an I-Markov equivalent graph. They later relax this score/loss to be able to differentiably optimize it using neural nets, using approaches from the existing literature.

__ Strengths__: The paper proposes a score-based approach for learning causal graphs with interventions which is supported by theory. I think this is a very important direction.

__ Weaknesses__: The unknown intervention section is unclear to me. It is not clear why the additional mask to learn intervention targets would work. For me, this section only makes the paper weaker. I would recommend the authors to only mention this in the experimental section since I do not think there is enough motivation to present it as a sound method.

__ Correctness__: Yes, the proposed theory (Theorem 1) is correct. The methodology of Section 3.3 about unknown interventions, however, is not clear.

__ Clarity__: Yes, the paper is mostly well written.

__ Relation to Prior Work__: Related work section is missing some references. The authors cite "Joint causal inference from multiple contexts" by Mooij et al. in bibliography but not in the text of the paper. Another related work that is not included is the "Characterization and learning of causal graphs with latent variables from soft interventions" by Kocaoglu et al.

__ Reproducibility__: Yes

__ Additional Feedback__: POST-REBUTAL FEEDBACK
Thank you for your response. I do think it's a good idea to add the hinted theorem on operating under unknown intervention targets. I recommend the authors to emphasize the assumptions even more for this one, i.e., causal sufficiency and Assumption 2, which could make this possible. Clearly, there are some proof details the authors will figure out, but the proof sketch seems reasonable and I will be excited to read once they are available in camera-ready. Thank you for the good work.
Authors show how they use NNs to model the conditionals in (7). It might be better to explicitly state Assumption 1 here, which restricts the considered interventions in this way.
For I-Markov, please cite related work in the main text. There are multiple definitions with the same name in the literature, which can be confusing.
I believe the finite entropy assumption should be decoupled from Assumption 1, which is currently described only as "sufficient capacity neural nets" assumption. It might be good to provide intuition on why this is necessary.
Assumption 2 seems tied to the intervention targets, which could have created problems with perfect interventions. However, so long as obs. data is assumed to be available, this is simply the original faithfulness assumption. This point could be emphasized.
line 575: "This means that KL>0". For clarity, please consider adding the argument that differing CI statement means we cannot fit the exact distribution and KL=0 iff dist's are the same.
I want to say that the argument for Case 5 is very well done.
Proof of Theorem 1 looks correct.
The proposed approach in Section 3.3 seems a bit arbitrary. Moreover, there are a lot of works that could be used even when intervention targets are unknown. Please see Mooij et al. "Joint causal inference" Table 4 for these.
The experiments seem okay, but definitely the authors could have provided comparisons with the other existing work, as detailed in the paper by Mooij et al. mentioned above.
Minor comments:
line 116: the use a->the use of a
in (7), f is used for density whereas p is used in (8).
line 564: their immoralities implying ->their immoralities including

__ Summary and Contributions__: This paper proposes a neural-network-based method for causal discovery that can leverage interventional data.
- A causal discovery method using continuous constrained optimization that could utilize interventional data.
- A score for interventional data with theoretical justification of its validity
- Applicable to both imperfect and perfect intervention.
- provide theoretical results of the method's identifiability

__ Strengths__: - This work is of good novelty. It proposes the first causal discovery method using continuous constrained optimization considering interventional data.
- From the theoretical aspects, it provides sufficient theoretical results to justify the validity of the score used in the proposed method.
- In the empirical evaluation, it conducted extensive synthetic experiments which provide comprehensive understanding of the properties in various factors (e.g., type of interventions, graph size, density, type of mechanisms) of the proposed method.

__ Weaknesses__: - Assumptions used in theorem 1 are not included in the main text. As manuscripts should be as self-contained as possible, it may be better to move assumptions 1 and 2 from Appendix A.2 to the main text. Furthermore, if possible, it would be better to extend intuitive explanations of these assumptions to help readers have an idea of their limitations.
- The idea of how to prove theorem 1 is not mentioned. Similar to the comment above, a general description of the proof (e.g., a proof sketch) may help readers accept the theorem more easily. It would be better to give a brief description in the main text.
- The empirical results can be further analyzed. For example, what contribute to the good performance of DCDI in cases with higher number of average edges?

__ Correctness__: Both the theoretical result and empirical methodology are technically correct.

__ Clarity__: This paper is well written. It gives sufficient background knowledge in the Appendix and adds references of the Appendix in the main text.

__ Relation to Prior Work__: Yes, previous works are carefully reviewed and the references are sufficient. It is a meaningful extension of existing works using continuous constrained optimization.

__ Reproducibility__: Yes

__ Additional Feedback__: - The reference of RMSprop in line 189 is missing.

__ Summary and Contributions__: This paper works on causal discovery in the presence of interventional data. The proposed algorithm is in line with the recent works of differentiable score-based learning. The score is a penalized interventional log-likelihood of the data, where the likelihood is either 1) a Gaussian parameterized by nonlinear functions of parents; 2) a nonparametric tractable density modeled by DSF. Theorem 1 gives justification for this kind of scores. The DAG constraint is enforced by introducing a stochastic masking matrix that is learned with Monte Carlo gradient estimates. Furthermore, it also introduces a way to deal with unknown interventions by estimating the binary intervention target matrix in a similar stochastic manner.
----------------
Thanks for the clarifications. Hopefully this can be incorporated into the final version.

__ Strengths__: - To my knowledge first differentiable causal discovery method for interventional data.
- Extensive experiments and detailed description.

__ Weaknesses__: - Algorithm idea is not new.

__ Correctness__: Did not check the proof for Theorem 1. Methodology seems correct.

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: The algorithm itself is perhaps less exciting since most of the techniques (differentiable causal discovery, neural network likelihood models, stochastic mask and gradient estimates) have existed in the literature. However, the paper did a great job executing these ideas and recording all the details.
Additional questions:
- Would it be interesting to try linear Gaussian likelihood against linear Gaussian model just for sanity check?
- It seems that on ANM data DCDI-G should in principal outperform DCDI-DSF. However it is not always the case. Is there an explanation for this?

__ Summary and Contributions__: The authors extend a continuous optimization technique for causal structure discovery to include a combination of observational and perfect and imperfect interventional data, rather than just observational data.

__ Strengths__: The paper addresses an important problem, and builds off of recent advances using continuous optimization for causal structure discovery. The authors provide a thorough description of necessary background and related work on approaches for causal structure discovery with interventional data.

__ Weaknesses__: The paper has two major weaknesses, (1) the description of the methodology lacks clarity and (2) the empirical results do not provide compelling evidence that the methodology is effective.
Methodology:
The paper does not clearly explain what I understand as a central methodological contribution, section 3.1. Several questions remain unanswered.
1. Why is each intervention set parameterized by independent sets of neural network weights?
2. How is background knowledge about the intervention assignment incorporated into the likelihood function? For example, with an encouragement design we know that a random variables distributions will place higher density on larger values than the observational distribution.
3. What is the score function when interventions are perfect? It is not enough to say that “the idea is simple” without providing a mathematical expression for the likelihood.
Empirical Results:
In addition to the structural measures you report, it is important to report on some metric of the accuracy of effect estimates for some previously unseen set of interventions. See (Gentzel et al., NeurIPS 2019) for a discussion of using interventional distributions to evaluate causal discovery algorithms. This should be straightforward, as DCDI jointly learns structure and conditional probability distributions.
I strongly disagree with moving the cytometry evaluation to the appendix. When making parametric and semi-parametric assumptions (such as a particular NN architecture), it is important to understand the algorithms’ performance when those assumptions are violated. The synthetic experiments provide useful “knobs” to twist, but they do not provide insight into the key question, “how will this work in the real world?”
The visual presentation of the results makes it difficult to determine where DCDI outperforms the alternative approaches.
See the additional feedback below for additional details.

__ Correctness__: Given that the empirical evaluation does not provide metrics of interventional distributions and focusses on synthetic data, it is difficult to determine whether the claims and methods are correct.
On the real-world dataset, the empirical results appear to be inconclusive. As the authors note, CAM performs better than DCDI on several metrics.

__ Clarity__: The paper is mostly well written, although some additional editorial review would have strengthened the submission. See the additional feedback below.

__ Relation to Prior Work__: Yes, the authors appear to have touched on the relevant areas of prior work. In particular, prior approaches either (1) don’t account for interventional data or (2) don’t use a continuous-optimization formulation of the structure discovery problem.

__ Reproducibility__: Yes

__ Additional Feedback__: Line 36: “Constrained-based” -> “constraint-based”
Line 63: we call interventional target -> we call the interventional target
Line 82: “identifiable (more severe without interventional data)” -> “identifiable, which is more severe without interventional data.”
Line 98: “space of DAGs is enormous” -> “space of DAGS is super-exponential in the number of variables”
Line 112: “which serves as basis “ -> “which serves as the basis”
Line 157: “Intuitively, this score favors graphs in which a conditional p(x_j|_{\pi_j}G) is invariant across all interventional distributions in which x_j is not a target, i.e. j \not \in I_k.” -> This expression appears to one of the central claims in the paper. As written the paper does not clarify how this is true, or why it is important.
Line 246: “although performance is sensible to this hyperparameter” -> “although performance is sensitive to this hyperparameter”
Line 257: “how two DAGs differ with respect to their causal inference statements” -> This needs to be made more precise.
Figure 2, 3, 4 Captions: This will be clearer at first glance if you spell out Structural Hamming Distance and Structural Interventional Distance.
Section 3.3: Please clarify exactly what is meant by “unknown interventions”. Is the set of random variables intervened upon unknown? The intervention assignment? Whether it is a perfect or imperfect intervention?
Equation 8: My understanding is that the \lambda |G| term induces an acyclicity constraint. Is there any reason not to include any other regularization terms? This seems important when each interventional dataset is relatively small.
POST REBUTTAL FEEDBACK:
Based on the authors' response I have increased my score from a 4 to a 6.
The authors' response effectively addresses my concerns about clarity with a thorough discussion of how background knowledge is incorporated and some more intuition about the score for unknown targets. These concerns are now completely addressed. Thank you!
Regarding empirical evaluation: I am very glad to see that the authors have included an additional experiment evaluating DCDI's ability to produce accurate interventional distributions. However, I still would have liked to see more emphasis placed on real or semi-synthetic benchmarks.