{"title": "Permutation-based Causal Inference Algorithms with Interventions", "book": "Advances in Neural Information Processing Systems", "page_first": 5822, "page_last": 5831, "abstract": "Learning directed acyclic graphs using both observational and interventional data is now a fundamentally important problem due to recent technological developments in genomics that generate such single-cell gene expression data at a very large scale. In order to utilize this data for learning gene regulatory networks, efficient and reliable causal inference algorithms are needed that can make use of both observational and interventional data. In this paper, we present two algorithms of this type and prove that both are consistent under the faithfulness assumption. These algorithms are interventional adaptations of the Greedy SP algorithm and are the first algorithms using both observational and interventional data with consistency guarantees. Moreover, these algorithms have the advantage that they are nonparametric, which makes them useful also for analyzing non-Gaussian data. In this paper, we present these two algorithms and their consistency guarantees, and we analyze their performance on simulated data, protein signaling data, and single-cell gene expression data.", "full_text": "Permutation-based Causal Inference Algorithms\n\nwith Interventions\n\nYuhao Wang\n\nLaboratory for Information and Decision Systems\n\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nyuhaow@mit.edu\n\nLiam Solus\n\nDepartment of Mathematics\n\nKTH Royal Institute of Technology\n\nStockholm, Sweden\n\nsolus@kth.se\n\nKarren Dai Yang\n\nCaroline Uhler\n\nInstitute for Data, Systems and Society\nand Broad Institute of MIT and Harvard\nMassachusetts Institute of Technology\n\nLaboratory for Information and Decision Systems\n\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nkarren@mit.edu\n\nCambridge, MA 02139\n\ncuhler@mit.edu\n\nAbstract\n\nLearning directed acyclic graphs using both observational and interventional data is\nnow a fundamentally important problem due to recent technological developments\nin genomics that generate such single-cell gene expression data at a very large\nscale. In order to utilize this data for learning gene regulatory networks, ef\ufb01cient\nand reliable causal inference algorithms are needed that can make use of both\nobservational and interventional data. In this paper, we present two algorithms\nof this type and prove that both are consistent under the faithfulness assumption.\nThese algorithms are interventional adaptations of the Greedy SP algorithm and\nare the \ufb01rst algorithms using both observational and interventional data with\nconsistency guarantees. Moreover, these algorithms have the advantage that they\nare nonparametric, which makes them useful also for analyzing non-Gaussian data.\nIn this paper, we present these two algorithms and their consistency guarantees,\nand we analyze their performance on simulated data, protein signaling data, and\nsingle-cell gene expression data.\n\n1\n\nIntroduction\n\nDiscovering causal relations is a fundamental problem across a wide variety of disciplines including\ncomputational biology, epidemiology, sociology, and economics [5, 18, 20, 22]. DAG models can\nbe used to encode causal relations in terms of a directed acyclic graph (DAG) G, where each node\nis associated to a random variable and the arrows represent their causal in\ufb02uences on one another.\nThe non-arrows of G encode a collection of conditional independence (CI) relations through the so-\ncalled Markov properties. While DAG models are extraordinarily popular within the aforementioned\nresearch \ufb01elds, it is in general a dif\ufb01cult task to recover the underlying DAG G from samples from the\njoint distribution on the nodes. In fact, since different DAGs can encode the same set of CI relations,\nfrom observational data alone the underlying DAG G is in general only identi\ufb01able up to Markov\nequivalence, and interventional data is needed to identify the complete DAG.\nIn recent years, the new drop-seq technology has allowed obtaining high-resolution observational\nsingle-cell gene expression data at a very large scale [12]. In addition, earlier this year this technology\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwas combined with the CRISPR/Cas9 system into perturb-seq, a technology that allows obtaining\nhigh-throughput interventional gene expression data [4]. An imminent question now is how to make\nuse of a combination of observational and interventional data (of the order of 100,000 cells / samples\non 20,000 genes / variables) in the causal discovery process. Therefore, the development of ef\ufb01cient\nand consistent algorithms using both observational and interventional data that are implementable\nwithin genomics is now a crucial goal. This is the purpose of the present paper.\nThe remainder of this paper is structured as follows: In Section 2 we discuss related work. Then\nin Section 3, we recall fundamental facts about DAG models and causal inference that we will use\nin the coming sections. In Section 4, we present the two algorithms and discuss their consistency\nguarantees. In Section 5, we analyze the performance of the two algorithms on both simulated and\nreal datasets. We end with a short discussion in Section 6.\n\n2 Related Work\n\nCausal inference algorithms based on observational data can be classi\ufb01ed into three categories:\nconstraint-based, score-based, and hybrid methods. Constraint-based methods, such as the PC\nalgorithm [22], treat causal inference as a constraint satisfaction problem and rely on CI tests to\nrecover the model via its Markov properties. Score-based methods, on the other hand, assign a\nscore function such as the Bayesian Information Criterion (BIC) to each DAG and optimize the\nscore via greedy approaches. An example is the prominent Greedy Equivalence Search (GES) [14].\nHybrid methods either alternate between score-based and constraint-based updates, as in Max-Min\nHill-Climbing [26], or use score functions based on CI tests, as in the recently introduced Greedy SP\nalgorithm [23].\nBased on the growing need for ef\ufb01cient and consistent algorithms that accommodate observational and\ninterventional data [4], it is natural to consider extensions of the previously described algorithms that\ncan accommodate interventional data. Such options have been considered in [8], in which the authors\npropose GIES, an extension of GES that accounts for interventional data. This algorithm can be\nviewed as a greedy approach to (cid:96)0-penalized maximum likelihood estimation with interventional data,\nan otherwise computationally infeasible score-based approach. Hence GIES is a parametric approach\n(relying on Gaussianity) and while it has been applied to real data [8, 9, 15], we will demonstrate via\nan example in Section 3 that it is in general not consistent. In this paper, we assume causal suf\ufb01ciency,\ni.e., that there are no latent confounders in the data-generating DAG. In addition, we assume that the\ninterventional targets are known. Methods such as ACI [13], HEJ [10], COmbINE [25] and ICP [15]\nallow for latent confounders with possibly unknown interventional targets. In addition, other methods\nhave been developed speci\ufb01cally for the analysis of gene expression data [19]. A comparison of the\nmethod presented here and some of these methods in the context of gene expression data is given in\nthe Supplementary Material.\nThe main purpose of this paper is to provide the \ufb01rst algorithms (apart from enumerating all DAGs)\nfor causal inference based on observational and interventional data with consistency guarantees.\nThese algorithms are adaptations of the Greedy SP algorithm [23]. As compared to GIES, another\nadvantage of these algorithms is that they are nonparametric and hence do not assume Gaussianity, a\nfeature that is crucial for applications to gene expression data which is inherently non-Gaussian.\n\n3 Preliminaries\nDAG models. Given a DAG G = ([p], A) with node set [p] := {1, . . . , p} and a collection of arrows\nA, we associate the nodes of G to a random vector (X1, . . . , Xp) with joint probability distribution P.\nFor a subset of nodes S \u2282 [p], we let PaG(S), AnG(S), ChG(S), DeG(S), and NdG(S), denote the\nparents, ancestors, children, descendants, and nondescendants of S in G. Here, we use the typical\ngraph theoretical de\ufb01nitions of these terms as given in [11]. By the Markov property, the collection of\nnon-arrows of G encode a set of CI relations Xi \u22a5\u22a5 XNd(i)\\ Pa(i) | XPa(i). A distribution P is said to\nsatisfy the Markov assumption (a.k.a. be Markov) with respect to G if it entails these CI relations. A\nfundamental result about DAG models is that the complete set of CI relations implied by the Markov\nassumption for G is given by the d-separation relations in G [11, Section 3.2.2]; i.e., P satis\ufb01es the\nMarkov assumption with respect to G if and only if XA \u22a5\u22a5 XB | XC in P whenever A and B are\n\n2\n\n\fFigure 1: A generating DAG (left) and its GIES local maxima (right) for which GIES is not consistent.\n\ngraph (cid:98)G := ([p], D, E), called its CP-DAG or essential graph [1]. The arrows D are precisely\n\nd-separated in G given C. The faithfulness assumption is the assertion that the only CI relations\nentailed by P are those implied by d-separation in G.\nTwo DAGs G and H with the same set of d-separation statements are called Markov equivalent,\nand the complete set of DAGs that are Markov equivalent to G is called its Markov equivalence\nclass (MEC), denoted [G]. The MEC of G is represented combinatorially by a partially directed\nthose arrows in G that have the same orientation in all members of [G], and the edges E represent\nthose arrows that change direction between distinct members of the MEC. In [2], the authors give a\ntransformational characterization of the members of [G]. An arrow i \u2192 j in G is called a covered\narrow if PaG(j) = PaG(i) \u222a {i}. Two DAGs G and H are Markov equivalent if and only if there\nexists a sequence of covered arrow reversals transforming G into H [2]. This transformational\ncharacterization plays a fundamental role in GES [14], GIES [8], Greedy SP [23], as well as the\nalgorithms we introduce in this paper.\nLearning from Interventions. In this paper, we consider multiple interventions. Given an ordered\nlist of subsets of [p] denoted by I := {I1, I2, . . . , IK}, for each Ij we generate an interventional\ndistribution, denoted Pj, by forcing the random variables Xi for i \u2208 Ij to the value of some\nindependent random variables. We assume throughout that Ij = \u2205 for some j, i.e., that we have access\nto a combination of observational and interventional data. If P is Markov with respect to G = ([p], A),\nthen the intervention DAG of Ij is the subDAG Gj := ([p], Aj) where Aj = {(i, j) \u2208 A : j /\u2208 Ij};\ni.e., Gj is given by removing the incoming arrows to all intervened nodes in G. Notice that Pj is\nalways Markov with respect to Gj. This fact allows us to naturally extend the notions of Markov\nequivalence and essential graphs to the interventional setting, as described in [8]. Two DAGs G and H\nare I-Markov equivalent for the collection of interventions I if they have the same skeleton and the\nsame set of immoralities, and if Gj and Hj have the same skeleton for all j = 1, . . . , K [8, Theorem\n10]. Hence, any two I-Markov equivalent DAGs lie in the same MEC. The I-Markov equivalence\nclass (I-MEC) of G is denoted [G]I. The I-essential graph of G is the partially directed graph\narrows of G.\nGreedy Interventional Equivalence Search (GIES). GIES is a three-phase score-based algorithm:\n\n(cid:98)GI :=(cid:0)[p],\u222aK\nj=1Ej(cid:1), where (cid:98)Gj = ([p], Dj, Ej). The arrows of (cid:98)GI are called I-essential\nIn the forward phase, GIES initializes with an empty I-essential graph (cid:98)G0. Then it sequentially\nsteps from one I-essential graph (cid:98)Gi to a larger one (cid:98)Gi+1 given by adding a single arrow to (cid:98)Gi. In\nthe backward phase, it steps from one essential graph (cid:98)Gi to a smaller one (cid:98)Gi+1 containing precisely\none less arrow than (cid:98)Gi. In the turning phase, the algorithm reverses the direction of arrows. It \ufb01rst\n\nj=1Dj,\u222aK\n\nconsiders reversals of non-I-essential arrows and then the reversal of I-essential arrows, allowing\nit to move between I-MECs. At each step in all phases the maximal scoring candidate is chosen,\nand the phase is only terminated when no higher-scoring I-essential graph exists. GIES repeatedly\nexecutes the forward, backward, and turning phases, in that order, until no higher-scoring I-essential\ngraph can be found. It is amenable to any score that is constant on an I-MEC, such as the BIC.\nThe question whether GIES is consistent, was left open in [8]. We now prove that GIES is in general\nnot consistent; i.e., if nj i.i.d. samples are drawn from the interventional distribution Pj, then even\nas n1 + \u00b7\u00b7\u00b7 + nK \u2192 \u221e and under the faithfulness assumption, GIES may not recover the optimal\nI-MEC with probability 1. Consider the data-generating DAG depicted on the left in Figure 1.\n\n3\n\n12345671234567\fAlgorithm 1:\n\nInput: Observations \u02c6X, an initial permutation \u03c00, a threshold \u03b4n >(cid:80)K\n\ninterventional targets I = {I1, . . . , IK}.\n\nOutput: A permutation \u03c0 and its minimal I-MAP G\u03c0.\n1 Set G\u03c0 := argmax\n2 Using a depth-\ufb01rst search approach with root \u03c0, search for a permutation \u03c0s with\n\nScore(G\u03c0s) > Score(G\u03c0) that is connected to \u03c0 through a sequence of permutations\n\nScore(G);\n\nG consistent with \u03c0\n\nk=1 \u03bbnk, and a set of\n\n\u03c00 = \u03c0, \u03c01,\u00b7\u00b7\u00b7 , \u03c0s\u22121, \u03c0s,\n\nwhere each permutation \u03c0k is produced from \u03c0k\u22121 by a transposition that corresponds to a\ncovered edge in G\u03c0k\u22121 such that Score(G\u03c0k ) > Score(G\u03c0k\u22121 ) \u2212 \u03b4n. If no such G\u03c0s exists,\nreturn \u03c0 and G\u03c0; else set \u03c0 := \u03c0s and repeat.\n\n1\n\nSuppose we take interventions I consisting of I1 = \u2205, I2 = {4}, I3 = {5}, and that GIES arrives at\nthe DAG G depicted on the right in Figure 1. If the data collected grows as n1 = Cn2 = Cn3 for\nsome constant C > 1, then we can show that the BIC score of G is a local maximum with probability\n2 as n1 tends to in\ufb01nity. The proof of this fact relies on the observation that GIES must initialize the\nturning phase at G, and that G contains precisely one covered arrow 5 \u2192 4, which is colored red in\nFigure 1. The full proof is given in the Supplementary Material.\nGreedy SP. In this paper we adapt the hybrid algorithm Greedy SP to provide consistent algorithms\nthat use both interventional and observational data. Greedy SP is a permutation-based algorithm that\nassociates a DAG to every permutation of the random variables and greedily updates the DAG by\ntransposing elements of the permutation. More precisely, given a set of observed CI relations C and a\npermutation \u03c0 = \u03c01 \u00b7\u00b7\u00b7 \u03c0p, the Greedy SP algorithm assigns a DAG G\u03c0 := ([p], A\u03c0) to \u03c0 via the rule\n\n\u03c0i \u2192 \u03c0j \u2208 A\u03c0 \u21d0\u21d2 i < j and \u03c0i (cid:54)\u22a5\u22a5 \u03c0j | {\u03c01, . . . , \u03c0max(i,j)}\\{\u03c0i, \u03c0j},\n\nfor all 1 \u2264 i < j \u2264 p. The DAG G\u03c0 is a minimal I-MAP (independence map) with respect to C,\nsince any DAG G\u03c0 is Markov with respect to C and any proper subDAG of G\u03c0 encodes a CI relation\nthat is not in C [17]. Using a depth-\ufb01rst search approach, the algorithm reverses covered edges in\nG\u03c0, takes a linear extension \u03c4 of the resulting DAG and re-evaluates against C to see if G\u03c4 has fewer\narrows than G\u03c0. If so, the algorithm reinitializes at \u03c4, and repeats this process until no sparser DAG\ncan be recovered. In the observational setting, Greedy SP is known to be consistent whenever the\ndata-generating distribution is faithful to the sparsest DAG [23].\n\n4 Two Permutation-Based Algorithms with Interventions\n\nWe now introduce our two interventional adaptations of Greedy SP and prove that they are consistent\nunder the faithfulness assumption. In the \ufb01rst algorithm, presented in Algorithm 1, we use the same\nmoves as Greedy SP, but we optimize with respect to a new score function that utilizes interventional\ndata, namely the sum of the interventional BIC scores. To be more precise, for a collection of\ninterventions I = {I1, . . . , IK}, the new score function is\n\n(cid:18)\n\nK(cid:88)\n\nk=1\n\n(cid:16)\n\n(cid:17)(cid:19)\n\n\u2212 K(cid:88)\n\nk=1\n\n\u03bbnk|Gk|,\n\nScore(G) :=\n\nmaximize\n(A,\u2126)\u2208Gk\n\n(cid:96)k\n\n\u02c6X k; A, \u2126\n\nwhere (cid:96)k denotes the log-likelihood of the interventional distribution Pk, (A, \u2126) are any parameters\nconsistent with Gk, |G| denotes the number of arrows in G, and \u03bbnk = log nk\nWhen Algorithm 1 has access to observational and interventional data, then uniform consistency\nfollows using similar techniques to those used to prove uniform consistency of Greedy SP in [23]. A\nfull proof of the following consistency result for Algorithm 1 is given in the Supplementary Material.\nTheorem 4.1. Suppose P is Markov with respect to an unknown I-MAP G\u03c0\u2217. Suppose also that\nobservational and interventional data are drawn from P for a collection of interventional targets\nI = {I1 := \u2205, I2, . . . , IK}. If Pk is faithful to (G\u03c0\u2217 )k for all k \u2208 [K], then Algorithm 1 returns the\nI-MEC of the data-generating DAG G\u03c0\u2217 almost surely as nk \u2192 \u221e for all k \u2208 [K].\n\nnk\n\n.\n\n4\n\n\fAlgorithm 2: Interventional Greedy SP (IGSP)\nInput: A collection of interventional targets I = {I1, . . . ,IK} and a starting permutation \u03c00.\nOutput: A permutation \u03c0 and its minimal I-MAP G\u03c0.\n1 Set G := G\u03c00;\n2 Using a depth-\ufb01rst-search approach with root \u03c0, search for a minimal I-MAP G\u03c4 with\n\n|G| > |G\u03c4| that is connected to G by a list of I-covered edge reversals. Along the search,\nprioritize the I-covered edges that are also I-contradicting edges. If such G\u03c4 exists, set\nG := G\u03c4 , update the number of I-contradicting edges, and repeat this step. If not, output G\u03c4\nwith |G| = |G\u03c4| that is connected to G by a list of I-covered edges and minimizes the number\nof I-contradicting edges.\n\nA problematic feature of Algorithm 1 from a computational perspective is the the slack parameter \u03b4n.\nIn fact, if this parameter were not included, then Algorithm 1 would not be consistent. This can be\nseen via an application of Algorithm 1 to the example depicted in Figure 1. Using the same set-up\nas the inconsistency example for GIES, suppose that the left-most DAG G in Figure 1 is the data\ngenerating DAG, and that we draw nk i.i.d. samples from the interventional distribution Pk for the\ncollection of targets I = {I1 = \u2205,I2 = {4},I3 = {5}}. Suppose also that n1 = Cn2 = Cn3 for\nsome constant C > 1, and now additionally assume that we initialize Algorithm 1 at the permutation\n\u03c0 = 1276543. Then the minimal I-MAP G\u03c0 is precisely the DAG presented on the right in Figure 1.\nThis DAG contains one covered arrow, namely 5 \u2192 4. Reversing it produces the minimal I-MAP G\u03c4\nfor \u03c4 = 1276453. Computing the score difference Score(G\u03c4 ) \u2212 Score(G\u03c0) using [16, Lemma 5.1]\nshows that as n1 tends to in\ufb01nity, Score(G\u03c4 ) < Score(G\u03c0) with probability 1\n2. Hence, Algorithm 1\nwould not be consistent without the slack parameter \u03b4n. This calculation can be found in the\nSupplementary Material.\nOur second interventional adaptation of the Greedy SP algorithm, presented in Algorithm 2, leaves the\nscore function the same (i.e., the number of edges of the minimal I-MAP), but restricts the possible\ncovered arrow reversals that can be queried at each step. In order to describe this restricted set of\nmoves we provide the following de\ufb01nitions.\nDe\ufb01nition 4.2. Let I = {I1, . . . , IK} be a collection of interventions, and for i, j \u2208 [p] de\ufb01ne the\ncollection of indices\nFor a minimal I-MAP G\u03c0 we say that a covered arrow i \u2192 j \u2208 G\u03c0 is I-covered if\n\nIi\\j := {k \u2208 [K] : i \u2208 Ik and j (cid:54)\u2208 Ik}.\n\nIi\\j = \u2205\n\nor\n\ni \u2192 j (cid:54)\u2208 (Gk)\u03c0\n\nfor all k \u2208 Ii\\j.\n\nDe\ufb01nition 4.3. We say that an arrow i \u2192 j \u2208 G\u03c0 is I-contradicting if the following three\nconditions hold: (a) Ii\\j \u222a Ij\\i (cid:54)= \u2205,\n(b) Ii\\j = \u2205 or i \u22a5\u22a5 j in distribution Pk for all k \u2208 Ii\\j,\n(c) Ij\\i = \u2205 or there exists k \u2208 Ij\\i such that i (cid:54)\u22a5\u22a5 j in distribution Pk.\nIn the observational setting, GES and Greedy SP utilize covered arrow reversals to transition between\nmembers of a single MEC as well as between MECs [2, 3, 23]. Since an I-MEC is characterized\nby the skeleta and immoralities of each of its interventional DAGs, I-covered arrows represent the\nnatural candidate for analogous transitionary moves between I-MECs in the interventional setting. It\nis possible that reversing an I-covered edge i \u2192 j in a minimal I-MAP G\u03c0 results in a new minimal\nI-MAP G\u03c4 that is in the same I-MEC as G\u03c0. Namely, this happens when i \u2192 j is a non-I-essential\nedge in G\u03c0. Similar to Greedy SP, Algorithm 2 implements a depth-\ufb01rst-search approach that allows\nfor such I-covered arrow reversals, but it prioritizes those I-covered arrow reversals that produce a\nminimal I-MAP G\u03c4 that is not I-Markov equivalent to G\u03c0. These arrows are I-contradicting arrows.\nThe result of this re\ufb01ned search via I-covered arrow reversal is an algorithm that is consistent under\nthe faithfulness assumption.\nTheorem 4.4. Algorithm 2 is consistent under the faithfulness assumption.\n\nThe proof of Theorem 4.4 is given in the Supplementary Material. When only observational data is\navailable, Algorithm 2 boils down to greedy SP. We remark that the number of queries conducted in a\ngiven step of Algorithm 2 is, in general, strictly less than in the purely observational setting. That is\nto say, I-covered arrows generally constitute a strict subset of the covered arrows in a DAG. At \ufb01rst\n\n5\n\n\f(a) p = 10, K = 1\n\n(b) p = 10, K = 2\n\n(c) p = 20, K = 1\n\n(d) p = 20, K = 2\n\nFigure 2: The proportion of consistently estimated DAGs for 100 Gaussian DAG models on p nodes\nwith K single-node interventions.\n\nglance, keeping track of the I-covered edges may appear computationally inef\ufb01cient. However, at\neach step we only need to update this list locally; so the computational complexity of the algorithm\nis not drastically impacted by this procedure. Hence, access to interventional data is bene\ufb01cial in\ntwo ways: it allows to reduce the search directions at every step and it often allows to estimate\nthe true DAG more accurately, since an I-MEC is in general smaller than an MEC. Note that in\nthis paper all the theoretical analysis are based on the low-dimensional setting, where p (cid:28) n. The\nhigh-dimensional consistency of greedy SP is shown in [23], and it is not dif\ufb01cult to see that the same\nhigh-dimensional consistency guarantees also apply to IGSP.\n\n5 Evaluation\n\nIn this section, we compare Algorithm 2, which we call Interventional Greedy SP (IGSP) with GIES\non both simulated and real data. Algorithm 1 is of interest from a theoretical perspective, but it is\ncomputationally inef\ufb01cient since it requires performing two variable selection procedures per update.\nTherefore, it will not be analyzed in this section. The code utilized for the following experiments can\nbe found at https://github.com/yuhaow/sp-intervention.\n\n5.1 Simulations\n\nOur simulations are conducted for linear structural equation models with Gaussian noise:\n\n(X1, . . . , Xp)T = ((X1, . . . , Xp)A)T + \u0001,\n\nwhere \u0001 \u223c N (0, 1p) and A = (aij)p\ni,j=1 is an upper-triangular matrix of edge weights with aij (cid:54)= 0\nif and only if i \u2192 j is an arrow in the underlying DAG G\u2217. For each simulation study we generated\n100 realizations of an (Erd\u00f6s-Renyi) random p-node Gaussian DAG model for p \u2208 {10, 20} with an\nexpected edge density of 1.5. The collections of interventional targets I = {I0 := \u2205, I1, . . . , IK}\nalways consist of the empty set I0 together with K = 1 or 2. For p = 10, the size of each intervention\nset was 5 for K = 1 and 4 for K = 2. For p = 20, the size was increased up to 10 and 8 to keep\nthe proportion of intervened nodes constant. In each study, we compared GIES with Algorithm 2\nfor n samples for each intervention with n = 103, 104, 105. Figure 2 shows the proportion of\nconsistently estimated DAGs as distributed by choice of cut-off parameter for partial correlation tests.\nInterestingly, although GIES is not consistent on random DAGs, in some cases it performs better than\nIGSP, in particular for smaller sample sizes. However, as implied by the consistency guarantees given\nin Theorem 4.4, IGSP performs better as the sample size increases.\nWe also conducted a focused simulation study on models for which the data-generating DAG G is that\ndepicted on the left in Figure 1, for which GIES is not consistent. In this simulation study, we took 100\nrealizations of Gaussian models for the data-generating DAG G for which the nonzero edge-weights\naij were randomly drawn from [\u22121,\u2212c, ) \u222a (c, 1] for c = 0.1, 0.25, 0.5. The interventional targets\nwere I = {I0 = \u2205, I1}, where I1 was uniformly at random chosen to be {4}, {5}, {4, 5}. Figure 3\nshows, for each choice of c, the proportion of times G was consistently estimated as distributed by the\nchoice of cut-off parameter for the partial correlation tests. We see from these plots that as expected\nfrom our theoretical results GIES recovers G at a lower rate than Algorithm 2.\n\n6\n\n\f(a) c = 0.1\n\n(b) c = 0.25\n\n(c) c = 0.5\n\nFigure 3: Proportion of times the DAG G from Figure 1 (left) is consistently estimated under GIES\nand Algorithm 2 for Gaussian DAG models with edge-weights drawn from [\u22121,\u2212c) \u222a (c, 1].\n\n5.2 Application to Real Data\n\nIn the following, we report results for studies conducted on two real datasets coming from genomics.\nThe \ufb01rst dataset is the protein signaling dataset of Sachs et al. [21], and the second is the single-cell\ngene expression data generated using perturb-seq in [4].\nAnalysis of protein signaling data. The dataset of Sachs et al. [21] consists of 7466 measurements of\nthe abundance of phosphoproteins and phospholipids recorded under different experimental conditions\nin primary human immune system cells. The different experimental conditions are generated using\nvarious reagents that inhibit or activate signaling nodes, and thereby correspond to interventions\nat different nodes in the protein signaling network. The dataset is purely interventional and most\ninterventions take place at more than one target. Since some of the experimental perturbations\neffect receptor enzymes instead of the measured signaling molecules, we consider only the 5846\nmeasurements in which the perturbations of receptor enzymes are identical. In this way, we can de\ufb01ne\nthe observational distribution to be that of molecule abundances in the model where only the receptor\nenzymes are perturbed. This results in 1755 observational measurements and 4091 interventional\nmeasurements. Table E.2 in the Supplementary Material summarizes the number of samples as\nwell as the targets for each intervention. For this dataset we compared the GIES results reported\nin [9] with Algorithm 2 using both, a linear Gaussian and a kernel-based independence criterium\n[6, 24]. A crucial advantage of Algorithm 2 over GIES is that it is nonparametric and does not require\nGaussianity. In particular, it supports kernel-based CI tests that are in general able to deal better with\nnon-linear relationships and non-Gaussian noise, a feature that is typical of datasets such as this one.\nFor the GIES algorithm we present the results of [8] in which the authors varied the number of edge\nadditions, deletions, and reversals as tuning parameters. For the linear Gaussian and kernel-based\nimplementations of IGSP our tuning parameter is the cut-off value for the CI tests, just as in the\nsimulated data studies in Section 5.1. Figure 4 reports our results for thirteen different cut-off values\nin [10\u22124, 0.7], which label the corresponding points in the plots. The linear Gaussian and kernel-based\nimplementations of IGSP are comparable and generally both outperform GIES. The Supplementary\nMaterial contains a comparison of the results obtained by IGSP on this dataset to other recent methods\nthat allow also for latent confounders, such as ACI, COmbINE and ICP.\nAnalysis of perturb-seq gene expression data. We analyzed the performance of GIES and IGSP\non perturb-seq data published by Dixit et al. [4]. The dataset contains observational data as well as\ninterventional data from \u223c30,000 bone marrow-derived dendritic cells (BMDCs). Each data point\ncontains gene expression measurements of 32,777 genes, and each interventional data point comes\nfrom a cell where a single gene has been targeted for deletion using the CRISPR/Cas9 system.\nAfter processing the data for quality, the data consists of 992 observational samples and 13,435\ninterventional samples from eight gene deletions. The number of samples collected under each of the\neight interventions is shown in the Supplementary Material. These interventions were chosen based\n\n7\n\n\f(a) Directed edge recovery\n\n(b) Skeleton recovery\n\nFigure 4: ROC plot of the models estimated from the data [21] using GIES as reported in [8] and the\nlinear Gaussian and kernel-based versions of IGSP with different cut-off values for the CI tests. The\nsolid line indicates the accuracy achieved by random guessing.\n\non empirical evidence that the gene deletion was effective1. We used GIES and IGSP to learn causal\nDAGs over 24 of the measured genes, including the ones targeted by the interventions, using both\nobservational and interventional data. We followed [4] in focusing on these 24 genes, as they are\ngeneral transcription factors known to regulate each other as well as numerous other genes [7].\nWe evaluated the learned causal DAGs based on their accuracy in predicting the true effects of each of\nthe interventions (shown in Figure 5(a)) when leaving out the data for that intervention. Speci\ufb01cally,\nif the predicted DAG indicates an arrow from gene A to gene B, we count this as a true positive if\nknocking out gene A caused a signi\ufb01cant change2 in the distribution of gene B, and a false positive\notherwise. For each inference algorithm and for every choice of the tuning parameters, we learned\neight causal DAGs, each one trained with one of the interventional datasets being left out. We then\nevaluated each algorithm based on how well the causal DAGs are able to predict the corresponding\nheld-out interventional data. As seen in Figure 5(b), IGSP predicted the held-out interventional data\nbetter than GIES (as implemented in the R-package pcalg) and random guessing, for a number of\nchoices of the cut-off parameter. The true and reconstructed networks for both genomics datasets are\nshown in the Supplementary Material.\n\n6 Discussion\n\nWe have presented two hybrid algorithms for causal inference using both observational and inter-\nventional data and we proved that both algorithms are consistent under the faithfulness assumption.\nThese algorithms are both interventional adaptations of the Greedy SP algorithm and are the \ufb01rst\nalgorithms of this type that have consistency guarantees. While Algorithm 1 suffers a high level of\ninef\ufb01ciency, IGSP is implementable and competitive with the state-of-the-art, i.e., GIES. Moreover,\nIGSP has the distinct advantage that it is nonparametric and therefore does not require a linear\nGaussian assumption on the data-generating distribution. We conducted real data studies for protein\nsignaling and single-cell gene expression datasets, which are typically non-linear with non-Gaussian\nnoise. In general, IGSP outperformed GIES. This purports IGSP as a viable method for analyzing the\nnew high-resolution datasets now being produced by procedures such as perturb-seq. An important\n\n1An intervention was considered effective if the distribution of the gene expression levels of the deleted gene\nis signi\ufb01cantly different from the distribution of its expression levels without intervention, based on a Wilcoxon\nRank-Sum test with \u03b1 = 0.05. Ineffective interventions on a gene are typically due to poor targeting ability of\nthe guide-RNA designed for that gene.\nmagnitude \u2265 3 in Figure 5(a)\n\n2Based on a Wilcoxon Rank-Sum test with \u03b1 = 0.05, which is approximately equivalent to a q-value of\n\n8\n\n\f(a) True effects of gene deletions\n\n(b) Causal effect prediction accuracy rate\n\nFigure 5: (a) Heatmap of the true effects of each gene deletion on each measured gene. The q-value\nhas the same magnitude as the log p-value of the Wilcoxon rank-sum test between the distributions of\nobservational data and the interventional data. Positive and negative q-values indicate increased and\ndecreased abundance as a result of deletion respectively. (b) ROC plot of prediction accuracy by the\ncausal DAGs learned by IGSP and GIES. The solid line indicates the accuracy achieved by random\nguessing.\n\nchallenge for future work is to make these algorithms scale to 20,000 nodes, i.e., the typical number\nof genes in such studies. In addition, in future work it would be interesting to extend IGSP to allow\nfor latent confounders. An advantage of not allowing for latent confounders is that a DAG is usually\nmore identi\ufb01able. For example, if we consider a DAG with two observable nodes, a DAG without\nconfounders is fully identi\ufb01able by intervening on only one of the two nodes, but the same is not true\nfor a DAG with confounders.\n\nAcknowledgements\n\nYuhao Wang was supported by DARPA (W911NF-16-1-0551) and ONR (N00014-17-1-2147).\nLiam Solus was supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship\n(DMS - 1606407). Karren Yang was supported by the MIT Department of Biological Engineering.\nCaroline Uhler was partially supported by DARPA (W911NF-16-1-0551), NSF (1651995) and ONR\n(N00014-17-1-2147). We thank Dr. So\ufb01a Trianta\ufb01llou from the University of Crete for helping us\nrun COmbINE.\n\nReferences\n\n[1] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence\n\nclasses for acyclic digraphs. The Annals of Statistics 25.2 (1997): 505-541.\n\n[2] D. M. Chickering. A transformational characterization of equivalent Bayesian network struc-\ntures. Proceedings of the Eleventh Conference on Uncertainty in Arti\ufb01cial Intelligence. Morgan\nKaufmann Publishers Inc., 1995.\n\n[3] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of Machine\n\nLearning Research 3.Nov (2002): 507-554.\n\n[4] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne,\nT. Burks, R. Raychowdhury, B. Adamson, T. M. Norman, E. S. Lander, J. S. Weissman,\nN. Friedman and A. Regev. Perturb-seq: dissecting molecular circuits with scalable single-cell\nRNA pro\ufb01ling of pooled genetic screens. Cell 167.7 (2016): 1853-1866.\n\n9\n\n\f[5] N. Friedman, M. Linial, I. Nachman and D. Peter. Using Bayesian networks to analyze expres-\n\nsion data. Journal of Computational Biology 7.3-4 (2000): 601\u2013620.\n\n[6] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00f6lkopf. Kernel measures of conditional dependence.\n\nAdvances in Neural Information Processing Systems. 2008.\n\n[7] M. Garber, N. Yosef, A Goren, R Raychowdhury, A. Thielke, M. Guttman, J. Robinson,\nB. Minie, N. Chevrier, Z. Itzhaki, R. Blecher-Gonen, C. Bornstein, D. Amann-Zalcenstein,\nA. Weiner, D. Friedrich, J. Meldrim, O. Ram, C. Chang, A. Gnirke, S. Fisher, N. Friedman,\nB. Wong, B. E. Bernstein, C. Nusbaum, N. Hacohen, A. Regev, and I. Amit. A high throughput\nChromatin ImmunoPrecipitation approach reveals principles of dynamic gene regulation in\nmammals Mol. Cell. 447.5 (2012): 810-822\n\n[8] A. Hauser and P. B\u00fchlmann. Characterization and greedy learning of interventional Markov\nequivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13.Aug\n(2012): 2409-2464.\n\n[9] A. Hauser and P. B\u00fchlmann. Jointly interventional and observational data: estimation of\ninterventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal\nStatistical Society: Series B (Statistical Methodology) 77.1 (2015): 291-318.\n\n[10] A. Hyttinen, F. Eberhardt, and M. J\u00e4rvisalo. Constraint-based Causal Discovery: Con\ufb02ict\n\nResolution with Answer Set Programming. UAI. 2014.\n\n[11] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[12] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas,\nN. Kamitaki, E. M. Martersteck, J. J. Trombetta, D. A. Weitz, J. R. Sanes, A. K. Shalek,\nA. Regev, and S. A. McCarroll. Highly parallel genome-wide expression pro\ufb01ling of individual\ncells using nanoliter droplets. Cell 161.5 (2015): 1202-1214.\n\n[13] S. Magliacane, T. Claassen, and J. M. Mooij. Ancestral causal inference. Advances In Neural\n\nInformation Processing Systems. 2016.\n\n[14] C. Meek. Graphical Models: Selecting causal and statistical models. Diss. PhD thesis, Carnegie\n\nMellon University, 1997.\n\n[15] N. Meinshausen, A. Hauser, J. M. Mooij, J. Peters, P. Versteeg, and P. B\u00fchlmann. Methods\nfor causal inference from gene perturbation experiments and validation. Proceedings of the\nNational Academy of Sciences, USA. 113.27 (2016): 7361-7368.\n\n[16] P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and\n\nhybrid structure learning. ArXiv preprint arXiv: 1507.02608 (2015).\n\n[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, 1988.\n[18] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge,\n\n2000.\n\n[19] A. Rau, F. Jaffr\u00e9zic, and G. Nuel. Joint estimation of causal effects from observational and\n\nintervention gene expression data. BMC Systems Biology 7.1 (2013): 111.\n\n[20] J. M. Robins, M. A. Hern\u00e1n and B. Brumback. Marginal structural models and causal inference\n\nin epidemiology. Epidemiology 11.5 (2000): 550-560.\n\n[21] K. Sachs, O. Perez, D. Pe\u2019er, D. A. Lauffenburger and G. P. Nolan. Causal protein-signaling\n\nnetworks derived from multiparameter single-cell data. Science 308.5721 (2005): 523-529.\n\n[22] P. Spirtes, C. N. Glymour and R. Scheines. Causation, Prediction, and Search. MIT Press,\n\nCambridge, 2001.\n\n[23] L. Solus, Y. Wang, C. Uhler, and L. Matejovicova. Consistency guarantees for permutation-\n\nbased causal inference algorithms. ArXiv preprint arXiv: 1702.03530 (2017).\n\n[24] R. E. Tillman, A. Gretton, and P. Spirtes. Nonlinear directed acyclic structure learning with\n\nweakly additive noise model. Advances in neural information processing systems. 2009.\n\n[25] S. Trianta\ufb01llou and I. Tsamardinos. Constraint-based causal discovery from multiple inter-\nventions over overlapping variable sets. Journal of Machine Learning Research 16 (2015):\n2147-2205.\n\n[26] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesian network\n\nstructure learning algorithm. Machine Learning 65.1 (2006): 31-78.\n\n10\n\n\f", "award": [], "sourceid": 2985, "authors": [{"given_name": "Yuhao", "family_name": "Wang", "institution": "MIT"}, {"given_name": "Liam", "family_name": "Solus", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Karren", "family_name": "Yang", "institution": "MIT"}, {"given_name": "Caroline", "family_name": "Uhler", "institution": "Massachusetts Institute of Technology"}]}