{"title": "Experimental Design for Learning Causal Graphs with Latent Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 7018, "page_last": 7028, "abstract": "We consider the problem of learning causal structures with latent variables using interventions. Our objective is not only to learn the causal graph between the observed variables, but to locate unobserved variables that could confound the relationship between observables. Our approach is stage-wise: We first learn the observable graph, i.e., the induced graph between observable variables. Next we learn the existence and location of the latent variables given the observable graph. We propose an efficient randomized algorithm that can learn the observable graph using O(d\\log^2 n) interventions where d is the degree of the graph. We further propose an efficient deterministic variant which uses O(log n + l) interventions, where l is the longest directed path in the graph. Next, we propose an algorithm that uses only O(d^2 log n) interventions that can learn the latents between both non-adjacent and adjacent variables. While a naive baseline approach would require O(n^2) interventions, our combined algorithm can learn the causal graph with latents using O(d log^2 n + d^2 log (n)) interventions.", "full_text": "Experimental Design for Learning Causal Graphs\n\nwith Latent Variables\n\nMurat Kocaoglu\u21e4\n\nDepartment of Electrical and Computer Engineering\n\nThe University of Texas at Austin, USA\n\nmkocaoglu@utexas.edu\n\nKarthikeyan Shanmugam\u21e4\nIBM Research NY, USA\n\nkarthikeyan.shanmugam2@ibm.com\n\nElias Bareinboim\n\nDepartment of Computer Science and Statistics\n\nPurdue University, USA\n\neb@purdue.edu\n\nAbstract\n\nWe consider the problem of learning causal structures with latent variables using\ninterventions. Our objective is not only to learn the causal graph between the\nobserved variables, but to locate unobserved variables that could confound the\nrelationship between observables. Our approach is stage-wise: We \ufb01rst learn the\nobservable graph, i.e., the induced graph between observable variables. Next we\nlearn the existence and location of the latent variables given the observable graph.\nWe propose an ef\ufb01cient randomized algorithm that can learn the observable graph\nusing O(d log2 n) interventions where d is the degree of the graph. We further\npropose an ef\ufb01cient deterministic variant which uses O(log n + l) interventions,\nwhere l is the longest directed path in the graph. Next, we propose an algorithm that\nuses only O(d2 log n) interventions that can learn the latents between both non-\nadjacent and adjacent variables. While a naive baseline approach would require\nO(n2) interventions, our combined algorithm can learn the causal graph with\nlatents using O(d log2 n + d2 log (n)) interventions.\n\n1\n\nIntroduction\n\nCausality shapes how we view, understand, and react to the world around us. It is arguably a key\ningredient in building intelligent systems that are autonomous and can act ef\ufb01ciently in complex\nenvironments. Not surprisingly, the task of automating the learning of cause-and-effect relationships\nhave attracted great interest in the arti\ufb01cial intelligence and machine learning communities. This effort\nhas led to a general theoretical and algorithmic understanding of the assumptions under which cause-\nand-effect relationships can be inferred from data. These results have started to percolate through the\napplied \ufb01elds ranging from genetics to medicine, from psychology to economics [5, 26, 33, 25].\nThe endeavour of algorithmically learning causal relations may have started from the independent\ndiscovery of the IC [35] and PC algorithms [33], which almost identically, and contrary to previously\nheld beliefs, showed the feasibility of recovering these relations from purely observational, non-\nexperimental data. A plethora of methods followed this breakthrough, and now we understand, at\nleast in principle, the limits of what can be inferred from purely observational data, including (not\nexhaustively) [31, 14, 21, 27, 19, 13]. There are a number of assumptions that have been considered\nabout the data-generating model when attempting to unveil the causal structure. One of the most\n\n\u21e4Equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fpopular assumptions is that the data-generating model is causally suf\ufb01cient, which means that no\nlatent (unmeasured) variable affects more than one observed variable. In practice, this is a very\nstringent condition since the existence of latents affecting more than one observed variable, and\ngenerating what is called confounding bias, is one of the main concerns of empirical scientists.\nThe problem of causation is deemed challenging in most of the empirical \ufb01elds because scientists\nrecognize that not all the variables in\ufb02uencing the observed phenomenon can be measured. The\ngeneral question that arises is then how much of the observed behavior of the system is truly causal,\nor whether it is due to some external, unobserved forces [26, 5].\nTo account for the latent variables in the context of structural learning, the IC* [35] and FCI [33]\nalgorithms were introduced, which showed the possibility of recovering causal structures even when\nlatent variables may be confounding the observed behavior 2. One of the main challenges faced\nby these algorithms is that although some ancestral relations as well as certain causal edges can be\nlearned [36, 7], many observationally equivalent architectures cannot be distinguished. Despite the\npractical challenges when collecting the data (e.g., \ufb01nite samples, selection bias, missing data), we\nnow have a complete characterization of what structures are recoverable from observational data\nbased on conditional independence constraints [33, 2, 37]. Inferences will be constrained within\nan equivalence class. Initial works leveraged ideas of experimental design and the availability of\ninterventional data to move from the equivalence class to a speci\ufb01c graph, but almost exclusively\nconsidering causally suf\ufb01cient systems [9, 15, 11, 12, 30, 18].\nFor causally insuf\ufb01cient systems, there is a growing interest in identifying experimental quantities\nand structures based on partially observed interventional data [4, 32, 29, 28, 24, 16, 8, 34, 22], but\nwithout the goal of designing the optimal set of interventions. Perhaps the most relevant paper to\nour setup is [23]. Authors identify the experiments needed to learn the causal graph under latents,\ngiven the output of FCI algorithm. However, they are not interested in minimizing the number of\nexperiments.\nIn this paper, we propose the \ufb01rst ef\ufb01cient non-parametric algorithm for learning a causal graph with\nlatent variables. It is known that log(n) interventions are necessary (across all graphs) and suf\ufb01cient\nto learn a causal graph without latent variables [12], and we show, perhaps surprisingly, that there\nexists an algorithm that can learn any causal graph with latent variables which requires poly(log n)\ninterventions when the observable graph is sparse. More speci\ufb01cally, our contributions are as follow:\n\u2022 We introduce a deterministic 3 algorithm that can learn any causal graph and the existence and\nlocation of the latent variables using O(d log(n) + l) interventions, where d is the largest node\ndegree and l is the longest directed path of the causal graph.\n\u2022 We design a randomized algorithm that can learn the observable graph and all the latent variables\nusing O(d log2(n) + d2 log(n)) interventions with high probability, where d is the largest node\ndegree.\n\nThe \ufb01rst algorithm is useful in practical settings where the longest directed path is not very deep, e.g.,\nO(log(n)). This includes bipartite, time-series, and relational type of domains where the underlying\ncausal topology is somewhat sparse. As an example application, consider the problem of inferring\nthe causal effect of a set of genes on a set of phenotypes, that could be cast as learning a bipartite\ncausal system. For the more general setting, we introduce a randomized algorithm that with high\nprobability is capable of unveiling the true causal structure.\nBackground\nWe assume for simplicity that all the random variables are discrete. We use the language of Structural\nCausal Models (SCM) [26, pp. 204-207]. Formally, an SCM M is a 4-tuple hU,V,F, P (u)i, where\nU is a set of exogenous (unobserved, latent) variables, V is a set of endogenous (measured) variables.\nWe partition the set of exogenous variables into two disjoint sets: Exogenous variables with one\nobservable child, denoted by E, exogenous variables with two observable children, denoted by L.\nF = {fi} is a collection of functions such that each endogenous variable Vi 2V is determined by\na function fi 2 F : Each fi is a mapping from the respective domain of the exogenous variables\nassociated with Vi and a set of observable variables associated with Vi, called P Ai, into Vi. The\n\n2Hereafter, latent variable refers to any unmeasured variable that affects more than one observed variable.\n3We assume access to an oracle that outputs a size-O(d2 log (n)) independent set cover for the non-edges of\na given graph. This oracle can be implemented using another randomized algorithm as we explain in Section 5.\n\n2\n\n\fset of exogenous variables associated with Vi can be divided into two classes, the one with a single\nobservable child, denoted by Ei 2E , and those with two observable children, denoted by Li \u2713L .\nHence fi maps from the domain of Ei [ P Ai [L i to Vi. The entire set F forms a mapping from U to\nV. The uncertainty is encoded through a product probability distribution over the exogenous variables\nP (E,L). For simplicity we refer to L as the set of latents, and E as the set of exogenous variables.\nWithin the structural semantics, performing an action S = s is represented through the do-operator,\ndo(S = s), which encodes the operation of replacing the original equation of S by the constant s\nand induces a submodel MS (also for when S is not a singleton). We denote the post-interventional\ndistribution by PS(\u00b7). For a detailed discussion on the properties of structural models, we refer\nreaders to [5, 23, 24, Ch. 7]. De\ufb01ne D` = (V[L , E`) to be the causal graph with latents. We de\ufb01ne\nthe observable graph to be the induced subgraph on V which is D = (V, E).\nIn practice, we use an independent random variable Wi taking values uniformly at random in the state\nspace of Vi, to implement an intervention do(Vi). A conditional independence statement, e.g., X is\nindependent from Y given Z \u21e2V with respect to causal model MS, in shown by (X ?? Y |Z)MS,\nor (X ?? Y |Z)S when the causal model is clear from the context. These conditional independencies\nare with respect to the post-interventional joint probability distribution PS(\u00b7). In this paper, we\nassume that an oracle to conditional independence (CI) tests is available.\nThe mutilated or post-interventional causal graph, denoted D`[S] = (V[L , E`[S]), is identical to\nD` except that all the incoming edges incident on any vertex in the interventional set S is absent, i.e.,\nE`[S] = E` { (Y, V ) : V 2 S, (Y, V ) 2 E`}. We de\ufb01ne the transitive closure, denoted Dtc, of an\nobservable causal DAG D as follows: If there is a directed path from Vi to Vj in D, there is a directed\nedge from Vi to Vj in Dtc. Essentially, a directed edge in Dtc represents an ancestral relation in D.\nFor any DAG D = (V, E), a set of nodes S \u21e2 V d-separates two nodes a and b if and only if S\nblocks all paths between a and b. \u2018Blocking\u2019 is a graphical criterion associated with d-separation 4. A\nprobability distribution is said to be faithful (or stable) to a graph, if and only if every conditional\nindependence statement can be read off from the graph using d-separation, see [26, Ch. 2] for a\nreview. We assume that faithfulness holds in the observational and post-interventional distributions\nfollowing [12].\nResults and outline of the paper\nThe skeleton of the proposed learning algorithms can be split into 3 steps, namely:\n\n(a)\n\n;\n\n! Transitive Closure (b)\n\n! Observable graph (c)\n\n! Observable graph with Latent variables\n\nEach step requires different tools and graph theoretic concepts:\n(a) We use a pairwise independence test under interventions that reveals the ancestral relations. This\nis combined in an ef\ufb01cient manner with separating systems to discover the transitive closure of D\nin O(log n) interventions.\n(b) We rely on the transitive reduction of directed acyclic graphs that can be ef\ufb01ciently computed only\nfrom their transitive closure. A key property we observe is that the transitive reduction reveals a\nsubset of the true edges. For our randomized algorithm, we use a sequence of transitive reductions\ncomputed from transitive closures (obtained using step (a)) of different post-interventional graphs.\n(c) Given the observable graph, it is possible to discover latents between non-adjacent nodes using\nCI tests under suitable interventions. We use an edge-clique cover on the complement graph to\noptimize the number of experiments. For latents between adjacent nodes, we use a relatively\nunknown test called the do-see test, i.e., leveraging the equivalence between observing and\nintervening on the node. We implement it using induced matching cover of the observable graph.\nThe modularity of our approach allows us to solve subproblems: given the ancestral graph, we can\nuse (b) to discover the observable graph D. If D is known, we can learn the latents with (c). Some\npictorial illustrations of the main results in the technical sections are found in the full version [20].\n\nIdentifying the Observable Graph: A simple baseline\n\n2\nWe discuss a natural and a simple deterministic baseline algorithm that \ufb01nds the observable graph\nwith experiments when confounders are present. To our knowledge, a provably complete algorithm\n4For convenience, detailed de\ufb01nitions of blocking and non-blocking paths are provided in the full version\n\n[20].\n\n3\n\n\fthat recovers the observable graph under this setting and is superior than this simple baseline in the\nworst case is not known. We start from the following observation. Suppose X ! Y where X, Y\nare observable variables and let L be a latent variable such that L ! X, L ! Y . Consider the\npost interventional graph D`[{X}] where we intervene on X. It is easy to see that, X and Y are\ndependent in the post interventional graph too because of the direct causal relationship. However, if\nX is not a parent of Y , then in the post interventional graph D`[{X}] even with or without the latent\nL between X and Y , X is independent of Y since X is intervened on.\nIt is possible to recreate this condition between any target variable Y and any one of its direct parents\nX when many other observable variables are involved. Simply, we consider the post-interventional\ngraph where we intervene on all observable variables but Y . In D`[V { Y }], Y and X are dependent\nif and only if X ! Y is a directed edge in the observable graph D, because every variable except X\nbecomes independent of all other variables in the post interventional graph. Therefore, one needs n\ninterventions, each of size n 1 to \ufb01nd out the parent set of every node. We basically show in the next\ntwo sections that when the graph D has constant degree, it is enough to do O(log2(n)) interventions\nrepresenting the \ufb01rst provably exponential improvement.\n\n3 Learning Ancestral Relations\n\nIn this section, we show that separating systems can be used to construct sequences of pairwise CI\ntests to discover the transitive closure of the observable causal graph, i.e., the graph that captures all\nancestral relations. The following lemma relates post-interventional statistical dependencies with the\nancestral relations in the graph with latents.\nLemma 1. [Pairwise Conditional Independence Test] Consider a causal graph with latents D`. Con-\nsider an intervention on the set S \u21e2V of observable variables. Then, under the post-interventional\nfaithfulness assumption, for any pair Xi 2 S, Xj 2V\\ S, (Xi 6?? Xj)D`[S] if and only if Xi is an\nancestor of Xj in the post-interventional observable graph D[S].\n\nLemma 1 constitutes, for any ordered pair of variables (Xi, Xj) in the observable graph D, a test for\nwhether Xi is an ancestor of Xj or not. Note that a single test is not suf\ufb01cient to discover the ancestral\nrelation between a pair (Xi, Xj), e.g., if Xi ! Xk ! Xj and Xi, Xk 2 S, Xj /2 S, the ancestral\nrelation will not be discovered. This issue can be resolved by using a sequence of interventions\nguided by a separating system, and later \ufb01nding the transitive closure of the learned graph.\nSeparating systems were \ufb01rst de\ufb01ned by [17], and has been subsequently used in the context of\nexperimental design [10]. A separating system on a ground set S is a collection of subsets of S,\nS = {S1, S2 . . .} such that for every pair (i, j), there is a set that contains only one, i.e., 9k such\nthat i 2 Sk, j /2 Sk or j 2 Sk, i /2 Sk. We require a stronger notion which is captured by a strongly\nseparating system.\nDe\ufb01nition 1. An (m, n) strongly separating system is a family of subsets {S1, S2 . . . Sm} of the\nground set [n] such that for any two pairs of nodes i and j, there is a set S in the family such that\ni 2 S, j /2 S and also another set S0 such that i /2 S0, j 2 S0.\nSimilar to separating systems, one can construct strongly separating systems using O(log(n)) subsets:\nLemma 2. An (m, n) strong separating system exists on a ground set [n] where m \uf8ff 2dlog ne.\nWe propose Algorithm 1 to discover the ancestral relations between the observable variables. It uses\nthe subsets of a strongly separating system on the ground set of all observable variables as intervention\nsets, to assure that the ancestral relation between every ordered pair of observable variables is tested.\nThe following theorem shows the number of experiments and the soundness of Algorithm 1.\nTheorem 1. Algorithm 1 requires only 2dlog ne interventions and conditional independence tests on\nsamples obtained from each post-interventional distribution and outputs the transitive closure Dtc.\n\n4 Learning the Observable Graph\nWe introduce a deterministic and a randomized algorithm for learning the observable causal graph D\nfrom ancestral relations. D encodes every direct causal connection between the observable nodes.\n\n4\n\n\fE = ;.\nConsider a strongly sep. system of size \uf8ff 2 log n on the ground set V - {S1, S2..S2dlog ne}.\nfor i in [1 : 2dlog ne] do\n\nAlgorithm 1 LearnAncestralRelations- Given access to a conditional independence testing oracle\n(CI oracle), query access to samples from any post-interventional causal model derived out of M\n(with causal graph D`), outputs all ancestral relationships between observable variables, i.e., Dtc\n1: function LEARNANCESTRALRELATIONS(M)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end function\n\nUse samples from MSi and use the CI-oracle to test the following.\nif (X 6?? Y )D`[S] then\nE E [ (X, Y ).\nend if\n\nend for\nreturn The transitive closure of the graph (V, E)\n\nIntervene on the set Si of nodes.\nfor X 2 Si, Y /2 Si, Y 2V do\n\nend for\n\n4.1 A Deterministic Algorithm\nBased on Section 3, assume that we are given the transitive closure of the observable graph. We show\nin Lemma 3 that, when the intervention set contains all parents of Xi, the only variables dependent\nwith Xi in the post-interventional observable graph are the parents of Xi in the observable graph.\nLemma 3. For variable Xi, consider an intervention on S where P ai \u21e2 S. Then {Xj 2 S : (Xi 6??\nXj)D[S]} = P ai.\nLet the longest directed path of Dtc be r. Consider the partial order <Dtc implied by Dtc on\nthe vertex set V. De\ufb01ne {Ti : i 2 [r + 1]} as the unique partitioning of vertices of Dtc where\nTi <Dtc Tj,8i < j and each node in Ti is a set of mutually incomparable elements. In other words,\nTi are the set of nodes at layer i of the transitive closure graph Dtc. De\ufb01ne Ti = [i1\nk=1Tk. We have\nthe following observation: P ai \u21e2T i. This paves the way for Algorithm 2 that leverages Lemma 3.\nAlgorithm 2 LearnObservableGraph/Deterministic - Given the ancestral graph, access to a conditional\nindependence testing oracle (CI oracle) and outputs the graph induced on observable nodes.\n1: function LEARNOBSERVABLEGRAPH/DETERIMINISTIC(M)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end function\n\nE = ;.\nfor i in {r + 1, r, r  1, . . . , 2} do\nIntervene on the set Ti of nodes.\nUse samples from MTi and use the CI-oracle to test the following.\nfor X in Ti do\n\nif (X 6?? Y )D`[Ti] then\nE E [ (X, Y ).\nend if\n\nend for\nreturn Observable graph\n\nend for\n\nThe correctness of Algorithm 2 follows from Lemma 3, which is stated explicitly in the sequel.\nTheorem 2. Let r be the length of the longest directed path in the causal graph D`. Algorithm 2\nrequires only r interventions and conditional independence tests on samples obtained from each one\nof the post-interventional distributions and outputs the observable graph D.\n\n4.2 A Randomized Algorithm\nWe propose a randomized algorithm that repeatedly uses the ancestor graph learning algorithm from\nSection 3 to learn the observable graph 5. A key structure that we use is the transitive reduction:\n\n5Note that this algorithm does not require learning the ancestral graph \ufb01rst.\n\n5\n\n\fV1\n\nV2\n\nV3\n\nV4\n\nV1\n\nV2\n\nV3\n\nV4\n\nObservable\tGraph\tD\n\n(a)\n\nTransitive\treduction\tof\tD\n(b)\n\nV1\n\nV2\n\nV3\n\nV4\n\nPost-interventional\tgraph\tD[{V2}]\n\nAfter\tintervention\ton\tV2\n\n(c)\n\nV1\n\nV2\n\nV3\n\nV4\n\nTransitive\treduction\tof\tD[{V2}]\n\n(d)\n\nIllustration of Lemma 5 - (a) An example of an observable graph D without latents\nFigure 1:\n(b): Transitive reduction of D. The highlighted red edge (V1, V3) has not been revealed under the\noperation of transitive reduction. c) Intervention on node V2 and its post interventional graph D[{V2}]\nd) Since all parents of V3 above V1 in the partial order have been intervened on, by Lemma 5, the\nedge (V1, V3) is revealed in the transitive reduction of D[{V2}].\n\nDe\ufb01nition 2 (Transitive Reduction). Given a directed acyclic graph D = (V, E), let its transitive\nclosure be Dtc. Then Tr(D) = (V, Er) is a directed acyclic graph with minimum number of edges\nsuch that its transitive closure is identical to Dtc.\nLemma 4. [1] Tr(D) is known to be unique if D is acyclic. Further, the set of directed edges of\nTr(D) is a subset of the directed edges of D, i.e., Er \u21e2 E. Computing Tr(D) from D takes the same\ntime as transitive closure of a DAG D, which takes time poly(n).\n\nWe note that Tr(D) = Tr(Dtc). Now, we provide an algorithm that outputs an observable graph\nbased on samples from the post-interventional distribution after a sequence of interventions. Let us\nassume an ordering \u21e1 on the observable vertices V that satis\ufb01es the partial order relationships in the\nobservable causal graph D. The key insight behind the algorithm is given by the following Lemma.\nLemma 5. Consider an intervention on a set S \u21e2V of nodes in the observable causal graph D.\nConsider the post-interventional observable causal graph D[S]. Suppose for a speci\ufb01c observable\nnode Vi, Vi 2 Sc. Let Y be a direct parent of Vi in D such that all the direct parents of Vi above Y\nin the partial order6 \u21e1(\u00b7) is in S, i.e., {X : \u21e1(X) >\u21e1 (Y ), (X, V ) 2 D}\u2713 S. Then, Tr(D[S]) will\ncontain the directed edge (Y, Vi) and it can be computed from Tr((D[S])tc)\n\nWe illustrated Lemma 5 through an example in Figure 1. The red edge in Figure 1(a) is not revealed\nin the transitive reduction. The edge is revealed when computing the transitive reduction of the\npost-interventional graph D[{V2}]. This is possible because all parents of V3 above V1 in the partial\norder (in this case node V2) have been intervened on.\nLemma 5 motivates Algorithm 3. The basic idea is to intervene in randomly, then compute the\ntransitive closure of the post-interventional graph using the algorithm in the previous section, compute\nthe transitive reduction, and then accumulate all the edges found in the transitive reduction at every\nstage. We will show in Theorem 3 that with high probability, the observable graph can be recovered.\nTheorem 3. Let dmax be greater than the maximum in-degree in the observable graph D. Al-\ngorithm 3 requires at most 8cdmax(log n)2 interventions and CI tests on samples obtained from\npost-interventional distributions, and outputs the observable graph with probability at least 1 1\nnc2 .\nRemark. The above algorithm takes as input a parameter dmax that needs to be estimated. One\npractical option is to gradually increase dmax and run Algorithm 3.\n\n6The nodes above with respect to the partial order of a graph are those that are closer to the source nodes.\n\n6\n\n\fAlgorithm 3 LearnObservable- Given access to a conditional independence testing oracle (CI oracle),\na parameter dmax outputs induced subgraph between observable variables, i.e. D\n1: function LEARNOBSERVABLE/RANDOMIZED(M, dmax)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end function\n\nS = ;.\nfor V 2V do\nS S [ V randomly with probability 1  1/dmax.\nend for\n\u02c6DS = LearnAncestralRelations(M). Let \u02c6D = (V, \u02c6E).\nCompute the transitive reduction of \u02c6D(Tr( \u02c6DS)) according to the algorithm in [1].\nAdd the edges of the transitive reduction to the set E if not already there, i.e. E E [ \u02c6E.\n\nE = ;.\nfor i in [1 : c \u21e4 4 \u21e4 dmax log n] do\n\nend for\nreturn The directed graph (V, E).\n\n5 Learning Latents from the Observable Graph\nThe \ufb01nal stage of our framework is learning the existence and location of latent variables given the\nobservable graph. We divide this problem into two steps \u2013 \ufb01rst, we devise an algorithm that can learn\nthe latent variables between any two variables that are non-adjacent in the observable graph; later, we\ndesign an algorithm that learns the latent variables between every pair of adjacent variables.\n\n5.1 Baseline Algorithm for Detecting Latents between Non-edges\nConsider two variables X and Y such that X L ! Y and where L is a latent variable. Clearly, to\ndistinguish it from the case where X and Y are disconnected and have no latents, one needs check if\nX 6?? Y or not. This is a conditional independence test. For any non edge (X, Y ) in the observable\ngraph D, when the observable graph D is known, to check for latents between them, when other\nvariables and possible confounders are around, one has to simply intervene on the rest of the n  2\nvariables and do a independence test between X and Y in the post interventional graph. This requires\na distinct intervention for every pair of variables. If the observable graph has maximum degree\nd = o(n), this requires \u21e5(n2) interventions. We will reduce this to O(d2 log n) interventions which\nis an exponential improvement for constant degree graphs.\n\n5.2 Latents between Non-adjacent Nodes\n\nWe start by noting the following fact about causal systems with latent variables:\nTheorem 4. Consider two non-adjacent nodes Xi, Xj. Let S be the union of the parents of Xi, Xj,\nS = P ai [ P aj. Consider an intervention on S. Then we have (Xi 6?? Xj)MS if and only if there\nexists a latent variable Li,j such that Xj Li,j ! Xi. The statement holds under an intervention\nS such that P ai [ P aj \u21e2 S, Xi, Xj /2 S.\nThe above theorem motivates the following approach: For a set of nodes which forms an independent\nset, an intervention on the union of parents of the nodes of the independent set allows us to learn\nthe latents between any two nodes in the independent set. We leverage this observation using the\nfollowing lemma on the number of such independent sets needed to cover all non-edges.\nLemma 6. Consider a directed acyclic graph D = (V, E) with degree (out-degree+in-degree)\nd. Then there exists a randomized algorithm that returns a family of m = O(4e2(d + 1)2 log(n))\nindependent sets I = {I1, I2, . . . , Im} that cover all non-edges of D: 8i, j such that (Xi, Xj) /2 E\nand (Xj, Xi) /2 E, 9k 2 [m] such that Xi 2 Ik and Xj 2 Ik, with probability at least 1  1\nn2 .\nNote that this is a randomized construction and we are not aware of any deterministic construction.\nOur deterministic causal learning algorithm requires oracle access to such a famiy of independent\nsets, whereas our randomized algorithm can directly use this randomized construction. Now, we use\nthis observation to construct a procedure to identify latents between non-edges (see Algorithm 4).\nThe following theorem about its performance follows from Lemma 6 and Theorem 4.\n\n7\n\n\fAlgorithm 4 LearnLatentNonEdge- Given access to a CI oracle, observable graph D with max degree\nd (in-degree+out-degree), outputs all latents between non-edges\n1: function LEARNLATENTNONEDGE(M, dmax)\n2:\n3:\n\nL = ;.\nApply the randomized algorithm in Lemma 6 to \ufb01nd a family of independent sets I =\nfor j 2 [1 : m] do\n\n{I1, I2, . . . , Im} that cover all non-edges in D such that m \uf8ff 4e2(d + 1)2 log(n).\n\nIntervene on the parent set of the nodes in Ij.\nfor every pair of nodes X, Y in Ij do\n\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end function\n\nend for\n\nif (X 6?? Y )D`[Ij ] then\nL L [{ X, Y }.\nend if\n\nend for\nreturn The set of non-edges L.\n\nU\n\nZ\n\nL\n\nM\n\nX\n\nY\n\nG1: do(PaX) is needed \n\nT\n\nL\n\nY\n\nX\nM\nG2: do(PaY) is needed \n\nZ\n\nFigure 2: Left: A graph where intervention on the parents of X is needed for do-see test to succeed.\nRight: A graph where intervention on the parents of Y is needed for do-see test to succeed.\n\nTheorem 5. Algorithm 4 outputs a list of non-edges L that have latent variables between them, given\nthe observable graph D, with probability at least 1  1\nn2 . The algorithm requires 4e2(d + 1)2 log(n)\ninterventions where d is the max-degree (in-degree+out-degree) of the observable graph.\n\n5.3 Latents between Adjacent Nodes\n\nWe construct an algorithm that can learn latent variables between the variables adjacent in the\nobservable graph. Note that the approach of CIT testing in the post-interventional graph is not helpful.\nConsider the variables X ! Y . To see the effect of the latent path, one needs to cut the direct edge\nfrom X to Y . This requires intervening on Y . However, such an intervention disconnects Y from its\nlatent parent. Thus we resort to a different approach compared to the previous stages and exploit a\ndifferent characterization of causal Bayesian networks called a \u2018do-see\u2019 test.\nA do-see test can be described as follows: Consider again a graph where X ! Y . If there are no\nlatents, we have P(Y |X) = P(Y |do(X)). Assume that there is a latent variable Z which causes both\nX and Y , then excepting the pathological cases7, P(Y |X) 6= P(Y |do(X)).\nFigure 2 illustrates the challenges associated with a do-see test in bigger graphs with latents. Graphs\nG1 and G2 are examples where parents of both nodes involved in the test need to be included in the\nintervention set for the Do-see test to work. In G1, suppose we condition on X, as required by the\n\u2018see\u2019 test. This opens up a non-blocking path X  U  T  M  Y . Since X ! Y is not the only\nd-connecting path, it is not necessarily true that P(Y |X) = P(Y |do(X)). Now suppose we perform\nthe do-see test under the intervention do(Z). Then the aforementioned path is closed since X is not a\ndescendant of T in the post interventional graph. Hence we have P(Y |X, do(Z)) = P(Y |do(X, Z)).\nSimilarly G2 shows that intervening on the parent set of Y is also necessary.\nWe have the following theorem, which shows that we can perform the do-see test between X, Y\nunder do(P aX, P aY ):\n\n7These cases are fully identi\ufb01ed in the full version [20].\n\n8\n\n\fTheorem 6. [Interventional Do-see test] Consider a causal graph D on the set of observable\nvariables V = {Vi}i2[n] and latent variables L = {Li}i2[m] with edge set E. If (Vi, Vj) 2 E, then\n\nPr(Vj|Vi = vi, do(P ai = pai, P aj = paj)) = Pr(Vj|do(Vi = vi, P ai = pai, P aj = paj)),\n\niff @k such that (Lk, Vi) 2 E and (Lk, Vj) 2 E, where P ai is the set of parents of Vi in V . Quantities\non both sides are invariant irrespective of additional interventions elsewhere.\n\nNext we need a subgraph structure to perform multiple do-see tests at once in order to ef\ufb01ciently\ndiscover the latents between the adjacent nodes. Performing the test for every edge would take O(n)\neven in graphs with constant degree. We use strong edge coloring of sparse graphs.\nDe\ufb01nition 3. A strong edge coloring of an undirected graph with k colors is a map  : E ! [k]\nsuch that every color class is an induced matching. Equivalently, it is an edge coloring such that any\ntwo nodes adjacent to distinct edges with the same color are non-adjacent.\n\nGraphs of maximum degree d can be strongly edge-colored with at most 2d2 colors.\nLemma 7. [6] A graph of maximum degree d can be strongly edge-colored with at most 2d2 colors.\nA simple greedy algorithm that colors edges in sequence achieves this.\n\nNow observe that a color class of the edges forms an induced matching. We show that due to this,\nthe \u2018do\u2019 part (RHS of Theorem 6) of all the do-see tests in a color class can be performed with a\nsingle intervention while the \u2018see\u2019 part (RHS of Theorem 6) can be again performed with another\nintervention. We argue that we need exactly two different interventions per color class. The following\ntheorem uses this property to prove correctness of Algorithm 5.\n\nAlgorithm 5 LearnLatentEdge- Observable graph D with max degree d (in-degree+out-degree),\noutputs all latents between edges\n1: function LEARNLATENTEDGE(M, d)\n2:\n3:\n4:\n5:\n\nL = ;.\nApply the greedy algorithm in Lemma 7 to color the edges of D with k \uf8ff 2d2 colors.\nfor j 2 [1 : k] do\n\nLet Aj be the nodes involved with the edges that form color class j. Let Pj be the union\n\nof parents of all nodes in Aj except the nodes in Aj.\n\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: end function\n\nend for\n\nLet the set of tail nodes of all edges be Tj.\nFollowing loop requires the intervention on the set Tj [ Pj, i.e. do({Tj, Pj}).\nfor Every directed edge (Vt, Vh) in color class j do\nCalculate S(Vt, Vh) = P (Vh|do(Tj, Pj)) using post interventional samples.\n\nend for\nFollowing loop requires the intervention on the set Pj.\nfor Every directed edge (Vt, Vh) in color class j do\n\nCalculate S0(Vt, Vh) = P (Vh|Vt, do(Pj)) using post interventional samples.\nif S0(Vt, Vh) 6= S(Vt, Vh) then\nend if\n\nL L [ (Vt, Vh)\n\nend for\nreturn The set of edges L that have latents between them.\n\nTheorem 7. Algorithm 5 requires at most 4d2 interventions and outputs all latents between the edges\nin the observable graph.\n\n6 Conclusions\n\nLearning cause-and-effect relations is one of the fundamental challenges in science. We studied the\nproblem of learning causal models with latent variables using experimental data. Speci\ufb01cally, we\nintroduced two ef\ufb01cient algorithms capable of learning direct causal relations (instead of ancestral\nrelations) and \ufb01nding the existence and location of potential latent variables.\n\n9\n\n\fReferences\n[1] Alfred V. Aho, Michael R Garey, and Jeffrey D. Ullman. The transitive reduction of a directed\n\ngraph. SIAM Journal on Computing, 1(2):131\u2013137, 1972.\n\n[2] Ayesha R. Ali, Thomas S. Richardson, Peter L. Spirtes, and Jiji Zhang. Towards characterizing\nmarkov equivalence classes for directed acyclic graphs with latent variables. In Proc. of the\nUncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[3] Noga Alon. Covering graphs by the minimum number of equivalence relations. Combinatorica,\n\n6(3):201\u2013206, 1986.\n\n[4] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identi\ufb01ability. In\nNando de Freitas and Kevin Murphy, editors, Proceedings of the Twenty-Eighth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 113\u2013120, Corvallis, OR, 2012. AUAI Press.\n\n[5] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the\n\nNational Academy of Sciences, 113:7345\u20137352, 2016.\n\n[6] Julien Bensmail, Marthe Bonamy, and Herv\u00e9 Hocquard. Strong edge coloring sparse graphs.\n\nElectronic Notes in Discrete Mathematics, 49:773\u2013778, 2015.\n\n[7] So\ufb01a Borboudakis, Giorgos andTrianta\ufb01llou and Ioannis Tsamardinos. Tools and algorithms for\ncausally interpreting directed edges in maximal ancestral graphs. In Sixth European Workshop\non Probabilistic Graphical Models, 2012.\n\n[8] Tom Claassen and Tom Heskes. Causal discovery in multiple models from different experiments.\n\nIn Advances in Neural Information Processing Systems, pages 415\u2013423, 2010.\n\n[9] Frederick Eberhardt. Phd thesis. Causation and Intervention (Ph.D. Thesis), 2007.\n\n[10] Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of\n\nScience, 74(5):981\u2013995, 2007.\n\n[11] Alain Hauser and Peter B\u00fchlmann. Characterization and greedy learning of interventional\nmarkov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research,\n13(1):2409\u20132464, 2012.\n\n[12] Alain Hauser and Peter B\u00fchlmann. Two optimal strategies for active learning of causal networks\nfrom interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical\nModels, 2012.\n\n[13] Christina Heinze-Deml, Marloes H. Maathuis, and Nicolai Meinshausen. Causal structure\n\nlearning. Annual Review of Statistics and Its Applications, 2017, To appear.\n\n[14] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf. Nonlinear\n\ncausal discovery with additive noise models. In Proceedings of NIPS 2008, 2008.\n\n[15] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery.\n\nJournal of Machine Learning Research, 14:3041\u20133071, 2013.\n\n[16] Antti Hyttinen, Patrik O Hoyer, Frederick Eberhardt, and Matti Jarvisalo. Discovering\ncyclic causal models with latent variables: A general sat-based procedure. arXiv preprint\narXiv:1309.6836, 2013.\n\n[17] Gyula Katona. On separating systems of a \ufb01nite set. Journal of Combinatorial Theory,\n\n1(2):174\u2013194, 1966.\n\n[18] Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Cost-optimal learning of\n\ncausal graphs. In ICML\u201917, 2017.\n\n[19] Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath, and Babak Hassibi. Entropic\n\ncausal inference. In AAAI\u201917, 2017.\n\n10\n\n\f[20] Murat Kocaoglu*, Karthikeyan Shanmugam*, and Elias Bareinboim. Experimental design for\nlearning causal graphs with latent variables. Technical Report R-28, AI Lab, Purdue University,\nhttps://www.cs.purdue.edu/homes/eb/r28.pdf, 2017.\n\n[21] Po-Ling Loh and Peter B\u00fchlmann. High-dimensional learning of linear causal networks via\n\ninverse covariance estimation. Journal of Machine Learning Research, 5:3065\u20133105, 2014.\n\n[22] Sara Magliacane, Tom Claassen, and Joris M Mooij. Joint causal inference on observational\n\nand experimental datasets. arXiv preprint arXiv:1611.10351, 2016.\n\n[23] Stijn Meganck, Sam Maes, Philippe Leray, and Bernard Manderick. Learning semi-markovian\ncausal models using experiments. In Proceedings of The third European Workshop on Proba-\nbilistic Graphical Models , PGM 06, 2006.\n\n[24] Pekka Parviainen and Mikko Koivisto. Ancestor relations in the presence of unobserved\nvariables. In Joint European Conference on Machine Learning and Knowledge Discovery in\nDatabases, 2011.\n\n[25] J. Pearl, M. Glymour, and N.P. Jewell. Causal Inference in Statistics: A Primer. Wiley, 2016.\n[26] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.\n[27] Jonas Peters and Peter B\u00fchlman. Identi\ufb01ability of gaussian structural equation models with\n\nequal error variances. Biometrika, 101:219\u2013228, 2014.\n\n[28] Jonas Peters, Peter B\u00fchlmann, and Nicolai Meinshausen. Causal inference using invariant\nprediction: identi\ufb01cation and con\ufb01dence intervals. Statistical Methodology, Series B, 78:947 \u2013\n1012, 2016.\n\n[29] Bernhard Sch\u00f6lkopf, David W. Hogg, Dun Wang, Daniel Foreman-Mackey, Dominik Janzing,\nCarl-Johann Simon-Gabriel, and Jonas Peters. Removing systematic errors for exoplanet search\nvia latent causes. In Proceedings of the 32 nd International Conference on Machine Learning,\n2015.\n\n[30] Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sriram Vishwanath. Learning\n\ncausal graphs with small interventions. In NIPS 2015, 2015.\n\n[31] S Shimizu, P. O Hoyer, A Hyvarinen, and A. J Kerminen. A linear non-gaussian acyclic model\n\nfor causal discovery. Journal of Machine Learning Research, 7:2003\u2013\u20132030, 2006.\n\n[32] Ricardo Silva, Richard Scheines, Clark Glymour, and Peter Spirtes. Learning the structure of\n\nlinear latent variable models. Journal of Machine Learning Research, 7:191\u2013246, 2006.\n\n[33] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A\n\nBradford Book, 2001.\n\n[34] So\ufb01a Trianta\ufb01llou and Ioannis Tsamardinos. Constraint-based causal discovery from multiple\ninterventions over overlapping variable sets. Journal of Machine Learning Research, 16:2147\u2013\n2205, 2015.\n\n[35] Thomas Verma and Judea Pearl. An algorithm for deciding if a set of observed independencies\nhas a causal explanation. In Proceedings of the Eighth international conference on uncertainty\nin arti\ufb01cial intelligence, 1992.\n\n[36] Jiji Zhang. Causal reasoning with ancestral graphs. J. Mach. Learn. Res., 9:1437\u20131474, June\n\n2008.\n\n[37] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of\n\nlatent confounders and selection bias. Arti\ufb01cial Intelligence, 172(16):1873\u20131896, 2008.\n\n11\n\n\f", "award": [], "sourceid": 3536, "authors": [{"given_name": "Murat", "family_name": "Kocaoglu", "institution": "University of Texas at Austin"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Elias", "family_name": "Bareinboim", "institution": "Purdue"}]}