{"title": "Sample Efficient Active Learning of Causal Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 14302, "page_last": 14312, "abstract": "We consider the problem of experimental design for learning causal graphs that have a tree structure. We propose an adaptive framework that determines the next intervention based on a Bayesian prior updated with the outcomes of previous experiments, focusing on the setting where observational data is cheap (assumed infinite) and interventional data is expensive.\nWhile information greedy approaches are popular in active learning, we show that in this setting they can be exponentially suboptimal (in the number of interventions required), and instead propose an algorithm that exploits graph structure in the form of a centrality measure.\nIf infinite interventional data is available, we show that the algorithm requires a number of interventions less than or equal to a factor of 2 times the minimum achievable number. We show that the algorithm and the associated theory can be adapted to the setting where each performed intervention yields finitely many samples. Several extensions are also presented, to the case where a specified set of nodes cannot be intervened on, to the case where $K$ interventions are scheduled at once, and to the fully adaptive case where each experiment yields only one sample.\nIn the case of finite interventional data, through simulated experiments we show that our algorithms outperform different adaptive baseline algorithms.", "full_text": "Sample Ef\ufb01cient Active Learning of Causal Trees\n\nKristjan Greenewald\n\nIBM Research\n\nMIT-IBM Watson AI Lab\n\nkristjan.h.greenewald@ibm.com\n\nKarthikeyan Shanmugam\n\nIBM Research\n\nMIT-IBM Watson AI Lab\n\nDmitriy Katz\nIBM Research\n\nMIT-IBM Watson AI Lab\ndkatzrog@us.ibm.com\n\nSara Magliacane\n\nIBM Research\n\nMIT-IBM Watson AI Lab\n\nkarthikeyan.shanmugam2@ibm.com\n\nsara.magliacane@ibm.com\n\nMurat Kocaoglu\nIBM Research\n\nMIT-IBM Watson AI Lab\n\nEnric Boix-Adser`a\n\nMIT\n\nMIT-IBM Watson AI Lab\n\nGuy Bresler\n\nMIT\n\nMIT-IBM Watson AI Lab\n\nmurat@ibm.com\n\neboix@mit.edu\n\nguy@mit.edu\n\nAbstract\n\nWe consider the problem of experimental design for learning causal graphs that\nhave a tree structure. We propose an adaptive framework that determines the next\nintervention based on a Bayesian prior updated with the outcomes of previous\nexperiments, focusing on the setting where observational data is cheap (assumed\nin\ufb01nite) and interventional data is expensive. While information greedy approaches\nare popular in active learning, we show that in this setting they can be exponentially\nsuboptimal (in the number of interventions required), and instead propose an\nalgorithm that exploits graph structure in the form of a centrality measure. If each\nintervention yields a very large data sample, we show that the algorithm requires\na number of interventions less than or equal to a factor of 2 times the minimum\nachievable number. We show that the algorithm and the associated theory can\nbe adapted to the setting where each performed intervention yields \ufb01nitely many\nsamples. Several extensions are also presented, to the case where a speci\ufb01ed set of\nnodes cannot be intervened on, to the case where K interventions are scheduled at\nonce, and to the fully adaptive case where each experiment yields only one sample.\nIn the case of \ufb01nite interventional data, through simulated experiments we show\nthat our algorithms outperform different adaptive baseline algorithms.\n\nIntroduction\n\n1\nCausal discovery from observational and interventional data is a fundamental problem and preva-\nlent in multiple areas of science and engineering (Pearl, 2009; Spirtes et al., 2000; Peters et al.,\n2017). Learning the underlying causal mechanisms is essential for policy design. Technological\nadvancements in the recent decades have paved the way for the collection of abundant amounts of\nobservational data, i.e., data collected without perturbing the underlying causal mechanisms. How-\never, observational data is generally not suf\ufb01cient for drawing causal conclusions and interventional\ndata, i.e., data collected after a perturbation in the system, may be needed. Therefore, many recent\nmethods propose to exploit both observational and interventional data (Trianta\ufb01llou & Tsamardinos,\n2015; Hyttinen et al., 2014; Peters et al., 2016; Zhang et al., 2017; Magliacane et al., 2016).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn the literature, there is growing interest in algorithms for intervention (experimental) design to\nlearn causal graphs (Hyttinen et al., 2013; Shanmugam et al., 2015; Kocaoglu et al., 2017; Lindgren\net al., 2018). These algorithms recommend the next experiment to perform to the practitioner, which\nis a perfect do(X) intervention Pearl (2009) on a set of intervention targets X. Moreover, they\nprovide guarantees that interventional data from these experiments are suf\ufb01cient to recover the\nunderlying causal graph in the minimum number of experiments. A non-adaptive intervention design\nis determined a priori before any of the interventions are performed. We focus on the adaptive\nintervention design setting, which determines the next experiment after collecting and processing the\ninformation collected from all experiments up until that point.\nIn many real-world settings, the collection of interventional data is much more dif\ufb01cult and costly\nthan that of its observational counterpart. For example, in many medical settings there is plenty of\nobservational clinical data, while randomized controlled trials are expensive to organize. Therefore it\nis generally desirable in practice to use as few interventional samples as possible. However, most\nexisting work assumes a perfect conditional independence oracle which is only true when a very\nlarge number of samples are available from each experiment. In this work, we focus on removing this\nconstraint: we assume that in\ufb01nitely many observational samples are available, while only \ufb01nitely\nmany samples for each intervention can be obtained.\nIn this paper we assume causal suf\ufb01ciency, i.e., the absence of latent confounders, and no selection\nbias. Causal inference using observational data in this setting has been extensively studied in the\nliterature. As an example, in the PC algorithm (Spirtes et al., 2000), causal structure is recovered\nfrom conditional independence tests using the rules described by Meek (1995), which are provably\ncomplete, i.e. they recover all causal relations that can be identi\ufb01ed from the data. The identi\ufb01able\ncausal directions are represented as directed edges in the essential graph, while the non-identi\ufb01able\ndirections are represented as undirected edges. It can be shown that each undirected component in\nthe essential graph does not give information about the other undirected components, and therefore\ncan be learned separately.\nCausal Forest Assumption: In this work, we assume that each of the undirected components of the\nessential graph are trees (in general, the essential graph has chordal undirected components), i.e. they\nform a forest. Under this assumption, the graph can be decomposed into a set of undirected trees\nin which there are no unshielded colliders. Our assumption is satis\ufb01ed when the original graph is\nbi-partite, since chordal components of bi-partite graphs are forests. Examples of bi-partite causal\ngraphs occur in systems biology networks, e.g. gene-protein networks where genes cause protein\nexpressions and expressed proteins block or activate other genes (Kontou et al. (2016)). Another\nmotivation for focusing on learning orientations in the undirected tree components is that it would\ngive insights for the general case when undirected components of the essential graph are chordal as\nchordal graphs are trees of cliques. In the remainder of the paper, we design algorithms for orienting\neach of these tree components individually, since their orientations are not informative of each other\nand must be determined in sequence.\nWe consider a Bayesian approach where we assume a prior distribution on the set of all possible causal\ngraphs on a given undirected tree. The problem can be described as follows: given an undirected tree\nthat does not contain any unshielded colliders, design experiments adaptively to learn the underlying\ncausal graph with the minimum expected number of interventions. In this context, expectation is\nwith respect to a given prior distribution over all causal graphs with the given tree as their essential\ngraph. We propose an ef\ufb01cient algorithm for discovering the underlying causal structure that does\nnot require a perfect conditional independence oracle to process interventional data. To illustrate\nthe soundness of our approach, we \ufb01rst assume that in\ufb01nite observational and interventional data\nare available and show that the average number of experiments required by our algorithm is within\na multiplicative factor of 2 of the optimal algorithm. Extensions are then given to the case where\nsome nodes cannot be intervened on, and to the case where K nodes at a time are requested by the\nexperimenter. We consider two adaptations of this theoretical result to the \ufb01nite sample case, both\nbased on obtaining a speci\ufb01ed number of samples per intervention: (1) a simple union-bound based\napproach that samples each intervention until the con\ufb01dence is suf\ufb01ciently high to apply the noiseless\nalgorithm, and (2) an approach inspired by the results of Emamjomeh-Zadeh et al. (2016) that obtains\na small number of samples per intervention and maintains a Bayesian posterior of the root location.\nThis last result requires O(log(n/)/\u270f2) total interventional samples in expectation, in contrast to\nthe result of Emamjomeh-Zadeh et al. (2016) for the structured noisy search problem, which has a\nO(log(d)) dependence for degree d.\n\n2\n\n\f1.1 Related Work\nExperimental Design Hyttinen et al. (2013); Eberhardt (2007); Eberhardt et al. (2005) show that\nin the worst-case scenario O log(n) experiments are necessary and suf\ufb01cient to recover the causal\ngraph, even when the algorithm is adaptive (Shanmugam et al., 2015). Hu et al. (2014) show\nthat O(log log(n)) randomized experiments are suf\ufb01cient to learn the graph with high probability.\nShanmugam et al. (2015) also propose an adaptive algorithm for the setting where at most k nodes\ncan be randomized. Kocaoglu et al. (2017) consider the problem of minimum-cost intervention\ndesign when each node is associated with an intervention cost and proposed a greedy algorithm,\nwhich gives a (2 + \u270f)-approximation (Lindgren et al., 2018). Ghassami et al. (2017b) studied the\nproblem of learning the maximum number of edges for a given number of size 1 interventions. Except\nShanmugam et al. (2015), all of these works operate in the non-adaptive (of\ufb02ine) setting where\nexperiments are designed before collecting interventional data. Moreover, all assume the existence\nof a perfect CI oracle after every intervention, which in general requires in\ufb01nite experimental data.\nRecently Agrawal et al. (2019) introduced an experimental selection algorithm for learning a speci\ufb01c\ntarget function of the causal graph with a budget on the number of samples, i.e., in the \ufb01xed budget\nregime, whereas we work in the \ufb01xed con\ufb01dence regime (Jamieson et al., 2014). Also recently,\nGhassami et al. (2017a) proposed a non-adaptive intervention design to learn as much as possible\nabout the underlying causal graph using at most M experiments. A routine in their algorithm chooses\na central node to intervene on. While we also use the concept of a central node, our learning algorithm\nand analysis is fully adaptive.\n\nSearch on Structured Data When the essential graph is a tree, learning the causal graph becomes\nequivalent to a structured search problem, since it is reduced to identifying the root node. Onak\n& Parys (2006) consider searching on trees where a query on a node outputs whether the queried\nnode is the marked node and if not outputs the branch on which the marked node lies, which under\nin\ufb01nite interventional data would be exactly our setup. However, they focus on minimizing the worst\ncase performance while we focus on the average case. Jacobs et al. (2010) consider the edge query\nmodel on trees: An edge after being queried yields the direction in which the marked node lies.\nTheir objective is to minimize the average case number of queries relative to an arbitrary prior which\nis known. Although several other variants of the search problem exist (Dereniowski et al., 2017;\nEmamjomeh-Zadeh et al., 2016; Dereniowski et al., 2018; Cicalese et al., 2010, 2014), even with\nextensions to the noisy case (Dereniowski et al., 2018), as far as we are aware searching with vertex\nqueries to minimize average case performance has not been studied before. See (Dereniowski et al.,\n2017) for an overview of the literature.\n\n2 Problem Statement\nWe assume that we have a collection of real-valued (possibly discrete) variables X =\n{X1, X2, . . . , Xn}, and have access to enough observational data to have determined the joint\ndistribution p(X1, X2, . . . , Xn) over the n variables. We further assume that the Causal Markov and\nfaithfulness assumptions (Spirtes et al., 2000) hold, implying a one-to-one correspondence between\nd-separations and conditional independences. Combined with the assumption that the underlying\ngraph is a tree, this implies that we have access to the correct undirected version of the causal graph.\nOur goal is to learn a causal model over X1, X2, . . . , Xn, where with a slight abuse of notation, we\nuse Xi to refer both to the random variable and to the associated node in the causal graph. Assume\nthat there are no v-structures in the causal graph (otherwise the graph can be decomposed into smaller\nsubgraphs which are non-informative about each other). The possible edge orientations following\nthis constraint correspond to directed graphs Gr with root node R = r and all edges oriented away\nfrom r. Let G = {Gr : r 2 [n]}. In other words, the causal model can be speci\ufb01ed completely by the\nidentity of the root node r.\nIn what follows, we use the following graph-related notation. For any node Xi, let NG(i) be the set\nof neighbors of Xi in the tree G, e.g., in Figure 1, NG(2) = {X1, X4, X5}. For a node Xi and its\nneighbor Y 2 NG(i), we write BXi:Y\nto denote the set of nodes that can be reached from Y when the\nedge between Xi and Y is cut from the graph. Note that node Y is included in BXi:Y\n. We also de\ufb01ne\n= {Xi}. As an example, in Figure 1 the branches connected to X2 are BX2:X1\nBXi:Xi\n= {X1, X3},\nBX2:X4\n= {X4}, and BX2:X5\n\n= {X5, X6}. We write the cardinality of a graph G as |G|.\n\nG\n\nG\n\nG\n\nG\n\nG\n\nG\n\n3\n\n\fX1\n\nX3\n\nX2\n\nX6\n\nX4\n\nX5\n\nFigure 1: Graph notation example.\n\nOur focus in this paper is to apply active learning approaches to adaptively and sequentially choose a\nseries of interventions to best determine the causal graph (from among the set G). For this paper, we\nassume interventions are single target perfect interventions, i.e., they take the form of the experimenter\nsetting the value of some chosen node Xi. At each time t the algorithm chooses to intervene at node\nit. It observes a sample X(t) \u21e0 P(\u00b7|do(Xit = xit)) for some xit. The algorithm runtime until the\nroot r (and the corresponding causal model Gr) is identi\ufb01ed with some desired con\ufb01dence 1 \ncould be random or deterministic.\nGiven an interventional sample at node Xi, the posterior update contained therein can be computed\nvia the following lemma, which is proved in Appendix B. The time index is omitted for simplicity.\nLemma 1. Given an interventional sample x from P(X|do(Xi = 1)), collected after we intervened\non Xi by setting it to x = 11, the posterior update for the probability that the root is in the branch\nBXi:Y\n\n, for all Y 2 NG(i) [ Xi, is given by\n\nG\n\n, P(R = Xa|X = x, do(Xi = 1)) /\u21e2 P(R = Xa)\n\n8Xa 2 BXi:Y\nwhere the proportionality constant does not depend on Y and y is the observed value of Y .\n\nP(Y =y|Xi=1) Y 2 NG(i)\n\nP(Y =y)\n\nG\n\nP(R = Xa)\n\nY = Xi,\n\nThis result implies that the only relevant interventional values are those of the neighbors NG(i) of\nthe intervened node Xi. This is a critical observation that informs the development of our approach.\nWe will also consider the simpler setting where given the choice of a node Xi on which to intervene,\nthe experimenter returns a large number of interventional experiments on that node (assumed to be\nin\ufb01nite). In this case, based on Lemma 1, an intervention acts to collapse the posterior distribution\nonto either Xi (if it is the true root) or one of the adjacent branches BXi:Y\nfor some Y 2 NG(i)\na neighbor of Xi. We call this setting the \u201cnoiseless\u201d setting, and use it as a starting point for the\ndevelopment of approaches for the more general setting.\n\nG\n\n2.1 Suboptimality of na\u00a8\u0131ve algorithms\nNonadaptive (without active learning feedback).\nIn a non-iterative setting where the outcome\nof the interventions are not observed until all experiments are complete, any algorithm that wishes\nto \ufb01nd the root node must take at least O(n/(d + 1)) interventions in expectation (under a uniform\nprior), where d is the largest degree in the graph. This follows since each intervention only provides\ninformation about the d + 1 possible directions of the root from the intervened node. For bounded d,\nthis is exponentially (in the number of interventions) worse than our bound for our adaptive central\nnode algorithm.\nInformation greedy algorithm is exponentially suboptimal. An information greedy algorithm is\none that intervenes at the node Xi that in expectation reduces the entropy of the posterior on R the\nmost. Several works have proposed this approach, including Ness et al. (2017) who applied it to the\nintervention design setting. While attractive from an intuitive standpoint, this counterexample shows\nit can be exponentially suboptimal. Consider the graph in Figure 2 for parameter K = 3. Construct a\n3-ary tree (each non-leaf node has degree 3) of minimimum depth with K leaf nodes `i, i = 1, . . . , K.\nAt each of these leaf nodes, draw edges to de4Ke new nodes (where d\u00b7e denotes rounding up to the\nnearest integer), which become the new leaf nodes of the \ufb01nished graph. Suppose that the true causal\ngraph corresponds to this skeleton, with the directions of the edges emanating away from a root node.\n\n1do(X = 1) is chosen for notational simplicity, as long as there is any value a for which do(X = a) affects\n\nthe effect variables, we can \ufb01nd it from the observational data and the theory will still hold.\n\n4\n\n\f`1\n\n. . .\n\n`2\n\n. . .\n\n`3\n\n. . .\n\nde4Ke nodes\n\nde4Ke nodes\n\nde4Ke nodes\n\nFigure 2: Counterexample for information greedy algorithm, shown for K = 3. The optimal algo-\nrithm can \ufb01nd the root in dlog2 Ke + 1 interventions by a top-down approach, while the information\ngreedy algorithm intervenes on the `k (k = 1, . . . , K) nodes, taking at least K/2 steps in expectation.\n\nSuppose further that the unknown root note has a uniform prior distribution over all nodes in the\ngraph. Let n be the number of nodes in this example, observe that n \uf8ff (K + 1)de4Ke.\nConsidering the \u201cnoiseless\u201d setting where each intervention is performed in\ufb01nitely many times, it is\nclear that the optimal algorithm can identify this root node in at most dlog2 Ke + 1 interventions. To\nsee this, observe that one can \ufb01rst intervene at the center node of the graph. This gives the direction\nof the root from this node, so intervene at the adjacent node in that direction. Repeat this process until\nthe root node is identi\ufb01ed. Since the depth of the tree is dlog2 Ke + 2, this algorithm is guaranteed to\n\ufb01nd the root node by dlog2 Ke + 1 interventions.\nFor the information greedy algorithm, we have the following bound proved in Appendix D:\nProposition 1. For the above counterexample, the information greedy algorithm will choose the\nnodes `i before any others, hence it takes a number of interventions with expected value at least K/2.\nThis implies that the information greedy algorithm is exponentially suboptimal in this scenario with\nrespect to the number of interventions required. Note that information greedy will also be at least\nnearly exponentially suboptimal in the noisy setting since the above noiseless-optimal algorithm can\nbe extended to the noisy case via repeating interventions at each node a number of times logarithmic\nin the size of the graph to recover the graph with high probability.\n3 Central node algorithm and variants\nConsider the following algorithm. At time t, there is a prior distribution pt(R = r) over the nodes of\nthe tree which is the posterior probability each node is the root given the intervention history up to but\nnot including time t. The posterior distribution at time t is formed by updating the prior pt(R = r)\nwith the observed data X = x to form qt(R = r) = P(R = r|X = x). Note that the posterior at\ntime t becomes the prior at time t + 1, i.e. pt+1(R = r) = qt(R = r). We call a node vc a central\nnode if it divides the tree into a set of undirected trees, each with total posterior probability less than\n2. Speci\ufb01cally, we have the following de\ufb01nition:\n1\nDe\ufb01nition 1. A central node vc of a tree G with respect to a distribution q over the nodes is one for\nwhich maxj2N (vc)q(Bvc:Xj\nAlgorithm 4 in Appendix A gives a simple algorithm for \ufb01nding such a central node with runtime\nlinear in n.\nWe next propose the following central node algorithm for discovering the root node, given in\nAlgorithm 1. At each time t, it intervenes on a central node with respect to the current prior and\nupdates the prior using data from this intervention in accordance with the update in Lemma 1.\nWhile itself a deterministic procedure, Algorithm 1 is adaptive to the outputs of the interventions\nand hence produces a stochastic sequence of interventions. Note that intervening on a leaf (e.g. if\nq(i) > 1/2 for some leaf node i) is never optimal, if the high-probability node is a leaf, one can\nsimply intervene at its (unique by de\ufb01nition) neighbor and strictly improve the algorithm. We omit\nthis special case from the algorithm and analysis for simplicity.\n3.1 Noiseless setting: Adaptive search on a tree\nWe \ufb01rst consider the simplest case, for which we show that the central node algorithm is within a\nfactor of 2 from the optimal. In this setting, we de\ufb01ne the optimal algorithm to be the one that requires\nthe smallest number of interventions, in expectation, to identify the true root. Recalling the \u201cnoiseless\u201d\nsetting, we start with a tree G0, such that an intervention on any node Xi provides the direction\nu 2{ Xi}[ NG0(i) in which the root node lies, in other words the true root r0 2 BXi:u(G0).\n\n) \uf8ff 1/2. At least one such vc is guaranteed to exist (Jordan, 1869).\n\nG\n\n5\n\n\fn, 8i = 1, . . . , n.\n\nAlgorithm 1 Central Node Algorithm\ninput Observational tree G0. Con\ufb01dence parameter .\n1: t 0.\n2: q0(i) 1\n3: while maxi qt(i) \uf8ff 1 do\n4:\n5:\n6:\n7:\n8: end while\noutput argmaxi qt(i) as the estimated root node of G0.\n\nt t + 1.\nIdentify central node index vc(t) of G with respect to qt1 (Algorithm 4).\nIntervene on node vc(t) and observe x1, . . . , xn.\nUpdate posterior distribution qt as given in Lemma 1.\n\nConsidering this noiseless setting allows us to examine the problem in its most basic \u201csearch on a\ntree\u201d form. In subsequent sections we will reintroduce various sources of uncertainty and provide\nstrategies for handling them. Note that in this setting, having a uniform prior p0(i) = 1/n yields\na uniform posterior over G(t) at time t, hence we can compute the central nodes under a uniform\ndistribution (e.g. q(i) = 1/|G(t)|). The extension to non-uniform priors is straightforward but\nomitted for readability. The resulting central node algorithm is shown in Algorithm 2.\n\nAlgorithm 2 Central Node Algorithm (noiseless)\ninput Observational tree skeleton G0.\n1: t 0, G(0) G0.\n2: while G(t) contains more than one node do\n3:\n4:\n5:\n6:\n7: end while\noutput Node remaining in G(t) as the root node r0 of G0.\n\nt t + 1.\nFind a central node vc(t) of G(t 1) under the current posterior distribution (Algorithm 4).\nIntervene on vc(t) and observe direction of root node ut 2{ vc(t)}[ NG(t1)(vc(t)).\nSet G(t) Bvc(t):ut\nG(t1) .\n\nBy the de\ufb01nition of a central node, we can show that Algorithm 2 converges exponentially. The\nquestion remains as to how close this rate is to that of the optimal algorithm. We prove the following\ntheorem in Appendix E.\nTheorem 1. Let G0 be an undirected tree for the causal discovery problem. Consider a sequence of\ninterventions determined by Algorithm 2. De\ufb01ne the running time (total number of interventions) of\nthis Algorithm to be TCN interventions. Let the running time of an optimal (in terms of number of\ninterventions) algorithm that \ufb01nds r0 be Topt. Then\n\nTCN \uf8ff dlog2 |G0|e,\n\nand moreover, ETCN \uf8ff 2ETopt,\n\nwhere the expectation is with respect to the prior distribution p0(i) = 1/n over i = 1, . . . , n.\nRemark 1 (Finite sample extension). While Algorithm 2 is written assuming that the interventions\nprovide noiseless information (in\ufb01nite sample case), it is simple to extend to the \ufb01nite sample case.\nSpeci\ufb01cally, if it is desired to \ufb01nd the correct root node with probability 1 for some , by the\nunion bound over all n nodes it is suf\ufb01cient to repeat the intervention on vc(t) enough times such\nthat the probability that the returned ut is correct exceeds 1 \nn. It can be shown that the number\nof repeated interventions required to achieve this threshold is O(log(n/)). The simplicity of this\n\ufb01nite-sample extension stands in contrast to CI testing based methods.\nThis algorithm also enjoys the practical advantage of doing batches of interventions on a single node\nbefore moving onto the next one. This will limit the number of distinct nodes for which interventions\nneed to be run, and opens the door to running multiple interventions in parallel.\n\n3.2 Designing K adaptive interventions per cycle\nConsider the case where it is ideal for the experimenter to perform K interventions in sequence\nbefore returning to the active learning algorithm for another set of K interventions to run. In this\n\n6\n\n\f1\n\nK+1 each:\n\nsetting, we extend the concept of a central node to a set of K-central nodes that divide the graph into\npieces with mass no more than\nDe\ufb01nition 2 (K-central nodes). A set of up to K nodes vk\nq over the nodes is a set of nodes for which max{jk2N (vk\nSimilar to the central node, this set of nodes is guaranteed to exist and can be constructed in a\nsimilar fashion. Using this concept, we propose Algorithm 3, where at each step we intervene on the\nK-central nodes and update as in Algorithm 2.\n\nc of a tree G with respect to a distribution\n\n\u25c6 \uf8ff 1\n\nq\u2713\\K\n\nvk\nc :Xjk\nG\n\nK+1.\n\nc )}K\n\nk=1\n\nk=1B\n\nAlgorithm 3 K-Central Node Algorithm (noiseless)\ninput Observational tree skeleton G0.\n1: t 0, G(0) G0.\n2: while G(t) contains more than one node do\n3:\n4:\n5:\n\nt t + 1.\nFind a set of K-central nodes {vk\nIntervene on each of the vk\nc (t)).\nuk\nt 2 vk\nc (t) [ NG(t1)(vk\nc (t):uk\nSet G(t) \\K\nG(t1) .\nt\n\n6:\n7: end while\noutput Node remaining in G(t) as the root node r0 of G0.\n\nk=1Bvk\n\nc (t)} of G(t 1) under the uniform distribution.\n\nc (t) in sequence and for each observe the direction of root node\n\nBy the de\ufb01nition of K-central nodes, as before we immediately have that under the uniform prior\nAlgorithm 3 converges exponentially. We also have the following theorem, proven in Appendix F.\nTheorem 2. Let G0 be the tree skeleton for the causal discovery problem. Consider a sequence of\ninterventions determined by Algorithm 3. De\ufb01ne the running time (total number of interventions)\nof this Algorithm as TCN interventions. Let the running time of an optimal algorithm (that also\nperforms K interventions at each time) that \ufb01nds r0 be Topt. Then\n\nTCN \uf8ff dlogK+1 |G0|e and moreover, ETCN \uf8ff\n\n18\n7 ETopt.\n\nwhere the expectation is with respect to a uniform prior distribution p0(i) = 1/n over i = 1, . . . , n.\n\n3.3 Central node algorithm under node intervention restrictions\nIn many real-world applications, some subset P of the nodes in G0 cannot be intervened on (e.g.\ndue to experimental limitations or cost). In Algorithm 5 (Appendix G), we extend the central node\nalgorithm to this setting, modifying it to choose the best unrestricted node whenever the central node\nis restricted. We have the following theorem, proved in Appendix G.\nTheorem 3. Let G0 be an undirected tree for the causal discovery problem. Let P \u21e2 G0 be a subset\nof nodes that are restricted from intervention. Assume that the probability that the root node is in\nG0 \\ P (where \\ denotes set difference) is uniformly distributed. De\ufb01ne the running time (total\nnumber of interventions) of Algorithm 5 to be TCN interventions. Let the running time of an optimal\n(in terms of number of interventions) algorithm that \ufb01nds r0 be Topt. Then\n\nwhere the expectation is with respect to a uniform prior distribution p0(i) = 1/n over i = 1, . . . , n.\n\nETCN \uf8ff 3ETopt,\n\n3.4 Central node algorithm for noisy observations\nWe now analyze the case in which an intervention on Xi no longer gives noiseless information. This\nis a setting that arises in many applications. For simplicity, we restrict our discussion to the case in\nwhich the Xi are binary variables, although our techniques may be applied to more general settings\nas well. Note that if an edge from one node to another is too weak, then no learning can occur. Hence\nwe require the following condition on the noise:\nCondition 1 (Bounded edge strength). We say that the edge strength of a tree G is lower bounded by\n\u270f> 0 if the following holds: for any nodes i, j adjacent in the graph such that i causes j, we have\n\n|P(Xj = 1 | do(Xi = 1)) P(Xj = 1)| >\u270f.\n\n7\n\n\fUnder the bounded edge strength condition, we have the following proposition indicating that\nrepeating an intervention do(Xi = 1) a constant number of times is suf\ufb01cient to good estimators of\nwhether each branch around Xi contains the root:\nProposition 2. Under Condition 1, let 0 > 0 be a desired soundness. Suppose Xi has neighbors\nu1, . . . , ud 2 NG(Xi). Then with d2 log(1/0)/\u270f2e = O(log(1/0)/\u270f2) samples from do(Xi = 1),\nwe may output estimators \u02c6aXi:u1, . . . , \u02c6aXi:ud 2{ 0, 1} such that for each j 2 [d],\nP(\u02c6aXi:uj = 0 | R 62 BXi:uj\n\nP(\u02c6aXi:uj = 1 | R 2 BXi:uj\n\n) 1 0,\n\nG\n\nG\n\n) 1 0,\n\nin other words the estimators \u02c6aXi:uj successfully identify whether the root R is in branch BXi:uj\nnot with the desired soundness.\n\nG\n\nor\n\nUsing this fact, we propose Algorithm 8 (in Appendix H) that slightly modi\ufb01es the central node\nalgorithm. For each intervention, we collect d2 log(1/0)/\u270f2e samples and update the posterior.\nWe then add the following step: if the posterior probability of a node is close to 1, we \ufb02ag it for\ninspection and temporarily remove it from consideration. Intuitively, this last step ensures that the\ncurrent central node will never have too large a weight, so that for each intervention it is possible\nto lower-bound the total probability mass that is not contained in the branch containing the root,\nimplying that each intervention always prunes away enough of the remaining probability mass. We\nthen have the following theorem, proved in Appendix H.\nTheorem 4. Under Condition 1, Algorithm 8 takes O(log(n/)/\u270f2) steps in expectation, and returns\nthe true root node with probability at least 1 .\nNote that if we were to directly apply the existing noisy graph search algorithms of Emamjomeh-\nZadeh et al. (2016) to our model, then when applying Proposition 2 we would have to take our\nsoundness parameter to be 0 \uf8ff 1/, where is the maximum degree of the tree, and therefore our\nrun-time guarantee would scale as log() log(n/)/\u270f2. Instead, our runtime does not depend on the\ndegree of the tree. Furthermore, the rate is directly comparable to that found for the noiseless case\n(Theorem 1), up to the incorporation of the \u270f and parameters that control the uncertainty.\n4 Empirical results\nWe consider several experimental settings and for each setting we simulate 200 random trees of n\nnodes. Here we present a subset of the results, more are described in Appendix I. We generate an\nundirected tree with three different strategies: a) sampling uniformly from the space of undirected\ntrees, b) generating power-law trees, and c) generating high degree d = n/2 random graphs and then\ncreating an undirected version of the BFS tree. In this section we show only results for strategy a),\nbut similar conclusions apply to the other strategies, given in Appendix I. Once we have a tree, we\npick the root node uniformly at random. In this section we focus on binary random variables, where\neach variable is a function of its parent: if XP ai = 0, then Xi \u21e0 Bern (\u270f), else Xi \u21e0 Bern (1 \u270f),\nwhere for each variable we sample \u270f uniformly from [, 0.5 ]. The root node is distributed as\nXr \u21e0 Bern (0.5). We show similar results with discrete variables in Appendix I.\nFigure 3 shows the average number of interventions required to \ufb01nd the root node for three \ufb01nite\nsample algorithms, all using our posterior update (Appendix B): a baseline algorithm that intervenes\non a node randomly selected using the probability of being root in the current prior, the information\ngreedy algorithm, implemented following the sampling strategy presented in Appendix C.2 with\nN = 50, and our central node algorithm presented in Algorithm 1. In the n = 1000 case, the\ninformation greedy algorithm was too computationally intensive and therefore omitted. Figure 4\nshows the performance of the K-central node algorithm for varying K. Figure 5 shows the behavior\nof the \ufb01nite sample extension of Algorithm 2 for n = 50 and different values of , when we vary\nthe number of interventional samples collected for each intervention from 1 (Algorithm 1) to 50.\nAs expected, the behavior of the central node algorithm improves smoothly with the number of\ninterventional samples, quickly converging to the performance of the noiseless Algorithm 2.\n5 Conclusion\nIn this paper we proposed an active learning framework designed to reduce the number of interventions\nrequired to identify the directed tree specifying the causal model for a given set of variables. We\npresented algorithms for active learning when the observational data admits a tree model and proved\nthat they can identify the causal graph with a number of interventions within a constant factor of the\n\n8\n\n\f2\n\n2\n\n0\n\n)\ng\no\nl\n(\n\u00a0\ns\nn\no\n\ni\nt\n\nn\ne\nv\nr\ne\nn\n\nt\n\ni\n\u00a0\n\ne\ng\na\nr\ne\nv\nA\n\n2\n\n0\n\n0\n\n1\n\n8\n\n0\n\n1\n\n6\n\n0\n\n1\n\n4\n\n0\n\n1\n\n2\n\n0\n\n1\n\n0\n\n0\n\n8\n\n0\n\nBaseline\nCentral\u00a0Node\nInformation\u00a0Greedy\n\n)\ng\no\nl\n(\n\u00a0\ns\nn\no\n\ni\nt\n\nn\ne\nv\nr\ne\nn\n\nt\n\ni\n\u00a0\n\ne\ng\na\nr\ne\nv\nA\n\n5\n\n5\n\n0\n\n5\n\n0\n\n0\n\n4\n\n5\n\n0\n\n4\n\n0\n\n0\n\n3\n\n5\n\n0\n\n3\n\n0\n\n0\n\n2\n\n5\n\n0\n\n2\n\n0\n\n0\n\n1\n\n5\n\n0\n\nBaseline\nCentral\u00a0Node\n\n[0.05,\u00a00.45]\n\n[0.1,\u00a00.4]\n\n[0.15,\u00a00.35]\n\n[0.2,\u00a00.3]\n\n[0.24,\u00a00.26]\n\n[0.05,\u00a00.45]\n\n[0.1,\u00a00.4]\n\n[0.15,\u00a00.35]\n\n[0.2,\u00a00.3]\n\n[0.24,\u00a00.26]\n\nEpsilon\u00a0range\n\nEpsilon\u00a0range\n\nFigure 3: Average number of interventions for \ufb01nite sample algorithms for varying ranges of \u270f,\nn = 30 (left) and n = 1000 (right).\n\n6\n\n5\n\n6\n\n0\n\n0\n\n5\n\n0\n\n5\n\n0\n\n5\n\n0\n\n0\n\n4\n\n5\n\n0\n\n4\n\n0\n\n0\n\n3\n\n5\n\n0\n\n3\n\n0\n\n0\n\n2\n\n5\n\n0\n\n2\n\n0\n\n0\n\n1\n\n5\n\n0\n\n1\n\n0\n\n0\n\n)\ng\no\nl\n(\n\u00a0\ns\nn\no\n\ni\nt\n\nn\ne\nv\nr\ne\n\nt\n\nn\n\ni\n\u00a0\n\ne\ng\na\nr\ne\nv\nA\n\nCentral\u00a0Node\u00a0k=1\nCentral\u00a0Node\u00a0k=2\nCentral\u00a0Node\u00a0k=3\nCentral\u00a0Node\u00a0k=5\n\n)\ng\no\nl\n(\n\u00a0\ns\nn\no\ni\nt\nn\ne\nv\nr\ne\n\nt\n\nn\n\ni\n\u00a0\n\ne\ng\na\nr\ne\nv\nA\n\n7\n6\n5\n\n4\n\n3\n\n2\n\n[0.05,\u00a00.45]\n[0.15,\u00a00.35]\n[0.24,\u00a00.26]\n\n1\n\n0\n\n0\n9\n8\n7\n6\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n9\n8\n7\n6\n5\n\n4\n\n[0.05,\u00a00.45]\n\n[0.1,\u00a00.4]\n\n[0.15,\u00a00.35]\n\n[0.2,\u00a00.3]\n\n[0.24,\u00a00.26]\n\nEpsilon\u00a0range\n\n1\n\n0\n\n2\n\n0\n\n3\n\n0\n\n4\n\n0\n\n5\n\n0\n\nNumber\u00a0of\u00a0samples\u00a0per\u00a0intervention\n\nFigure 4: Average number of interventions for\nthe K-central node algorithm for K = 1, 2, 3, 5\nfor n = 50 and varying ranges of \u270f.\n\nFigure 5: Finite sample extension of central\nnode, varying number of interventional samples,\nn = 50, the curves represent different \u270f ranges.\n\n(unknown) optimal procedure. As future work, we plan to extend our active learning approach and\nalgorithms to more complex graph structures, such as chordal graphs. We also plan to extend our\napproach to the case where it is desirable to optimize the accuracy of regressions and/or predictions\ncomputed according to the model, instead of simply the correctness of the learned structure.\n\nReferences\nAgrawal, R., Squires, C., Yang, K., Shanmugam, K., and Uhler, C. ABCD-strategy: Budgeted\n\nexperimental design for targeted causal structure discovery. AISTATS, 2019.\n\nBen-Or, M. and Hassidim, A. The Bayesian learner is optimal for noisy binary search (and pretty\ngood for quantum as well). In 2008 49th Annual IEEE Symposium on Foundations of Computer\nScience, pp. 221\u2013230. IEEE, 2008.\n\nCicalese, F., Jacobs, T., Laber, E., and Molinaro, M. On greedy algorithms for decision trees. In\n\nInternational Symposium on Algorithms and Computation, pp. 206\u2013217. Springer, 2010.\n\nCicalese, F., Jacobs, T., Laber, E., and Molinaro, M. Improved approximation algorithms for the\n\naverage-case tree searching problem. Algorithmica, 68(4):1045\u20131074, 2014.\n\nDereniowski, D., Kosowski, A., Uznanski, P., and Zou, M. Approximation strategies for generalized\n\nbinary search in weighted trees. arXiv preprint arXiv:1702.08207, 2017.\n\nDereniowski, D., Graf, D., Tiegel, S., and Uzna\u00b4nski, P. A framework for searching in graphs in the\n\npresence of errors. arXiv preprint arXiv:1804.02075, 2018.\n\nEberhardt, F. Phd thesis. Causation and Intervention (Ph.D. Thesis), 2007.\nEberhardt, F., Glymour, C., and Scheines, R. On the number of experiments suf\ufb01cient and in the\nworst case necessary to identify all causal relations among n variables. In Proceedings of the 21st\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pp. 178\u2013184, 2005.\n\n9\n\n\fEmamjomeh-Zadeh, E., Kempe, D., and Singhal, V. Deterministic and probabilistic binary search in\ngraphs. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp.\n519\u2013532. ACM, 2016.\n\nGhassami, A., Salehkaleybar, S., and Kiyavash, N. Optimal experiment design for causal discovery\n\nfrom \ufb01xed number of experiments. arXiv preprint arXiv:1702.08567, 2017a.\n\nGhassami, A., Salehkaleybar, S., Kiyavash, N., and Bareinboim, E. Budgeted experiment design for\n\ncausal structure learning. arXiv preprint arXiv:1709.03625, 2017b.\n\nHu, H., Li, Z., and Vetta, A. Randomized experimental design for causal graph discovery.\n\nProceedings of NIPS 2014, Montreal, CA, December 2014.\n\nIn\n\nHyttinen, A., Eberhardt, F., and Hoyer, P. O. Experiment selection for causal discovery. The Journal\n\nof Machine Learning Research, 14(1):3041\u20133071, 2013.\n\nHyttinen, A., Eberhardt, F., and J\u00a8arvisalo, M. Constraint-based causal discovery: con\ufb02ict resolution\nwith answer set programming. In Proceedings of the Thirtieth Conference on Uncertainty in\nArti\ufb01cial Intelligence, pp. 340\u2013349. AUAI Press, 2014.\n\nJacobs, T., Cicalese, F., Laber, E., and Molinaro, M. On the complexity of searching in trees: average-\ncase minimization. In International Colloquium on Automata, Languages, and Programming, pp.\n527\u2013539. Springer, 2010.\n\nJamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lilucb: An optimal exploration algorithm for\n\nmulti-armed bandits. In Conference on Learning Theory, pp. 423\u2013439, 2014.\n\nJordan, C. Sur les assemblages de lignes. J. Reine Angew. Math, 70(185):81, 1869.\n\nKocaoglu, M., Dimakis, A., and Vishwanath, S. Cost-optimal learning of causal graphs. In Proceed-\nings of the 34th International Conference on Machine Learning-Volume 70, pp. 1875\u20131884. JMLR.\norg, 2017.\n\nKontou, P. I., Pavlopoulou, A., Dimou, N. L., Pavlopoulos, G. A., and Bagos, P. G. Network analysis\n\nof genes and their association with diseases. Gene, 590(1):68\u201378, 2016.\n\nLindgren, E., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. Experimental design for cost-aware\nlearning of causal graphs. In Advances in Neural Information Processing Systems, pp. 5279\u20135289,\n2018.\n\nMagliacane, S., Claassen, T., and Mooij, J. M. Joint causal inference on observational and experi-\n\nmental datasets. CoRR, abs/1611.10351, 2016. URL http://arxiv.org/abs/1611.10351.\n\nMeek, C. Causal inference and causal explanation with background knowledge. In Proceedings\nof the Eleventh Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201995, pp. 403\u2013410, San\nFrancisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.\nISBN 1-55860-385-9. URL\nhttp://dl.acm.org/citation.cfm?id=2074158.2074204.\n\nNess, R. O., Sachs, K., Mallick, P., and Vitek, O. A Bayesian active learning experimental design\nfor inferring signaling networks. In International Conference on Research in Computational\nMolecular Biology, pp. 134\u2013156. Springer, 2017.\n\nOnak, K. and Parys, P. Generalization of binary search: Searching in trees and forest-like partial\norders. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS\u201906),\npp. 379\u2013388. IEEE, 2006.\n\nPearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY,\n\nUSA, 2nd edition, 2009. ISBN 052189560X, 9780521895606.\n\nPeters, J., Bhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals.\nJournal of the Royal Statistical Society: Series B\n(Statistical Methodology), 78(5):947\u20131012, 2016. doi: 10.1111/rssb.12167. URL https:\n//rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12167.\n\n10\n\n\fPeters, J., Janzing, D., and Sch\u00a8olkopf, B. Elements of Causal Inference - Foundations and Learning\nAlgorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge,\nMA, USA, 2017.\n\nShanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. Learning causal graphs with\nsmall interventions. In Advances in Neural Information Processing Systems, pp. 3195\u20133203, 2015.\nSpirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT press, 2nd edition,\n\n2000.\n\nTrianta\ufb01llou, S. and Tsamardinos, I. Constraint-based causal discovery from multiple interventions\n\nover overlapping variable sets. Journal of Machine Learning Research, 16:2147\u20132205, 2015.\n\nZhang, K., Huang, B., Zhang, J., Glymour, C., and Sch\u00a8olkopf, B. Causal discovery from nonstation-\nary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: proceedings\nof the conference, volume 2017, pp. 1347. NIH Public Access, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8058, "authors": [{"given_name": "Kristjan", "family_name": "Greenewald", "institution": "IBM Research"}, {"given_name": "Dmitriy", "family_name": "Katz", "institution": "IBM Research"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Sara", "family_name": "Magliacane", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Murat", "family_name": "Kocaoglu", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Enric", "family_name": "Boix Adsera", "institution": "MIT"}, {"given_name": "Guy", "family_name": "Bresler", "institution": "MIT"}]}