{"title": "Randomized Experimental Design for Causal Graph Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 2339, "page_last": 2347, "abstract": "We examine the number of controlled experiments required to discover a causal graph. Hauser and Buhlmann showed that the number of experiments required is logarithmic in the cardinality of maximum undirected clique in the essential graph. Their lower bounds, however, assume that the experiment designer cannot use randomization in selecting the experiments. We show that significant improvements are possible with the aid of randomization \u2013 in an adversarial (worst-case) setting, the designer can then recover the causal graph using at most O(log log n) experiments in expectation. This bound cannot be improved; we show it is tight for some causal graphs. We then show that in a non-adversarial (average-case) setting, even larger improvements are possible: if the causal graph is chosen uniformly at random under a Erd\u00f6s-R\u00e9nyi model then the expected number of experiments to discover the causal graph is constant. Finally, we present computer simulations to complement our theoretic results. Our work exploits a structural characterization of essential graphs by Andersson et al. Their characterization is based upon a set of orientation forcing operations. Our results show a distinction between which forcing operations are most important in worst-case and average-case settings.", "full_text": "Randomized Experimental Design for Causal Graph\n\nDiscovery\n\nHuining Hu\n\nSchool of Computer Science, McGill University.\n\nhuining.hu@mail.mcgill.ca\n\nZhentao Li\n\nLIENS, \u00b4Ecole Normale Sup\u00b4erieure\n\nzhentao.li@ens.fr\n\nDepartment of Mathematics and Statistics and School of Computer Science, McGill University.\n\nAdrian Vetta\n\nvetta@math.mcgill.ca\n\nAbstract\n\nWe examine the number of controlled experiments required to discover a causal\ngraph. Hauser and Buhlmann [1] showed that the number of experiments required\nis logarithmic in the cardinality of maximum undirected clique in the essential\ngraph. Their lower bounds, however, assume that the experiment designer cannot\nuse randomization in selecting the experiments. We show that signi\ufb01cant improve-\nments are possible with the aid of randomization \u2013 in an adversarial (worst-case)\nsetting, the designer can then recover the causal graph using at most O(log log n)\nexperiments in expectation. This bound cannot be improved; we show it is tight\nfor some causal graphs.\nWe then show that in a non-adversarial (average-case) setting, even larger im-\nprovements are possible: if the causal graph is chosen uniformly at random under\na Erd\u00a8os-R\u00b4enyi model then the expected number of experiments to discover the\ncausal graph is constant. Finally, we present computer simulations to complement\nour theoretic results.\nOur work exploits a structural characterization of essential graphs by Andersson\net al. [2]. Their characterization is based upon a set of orientation forcing opera-\ntions. Our results show a distinction between which forcing operations are most\nimportant in worst-case and average-case settings.\n\nIntroduction\n\n1\nWe are given n random variables V = {V1, V2, . . . , Vn} and would like to learn the causal relations\nbetween these variables. Assume the dependencies between the variables can be represented as a\ndirected acyclic graph G = (V, A), known as the causal graph. In seminal work, Sprites, Glymour,\nand Scheines [3] present methods to obtain structural information on G from passive observational\ndata. In general, however, observational data can be used to discover only a part of the causal graph\nG; speci\ufb01cally, observation data will recover the essential graph E(G). To recover the entire causal\ngraph G we may undertake experiments. Here, an experiment is a controlled intervention on a subset\nS of the variables. A controlled intervention allows us to deduce information about which variables\nS in\ufb02uences.\nThe focus of this paper is to understand how many experiments are required to discover G. This line\nof research was initiated in a series of works by Eberhardt, Glymour, and Scheines (see [4, 5, 6]).\nFirst, they showed [4] that n \u2212 1 experiments suf\ufb01ce when interventions can only be made upon\nsingleton variables. For general experiments, they proved [5] that (cid:100)log n(cid:101) experiments are suf\ufb01cient\nand, in the worst case necessary, to discover G. Eberhardt [7] then conjectured that (cid:100)log(\u03c9(G))(cid:101)\n\n1\n\n\fexperiments are suf\ufb01cient and, in the worst case, necessary; here \u03c9(G) is the size of a maximum\nclique in G.1 Hauser and Buhlmann [1] recently proved (a slight strengthening of) this conjecture.\nThe essential mathematical concepts underlying this result can be traced back to work of Cai [8] on\n\u201cseparating systems\u201d [9]; see also Hyttinen et al. [10].\nEberhardt [11] proposed the use of randomization (mixed strategies) in causal graph discovery. He\nproved that, if the designer is restricted to single-variable interventions, the worst case expected\nnumber of experiments required is \u0398(n). Eberhardt [11] considered multi-variable interventions to\nbe \u201cfar more complicated\u201d to analyze, but hypothesized that O(log n) experiments may be suf\ufb01cient,\nin that setting, in the worst-case.\n\n1.1 Our Results\n\nThe purpose of this paper is to show that the lower bounds of [5] and [1] are not insurmountable.\nIn essence, those lower bounds are based upon the causal graph being constructed by a powerful\nadversary. This adversary must pre-commit to the causal graph in advance but, before doing so,\nit has access to the entire list of experiments S = {S1, S2, . . .} that the experiment designer will\nuse; here Si \u2286 V for all i. (This adversary also describes the \u201cseparating system\u201d model of causal\ndiscovery. In Section 2.4 we will explain how this adversary can also be viewed as adaptive. The\nadversary may be given the list of experiments in order over time, but at time i it needs only commit\nto the arcs in \u03b4(Si), the set of edges with exactly one end-vertex in Si.)\nOur \ufb01rst result is that we show this powerful adversary can be tricked if the experiment designer\nuses randomization in selecting the experiments. Speci\ufb01cally, suppose the designer selects the ex-\nperiments {S1, S2, . . .} from a collection of probability distributions P = {P1,P2, . . .}, respec-\ntively, where distribution Pi+1 may depend upon the results of experiments 1, 2 . . . , i. Then, even\nif the adversary has access to the list of probability distributions P before it commits to the causal\ngraph G, the expected number of experiments required to recover G falls signi\ufb01cantly. Speci\ufb01cally,\nif the designer uses randomization then, in the worst case, only at most O(log log n) experiments\nin expectation are required. This result is given in Section 3, after we have presented the necessary\nbackground on causal graphs and experiments in Section 2. We also prove our lower bound is tight.\nThis worst case result immediately extends to the case where the adversary is also allowed to use\nrandomization in selecting the causal graph. Thus, the O(log log n) bound applies to mixed-strategy\nequilibria in the game framework [11] where multi-variable interventions are allowed.\nOur second result is that even more dramatic improvements are possible if the causal graph is non-\nadversarial. For a typical causal graph needs only a constant number of experiments are required in\nexpectation! Speci\ufb01cally, if the directed acyclic graph is random, based upon an underlying Erd\u00a8os-\nR\u00b4enyi model, then O(1) experiments in expectation are required to discover G. We prove this result\nin Section 4.\nOur work exploits a structural characterization of essential graphs by Andersson et al. [2]. Their\ncharacterization is based upon a set of four operations. One operation is based upon acyclicity, the\nother three are based upon v-shapes. Our results show that the acyclicity operation is most important\nin improving worst-case bounds, but the v-shape operations are more important for average-case\nbounds. This conclusion is highlighted by our simulation results in Section 5. These simulations\ncon\ufb01rm that, by exploiting the v-shape operations, causal graph discovery is extremely quick in the\nnon-adversarial setting. In fact, the constant in the O(1) average-case guarantee may be even better\nthan our theoretical results suggest. Typically, it takes one or two experiments to discover a causal\ngraph on 15000 vertices!\n\n2 Background\n\nSuppose we want to discover an (unknown) directed acyclic graph G = (V, A) and we are given\nits observational data. Without experimentation, we may not be able to recover all of G from its\nobservation data. But we can deduce a subgraph of it known as the essential graph E(G). In this\nsection, we describe this process and explain how experiments (deterministic or randomized) can\nthen be used to recover the rest of the graph. Throughout this paper, we assume the causal graph\n\n1A directed graph is a clique if its underlying undirected graph is a (undirected) clique.\n\n2\n\n\fand data distribution obey the faithfulness assumption and causal suf\ufb01ciency [3]. The faithfulness\nassumption ensures that all independence relationships revealed by the data are results of the causal\nstructure and are not due to some coincidental combinations of parameters. Causal suf\ufb01ciency means\nthere are no latent (that is, hidden) variables. These assumptions are important as they provide a one\nto one mapping between data and causal structure.\n\n2.1 Observational Equivalence\n\nFirst we may discover the skeleton and all the v-structures of G. To explain this, we begin with some\nde\ufb01nitions. The skeleton of G is the undirected graph on V with an undirected edge (between the\nsame endpoints) for each arc of A. A v-shape in a graph (directed or undirected) is an ordered set\n(a, b, c) of three distinct vertices with exactly two edges (arcs), both incident to b. The v-structures,\nsometimes called immoralities [2], are the set of v-shapes (a, b, c) where ab and cb are arcs. Two\ndirected graphs with indistinguishable by observational data are said to belong to the same Markov\nequivalence class. Speci\ufb01cally, Verma and Pearl [12] and Frydenberg [13] showed the skeleton and\nthe set of v-structures determine which equivalence class G belongs to.\nTheorem 2.1. (Observational Equivalence) G and H are in the same Markov equivalence class if\nand only if they have the same skeletons and the same sets of v-structures.\n\nBecause of this equivalence, we will think of an observational Markov equivalence class as given by\nthe skeleton and the set of (all) v-structures. From the observational data it is straightforward [12] to\nobtain the basic graph B(G), a mixed graph2 obtained from the skeleton of G by orienting the edges\nin each v-structure. For example, to test for an edge {i, j}, simply check there is no d-separator for\ni and j; to test for a v-structure (i, k, j), simply check that there is no d-separator for i and j that\ncontains k. (These tests are not polynomial time. However, this is not relevant for the question we\naddress in this paper.)\n\n2.2 The Essential Graph\n\nIn fact, from the observational data we may orient more edges than simply those in the basic graph\nB(G). Speci\ufb01cally we can obtain the essential graph E(G). The essential graph is a mixed graph that\nalso includes every edge orientation that is present in every directed acyclic graph that is compatible\nwith the data. That is, an edge is oriented if and only if it has the same orientation in every graph\nin the equivalence class. For example, an edge {a, b} is forced to be oriented as the arc ab for the\nfollowing reasons.\n\n(F1) The arc ab (and the arc cb) is forced if it belongs to a v-structure (a, b, c).\n(F2) There is a v-shape (b, a, c) but it is not a v-structure. Then arc ab is forced if ca is an arc.\n(F3) The arc ab is forced, by acyclicity, if there is already a directed path P from a to b.\n(F4) There is a v-shape (c1, a, c2) but it is not a v-structure. Then the arc ab is forced if there\n\nare directed paths Q1 and Q2 from c1 to b and from c2 to b, respectively.\n\nThe reader can \ufb01nd illustrations of these forcing mechanisms in Figure 2 of the supplemental mate-\nrial. Andersson et al. [2] showed that these are the only ways to force an edge to become oriented.\nIn fact, they characterize essential graphs and show only local versions of (F3) and (F4) are needed\nto obtain the essential graph \u2013 that is, it suf\ufb01ces to assume the path P has two arcs and the paths Q1\nand Q2 have only one arc each.\nLet U(G) be the subgraph induced by the undirected edges of the essential graph E(G). For sim-\nplicity, we will generally just use the notation B,E and U. From the characterization, it can be\nshown that U is a chordal graph.3 We remark that this chordality property is extremely useful in\nquantitatively analyzing the performance of the experiments we design. In particular, the size of the\nmaximum clique and the chromatic number can be computed in linear time.\nCorollary 2.2. [2] The subgraph U is chordal.\n\n2A mixed graph contains oriented edges and unoriented edges. To avoid confusion, we refer to oriented\n\nedges as arcs.\n\n3A graph H is chordal if every induced cycle in H contains exactly three vertices. That is, every cycle C\n\non at least four vertices has a chord, an edge not in C that connects two vertices of the cycle.\n\n3\n\n\f2.3 Experimental Design\nSo observation data (the null experiment) will give us the essential graph E. If we perform experi-\nments then we may recover the entire causal graph G and, in a series of works, Eberhardt, Glymour,\nand Scheines [5, 4, 6] investigated the number of experiments required to achieve this. An experi-\nment is a controlled intervention that forces a distribution, chosen by the designer, on a set S \u2282 V .\nA key fact is that, given the existence of an edge (a, b) in G, an experiment on S can perform a\ndirectional test on (a, b) if (a, b) \u2208 \u03b4(S) (that is, if exactly one endpoint of the edge is in S); see [5]\nfor more details. Recall that we already know the skeleton of G from the observational data. Thus,\nwe can determine the existence of every edge in G. It then follows that to recover the entire causal\ngraph it suf\ufb01ces that (\u03a8) Each edge undergoes one directional test. The separating systems method\nis based on this suf\ufb01ciency condition (\u03a8). Using this condition, it is known that log n experiments\nsuf\ufb01ce [5]. In fact, this bound can be improved to log \u03c9(U), where \u03c9(U) is the size of the maximum\nclique in the undirected subgraph U of the essential graph E. For completeness we show this result\nhere; see also [8] and [1].\nTheorem 2.3. We can recover G using log \u03c9(U) experiments.\n\nProof. First use the observational data to obtain the skeleton of G. To \ufb01nd the orientation of each\nedge, take a vertex colouring c : V (U) \u2192 {0, 1, . . . , \u03c7(U) \u2212 1}, where \u03c7(U) is the chromatic\nnumber of U. We use this colouring to de\ufb01ne our experiments. Speci\ufb01cally, for the ith experiment,\nselect all vertices whose colour is 1 in the ith bit. That is, select Si = {v : bini(c(v)) = 1}, where\nbini extracts the ith bit of a number. Now, if vertices u and v are adjacent in U, they receive different\ncolours and consequently their colours differ at some bit j. Thus, in the jth experiment, one of u, v\nis selected in Sj and the other is not. This gives a directional test for the edge {u, v}. Therefore,\nfrom all the experiments we \ufb01nd the orientation of every edge. The result follows from the fact that\nchordal graphs are perfect (see, for example, [14]).\n\nBut (\u03a8) is just a suf\ufb01ciency condition for recovering the entire causal graph G; it need not be neces-\nsary to perform a directional test on every edge. Indeed, we may already know some edge orienta-\ntions from the essential graph E via the forcing operations (F1), (F2), (F3) and (F4). Furthermore,\nthe experiments we carry out will force some more edge orientations. But then we may again apply\nthe forcing operations (F1)-(F4) incorporating these new arcs to obtain even more orientations.\nLet S = {S1, S2, . . . Sk}, where Si \u2286 V for all 1 \u2264 i \u2264 k, be a collection of experiments,\nThen the experimental graph is a mixed graph that includes every edge orientation that is present\nin every directed acyclic graph that is compatible with the data and the experiments S. We denote\nthe experimental graph by E +S (G). Thus the question Eberhardt, Glymour, and Scheines pose is:\nhow many experiments are needed to ensure that E +S (G) = G? As before, we know how to \ufb01nd the\nexperimental graph.\nTheorem 2.4. The experimental graph E +S (G) is obtained by repeatedly applying rules (F1)\u2013(F4)\nalong with the rule:\n(F0) There is an experiment Si \u2208 S and an edge (a, b), with a \u2208 Si and b /\u2208 Si. Then either the arc\nab or the arc ba is forced depending upon the outcome of the experiment.\n\nWe note that the proof uses the fact that arcs forced by (F0) are the union of edges across a set of\ncuts; without this property, a fourth forcing rule may be needed [15].\nTheorem 2.4 suggests that it may be possible to improve upon the log \u03c9(U) upper bound. Unfortu-\nnately, Hauser and Buhlmann [1] show using an adversarial argument that in the worst case there is\na matching lower bound, settling a conjecture of Eberhardt [6].\n\n2.4 Randomized Experimental Design\n\nAs discussed in the introduction, the lower bounds of [5] and [1] are generated via a powerful\nadversary. The adversary must pre-commit to the causal graph in advance but, before doing so, it\nhas access to the entire list of experiments S = {S1, S2, . . .} that the experiment designer will use.\nFor example, assume that the adversary choses a clique for G and the experiment designer selects\na collection of experiments S = {S1, S2, . . .}. Given the knowledge of S then, for a worst case\nperformance, the adversary will direct every edge in \u03b4(S1) from S1 to V \\ S1. The adversary will\n\n4\n\n\fthen direct every edge in \u03b4(S2) (that has yet to be assigned an orientation) from S2 to V \\ S2, etc. It\nis not dif\ufb01cult to show that the designer will need to implement at least log n of the experiments.\nWe remark that there is an alternative way to view the adversary. It need commit only to the essential\ngraph in advance but otherwise may adaptively commit the rest of the graph over time. In particular,\nat time i, after experiment Si is conducted it must commit only to the arcs in \u03b4(Si) and to any\ninduced forcings. This second adversary is clearly weaker than the \ufb01rst, but the lower bounds of [5]\nand [1] still apply here. Again, though, even this form of adversary appears unnaturally strong in the\ncontext of causal graphs. In particular, given the random variables V the causal relations between\nthem are pre-determined. They are already naturally present before the experimentation begins, and\nthus it seems appropriate to insist that the adversary pre commit to the graph rather than construct it\nadaptively.\nRegardless, both of these adversaries can be countered if the designer uses randomization in select-\ning the experiments. In particular, in randomized experimental design we allow the designer to select\nthe experiments {S1, S2, . . .} from a collection of probability distributions P = {P1,P2, . . .}, re-\nspectively, where distribution Pi+1 may depend upon the results of experiments 1, 2 . . . , i. As an\nexample, consider again the case in which the adversary selects a clique. Suppose now that the\ndesigner selects the \ufb01rst experiment S1 uniformly at random from the collection of subsets of cardi-\nnality 1\n2 n. Even given this knowledge, it is less obvious how the adversary should act against such a\ndesign. Indeed, in this article we show the usefulness of the randomized approach. It will allow the\ndesigner to require only O(log log n) experiments in expectation. This is the case even if the adver-\nsary has access to the entire list of probability distributions P before it commits to the causal graph\nG. We prove this in Section 3. Thus, by Theorem 2.3, we have that min[O(log log n), log \u03c9(U)]\nexperiments are suf\ufb01cient. We also prove that this bound is tight; there are graphs for which\nmin[O(log log n), log \u03c9(U)] experiments are necessary.\nStill our new lower bound only applies to causal graphs selected adversarially. For a typical causal\ngraph we can do even better. Speci\ufb01cally, we prove, in Section 4, that for a random causal graph\na constant number of experiments is suf\ufb01cient in expectation. Consequently, for a random causal\ngraph the number of experiments required is independent of the number of vertices in the graph!\nThis surprising result is con\ufb01rmed by our simulations. For various values n of number of vertices,\nwe construct numerous random causal graphs and compute the average and maximum number of\nexperiments needed to discover them. Simulations con\ufb01rm this number does not increase with n.\nOur results can be viewed in the game theoretic framework of Eberhardt [11], where the adversary\nselects a probability distribution (mixed strategy) over causal graphs and the experiment designer\nchoses a distribution over which experiments to run.\nIn this zero-sum game, the payoff to the\ndesigner is the negative of the number of experiments needed. The worst case setting corresponds\nto the situation where the adversary can choose any distribution over causal graphs. Thus, our result\nimplies a worst case \u2212\u0398(log log n) bound on the value of a game with multi-variable interventions\nand no latent variables. Therefore, the ability to randomize turns out to be much more helpful to the\ndesigner than the adversary. Our average case O(1) bound corresponds to the situation where the\nadversary in the game is restricted to choose the uniform distribution over causal graphs.\n\n3 Randomized Experimental Design\n\n3.1\n\nImproving the Upper Bound by Exploiting Acyclicity\n\nWe now show randomization signi\ufb01cantly reduces the number of experiments required to \ufb01nd the\ncausal graph. To improve upon the log \u03c7(U) bound, recall that (\u03a8) is a suf\ufb01cient but not necessary\ncondition. In fact, we will not need to apply directional tests to every edge. Given some edge orien-\ntations we may obtain other orientations for free by acyclicity or by exploiting the characterization\nof [2]. Here we show that the acyclicity forcing operation (F3) on its own provides for signi\ufb01cant\nspeed-ups when we allow randomisation.\nTheorem 3.1. To orient a clique on t vertices, O(log log t) experiments suf\ufb01ce in expectation.\nProof. Let {x1, x2, . . . , xt} be the true acyclic ordering of the clique G. Now take a random ex-\n2. The experiment\nperiment S, where each vertex is independently selected in S with probability 1\nS partitions the ordering into runs (streaks) \u2013 contiguous segments of {x1, x2, . . . , xt} where either\n\n5\n\n\fevery vertex of the segment is in S or every vertex of the segment is in \u00afS = V \\ S. Without loss of\ngenerality the \ufb01rst run is in S and we denote it by R0. We denote the second run, which is in \u00afS, by\n\u00afR0, the third run by R1, the fourth run by \u00afR1 etc. A well known fact (see, for example, [16]) is that,\nwith high probability, the longest run has length \u0398(log t).\nTake any pair of vertices u and v. We claim that edge {u, v} can be oriented provided the two\nvertices are in different runs. To see this \ufb01rst observe that the experiment will orient any edge\nbetween S and \u00afS. Thus if u \u2208 Ri and v \u2208 \u00afRj, or vice versa, then we may orient {u, v}. Assume\nu \u2208 Ri and v \u2208 Rj, where i < j. We know {i, j} must be the arc ij, but how do we conclude this\nfrom our experiment? Well, take any vertex w \u2208 \u00afRi. Because G is a clique there are edges {u, w}\nand {v, w}. But these edges have already been oriented as uw and wv by the experiment. Thus, by\nacyclicity the arc uv is forced. A similar argument applies for u \u2208 \u00afRi and v \u2208 \u00afRj, where i < j.\nIt follows that the only edges that cannot be oriented lie between vertices within the same run. Each\nrun induces an undirected clique after the experiment, but each such clique has cardinality O(log t)\nwith high probability. We can now independently and simultaneously apply the deterministic method\nof Theorem 2.3 to orient the edges in each of these cliques using O(log log t) experiments. Hence\nthe entire graph is oriented using 1 + O(log log t) experiments.\n\nexpected number of experiments is then the number we get with no restart multiplied by(cid:80)\n\nWe note that if any high probability event does not occur, we simply restart with new random vari-\nables, at most doubling the number of experiments (and tripling if it happens again and so on). The\ni ipi,\n\nwhich is bounded by a constant (usually approaching 1 if p is a decreasing function of t).\nTheorem 3.1 applies to cliques. The same guarantee, however, can be obtained for any graph.\nTheorem 3.2. To construct G, O(log log n) experiments suf\ufb01ce in expectation.\n\n2, as in Theorem 3.1. Then any vertex of a maximal clique Q is in S with probability 1\n\nProof. Take any graph G with n vertices. Recall, we only need orient the edges of the chordal graph\nU. But a chordal graph contains at most n maximal cliques [14] (each of size t \u2264 n). Suppose we\nperform the randomized experiment where each vertex is independently selected in S with probabil-\n2. Thus,\nity 1\nthis experiment breaks Q into runs all of cardinality at most O(log n) with high probability.4 Since\nthere are only n maximal cliques, applying the union bound gives that every maximal clique in U is\nbroken up into runs of cardinality O(log n) with high probability. Therefore, since every clique is a\nsubgraph of a maximal clique, after a single randomized experiment, the chordal graph U(cid:48) formed\nby the remaining undirected edges has \u03c9 = O(log n). We can now independently apply Theorem\n2.3 on U(cid:48) to orient the remaining edges using O(log log n) experiments.\n\nWe can also iteratively exploit the essential graph characterization [2] but in the worst case we will\nhave no v-structures and so the expected bound above will not be improved. Combining Theorem\n2.3 and Theorem 3.2 we obtain\nCorollary 3.3. To construct G, min[O(log log n), log \u03c9(U)] experiments suf\ufb01ce in expectation.\n\n3.2 A Matching Lower Bound\n\nThe bound in Corollary 3.3 cannot be improved.\ndisjoint cliques. (Due to space constraints, this proof is given in the supplemental materials.)\nLemma 3.4. If G is a union of disjoint cliques, \u2126 (min[log log n, log \u03c9(U)]) experiments are nec-\nessary in expectation to construct G.\n\nIn particular, the bound is tight for unions of\n\nObserve that Lemma 3.4 explains why attempting to recursively partition the runs (used in Theorem\n3.1) in sub-runs will not improve worst-case performance. Speci\ufb01cally, a recursive procedure may\nproduce a large number of sub-runs and, with high probability, the trick will fail on one of them.\n\n4Speci\ufb01cally, every run will have cardinality at most k \u00b7 log n with probability at least 1 \u2212 1\n\nnk\u22121 .\n\n6\n\n\f4 Random Causal Graphs\n\nIn this section, we go beyond worst-case analysis and consider the number of experiments needed\nto recover a typical causal graph. To do this, however, we must provide a model for generating a\n\u201ctypical\u201d causal graph. For this task, we use the Erd\u00a8os-R\u00b4enyi (E-R) random graph model. Under this\nmodel, we show that the expected number of experiments required to discover the causal graph is just\na constant. We remark that we chose the E-R model because it is the predominant graph sampling\nmodel. We do not claim that the E-R model is the most appropriate random model for every causal\ngraph application. However, we believe the main conclusion we draw, that the expected number of\nexperiments to orient a typical graph is very small, applies much more generally. This is because\nthe vast improvement we obtain for our average-case analysis (over worst-case analysis) is derived\nfrom the fact that the E-R model produces many v-shapes. Since any other realistic random graph\nmodel will also produce numerous v-shapes (or small clique number), the number of experiments\nrequired should also be small in those models.\nNow, recall that the standard Erd\u00a8os-R\u00b4enyi random graph model generates an undirected graph. The\nmodel, though, extends naturally to directed, acyclic graphs as well. Speci\ufb01cally, our graphs Cn,p\nwith parameters n and p are chosen according to the following distribution:\n(1) Pick a random permutation \u03c3 of n vertices.\n(2) Pick an edge (i, j) (with 1 \u2264 i < j \u2264 n) independently with probability p.\n(3) If (i, j) is picked, orient it from i to j if \u03c3(i) < \u03c3(j) and from j to i otherwise.\nNote that since each edge was chosen randomly, we obtain the same distribution of causal graphs\nif we simply \ufb01x \u03c3 to be the identity permutation. In other words, Cn,p is just a random undirected\ngraph Gn,p in which we\u2019ve directed all edges from lower to higher indexed vertices. Clearly, this\ngraph is then acyclic. The main result in this section is that the expected number of experiments\nneeded to recover the graph is constant. We prove this in the supplemental materials.\nTheorem 4.1. For p \u2264 4\n5 we can recover Cn,p using at most log log 13 experiments in expectation.\n5 in Theorem 4.1 can easily be replaced by 1\u2212 \u03b4, for any \u03b4 > 0. The\nWe remark that the probability 4\nresulting expected number of experiments is a constant depending upon \u03b4. Note, also, that the result\nholds even if \u03b4 is a function of n tending to zero. Furthermore, we did not attempt to optimize the\nconstant log log 13 in this bound.\nTheorem 4.1 illustrates an important distinction between worst-case and average-case analyses.\nSpeci\ufb01cally, the bad examples for the worst-case setting are based upon clique-like structures.\nCliques have no v-shapes, so to improve upon existing results we had to exploit the acyclicity oper-\nation (F3). In contrast, for the average-case, the proof of Theorem 4.1 exploits the v-structure oper-\nation (F1). The simulations in Section 5 reinforce this point: in practice, the operations (F1, F2, F4)\nare extremely important as v-shapes are likely to arise in typical causal graphs.\n\n5 Simulation Results\n\nIn this section, we describe the simulations we conducted in MATLAB. The results con\ufb01rm the\ntheoretical upper bounds of Theorem 4.1; indeed the results suggest that the expected number of\nexperiments required may be even smaller than the constant produced in Theorem 4.1. For example,\neven in graphs with 15000 vertices, the average cardinality of the maximum clique in the simulations\nis only just over two! This suggests that the full power of the forcing rules (F1)-(F4) has not been\ncompletely measured by the theoretical results we presented in Sections 3 and 4.\nFor the simulations, we \ufb01rst generate a random causal graph G in the E-R model. We then calculate\nthe essential graph E(G). To do this we apply the forcing rules (F1)-(F4) from the characterization\nof [2]. At this point we examine properties of the U(G) the undirected subgraph of E(G). We\nare particularly interested in the maximum clique size in U because this information is suf\ufb01cient to\nupper bound the number of experiments that any reasonable algorithm will require to discover G.\nWe remark that, to speed up the simulations we represent a random graph G by a symmetric adja-\ncency matrix M. Here, if Mi,j = 1 then there is an arc ij if i < j and an arc ji if i > j. The matrix\nformulation allows the forcing rules (F1)-(F4) to be implemented more quickly than standard ap-\nproaches. For example, the natural way to apply the forcing rule (F1) is to search explicitly for each\nv-structure of which there may be O(n3). Instead we can \ufb01nd every edge contained in a v-structure\n\n7\n\n\fFigure 1: Experimental results: number of edges and size of the maximum cliques for Cn,p\n\nusing matrix multiplication, which is fast under MATLAB.5 The validity of such an approach can\nbe seen by the following theorem whose proof is left to the supplemental material.\nTheorem 5.1. Given the adjacency matrix M of a causal graph, we can \ufb01nd all edges contained in\na v-structure via matrix multiplication.\nTo speed up computation for smaller values of p and large n, we instead used sparse matrices to\napply (F1) storing only a list of non-zero entries ordered by row and column and vice versa. Then\nmatrix multiplication could be performed quickly by looking for common entries in two short lists.\nWe ran simulations for four choices of probability p, speci\ufb01cally p \u2208 {0.8, 0.5, 0.1, 0.01}, and for\nfour choices of graph size n, speci\ufb01cally n \u2208 {500, 1000, 5000, 15000}. For each combination\npair {n, p} we ran 1000 simulations. For each random graph G, once no more forcing rules can be\napplied we have obtained the essential graph E(G). We then calculate |E(U)| and \u03c9(U). Our results\nare summarized in Figure 1.\nHere average/largest refers to the average/largest over all 1000 simulations for that {n, p} combina-\ntion. Observe that the lines for AVG-E(G) and AVG-E(F1) illustrate Theorem 4.1: there is a dramatic\nfall in the expected number of undirected edges remaining by just applying the v-structure forcing\noperation (F1). The AVG-E(U) and MAX-E(U) show that the number of edges fall even more when\nwe apply all the forcing operations to obtain U.\nMore remarkably the maximum clique size in U is tiny, AVG-\u03c9(U) is just around two or three for\nall our choices of p \u2208 {0.8, 0.5, 0.1, 0.01}. The largest clique size we ever encountered was just\nnine. Since the number of experiments required is at most logarithmic in the maximum clique size,\nnone of our simulations would ever require more than \ufb01ve experiments to recover the causal graph\nand nearly always required just one or two. Thus, the expected clique size (and hence number of\nexperiments) required appears even smaller than the constant 13 produced in Theorem 4.1.\nWe emphasize that the simulations do not require the use of a speci\ufb01c algorithm, such as the algo-\nrithms associated with the proofs of the worst-case bound (Theorem 3.2) and the average-case bound\n(Theorem 4.1). In particular, the simulations show that the null experiment applied in conjunction\nwith the forcing operations (F1)-(F4) is typically suf\ufb01cient to discover most of the causal graph.\nSince the remaining unoriented edges U have small maximum clique size, any reasonable algorithm\nwill then be able to orient the rest of the graph using a constant number of experiments.\nAcknowledgement We would like to thank the anonymous referees for their remarks that helped us\nimprove this paper.\n\n5In theory, matrix multiplication can be carried in time O(n2.38) [17].\n\n8\n\n5001000500015,0000.02.04.06.08.010.0n=7.16.96.973.23.23.23.288990.020.040.060.080.0100.0120.030.930.531.131.2404151431.0E+031.0E+044.0E+051.0E+081.0e+054.0e+051.0e+079.0e+07P=0.8n5001000500015,0000.02.04.06.08.010.0n=2.22.22.22.33.23.33.33.655550.020.040.060.080.0100.0120.07.87.77.98.3181621191.0E+031.0E+044.0E+051.0E+086.2e+042.5e+056.2e+065.6e+07P=0.5n5001000500015,0000.02.04.06.08.010.0n=2.22.22.22.28.38.48.28.143340.020.040.060.080.0100.0120.012.412.312.212.5202120191.0E+031.0E+044.0E+051.0E+081.2e+045.0e+041.3e+061.1e+07P=0.1n5001000500015,0000.02.04.06.08.010.0n=2.22.22.22.2n33330.020.040.060.080.0100.0120.01031041029872.472.572.471.21.0E+031.0E+044.0E+051.0E+081.2e+035.0e+031.3e+051.1e+06P=0.0197.6101.7102.0102.1n\fReferences\n\n[1] A. Hauser and P. B\u00a8uhlmann. Two optimal strategies for active learning of causal models from\n\ninterventional data. International Journal of Approximate Reasoning, 55(4):926\u2013939, 2013.\n\n[2] S. Andersson, D. Madigan, and M. Perlman. A characterization of Markov equivalence classes\n\nfor acyclic digraphs. Annals of Statistics, 25(2):505\u2013541, 1997.\n\n[3] P. Sprites, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2\n\nedition, 2000.\n\n[4] F. Eberhardt, C. Glymour, and R. Scheines. n \u2212 1 experiments suf\ufb01ce to determine the causal\nrelations among n variables. In D. Holmes and L. Jain, editors, Innovations in Machine Learn-\ning, volume 194, pages 97\u2013112. Springer-Verlag, 2006.\n\n[5] F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments suf\ufb01cient and in\nthe worst case necessary to identify all causal relations among n variables. In Proceedings of\nthe 21st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 178\u2013184, 2005.\n\n[6] F. Eberhardt. Causation and Intervention. Ph.d. thesis, Carnegie Melon University, 2007.\n\n[7] F. Eberhardt. Almost optimal sets for causal discovery. In Proceedings of the 24th Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 161\u2013168, 2008.\n\n[8] M. Cai. On separating systems of graphs. Discrete Mathematics, 49(1):15\u201320, 1984.\n\n[9] A. R\u00b4enyi. On random generating elements of a \ufb01nite boolean algebra. Acta Sci. Math. Szeged,\n\n22(75-81):4, 1961.\n\n[10] A. Hyttinen, F. Eberhardt, and P. Hoyer. Experiment selection for causal discovery. Journal of\n\nMachine Learning Research, 14:3041\u20133071, 2013.\n\n[11] F. Eberhardt. Causal discovery as a game. Journal of Machine Learning Research, 6:87\u201396,\n\n2010.\n\n[12] T. Verma and J. Pearl. Equivalence and synthesis in causal models. In Proceedings of the 6th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 255\u2013268, 1990.\n\n[13] M. Frydenberg. The chain graph Markov property. Scandinavian Journal of Statistics, 17:333\u2013\n\n353, 1990.\n\n[14] F. Gavril. Algorithms for minimum coloring, maximum clique, minimum covering by cliques,\nand maximum independent set of a chordal graph. SIAM Journal on Computing, 2(1):180\u2013187,\n1972.\n\n[15] C. Meek. Causal inference and causal explanation with background knowledge. In Proceedings\nof the Eleventh conference on Uncertainty in arti\ufb01cial intelligence, pages 403\u2013410. Morgan\nKaufmann Publishers Inc., 1995.\n\n[16] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. McGraw Hill, 2\n\nedition, 2001.\n\n[17] V. Williams. Multiplying matrices faster than Coppersmith-Winograd.\n\nthe44th Symposium on Theory of Computing (STOC), pages 887\u2013898, 2012.\n\nIn Proceedings of\n\n9\n\n\f", "award": [], "sourceid": 1232, "authors": [{"given_name": "Huining", "family_name": "Hu", "institution": "McGill University"}, {"given_name": "Zhentao", "family_name": "Li", "institution": "\u00c9cole normale sup\u00e9rieure"}, {"given_name": "Adrian", "family_name": "Vetta", "institution": "McGill University"}]}