{"title": "Rate-Agnostic (Causal) Structure Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3303, "page_last": 3311, "abstract": "Causal structure learning from time series data is a major scientific challenge. Existing algorithms assume that measurements occur sufficiently quickly; more precisely, they assume that the system and measurement timescales are approximately equal. In many scientific domains, however, measurements occur at a significantly slower rate than the underlying system changes. Moreover, the size of the mismatch between timescales is often unknown. This paper provides three distinct causal structure learning algorithms, all of which discover all dynamic graphs that could explain the observed measurement data as arising from undersampling at some rate. That is, these algorithms all learn causal structure without assuming any particular relation between the measurement and system timescales; they are thus rate-agnostic. We apply these algorithms to data from simulations. The results provide insight into the challenge of undersampling.", "full_text": "Rate-Agnostic (Causal) Structure Learning\n\nSergey Plis\n\nThe Mind Research Network,\n\nAlbuquerque, NM\n\ns.m.plis@gmail.com\n\nDavid Danks\n\nCarnegie-Mellon University\n\nPittsburgh, PA\n\nddanks@cmu.edu\n\nCynthia Freeman\n\nThe Mind Research Network,\n\nCS Dept., University of New Mexico\n\nAlbuquerque, NM\n\ncynthiaw2004@gmail.com\n\nVince Calhoun\n\nThe Mind Research Network\n\nECE Dept., University of New Mexico\n\nAlbuquerque, NM\n\nvcalhoun@mrn.org\n\nAbstract\n\nCausal structure learning from time series data is a major scienti\ufb01c challenge. Ex-\ntant algorithms assume that measurements occur suf\ufb01ciently quickly; more pre-\ncisely, they assume approximately equal system and measurement timescales. In\nmany domains, however, measurements occur at a signi\ufb01cantly slower rate than\nthe underlying system changes, but the size of the timescale mismatch is often\nunknown. This paper develops three causal structure learning algorithms, each\nof which discovers all dynamic causal graphs that explain the observed measure-\nment data, perhaps given undersampling. That is, these algorithms all learn causal\nstructure in a \u201crate-agnostic\u201d manner: they do not assume any particular relation\nbetween the measurement and system timescales. We apply these algorithms to\ndata from simulations to gain insight into the challenge of undersampling.\n\n1\n\nIntroduction\n\nDynamic causal systems are a major focus of scienti\ufb01c investigation in diverse domains, includ-\ning neuroscience, economics, meteorology, and education. One signi\ufb01cant limitation in all of these\nsciences is the dif\ufb01culty of measuring the relevant variables at an appropriate timescale for the par-\nticular scienti\ufb01c domain. This challenge is particularly salient in neuroimaging: standard fMRI\nexperiments sample the brain\u2019s blood\ufb02ow approximately every one or two seconds, though the un-\nderlying neural activity (i.e., the major driver of blood\ufb02ow) occurs much more rapidly. Moreover,\nthe precise timescale of the underlying causal system is unknown; it is almost certainly faster than\nthe fMRI measurements, but it is unknown how much faster.\nIn this paper, we aim to learn the causal structure of a system that evolves at timescale \u03c4S, given\nmeasurements at timescale \u03c4M . We focus on the case in which \u03c4S is faster than \u03c4M to an unknown\ndegree. We assume that the underlying causal structure can be modeled as a directed graphical\nmodel G without simultaneous in\ufb02uence. There has been substantial work on modeling the statis-\ntics of time series, but relatively less on learning causal structure, and almost all of that assumes that\nthe measurement and causal timescales match [1\u20135]. The problem of causal learning from \u201cunder-\nsampled\u201d time series data was explicitly addressed by [6, 7], but they assumed that the degree of\nundersampling\u2014i.e., the ratio of \u03c4S to \u03c4M \u2014was both known and small. In contrast, we focus on\nthe signi\ufb01cantly harder challenge of causal learning when that ratio is unknown.\nWe provide a formal speci\ufb01cation of the problem and representational framework in Section 2. We\nthen present three different Rate-Agnostic Structure Learning (RASL) algorithms in Section 3. We\n\ufb01nish in Section 4 by exploring their performance on synthetic data.\n\n1\n\n\f2 Representation and Formalism\n\ni \u2192 V t\n\ni \u2192 V t\n\ni \u2192 V t\n\nj in Gu iff there is a trek between V t\n\ni and V t\n\nA dynamic causal graphical model consists of a graph G over random variables V at the current time\nt, as well as nodes for V at all previous (relative) timesteps that contain a direct cause of a variable\ni \u2192 V t\nat the current timestep.1 The Markov order of the system is the largest k such that V t\u2212k\nj ,\nwhere superscripts denote timestep. We assume throughout that the \u201ctrue\u201d underlying causal system\nis Markov order 1, and that all causally relevant variables are measured.2 Finally, we assume that\nthere are no isochronal causal edges V t\nj ; causal in\ufb02uences inevitably take time to propagate,\nand so any apparent isochronal edge will disappear when measured suf\ufb01ciently \ufb01nely. Since we do\nnot assume that the causal timescale \u03c4S is known, this is a relatively innocuous assumption.\nG is thus over 2V nodes, where the only edges are V t\u22121\nj , where possibly i = j. There\nis additionally a conditional probability distribution or density, P (Vt|Vt\u22121), which we assume\nto be time-independent. We do not, however, assume stationarity of P (Vt). Finally, we assume\nappropriate versions of the Markov (\u201cVariable V is independent of non-descendants given parents\u201d)\nand Faithfulness/Stability (\u201cThe only independencies are those implied by Markov\u201d) assumptions,\nsuch that the graph and probability distribution/density mutually constrain each other.\nLet {t0, t1, . . . , tk, . . .} be the measurement timesteps. We undersample at rate u when we measure\nonly timesteps {t0, tu, . . . , tuk, . . .}; the causal timescale is thus \u201cundersampled at rate 1.\u201d We\ndenote the causal graph resulting from undersampling at rate u by Gu. To obtain Gu, we \u201cunroll\u201d\nG1 by introducing nodes for Vt\u22122 that bear the same graphical and parametric relations to Vt\u22121\nas those variables bear to Vt, and iterate until we have included Vt\u2212u. We then marginalize out all\nvariables except those in Vt and Vt\u2212u.\nMarginalization yields an Acyclic Directed Mixed Graph (ADMG) Gu containing both directed and\nbidirected edges [8]. V t\u2212u\nj in the unrolled\ngraph. De\ufb01ne a trek to be a pair of directed paths (\u03c01, \u03c02) such that both have the same start variable.\ni \u2194 V t\nj with length(\u03c01) = length(\u03c02) = k < u.\nV t\nClearly, if a bidirected edge occurs in Gm, then it occurs in Gu for all u \u2265 m.\nUnrolling-and-marginalizing can be computationally complex due to duplication of nodes, and so\nwe instead use compressed graphs that encode temporal relations in edges. For an arbitrary dynamic\ncausal graph H, H is its compressed graph representation: (i) H is over non-time-indexed nodes for\nV; (ii) Vi \u2192 Vj in H iff V t\u22121\nj in H. Compressed\ngraphs can be cyclic (Vi (cid:29) Vj for V t\u22121\ni ), including self-cycles. There is\nclearly a 1-1 mapping between dynamic ADMGs and compressed graphs.\nComputationally, the effects of undersampling at rate u can be computed in a compressed graph\nsimply by \ufb01nding directed paths of length u in G1. More precisely, V t\u2212u\nj in Gu iff there is\na directed path of length u in G1. Similarly, V t\nj in Gu iff there is a trek with length(\u03c01) =\nlength(\u03c02) = k < u in G1. We thus use compressed graphs going forward.\n3 Algorithms\nThe core question of this paper is: given H = Gu for unknown u, what can be inferred about G1?\n\nj in H; and (iii) Vi \u2194 Vj in H iff V t\ni \u2192 V t\n\nj in Gu iff there is a directed path from V t\u2212u\n\nLet(cid:74)H(cid:75) = {G1 : \u2203u Gu = H} be the equivalence class of G1 that could, for some undersample\nrate, yield H. We are thus trying to learn(cid:74)H(cid:75) from H. An obvious brute-force algorithm is: for\nstrategies that more ef\ufb01ciently \u201cbuild\u201d the members of(cid:74)H(cid:75) (Sections 3.2, 3.3, and 3.4). Because\n\neach possible G1, compute the corresponding graphs for all u, and then output all Gu = H. Equally\nobviously, this algorithm will be computationally intractable for any reasonable n, as there are 2n2\npossible G1 and u can (in theory) be arbitrarily large. Instead, we pursue three different constructive\n\nthese algorithms make no assumptions about u, we refer to them each as RASL\u2014Rate Agnostic\nStructure Learner\u2014and use subscripts to distinguish between different types. First, though, we pro-\nvide some key theoretical results about forward inference that will be used by all three algorithms.\n\nj and V t\u22121\n\nj \u2192 V t\n\ni \u2192 V t\n\nto V t\n\ni\n\ni \u2192 V t\n\ni \u2194 V t\n\ni \u2194 V t\n\n1We use difference equations in our analyses. The results and algorithms will be applicable to systems of\n\ndifferential equations to the extent that they can be approximated by a system of difference equations.\n\n2More precisely, we assume a dynamic variant of the Causal Suf\ufb01ciency assumption, though it is more\n\ncomplicated than just \u201cno unmeasured common causes.\u201d\n\n2\n\n\f3.1 Nonparametric Forward Inference\nFor given G1 and u, there is an ef\ufb01cient algorithm [9] for calculating Gu, but it is only useful in\nlearning if we have stopping rules that constrain which G1 and u should ever be considered. These\nrules will depend on how G1 changes as u \u2192 \u221e. A key notion is a strongly connected component\n(SCC) in G1: a maximal set of variables S \u2286 V such that, for every X, Y \u2208 S (possibly X = Y ),\nthere is a directed path from X to Y . Non-singleton SCCs are clearly cyclic and can provably\nbe decomposed into a set of (possibly overlapping) simple loops (i.e., those in which no node is\nrepeated): \u03c31, . . . , \u03c3s [10]. Let LS be the set of those simple loop lengths.\nOne stopping rule must specify, for given G1, which u to consider. For a single SCC, the greatest\ncommon divisor of simple loop lengths (where gcd(LS) = 1 for singleton S) is key: gcd(LS) = 1\niff \u2203f s.t. \u2200u > f [Gu = Gf ]; that is, gcd() determines whether an SCC \u201cconverges\u201d to a \ufb01xed-\npoint graph as u \u2192 \u221e. We can constrain u if there is such a \ufb01xed-point graph, and Theorem 3.1\ngeneralizes [9, Theorem 5] to provide an upper bound on (interesting) u.\n(All proofs found in\nsupplement.)\nTheorem 3.1. If gcd(LS) = 1, then stabilization occurs at f \u2264 nF + \u03b3 + d + 1.\nwhere nF is the Frobenius number,3 d is the graph diameter, and \u03b3 is the transit number (see sup-\nplement). This is a theoretically useful bound, but is not practically helpful since neither \u03b3 nor nF\nhave a known analytic expression. Moreover, gcd(LS) = 1 is a weak restriction, but a restriction\nnonetheless. We instead use a functional stopping rule for u (Theorem 3.2) that holds for all G:\nTheorem 3.2. If Gu = Gv for u > v, then \u2200w > u\u2203kw < u[Gw = Gkw ].\nThat is, as u increases, if we \ufb01nd a graph that we previously encountered, then there cannot be\nany new graphs as u \u2192 \u221e. For a given G1, we can thus determine all possible corresponding\nundersampled graphs by computing G2,G3, . . . until we encounter a previously-observed graph.\nThis stopping rule enables us to (correctly) constrain the u that are considered for each G1.\nWe also require a stopping rule for G1, as we cannot evaluate all 2n2 possible graphs for any reason-\nable n. The key theoretical result is:\nTheorem 3.3. If G1 \u2286 J 1, then \u2200u[Gu \u2286 J u].\nLet G1\nit can be undersampled at rate u; denote the result (G1\nTheorem 3.3, we immediately have the following two corollaries:\nCorollary 3.4. If Gu (cid:42) H, then \u2200E[(G1\nE)u (cid:42) H]\nCorollary 3.5. If \u2200u[Gu (cid:42) H], then \u2200E, u[(G1\nWe thus have a stopping rule for some candidate G1: if Gu is not an edge-subset of H for all u, then\ndo not consider any edge-superset of G1. This stopping rule \ufb01ts very cleanly with \u201cconstructive\u201d\nalgorithms that iteratively add edge(s) to candidate G1. We now develop three such algorithms.\n3.2 A recursive edgewise inverse algorithm\nThe two stopping rules naturally suggest a recursive structure learning algorithm with H as input\ncontaining only e. If Gu (cid:42) H for all u, then reject; else if Gu = H for some u,4 then add G1 to\n\nand(cid:74)H(cid:75) as output. Start with an empty graph. For each edge e (of n2 possible edges), construct G1\n(cid:74)H(cid:75); else, recurse into non-con\ufb02icting graphs in order. Effectively, this is a depth \ufb01rst search (DFS)\n\nE be the graph resulting from adding the edges in E to G1. Since this is simply another graph,\nE can always serve as J 1 in\n\nE)u. Since G1\n\nE)u (cid:42) H]\n\nalgorithm on the solution tree; denote it as RASLre for \u201crecursive edgewise.\u201d Figure 1a provides\npseudo-code, and Figure 1b shows how one DFS path in the search tree unfolds. We can prove:\nTheorem 3.6. The RASLre algorithm is correct and complete.\n\n3For set B of positive integers with gcd(B) = 1, nF is the max integer with nF (cid:54)=(cid:80)b\n\nOne signi\ufb01cant drawback of RASLre is that the same graph can be constructed in many different\nways, corresponding to different orders of edge addition; the search tree is actually a search lat-\ni=1 \u03b1iBi for \u03b1i \u2265 0.\n4This check requires at most min(eu, eH) + 1 (fast) operations, where eu, eH are the number of edges in\n\nGu, H, respectively. This equality check occurs relatively rarely, since Gu and H must be non-con\ufb02icting.\n\n3\n\n\fAlgorithm RecursiveEqClass\n\nOutput:(cid:74)H(cid:75)\n\nInput: H\ninitialize empty graph G and set S\nbegin EdgeAdder G\u2217,H, L\nif L has elements then\n\nforall the edges in L do\n\nif edge creates a con\ufb02ict then\n\nremove it from L\n\nif L has elements then\n\nforall the edges in L do\nadd the edge to G\u2217\nif \u2203G \u2208 {(G\u2217)u} s.t. G = H then\nEdgeAdder G\u2217,H, L \\ the edge\nremove the edge from G\u2217\n\nadd G\u2217 to S\n\nput all n2 edges into list L\nEdgeAdder(G,H, L)\nreturn S\n\n1\n\n2\n3\n4\n5\n6\n\n7\n8\n9\n10\n11\n12\n13\n\n14\n15\n16\n\na: RASLre algorithm\n\nb: Branch of the search tree\n\nFigure 1: RASLre algorithm 1a speci\ufb01cation, and 1b search tree example\n\ntice. The algorithm is thus unnecessarily inef\ufb01cient, even when we use dynamic programming via\nmemoization of input graphs.\n\n3.3 An iterative edgecentric inverse algorithm\n\nTo minimize multiple constructions of the same graph, we can use RASLie (\u201citerative edgewise\u201d)\nwhich generates, at stage i, all not-yet-eliminated G1 with exactly i edges. More precisely, at stage\nOtherwise, it moves to stage 1. In general, at stage i + 1, RASLie (a) considers each graph G1\nresulting from a single edge addition to an acceptable graph at stage i; (b) rejects G1 if it con\ufb02icts\n(for all u) with H; (c) otherwise keeps G1 as acceptable at i + 1; and (d) if \u2203u[Gu = H], then adds\n\n0, RASLie starts with the empty graph; if H is also empty, then it adds the empty graph to(cid:74)H(cid:75).\nG1 to(cid:74)H(cid:75). RASLie continues until there are no more edges to add (or it reaches stage n2 + 1).\n\nFigure 2 provides the main loop (Figure 2a) and core function of RASLie (Figure 2c), as well as an\nexample of the number of graphs potentially considered at each stage (Figure 2b). RASLie provides\nsigni\ufb01cant speed-up and memory gains over RASLre (see Figure 3).\nWe optimize RASLie by tracking the single edges that could possibly still be added; for example,\nif a single-edge graph is rejected in stage 1, then do not consider adding that edge at other stages.\nAdditional con\ufb02icts can be derived analytically, further reducing the graphs to consider. In general,\nabsence of an edge in H implies, for the corresponding (unknown) u, absence of length u paths in all\n\nG1 \u2208(cid:74)H(cid:75). Since we do not know u, we cannot directly apply this constraint. However, lemmas 3.7\n\nand 3.8 provide useful, special case constraints for u > 1 (implied by a single bidirected edge).\nLemma 3.7. If u > 1, then \u2200V (cid:54)\u2192 W \u2208 H, G1 cannot contain any of the following paths:\n1. (cid:9)V \u2192 W ; 2. (cid:9)V \u2192 X \u2192 W ; 3. V \u2192 (cid:9)X \u2192 W ; 4. V \u2192 X \u2192 (cid:9)W ; 5. V \u2192 (cid:9)W .\nLemma 3.8. If u > 1, then \u2200V (cid:54)\u2194 W \u2208 H (cid:64)T [V \u2190 T \u2192 W ] \u2208 G1\n3.4 An iterative loopcentric inverse algorithm\nRASLie yields results in reasonable time for H with up to 8 nodes, though it is computationally\ndemanding. We can gain further computational advantages if we assume that H is an SCC. This\nassumption is relatively innocuous, as it requires only that our time series be generated by a system\nwith (appropriate) feedback loops. As noted earlier, any SCC is composed of a set of simple loops,\nand so we modify RASLie to iteratively add loops instead of edges; call the resulting algorithm\n\n4\n\n11223331121321322312311223331121321322312331121321322312132132231231233223123candidate edgesconstructed graphno more non-conflictingcandidates: backtracknext edge to addpruned conflict-inducingcandidate edgesno graph constructedalong this branchgenerates H123123groundtruthH\fAlgorithm IterativeEqClass\n\nInput: H\n\nOutput:(cid:74)H(cid:75)\n\ninitialize empty sets S\ninit d as an empty graph and n2 edges\nwhile d do\n\nd, Si\nN extIterationGraphs(d,H)\nS = S \u222a Si\n\nreturn S\n\n=\n\na: RASLie main algorithm\n\n1\n\n2\n\n3\n4\n\n5\n\n6\n\n1\n\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\nProcedure NextIterationGraphs\n\nInput: d graph:edges structure, and H\n\nOutput: dr and set S \u2286(cid:74)H(cid:75)\n\ninitialize empty structure dr and sets\nS, Si\nforall the graphs G in d do\n\nforall the edges e in d(G) do\n\nif e (cid:54)\u2208 G then\n\ncontinue\n\nif e con\ufb02icts with G then\nadd e to G\nif G (cid:54)\u2208 Si then\nadd G to Si\nif G con\ufb02icts with H\nthen\nif \u2203 \u02dcG \u2208 {Gu} s.t. \u02dcG = H\nthen\n\ncontinue\n\nadd G to S\nremove e from G\n\nadd non-con\ufb02icting graphs w/ edges\nto dr\n\nreturn dr, S\n\nb: Three runs of the algorithm\n\nc: Core function of RASLie\n\nFigure 2: RASLie algorithm (a) main loop; (b) example of graphs considered; and (c) core function.\n\nRASLil for \u201citerative loopwise.\u201d More precisely, RASLil uses the same algorithm as in Figure 2,\nbut successively attempts to add non-con\ufb02icting simple loops, rather than non-con\ufb02icting edges.\nRASLil also incorporates the additional constraints due to lemmas 3.7 and 3.8.\nRASLil is surprisingly much faster than RASLie even though, for\n\n(cid:1)(i\u22121)! simple loops (compared to n2\n\nn nodes, there are(cid:80)n\n\n(cid:0)n\n\ni=0\n\ni\n\nedges). The key is that introducing a single simple loop induces\nmultiple constraints simultaneously, and so con\ufb02icting graphs are\ndiscovered at a much earlier stage. As a result, RASLil checks\nmany fewer graphs in practice. For example, consider the G1\nin Figure 1, with corresponding H for u = 3. RASLre con-\nstructs (not counting pruned single edges) 28,661 graphs; RASLie\nconstructs only 249 graphs; and RASLil considers only 47. For\nu = 2, these numbers are 413, 44, and 7 respectively. Unsurpris-\ningly, these differences in numbers of examined graphs translate\ndirectly into wall clock time differences (Figure 3).\n\nFigure 3: Run-time comparison.\n4 Results\nAll three RASL algorithms take a measurement timescale graph H as input. They are therefore com-\npatible with any structure learning algorithm that outputs a measurement timescale graph, whether\nStructural Vector Autoregression (SVAR) [11], direct Dynamic Bayes Net search [12], or modi\ufb01ca-\ntions of standard causal structure learning algorithms such as PC [1, 13] and GES [14]. The problem\nof learning a measurement timescale graph is a very hard one, but is also not our primary focus here.\nInstead, we focus on the performance of the novel RASL algorithms.\nFirst, we abstract away from learning measurement timescale structure and assume that the correct H\nis provided as input. For these simulated graphs, we focus on SCCs, which are the most scienti\ufb01cally\ninteresting cases. For simplicity (and because within-SCC structure can be learned in parallel for a\ncomplex H [9]), we employ single-SCC graphs. To generate random SCCs, we (i) build a single\nsimple loop over n nodes, and (ii) uniformly sample from the other n(n \u2212 1) possible edges until\n\n5\n\n123123123171221717154193468744615212345678123451234number of non-conflicting graphs at the iterationiteration indexnon-conflicting graphs histogramrun 1run 2run 3RASL input:\fwe reach the speci\ufb01ed density (i.e., proportion of the n2 total possible edges). We employ density\nin order to measure graph complexity in an (approximately) n-independent way.\nWe can improve the runtime speed of RASLre using mem-\noization, though it is then memory-constrained for n \u2265 6.\nFigure 3 provides the wall-clock running times for all\nthree RASL algorithms applied to 100 random 5-node\ngraphs at each of three densities. This graph substanti-\nates our earlier claims that RASLil is faster than RASLie,\nwhich is faster than RASLre. In fact, each is at least an\norder of magnitude faster than the previous one.\nRASLre would take over a year on the most dif\ufb01cult prob-\nlems, so we focus exclusively on RASLil. Unsurpris-\ningly, run-time complexity of all RASL algorithms de-\npends on the density of H. For each of three density val-\nues (20%, 25%, 30%), we generated 100 random 6-node\nSCCs, which were then undersampled at rates 2, 3, and 4\nbefore being provided as input to RASLil. Figure 4 sum-\nmarizes wall clock computation time as a function of H\u2019s\ndensity, with different plots based on density of G1 and\nundersampling rate. We also show three examples of H\nwith a range of computation runtime. Unsurprisingly, the\nmost dif\ufb01cult H is quite dense; H with densities below\n50% typically require less than one minute.\n\nFigure 4: Run-time behavior.\n\n4.1 Equivalence classes\n\nthe degree of underdetermination produced by undersampling. The worst-case underdetermination\noccurs if H is a super-clique with every possible edge: \u2200X, Y [X \u2192 Y & X \u2194 Y ]. Any SCC\n\nWe \ufb01rst use RASLil to determine (cid:74)H(cid:75) size and composition for varying H; that is, we explore\nwith gcd(LS) = 1 becomes a super-clique as u \u2192 \u221e [9], so(cid:74)H(cid:75) contains all such graphs for\nsuper-clique H. We thus note when H is a super-clique, rather than computing the size of(cid:74)H(cid:75).\n\nFigure 5: Size of equivalence classes for 100 random SCCs at each density and u \u2208 {2, 3, 4}.\n\n6\n\n20%25%30%25%30%35%40%45%50%124>1000supercliquedensitydensity5-node graphs6-node graphsu=2u=3u=4\fFigure 6: Size of equivalence classes for larger graphs n \u2208 7, 8, 10 for u \u2208 2, 3\n\nFigures 5 and 6 plot equivalence class size as a function of both G1 density and the true under-\nsampling rate. For each n and density, we (i) generated 100 random G1; (ii) undersampled each at\n\nindicated u; (iii) passed Gu = H to RASLil; and (iv) computed the size of(cid:74)H(cid:75). Interestingly,(cid:74)H(cid:75)\nhave singleton(cid:74)H(cid:75) up to 40% G1 density. Even 10-node graphs often have a singleton(cid:74)H(cid:75) (though\nnation, but often not intractably so, particularly since even nonsingleton(cid:74)H(cid:75) can be valuable if they\n\nis typically quite small, sometimes even a singleton. For example, 5-node graphs at u = 2 typically\nwith relatively sparse G1). Increased undersampling and density both clearly worsen underdetermi-\n\npermit post hoc inspection or analysis.\n\nFigure 7: Effect of the undersampling rate on equivalence class size.\n\nTo focus on the impact of undersampling, we generated 100 random 5-node SCCs with 25% density,\n\neach of which was undersampled for u \u2208 {2, . . . , 11}. Figure 7 plots the size of(cid:74)H(cid:75) as a function\nof u for these graphs. For u \u2264 4, singleton(cid:74)H(cid:75) still dominate. Interestingly, even u = 11 still yields\n\nsome non-superclique H.\n\nFigure 8: Distribution of u for Gu = H for G1 \u2208(cid:74)H(cid:75) for 5- and 6-node graphs\n\nFinally, G1 \u2208 (cid:74)H(cid:75) iff \u2203u[Gu = H], but the appropriate u need not be the same for all members\nof(cid:74)H(cid:75). Figure 8 plots the percentages of u-values appropriate for each G1 \u2208(cid:74)H(cid:75), for the H from\nFigure 5. If actually utrue = 2, then almost all G1 \u2208 (cid:74)H(cid:75) are because of G2; there are rarely\nG1 \u2208(cid:74)H(cid:75) due to u > 2. If actually utrue > 2, though, then many G1 \u2208(cid:74)H(cid:75) are due to Gu where\n\nu (cid:54)= utrue. As density and utrue increase, there is increased underdetermination in both G1 and u.\n4.2 Synthetic data\n\n7\n\n2345678910115-node 25% edge density graphs ...equivalence class sizeS.C.undersampling rate25%30%35%40%45%50%20%25%30%...rateu=2u=3u=45-node graphs6-node graphs\fIn practice, we typically must learn H struc-\nture from \ufb01nite sample data. As noted ear-\nlier,\nthere are many algorithms for learning\nH, as it is a measurement timescale struc-\nture (though small modi\ufb01cations are required\nto learn bidirected edges).\nIn pilot testing,\nwe found that structural vector autoregressive\n(SVAR) model [11] optimization provided the\nmost accurate and stable solutions for H for our\nsimulation regime. We thus employ the SVAR\nprocedure here, though we note that other mea-\nsurement timescale learning algorithms might\nFigure 9: The estimation and search errors on syn-\nwork better in different domains.\nthetic data: 6-node graphs, u = 2, 20 per density.\nTo test the two-step procedure\u2014SVAR learning\npassed to RASLil\u2014we generated 20 random 6-node SCCs for each density in {25%, 30%, 35%}.\nFor each random graph, we generated a random transition matrix A by sampling weights for the\nnon-zero elements of the adjacency matrix, and controlling system stability (by keeping the maximal\neigenvalue at or below 1). We then generated time series data using a vector auto-regressive (VAR)\nmodel [11] with A and random noise (\u03c3 = 1). To simulate undersampling, datapoints were removed\nto yield u = 2. SVAR optimization on the resulting time series yielded a candidate H that was passed\n\nto RASLil to obtain(cid:74)H(cid:75).\nThe space of possible H is a factor of (cid:0)n\n(cid:1) larger than the space of possible G1, and so SVAR\noptimization can return an H such that(cid:74)H(cid:75) = \u2205. If RASLil returns \u2205, then we rerun it on all H\u2217 that\n3-step Hamming neighborhood of H essentially always \ufb01nds an H\u2217 with(cid:74)H\u2217(cid:75) (cid:54)= \u2205.\n\nresult from a single edge addition or deletion on H. If RASLil returns \u2205 for all of those graphs, then\nwe consider the H\u2217 that result from two changes to H, then three changes. This search through the\n\nFigure 9 shows the results of the two-step process, where algorithm output is evaluated by two\nerror-types: omission error: the number of omitted edges normalized to the total number of edges\nin the ground truth; comission error: number of edges not present in the ground truth normalized\nto the total possible edges minus the number of those present in the ground truth. We also plot\nthe estimation errors of SVAR (on the undersampled data) to capture the dependence of RASLil\nestimation errors on estimation errors for H. Interestingly, RASLil does not signi\ufb01cantly increase\nthe error rates over those produced by the SVAR estimation. In fact, we \ufb01nd the contrary (similarly\nto [6]): the requirement to use an H that could be generated by some undersampled G1 functions as\na regularization constraint that corrects for some SVAR estimation errors.\n5 Conclusion\n\n2\n\nTime series data are widespread in many scienti\ufb01c domains, but if the measurement and system\ntimescales differ, then we can make signi\ufb01cant causal inference errors [9, 15]. Despite this potential\nfor numerous errors, there have been only limited attempts to address this problem [6, 7], and even\nthose methods required strong assumptions about the undersample rate.\nWe here provided the \ufb01rst causal inference algorithms that can reliably learn causal structure from\ntime series data when the system and measurement timescales diverge to an unknown degree. The\nRASL algorithms are complex, but not restricted to toy problems. We also showed that underde-\n\ntermination of G1 is sometimes minimal, given the right methods.(cid:74)H(cid:75) was often small; substantial\nSigni\ufb01cant open problems remain, such as more ef\ufb01cient methods when H has(cid:74)H(cid:75) = \u2205. This paper\n\nsystem timescale causal structure could be learned from undersampled measurement timescale data.\n\nhas, however, expanded our causal inference \u201ctoolbox\u201d to include cases of unknown undersampling.\n\nAcknowledgments\n\nSP & DD contributed equally. This work was supported by awards NIH R01EB005846 (SP); NSF\nIIS-1318759 (SP); NSF IIS-1318815 (DD); & NIH U54HG008540 (DD) (from the National Hu-\nman Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge\n(BD2K) initiative). The content is solely the responsibility of the authors and does not necessarily\nrepresent the of\ufb01cial views of the National Institutes of Health.\n\n8\n\n\fReferences\n\n[1] A. Moneta, N. Chla\u00df, D. Entner, and P. Hoyer. Causal search in structural vector autoregressive models.\nIn Journal of Machine Learning Research: Workshop and Conference Proceedings, Causality in Time\nSeries (Proc. NIPS2009 Mini-Symposium on Causality in Time Series), volume 12, pages 95\u2013114, 2011.\n[2] C.W.J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econo-\n\nmetrica: Journal of the Econometric Society, pages 424\u2013438, 1969.\n\n[3] B. Thiesson, D. Chickering, D. Heckerman, and C. Meek. Arma time-series modeling with graphical\nIn Proceedings of the Twentieth Conference Annual Conference on Uncertainty in Arti\ufb01cial\n\nmodels.\nIntelligence (UAI-04), pages 552\u2013560, Arlington, Virginia, 2004. AUAI Press.\n\n[4] Mark Voortman, Denver Dash, and Marek Druzdzel. Learning why things change: The difference-based\nIn Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Arti\ufb01cial\n\ncausality learner.\nIntelligence (UAI), pages 641\u2013650, Corvallis, Oregon, 2010. AUAI Press.\n\n[5] Nir Friedman, Kevin Murphy, and Stuart Russell. Learning the structure of dynamic probabilistic net-\nworks. In 15th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 139\u2013147, San Fran-\ncisco, 1999. Morgan Kaufmann.\n\n[6] Sergey Plis, David Danks, and Jianyu Yang. Mesochronal structure learning.\n\nIn Proceedings of the\nThirty-First Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-15), Corvallis,\nOregon, 2015. AUAI Press.\n\n[7] Mingming Gong, Kun Zhang, Bernhard Schoelkopf, Dacheng Tao, and Philipp Geiger. Discovering\n\ntemporal causal relations from subsampled data. In Proc. ICML, pages 1898\u20131906, 2015.\n\n[8] T. Richardson and P. Spirtes. Ancestral graph markov models. The Annals of Statistics, 30(4):962\u20131030,\n\n2002.\n\n[9] David Danks and Sergey Plis. Learning causal structure from undersampled time series.\n\nWorkshop and Conference Proceedings, volume 1, pages 1\u201310, 2013.\n\nIn JMLR:\n\n[10] Donald B Johnson. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing,\n\n4(1):77\u201384, 1975.\n\n[11] Helmut L\u00a8utkepohl. New introduction to multiple time series analysis. Springer Science & Business\n\nMedia, 2007.\n\n[12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC\n\nBerkeley, 2002.\n\n[13] Clark Glymour, Peter Spirtes, and Richard Scheines. Causal inference.\n\nIn Erkenntnis Orientated: A\n\nCentennial Volume for Rudolf Carnap and Hans Reichenbach, pages 151\u2013189. Springer, 1991.\n\n[14] David Maxwell Chickering. Optimal structure identi\ufb01cation with greedy search. The Journal of Machine\n\nLearning Research, 3:507\u2013554, 2003.\n\n[15] Anil K Seth, Paul Chorley, and Lionel C Barnett. Granger causality analysis of fmri bold signals is\n\ninvariant to hemodynamic convolution but not downsampling. Neuroimage, 65:540\u2013555, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1831, "authors": [{"given_name": "Sergey", "family_name": "Plis", "institution": "The Mind Research Network"}, {"given_name": "David", "family_name": "Danks", "institution": "Carnegie Mellon University"}, {"given_name": "Cynthia", "family_name": "Freeman", "institution": "The Mind Research Network"}, {"given_name": "Vince", "family_name": "Calhoun", "institution": "MRN"}]}