{"title": "Scalable Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3746, "page_last": 3756, "abstract": "Continuous-time Bayesian Networks (CTBNs) represent a compact yet powerful framework for understanding multivariate time-series data. Given complete data, parameters and structure can be estimated efficiently in closed-form. However, if data is incomplete, the latent states of the CTBN have to be estimated by laboriously simulating the intractable dynamics of the assumed CTBN. This is a problem, especially for structure learning tasks, where this has to be done for each element of a super-exponentially growing set of possible structures. In order to circumvent this notorious bottleneck, we develop a novel gradient-based approach to structure learning. Instead of sampling and scoring all possible structures individually, we assume the generator of the CTBN to be composed as a mixture of generators stemming from different structures. In this framework, structure learning can be performed via a gradient-based optimization of mixture weights. We combine this approach with a new variational method that allows for a closed-form calculation of this mixture marginal likelihood.\nWe show the scalability of our method by learning structures of previously inaccessible sizes from synthetic and real-world data.", "full_text": "Scalable Structure Learning of Continuous-Time\n\nBayesian Networks from Incomplete Data\n\nDominik Linzner1 Michael Schmidt1 Heinz Koeppl1,2\n\n1Department of Electrical Engineering and Information Technology\n\n{dominik.linzner, michael.schmidt, heinz.koeppl}@bcs.tu-darmstadt.de\n\n2Department of Biology\n\nTechnische Universit\u00e4t Darmstadt\n\nAbstract\n\nContinuous-time Bayesian Networks (CTBNs) represent a compact yet powerful\nframework for understanding multivariate time-series data. Given complete data,\nparameters and structure can be estimated ef\ufb01ciently in closed-form. However, if\ndata is incomplete, the latent states of the CTBN have to be estimated by laboriously\nsimulating the intractable dynamics of the assumed CTBN. This is a problem,\nespecially for structure learning tasks, where this has to be done for each element\nof a super-exponentially growing set of possible structures. In order to circumvent\nthis notorious bottleneck, we develop a novel gradient-based approach to structure\nlearning. Instead of sampling and scoring all possible structures individually, we\nassume the generator of the CTBN to be composed as a mixture of generators\nstemming from different structures. In this framework, structure learning can be\nperformed via a gradient-based optimization of mixture weights. We combine this\napproach with a new variational method that allows for a closed-form calculation\nof this mixture marginal likelihood. We show the scalability of our method by\nlearning structures of previously inaccessible sizes from synthetic and real-world\ndata.\n\n1\n\nIntroduction\n\nLearning correlative or causative dependencies in multivariate data is a fundamental problem in\nscience and has application across many disciplines such as natural and social sciences, \ufb01nance and\nengineering [1, 20]. Most statistical approaches consider the case of snapshot or static data, where\none assumes that the data is drawn from an unknown probability distribution. For that case several\nmethods for learning the directed or undirected dependency structure have been proposed, e.g., the PC\nalgorithm [21, 13] or the graphical LASSO [8, 12], respectively. Causality for such models can only\npartially be recovered up to an equivalence class that relates to the preservation of v-structures [21] in\nthe graphical model corresponding to the distribution. If longitudinal and especially temporal data\nis available, structure learning methods need to exploit the temporal ordering of cause and effect\nthat is implicit in the data for determining the causal dependency structure. One assumes that the\ndata are drawn from an unknown stochastic process. Classical approaches such as Granger causality\nor transfer entropy methods usually require large sample sizes [23]. Dynamic Bayesian networks\noffer an appealing framework to formulate structure learning for temporal data within the graphical\nmodel framework [10]. The fact that the time granularity of the data can often be very different\nfrom the actual granularity of the underlying process motivates the extension to continuous-time\nBayesian networks (CTBN) [14], where no time granularity of the unknown process has to be\nassumed. Learning the structure within the CTBN framework involves a combinatorial search over\nstructures and is hence generally limited to low-dimensional problems even if one considers variational\napproaches [11] and/or greedy hill-climbing strategies in structure space [15, 16]. Reminiscent of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\foptimization-based approaches such as graphical LASSO, where structure scoring is circumvented\nby performing gradient descent on the edge coef\ufb01cients of the structure under a sparsity constraint,\nwe here propose the \ufb01rst gradient-based scheme for learning the structure of CTBNs.\n\n2 Background\n\n2.1 Continuous-time Bayesian Networks\nWe consider continuous-time Markov chains (CTMCs) {X(t)}t\u22650 taking values in a countable state-\nspace S. A time-homogeneous Markov chain evolves according to an intensity matrix R : S\u00d7S \u2192 R,\nwhose elements are denoted by R(s, s(cid:48)), where s, s(cid:48) \u2208 S. A continuous-time Bayesian network [14]\nis de\ufb01ned as an N-component process over a factorized state-space S = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XN evolving\ni \u2208 Xi, we will drop the states\u2019 component index i, if\njointly as a CTMC. For local states xi, x(cid:48)\nevident by the context and no ambiguity arises. We impose a directed graph structure G = (V, E),\nencoding the relationship among the components V \u2261 {V1, . . . , VN}, which we refer to as nodes.\nThese are connected via an edge set E \u2286 V \u00d7 V . This quantity is the structure, which we will later\nlearn. The state of each component is denoted by Xi(t) assuming values in Xi, which depends only\non the states of a subset of nodes, called the parent set parG(i) \u2261 {j | (j, i) \u2208 E}. Conversely,\nwe de\ufb01ne the child set chG(i) \u2261 {j |\n(i, j) \u2208 E}. The dynamics of a local state Xi(t) are\ndescribed as a Markov process conditioned on the current state of all its parents Un(t) taking values\nin Ui \u2261 {Xj | j \u2208 parG(i)}. They can then be expressed by means of the conditional intensity\nmatrices (CIMs) Ri : Xi \u00d7 Xi \u00d7 Ui \u2192 R, where ui \u2261 (u1, . . . uL) \u2208 Ui denotes the current state of\nthe parents (L = |parG(i)|). The CIMs are the generators of the dynamics of a CTBN. Speci\ufb01cally,\nwe can express the probability of \ufb01nding node i in state x(cid:48) after some small time-step h, given that it\nwas in state x at time t with x, x(cid:48) \u2208 Xi as\n\np(Xi(t + h) = x(cid:48) | Xi(t) = x, Ui(t) = u) = \u03b4x,x(cid:48) + hRi(x, x(cid:48) | u) + o(h),\n\nlimh\u21920 o(h)/h = 0. It holds that Ri(x, x | u) = \u2212(cid:80)\n\nwhere Ri(x, x(cid:48) | u) is the rate the transition x \u2192 x(cid:48) given the parents\u2019 state u \u2208 Ui and \u03b4x,x(cid:48)\nbeing the Kronecker-delta. We further make use of the small o(h) notation, which is de\ufb01ned via\nx(cid:48)(cid:54)=x Ri(x, x(cid:48) | u). The CIMs are connected\n\nto the joint intensity matrix R of the CTMC via amalgamation \u2013 see, for example [14].\n\n2.2 Structure Learning for CTBNs\n\nComplete data. The likelihood of a CTBN can be expressed in terms of its suf\ufb01cient statis-\ntics [15], Mi(x, x(cid:48) | u), which denotes the number of transitions of node i from state x to\nx(cid:48) and Ti(x | u), which denotes the amount of time node i spend in state x.\nIn order to\navoid clutter, we introduce the sets M \u2261 {Mi(x, x(cid:48) | u) | i \u2208 {1, . . . , N}, x, x(cid:48) \u2208 X , u \u2208 U} and\nT \u2261 {Ti(x | u) | i \u2208 {1, . . . , N}, x \u2208 X , u \u2208 U}. The likelihood then takes the form\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\nx,x(cid:48)(cid:54)=x,u\n\nN(cid:89)\n\ni=1\n\n\uf8fc\uf8fd\uf8fe .\n\n(1)\n\np(M,T | G, R) =\n\nexp\n\nMi(x, x(cid:48) | u) ln Ri(x, x(cid:48) | u) \u2212 Ti(x | u)Ri(x, x(cid:48) | u)\n\nIn [15] and similarly in [22] it was shown that a marginal likelihood for the structure can be calculated\nin closed form, when assuming a gamma prior over the rates Ri(x, x(cid:48) | u) \u223c Gam(\u03b1i(x, x(cid:48) |\nu), \u03b2i(x(cid:48) | u)). In this case, the marginal log-likelihood of a structure takes the form\n\nln p(M,T | G, \u03b1, \u03b2) \u221d N(cid:88)\n\n(cid:88)\n\n(cid:8)ln \u0393 (\u00af\u03b1i(x, x(cid:48) | u)) \u2212 \u00af\u03b1i(x, x(cid:48) | u) ln \u00af\u03b2i(x | u)(cid:9) ,\n\n(2)\n\ni=1\n\nu,x,x(cid:48)(cid:54)=x\n\nwith \u00af\u03b1i(x, x(cid:48) | u) \u2261 Mi(x, x(cid:48) | u) + \u03b1i(x, x(cid:48) | u) and \u00af\u03b2i(x | u) \u2261 Ti(x | u) + \u03b2i(x | u). Structure\nlearning in previous works [16, 22, 11] is then performed by iterating over possible structures and\nscoring them using the marginal likelihood. The best scoring structure is then the maximum-a-\nposteriori estimate of the structure.\nIncomplete data. In many cases, the suf\ufb01cient statistics of a CTBN cannot be provided. Instead,\ndata comes in the form of noisy state observations at some points in time. In the following, we\n\n2\n\n\fwill assume data is provided in form of Ns samples D \u2261 (cid:8)(tk, yk) | k \u2208 {1, . . . , Ns}(cid:9), where\n\nyk is some, possibly noisy, measurement of the latent-state generated by some observation model\nyk \u223c p(Y = yk | X(tk) = s) at time tk. This data is incomplete, as the suf\ufb01cient statistics of\nthe underlying latent process have to be estimated before model identi\ufb01cation can be performed.\nIn [16], an expectation-maximization for structure learning (SEM) was introduced, in which, given a\nproposal CTBN, suf\ufb01cient statistics were \ufb01rst estimated by exact inference, the CTBN parameters\nwere optimized given those expected suf\ufb01cient-statistics and, subsequently, structures where scored\nvia (1). Similarly, in [11] expected suf\ufb01cient-statistics were estimated via variational inference under\nmarginal (parameter-free) dynamics and structures were then scored via (2).\nThe problem of structure learning from incomplete data has two distinct bottlenecks, (i) Latent\nstate estimation (scales exponentially in the number of nodes) (ii) Structure identi\ufb01cation (scales\nsuper-exponentially in the number of nodes). While bottleneck (i) has been tackled in many ways [4,\n5, 19, 11], existing approaches [16, 11] employ a combinatorial search over structures, thus an\nef\ufb01cient solution for bottleneck (ii) is still outstanding.\nOur approach. We will employ a similar strategy as mentioned above in this manuscript. However,\nstatistics are estimated under a marginal CTBN that no longer depends on rate parameters or a discrete\nstructure. Instead, statistics are estimated given a mixture of different parent-sets. Thus, instead of\nblindly iterating over possible structures in a hill-climbing procedure, we can update our distribution\nover structures by a gradient step. This allows us to directly converge into regions of high-probability.\nFurther, in combination of this gradient-based approach with a high-order variational method, we\ncan perform estimation of the expected suf\ufb01cient-statistics in large systems. These two features\ncombined, enable us to perform structure learning in large systems. An implementation of our method\nis available via Git1.\n\n3 Likelihood of CTBNs Under a Mixture of CIMs\nComplete data. In the following, we consider a CTBN over some 2over-complete graph G. In\npractice, this graph may be derived from data as prior knowledge. In the absence of prior knowledge,\nwe will choose the full graph. We want to represent its CIMs Ri(x, x(cid:48) | u), here for node i, as mixture\nof CIMs of smaller support and write by using the power-set P(\u00b7) (set of all possible subsets)\n\nRi(x, x(cid:48) | u) =\n\n\u03c0i(m)ri(x, x(cid:48) | um) \u2261 E\u03c0\n\ni [ri(x, x(cid:48) | um)],\n\nm\u2208P(parG (i))\n\nwhere um denotes the projection of the full parent-state u on the subset m, i.e. f (um) =(cid:80)\n\nf (u),\nm\u2208P(parG (i)) \u03c0i(m)f (\u03b8m). The mixture-weights are given by a\nand the expectation E\u03c0\ndistribution \u03c0i \u2208 \u2206i with \u2206i being the |P(parG(i))|\u2212dimensional probability simplex. Correspond-\ning edge probabilities of the graph can be computed via marginalization. The probability that an edge\neij \u2208 E exists is then\n\nu/um\n\n(3)\n\n(cid:88)\ni [f (\u03b8m)] =(cid:80)\n\n(cid:88)\n\np(eij = 1) =\n\nm\u2208P(parG (j))\n\n\u03c0j(m)1(i \u2208 m),\n\n(4)\n\nwith 1(\u00b7) being the indicator function. In order to arrive at a marginal score for the mixture we\ninsert (3) into (1) and apply Jensen\u2019s inequality E\u03c0\ni [r]). This yields a lower-bound\nto the mixture likelihood\n\ni [ln (r)] \u2264 ln (E\u03c0\n\ni [Mi(x,x(cid:48)|um) ln ri(x,x(cid:48)|um)\u2212Ti(x|um)ri(x,x(cid:48)|um)].\n\neE\u03c0\n\np(M,T | \u03c0, r) \u2265 N(cid:89)\n\n(cid:89)\n\ni=1\n\nx,x(cid:48)(cid:54)=x,um\n\nFor details on this derivation, we refer to the supplementary material A.1. Note that Jensens inequality,\nwhich only provides a poor approximation in general, improves with increasing concentration\nof probability mass and becomes exact for degenerate distributions. For the task of selecting\na CTBN with a speci\ufb01c parent-set, it is useful to marginalize over the rate parameters r of the\nCTBNs. This allows for a direct estimation of the parent-set, without \ufb01rst estimating the rates. This\n\n1https://git.rwth-aachen.de/bcs/ssl-ctbn\n2An over-complete graph has more edges than the underlying true graph, which generated the data.\n\n3\n\n\fmarginal likelihood can be computed under the assumption of independent gamma prior distributions\nri(x, x(cid:48) | um) \u223c Gam(\u03b1i(x, x(cid:48) | um), \u03b2i(x(cid:48) | um)) over the rates. The marginal likelihood lower-\nbound can then be computed analytically. Under the assumption of independent Dirichlet priors\n\u03c0i \u223c Dir(\u03c0i | ci), with concentration parameters ci we arrive at a lower-bound to the marginal\nlog-posterior of the mixture weights \u03c0\n\nFi[M,T , \u03c0] + ln Z,\n\n(cid:8)ln \u0393 (\u00af\u03b1i(x, x(cid:48) | um)) \u2212 \u00af\u03b1i(x, x(cid:48) | um) ln \u00af\u03b2i(x | um)(cid:9)+ln Dir(\u03c0i | ci),\n\n(5)\n\nln p(\u03c0 | M,T , \u03b1, \u03b2) \u2265(cid:88)\nFi[M,T , \u03c0] \u2261 (cid:88)\n\ni\n\nm,um,x,x(cid:48)(cid:54)=x\n\nwith the updated posterior parameters \u00af\u03b1i(x, x(cid:48) | um) \u2261 \u03c0i(m)Mi(x, x(cid:48) | um) + \u03b1i(x, x(cid:48) | um) and\n\u00af\u03b2i(x | um) \u2261 \u03c0i(m)Ti(x | um) + \u03b2i(x | um). For details, we refer to the supplementary material\nA.2. The constant log-partition function ln Z can be ignored in the following analysis. Because (5)\ndecomposes into a sum of node-wise terms, the maximum-a-posterior estimate of the mixture weights\nof node i can be calculated as solution of the following optimization problem:\n\n\u03c0\u2217\ni = arg max\n\u03c0i\u2208\u2206i\n\n{Fi[M,T , \u03c0]} .\n\n(6)\n\nBy construction, learning the mixture weights \u03c0 of the CIMs, corresponds to learning a distribution\nover parent-sets for each node. We thus re-expressed the problem of structure learning to an estimation\nof \u03c0. Further, we note that for any degenerate \u03c0, (5) coincides with the exact structure score (2).\nIncomplete data. In the case of incomplete noisy data D, the likelihood of the CTBN does no\nlonger decompose into node-wise terms. Instead, the likelihood is one of the full amalgamated\nCTMC [16]. In order to tackle this problem, approximation methods through sampling [7, 6, 19],\nor variational approaches [4, 5] have been investigated. These, however, either fail to treat high-\ndimensional spaces because of sample sparsity, are unsatisfactory in terms of accuracy, or provide\nonly an uncontrolled approximation. Our method is based on a variational approximation, e.g.\nweak coupling expansion [11]. Under this approximation, we recover by the same calculation an\napproximate likelihood of the same form as (1), where the suf\ufb01cient statistics Mi(x, x(cid:48) | u) and\nTi(x | u) are, however, replaced by their expectation Eq [Mi(x, x(cid:48) | u)] and Eq [Ti(x | u)] under a\nvariational distribution q, \u2013 for details we refer to the supplementary B.1. Subsequently, also our\noptimization objective Fi[M,T , \u03c0] becomes dependent on the variational distribution Fi[D, \u03c0, q]. In\nthe following chapter, we will develop an Expectation-Maximization (EM)-algorithm that iteratively\nestimates the expected suf\ufb01cient-statistics given the mixture-weights and subsequently optimizes\nthose mixture-weights given the expected suf\ufb01cient-statistics.\n\n4\n\nIncomplete data: Expected Suf\ufb01cient Statistics Under a Mixture of CIMs\n\nShort review of the foundational method. In [11], the exact posterior over paths of a CTBN given\nincomplete data D, is approximated by a path measure q(X[0,T ]) of a variational time-inhomogeneous\nMarkov process via a higher order variational inference method. For a CTBN, this path measure is\nfully described by its node-wise marginals qi(x(cid:48), x, u; t) \u2261 qi(Xi(t + h) = x(cid:48), Xi(t) = x, Ui(t) =\nu; t). From it, one can compute the marginal probability qi(x; t) of node i to be in state x, the\nmarginal probability of the parents qi(Ui(t) = u; t) \u2261 qu\ni (t) and the marginal transition probability\n\u03c4i(x, x(cid:48), u; t) \u2261 limh\u21920 qi(x(cid:48), x, u; t)/h\nfor x (cid:54)= x(cid:48). The exact form of the expected statistics\nwere calculated to be\n\nEq [Ti(x | u)] \u2261\n\ni (t), Eq [Mi(x, x(cid:48) | u)] \u2261\n\ndt \u03c4i(x, x(cid:48), u; t).\n\n0\n\ndt qi(x; t)qu\n\n(7)\nIn the following, we will use the short-hand Eq [M] and Eq [T ] to denote the sets of expected\nsuf\ufb01cient-statistics. We note, that the variational distribution q has the support of the full over-\ncomplete parent-set parG(i). Via marginalization of qi(x(cid:48), x, u; t), the marginal probability and the\nmarginal transition probability can be shown to be connected via the relation\n[\u03c4i(x,(cid:48) x, u; t) \u2212 \u03c4i(x, x(cid:48), u; t)] .\n\n(cid:88)\n\nqi(x; t) =\n\n(8)\n\n(cid:90) T\n\n0\n\n(cid:90) T\n\nd\ndt\n\nx(cid:48)(cid:54)=x,u\n\n4\n\n\fand data D.\n\nrepeat\n\n2: repeat\n3:\n4:\n5:\n6:\n\nfor all i \u2208 {1, . . . , N} do\nfor all (yk, tk) \u2208 D do\n\nUpdate \u03c1i(t) by backward propagation from tk to tk\u22121 using (10) ful\ufb01lling the jump\nconditons (12).\n\nend for\n\nend for\nUpdate qi(t) by forward propagation using (10) given \u03c1i(t).\n\n7:\n8:\n9:\n10:\n11:\n12: until Convergence of F[D, \u03c0, q]\n13: Output: Set of expected suf\ufb01cient statistics Eq[M] and Eq[T ].\n\nuntil Convergence\nCompute expected suf\ufb01cient statistics using (7) and (11) from qi(t) and \u03c1i(t).\n\nApplication to our setting. As discussed in the last section, the objective function in the incomplete\ndata case has the same form as (5)\n\nFi[D, \u03c0, q] \u2261 (cid:88)\n\n(cid:8)ln \u0393 (\u00af\u03b1q\n\nm,um,x,x(cid:48)(cid:54)=x\n\ni (x, x(cid:48) | um)) \u2212 \u00af\u03b1q\n\ni (x, x(cid:48) | um) ln \u00af\u03b2q\n\ni (x | um)(cid:9)+ln Dir(\u03c0i | ci),\n\n(9)\n\nnow\n\nand\nhowever,\ni (x | um) \u2261 \u03c0i(m)Eq[Ti(x | um)] + \u03b2i(x | um).\n\u00af\u03b2q\nIn order to arrive at approximation to the\nexpected suf\ufb01cient statistics in our case, we have to maximize (9) with respect to q, while ful\ufb01lling\nthe constraint (8). The corresponding Lagrangian becomes\n\ni (x, x(cid:48) | um) \u2261 \u03c0i(m)Eq[Mi(x, x(cid:48) | um)] + \u03b1i(x, x(cid:48) | um)\n\u00af\u03b1q\n\nwith\n\nL[D, \u03c0, q, \u03bb] =\n\n\uf8ee\uf8f0Fi[D, \u03c0, q] \u2212 (cid:88)\n\nN(cid:88)\n\n(cid:90) T\n\ni=1\n\nx,x(cid:48)(cid:54)=x,u\n\n0\n\ndt \u03bbi(x; t)\n\nqi(x; t) \u2212 [\u03c4i(x,(cid:48) x, u; t) \u2212 \u03c4i(x, x(cid:48), u; t)]\n\n(cid:26) d\n(cid:113) 2\u03c0\n\ndt\n\n(cid:0) z\n\n(cid:1)z\n\n+ O(cid:0) 1\n\n(cid:27)\uf8f9\uf8fb ,\n(cid:1), which becomes exact asymp-\n\nwith Lagrange-multipliers \u03bbi(x; t). In order to derive Euler-Lagrange equations, we employ Stirlings-\napproximation for the gamma function \u0393(z) =\ntotically. In our case, Stirlings-approximation is valid if \u00af\u03b1 (cid:29) 1. We thereby assumed that either\nenough data has been recorded, or a suf\ufb01ciently strong prior \u03b1. Finally, we recover the approximate\nforward- and backward-equations of the mixture CTBNs as the stationary point of the Lagrangian\nEuler-Lagrange equations\n\nz\n\nz\n\ne\n\n\u03c1i(t) = \u02dc\u2126\u03c0\n\ni (t)\u03c1i(t),\n\nd\ndt\n\nqi(t) = qi(t)\u2126\u03c0\n\ni (t),\n\n(10)\n\nAlgorithm 1 Stationary points of Euler\u2013Lagrange equation\n1: Input: Initial trajectories qi(x; t), boundary conditions q(x; 0) and \u03c1(x; T ), mixture weights \u03c0\n\nd\ndt\nwith effective rate matrices\n\ni (x, x(cid:48); t) \u2261 Eu\n\u2126\u03c0\ni (x, x(cid:48); t) \u2261 (1 \u2212 \u03b4x,x(cid:48))Eu\n\u02dc\u2126\u03c0\n\ni (x, x(cid:48) | u)\n\ni\n\n(cid:104) \u02dcR\u03c0\n\n(cid:105)\n\u03c1i(x; t)\ni (x, x(cid:48) | u)\n\n(cid:105) \u03c1i(x(cid:48); t)\n(cid:104) \u02dcR\u03c0\ni [f (u)] =(cid:80)\n(cid:21)\n\ni\n\n,\n\n\u02dcR\u03c0\n\ni (x, x(cid:48) | um)\ni (x | um)\n\u00af\u03b2q\n\n(cid:20) \u00af\u03b1q\n\ni (x, x(cid:48) | u) \u2261 E\u03c0\nR\u03c0\n\ni\n\nwith \u03c1i(x; t) \u2261 exp(\u2212\u03bbi(x; t)) and \u03a8i(x; t) as given in the supplementary material B.2. Further we\nhave introduced the shorthand Eu\nand de\ufb01ned the posterior expected rates\n\nu f (u)qu\n\ni (t)\n\n+ \u03b4x,x(cid:48) {Eu\n\ni [R\u03c0\n\ni (x, x(cid:48) | u)] + \u03a8i(x; t)} ,\n\n(cid:18) \u00af\u03b1q\n\ni (x, x(cid:48) | um)\ni (x | um)\n\u00af\u03b2q\n\n(cid:19)\u03c0i(m)\n\n,\n\ni (x, x(cid:48) | u) \u2261(cid:89)\n\nm\n\n5\n\n\fAlgorithm 2 Gradient-based Structure Learning\n1: Input: Initial trajectories qi(x; t), boundary conditions qi(x; 0) and \u03c1i(x; T ), initial mixture\n\nweights \u03c0(0), data D and iterator n = 0\n\nCompute expected suf\ufb01cient statistics Eq[M] and Eq[T ] given \u03c0(n) using Algorithm 1.\nfor all i \u2208 {1, . . . , N} do\n\n2: repeat\n3:\n4:\n5:\n6:\n7: until Convergence of F[D, \u03c0, q]\n8: Output: Maximum-a-posteriori mixture weights \u03c0(n)\n\nMaximize (6) with respect to \u03c0i, set maximizer \u03c0(n+1)\n\nend for\n\ni\n\n= \u03c0\u2217\n\ni and n \u2192 n + 1.\n\nwhich take the form of an arithmetic and geometric mean, respectively. For the variational transition-\nmatrix we \ufb01nd the algebraic relationship\n\n\u03c4i(x, x(cid:48), u; t) = qi(x; t)qu\n\ni (t) \u02dcR\u03c0\n\ni (x, x(cid:48) | u)\n\n.\n\n(11)\n\n\u03c1i(x(cid:48); t)\n\u03c1i(x; t)\n\nBecause,\norder\nan observation model.\n\np(Y = yk | X(tk) = s) =(cid:81)\n\nthe derivation is quite lengthy, we refer to supplementary B.2 for details.\n\nIn\nto incorporate noisy observations into the CTBN dynamics, we need to specify\nthe data likelihood factorizes\ni | Xi(tk) = x), allowing us to condition on the data by\n\nIn the following we assume that\ni pi(Yi = yk\n\nenforcing jump conditions\n\nt\u2192tk\u2212 \u03c1i(x; t) = lim\nlim\nt\u2192tk+\n\npi(Yi = yk\n\ni | Xi(tk) = x)\u03c1i(x; t).\n\n(12)\n\nThe converged solutions of the ODE system can then be used to compute the suf\ufb01cient statistics\nvia (7). For a full derivation, we refer to the supplementary material B.2.\nWe note that in the limiting case of a degenerate mixture distribution \u03c0, this set of equations reduces\nto the marginal dynamics for CTBNs proposed in [11]. The set of ODEs can be solved iteratively\nas a \ufb01xed-point procedure in the same manner as in previous works [17, 4] (see Algorithm 1) in a\nforward-backward procedure.\nExhaustive structure search. As we are now able to calculate expected-suf\ufb01cient statistics given\nmixture weights \u03c0, we can design an EM-algorithm for structure learning. For this iteratively optimize\n\u03c0 given the expected suf\ufb01cient statistics, which we subsequently re-calculate. The EM-algorithm\nis summarized in Algorithm 2. In contrast to the exact EM-procedure [16], we preserve structure\nmodularity. We can thus optimize the parent-set of each node independently. This already provides a\nhuge boost in performance, as in our case the search space scales exponentially in the components,\ninstead of super-exponentially. In the paragraph \"Greedy structure search\", we will demonstrate how\nto further reduce complexity to a polynomial scaling, while preserving most prediction accuracy.\nRestricted exhaustive search. In many cases, especially for applications in molecular biology,\ncomprehensive 3databases of putative interactions are available and can be used to construct over-\ncomplete yet not fully connected prior networks G0 of reported gene and protein interactions. In this\ncase we can restrict the search space by excluding possible non-reported parents for every node i,\nparG(i) = parG0 (i), allowing for structure learning of large networks.\nGreedy structure search. Although we have derived a gradient-based scheme for exhaustive search,\nthe number of possible mixture components still equals the number of all possible parent-sets.\nHowever, in many applications, it is reasonable to assume the number of parents to be limited, which\ncorresponds to a sparsity assumption. For this reason, greedy schemes for structure learning have been\nproposed in previous works [16]. Here, candidate parent-sets were limited to have at most K parents,\nin which case, the number of candidate graphs only grows polynomially in the number of nodes. In\norder to incorporate a similar scheme in our method, we have to perform an additional approximation\nto the set of equations (10). The problem lies in the expectation step (Algorithm 1), as expectation\nis performed with respect to the full over-complete graph. In order to calculate expectations of the\ni (x, x(cid:48) | u)], we have to consider the over-complete set of parenting nodes\ngeometric mean Eu\ni (x, x(cid:48) | u)] only\ni (t) for each node i. However, for the calculation of the arithmetic mean Eu\nqu\n\ni [ \u02dcR\u03c0\n\ni [R\u03c0\n\n3e.g. https://string-db.org/ or https://www.ebi.ac.uk/intact/\n\n6\n\n\fFigure 1: a) and b) AUROC and AUPR, respectively, for complete observations for different numbers\nof trajectories. Learning is performed via the graph-score (2) (blue) and gradient-based optimization\nof the marginal mixture likelihood (5) (red-dashed). c) Relative deviation of approximate marginal\nmixture likelihood (5) from the exact marginal likelihood, computed via numerical integration, for\nmixtures of different entropies given different amounts of trajectories (legend). Con\ufb01dence intervals\nare given by 75% and 25% percentiles.\n\nparent-sets restricted to the considered sub-graphs have to be considered, due to linearity. For this\nreason, we approximate the geometric mean by the arithmetic mean \u02dcR\u03c0\ni , corresponding to the\ni [x]) + O(Var[x]), which, as before, becomes more valid for\n\ufb01rst-order expansion E\u03c0\nmore concentrated \u03c0i and is exact if \u03c0i is degenerate.\n\ni [ln(x)] = ln(E\u03c0\n\ni \u2248 R\u03c0\n\n5 Experiments\n\nWe demonstrate the effectiveness of our method on synthetic and two real-world data sets. For\nall experiments, we consider a \ufb01xed set of hyper-parameters. We set the Dirichlet concentration\nparameter ci = 0.9 for all i \u2208 {1, . . . , N}. Further, we assume a prior for the generators, which is\nuninformative on the structure \u03b1i(x, x(cid:48) | u) = 5 and \u03b2i(x | u) = 10, for all x, x(cid:48) \u2208 Xi, u \u2208 Ui. For\nthe optimization step in Algorithm 2, we use standard Matlab implementation of the interior-point\nmethod with 100 random restarts. This is feasible, as the Jacobian of (9) can be calculated analytically.\n\n5.1 Synthetic Data\n\n(cid:16)\n\n2 + 1\n\n(cid:17)\n\n2 tanh\n\n\u03b3x(cid:80)\n\nIn this experiment, we consider synthetic data generated by random graphs with a \ufb02at degree\ndistribution, truncated at degree two, i.e. each nodes has a maximal number of two parents. We\nrestrict the state-space of each node to be binary X = {\u22121, 1}. The generators of each node are\nchosen such that they undergo Glauber-dynamics [9] Ri(x, \u00afx | u) = 1\n,\nj\u2208parG (i) uj\nwhich is a popular model for benchmarking, also in CTBN literature [4]. The parameter \u03b3 denotes the\ncoupling-strength of node j to i. With increasing \u03b3 the dynamics of the network become increasingly\ndeterministic, converging to a logical-model for \u03b3 \u2192 \u221e. In order to avoid violating the weak-\ncoupling assumption [11], underlying our method, we choose \u03b3 = 0.6. We generated a varying\nnumber of trajectories with each containing 10 transitions. In order to have a fair evaluation, we\ngenerate data from thirty random graphs among \ufb01ve nodes, as described above. By computing the\nedge probabilities p(eij = 1) via (4), we can evaluate the performance of our method as an edge-\nclassi\ufb01er by computing the receiver-operator characteristic curve (ROC) and the precision-recall curve\n(PR) and their area-under-curve (AUROC) and (AUPR). For an unbiased classi\ufb01er, both quantities\nhave to approach 1, for increasing amounts of data.\nComplete data. In this experiment, we investigate the viability of using the marginal mixture\nlikelihood lower-bound as in (5) given the complete data in the form of the suf\ufb01cient statistics M\nand T . In Figure 1 we compare the AUROCs a) and AUPRs b) achieved in an edge classi\ufb01cation task\nusing exhaustive scoring of the exact marginal likelihood (2) as in [15] (blue) and gradient ascend in\n\u03c0 of the mixture marginal likelihood lower-bound (red-dashed) as in (5). In Figure 1 c) we show via\nnumerical integration, that the marginal mixture likelihood lower-bound approaches the exact one (2)\nfor decreasing entropy of \u03c0 and increasing number of trajectories. Small negative deviations are due\n\n7\n\n00.511.52Entropy()02468101214Relative Deviation of lower bound[%]101102Number of Trajectories00.20.40.60.81AUROCexactgradient101102Number of Trajectories00.20.40.60.81AUPRexactgradient\fFigure 2: a) AUROCs and b) AUPRs for varying number of trajectories. c) ROC and d) PR curve\nfor 40 trajectories. In all plots (red) denotes the exhaustive, (blue/dashed) the greedy-algorithm. e)\nROC-curve f) PR-curve for different initial \u03c0(0), where (red) denotes heuristic and (grey/dashed)\nrandom. Con\ufb01dence intervals are given by 75% and 25% percentiles of the results from 30 random\ngraphs, generated as explained in the main text.\n\nto the limited accuracy of numerical integration. Additional synthetic experiments investigating the\neffect of different concentration parameters c can be found in the supplementary C.1\nIncomplete data. Next, we test our method for network inference from incomplete data. Noisy\nincomplete observations were generated by measuring the state at Ns = 10 uniformly random time-\npoints and adding Gaussian noise with zero mean and variance 0.2. Because of the expectation-step\nin Algorithm 1, is only approximate [11], we do not expect a perfect classi\ufb01er in this experiment. We\ncompare the exhaustive search, with a K = 4 parents greedy search, such that both methods have the\nsame search-space. We initialized both methods with \u03c0(0)\n(m) = 1 if m = parG(i) and 0 else, as\na heuristic. In Figure 2 a) and b), it can be seen that both methods approach AUROCs and AUPRs\nclose to one, for increasing amounts of data. However, due to the additional approximation in the\ngreedy algorithm, it performs slightly worse. In Figure 2 c) and d) we plot the corresponding ROC\nand PR curves for 40 trajectories.\nScalablity. We compare the scalability of our\ngradient-based greedy structure search with a greedy\nhill-climbing implementation of structure seach\n(K = 2) with variational inference as in [11] (we\nlimited this search to one sweep over families). We\n\ufb01xed all parameters as before and the number of tra-\njectories to 40. Results are displayed in Figure 3.\nDependence on initial values. We investigate the\nperformance of our method with respect to differ-\nent initial values. For this, we draw the initial val-\nues of mixture components uniformly at random,\nand then project them on the probability simplex\nvia normalization, \u02dc\u03c0(0)\n(m) =\n\u02dc\u03c0(0)\n(n). We \ufb01xed all parameters as be-\ni\nfore and the number of trajectories to 40. In Figure 2,\nwe displayed ROC e) and PR f) for our heuristic\ninitial and random initial values. We \ufb01nd, that the\nheuristic performs almost consistently better.\n\nFigure 3: Run-time comparison of hill-\nclimbing structure search with variational in-\nference as in [11] with our gradient-based\nmethod.\n\n(m)/(cid:80)\n\ni \u223c U (0, 1) and \u03c0(0)\n\ni\n\nn \u02dc\u03c0(0)\n\ni\n\ni\n\n8\n\n00.51FPR00.20.40.60.81TPR00.51Recall00.20.40.60.81Precision00.51FPR00.20.40.60.81TPR00.51Recall00.20.40.60.81Precisiona)b)c)d)e)f)10203040Number of Trajectories00.20.40.60.81AUROCexhaustivegreedy10203040Number of Trajectories00.20.40.60.81AUPRexhaustivegreedy345678number of nodes01234computational time [a.u.]gradient-based greedy seachhill-climbing structure search\fTable 1: AUROC (AUPR) of different methods on IRMA-data (top performers in bold).\n\nmethod\nsteady state\nDBN\n\nODE\nNDS\n\nGC\nCTBN\n\nrandom\n\nknockout\nG1DBN\nVBSSM\nTNSI\nGP4GRN\nCSId\nCSIc\nGCCA\nexhaustive\ngreedy K=2\n\nswitch on\n0.68 (0.42)\n0.78 (0.64)\n0.79 (0.70)\n0.68 (0.51)\n0.73 (0.61)\n0.63 (0.46)\n0.64 (0.39)\n0.71 (0.55)\n0.81 (0.86)\n0.88 (0.85)\n0.65 (0.45)\n\nswitch off\n0.81 (0.50)\n0.61 (0.34)\n0.76 (0.60)\n0.68 (0.42)\n0.76 (0.57)\n0.86 (0.72)\n0.73 (0.59)\n0.74 (0.65)\n0.93 (0.92)\n0.91 (0.89)\n0.65 (0.45)\n\n5.2 Real-world data\n\nBritish household dataset. We show scalabil-\nity in a realistic setting, we applied our method\nto the British Household Panel Survey (ESRC\nResearch Centre on Micro-social Change, 2003).\nThis dataset has been collected yearly from 1991\nto 2002, thus consisting of 11 time-points. Each\nof the 1535 participants was questioned about sev-\neral facts of their life. We picked 15 of those, that\nwe deemed interpretable, some of them, \"health\nstatus\", \"job status\" and \"health issues\", having\nnon-binary state-spaces. Because the participants\nhad the option of not answering a question and\nchanges in their lives are unlikely to happen dur-\ning the questionnaire, this dataset is strongly in-\ncomplete. Out of the 1535 trajectories, we picked\n600 at random and inferred the network presented in Figure 4. In supplementary C.2 we investigate\nthe stability of this result. We performed inference with our greedy algorithm (K = 2). This dataset\nhas been considered in [16], where a network among 4 variables was inferred. Inferring a large\nnetwork at once is important, as latent variables can create spurious edges in the network [2].\nIRMA gene-regulatory network. Finally, we investigate performance on realistic data. For this,\nwe apply it to the In vivo Reverse-engineering and Modeling Assessment (IRMA) network [3]. It\nis, to best of our knowledge, the only molecular biological network with a ground-truth. This\ngene regulatory network has been implemented on cultures of yeast, as a benchmark for network\nreconstruction algorithms. Special care has been taken to isolate this network from crosstalk with\nother cellular components. The authors of [3] provide time course data from two perturbation\nexperiments, referred to as \u201cswitch on\u201d and \u201cswitch off\u201d, and attempted reconstruction using different\nmethods. In Table 1, we compare to other methods tested in [18]. For more details on this experiment\nand details on other methods, we refer to the supplementary C.3, respectively.\n\nFigure 4: Learned structure using gradient-based\ngreedy structure learning with maximal K = 2\nparents from 600 trajectories.\n\n6 Conclusion\n\nWe presented a novel scalable gradient-based approach for structure learning for CTBNs from\ncomplete and incomplete data, and demonstrated its usefulness on synthetic and real-world data. In\nthe future we plan to apply our algorithm to new bio-molecular datasets. Further, we believe that the\nmixture likelihood may also be applicable to tasks different from structure learning.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for helpful comments on the previous version of this manuscript.\nDominik Linzner and Michael Schmidt are funded by the European Union\u2019s Horizon 2020 research\nand innovation programme (iPC\u2013Pediatric Cure, No. 826121). Heinz Koeppl acknowledges support\nby the European Research Council (ERC) within the CONSYN project, No. 773196, and by the\nHessian research priority programme LOEWE within the project CompuGene.\n\n9\n\ndisabledcarsmokesmarriedhospitaljobstatuspromotionoptionlookingforwork\ufb01nancialsituationlivingwithpartnerhealthstatuschildcare\fReferences\n[1] Enzo Acerbi, Teresa Zelante, Vipin Narang, and Fabio Stella. Gene network inference us-\ning continuous time Bayesian networks: a comparative study and application to Th17 cell\ndifferentiation. BMC Bioinformatics, 15, 2014.\n\n[2] Claudia Battistin, Benjamin Dunn, and Yasser Roudi. Learning with unknowns: Analyzing\nbiological data in the presence of hidden variables. Current Opinion in Systems Biology,\n1:122\u2013128, 2017.\n\n[3] Irene Cantone, Lucia Marucci, Francesco Iorio, Maria Aurelia Ricci, Vincenzo Belcastro,\nMukesh Bansal, Stefania Santini, Mario Di Bernardo, Diego di Bernardo, and Maria Pia Cosma.\nA Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling\nApproaches. Cell, 137(1):172\u2013181, apr 2009.\n\n[4] Ido Cohn, Tal El-Hay, Nir Friedman, and Raz Kupferman. Mean \ufb01eld variational approximation\nfor continuous-time Bayesian networks. Journal Of Machine Learning Research, 11:2745\u20132783,\n2010.\n\n[5] Tal El-Hay, Ido Cohn, Nir Friedman, and Raz Kupferman. Continuous-Time Belief Propagation.\nProceedings of the 27th International Conference on Machine Learning, pages 343\u2013350, 2010.\n\n[6] Tal El-Hay, R Kupferman, and N Friedman. Gibbs sampling in factorized continuous-time\nMarkov processes. Proceedings of the 22th Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2011.\n\n[7] Yu Fan and CR Shelton. Sampling for approximate inference in continuous time Bayesian\n\nnetworks. AI and Math, 2008.\n\n[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse covariance estimation. Bio-\n\nstatistics2, 9(3):432\u2013441, 2008.\n\n[9] Roy J Glauber. Time-Dependent Statistics of the Ising Model. J. Math. Phys., 4(1963):294\u2013307,\n\n1963.\n\n[10] Daphne Koller and Nir Friedman. Probabilistic graphical models principles and techniques.\n\nMIT Press, 2010.\n\n[11] Dominik Linzner and Heinz Koeppl. Cluster Variational Approximations for Structure Learning\nof Continuous-Time Bayesian Networks from Incomplete Data. Advances in Neural Information\nProcessing Systems 31, pages 7880\u20137890, 2018.\n\n[12] Nicolai Meinshausen and Peter B\u00fchlmann. High-dimensional graphs and variable selection\n\nwith the Lasso. Annals of Statistics, 34(3):1436\u20131462, 2006.\n\n[13] Preetam Nandy, Alain Hauser, and Marloes H Maathuis. High-dimensional consistency in\n\nscore-based and hybrid structure learning. Annals of Statistics, 46(6A):3151\u20133183, 2018.\n\n[14] Uri Nodelman, Christian R Shelton, and Daphne Koller. Continuous Time Bayesian Networks.\nProceedings of the 18th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 378\u2013387,\n1995.\n\n[15] Uri Nodelman, Christian R. Shelton, and Daphne Koller. Learning continuous time Bayesian\nnetworks. Proceedings of the 19th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n451\u2013458, 2003.\n\n[16] Uri Nodelman, Christian R Shelton, and Daphne Koller. Expectation Maximization and Complex\nDuration Distributions for Continuous Time Bayesian Networks. Proc. Twenty-\ufb01rst Conference\non Uncertainty in Arti\ufb01cial Intelligence, pages pages 421\u2013430, 2005.\n\n[17] Manfred Opper and Guido Sanguinetti. Variational inference for Markov jump processes.\n\nAdvances in Neural Information Processing Systems 20, pages 1105\u20131112, 2008.\n\n[18] Christopher A. Penfold and David L. Wild. How to infer gene networks from expression pro\ufb01les,\n\nrevisited. Interface Focus, 1(6):857\u2013870, dec 2011.\n\n10\n\n\f[19] Vinayak Rao and Yee Whye Teh. Fast MCMC sampling for Markov jump processes and\n\nextensions. Journal of Machine Learning Research, 14:3295\u20133320, 2012.\n\n[20] Eric E Schadt, John Lamb, Xia Yang, Jun Zhu, Steve Edwards, Debraj Guha Thakurta, Solveig K\nSieberts, Stephanie Monks, Marc Reitman, Chunsheng Zhang, Pek Yee Lum, Amy Leonardson,\nRolf Thieringer, Joseph M Metzger, Liming Yang, John Castle, Haoyuan Zhu, Shera F Kash,\nThomas A Drake, Alan Sachs, and Aldons J Lusis. An integrative genomics approach to infer\ncausal associations between gene expression and disease. Nature Genetics, 37(7):710\u2013717, jul\n2005.\n\n[21] Peter. Spirtes, Clark N. Glymour, and Richard. Scheines. Causation, prediction, and search.\n\nMIT Press, 2000.\n\n[22] Lukas Studer, Christoph Zechner, Matthias Reumann, Loic Pauleve, Maria Rodriguez Mar-\ntinez, and Heinz Koeppl. Marginalized Continuous Time Bayesian Networks for Network\nReconstruction from Incomplete Observations. Proceedings of the 30th Conference on Arti\ufb01cial\nIntelligence (AAAI 2016), pages 2051\u20132057, 2016.\n\n[23] Cunlu Zou and Jianfeng Feng. Granger causality vs. dynamic Bayesian network inference: a\n\ncomparative study. BMC Bioinformatics, 10(1):122, dec 2009.\n\n11\n\n\f", "award": [], "sourceid": 2044, "authors": [{"given_name": "Dominik", "family_name": "Linzner", "institution": "Technische Universit\u00e4t Darmstadt"}, {"given_name": "Michael", "family_name": "Schmidt", "institution": "TU Darmstadt"}, {"given_name": "Heinz", "family_name": "Koeppl", "institution": "Technische Universit\u00e4t Darmstadt"}]}