{"title": "Cluster Variational Approximations for Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 7880, "page_last": 7890, "abstract": "Continuous-time Bayesian networks (CTBNs) constitute a general and powerful framework for modeling continuous-time stochastic processes on networks. This makes them particularly attractive for learning the directed structures among interacting entities. However, if the available data is incomplete, one needs to simulate the prohibitively complex CTBN dynamics. Existing approximation techniques, such as sampling and low-order variational methods, either scale unfavorably in system size, or are unsatisfactory in terms of accuracy. Inspired by recent advances in statistical physics, we present a new approximation scheme based on cluster-variational methods that significantly improves upon existing variational approximations. We can analytically marginalize the parameters of the approximate CTBN, as these are of secondary importance for structure learning. This recovers a scalable scheme for direct structure learning from incomplete and noisy time-series data. Our approach outperforms existing methods in terms of scalability.", "full_text": "Cluster Variational Approximations for Structure\nLearning of Continuous-Time Bayesian Networks\n\nfrom Incomplete Data\n\nDominik Linzner1 and Heinz Koeppl1,2\n\n1Department of Electrical Engineering and Information Technology\n\n2Department of Biology\n\nTechnische Universit\u00e4t Darmstadt\n\n{dominik.linzner, heinz.koeppl}@bcs.tu-darmstadt.de\n\nAbstract\n\nContinuous-time Bayesian networks (CTBNs) constitute a general and powerful\nframework for modeling continuous-time stochastic processes on networks. This\nmakes them particularly attractive for learning the directed structures among inter-\nacting entities. However, if the available data is incomplete, one needs to simulate\nthe prohibitively complex CTBN dynamics. Existing approximation techniques,\nsuch as sampling and low-order variational methods, either scale unfavorably in\nsystem size, or are unsatisfactory in terms of accuracy. Inspired by recent advances\nin statistical physics, we present a new approximation scheme based on cluster\nvariational methods that signi\ufb01cantly improves upon existing variational approxi-\nmations. We can analytically marginalize the parameters of the approximate CTBN,\nas these are of secondary importance for structure learning. This recovers a scalable\nscheme for direct structure learning from incomplete and noisy time-series data.\nOur approach outperforms existing methods in terms of scalability.\n\n1\n\nIntroduction\n\nLearning directed structures among multiple entities from data is an important problem with broad\napplicability, especially in biological sciences, such as genomics [1] or neuroscience [20]. With\nprevalent methods of high-throughput biology, thousands of molecular components can be monitored\nsimultaneously in abundance and time. Changes of biological processes can be modeled as transitions\nof a latent state, such as expression or non-expression of a gene, or activation/inactivation of protein\nactivity. However, processes at the bio-molecular level evolve across vastly different time-scales [12].\nHence, tracking every transition between states is unrealistic. Additionally, biological systems are, in\ngeneral, strongly corrupted by measurement- or intrinsic noise.\nIn previous numerical studies, continuous-time Bayesian networks (CTBNs) [13] have been shown\nto outperform competing methods for reconstruction of directed networks, such as ones based on\nGranger causality or the closely related dynamic Bayesian networks [1]. Yet, CTBNs suffer from\nthe curse of dimensionality, prevalent in multi-component systems. This becomes problematic if\nobservations are incomplete, as then the latent state of a CTBN has to be laboriously estimated [15].\nIn order to tackle this problem, approximation methods through sampling [8, 7, 19], or variational\napproaches [5, 6] have been investigated. These, however, either fail to treat high-dimensional spaces\nbecause of sample sparsity, are unsatisfactory in terms of accuracy, or provide good accuracy at the\ncost of an only locally consistent description.\nIn this manuscript, we present, to the best of our knowledge, the \ufb01rst direct structure learning method\nfor CTBNs based on variational inference. We extend the framework of variational inference for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmulti-component Markov chains by borrowing results from statistical physics on cluster variational\nmethods [23, 22, 17]. Here the previous result in [5] is recovered as a special case. We derive\napproximate dynamics of CTBNs in form of a new set of ordinary differential equations (ODEs).\nWe show that these are more accurate than existing approximations. We derive a parameter-free\nformulation of these equations, that depends only on the observations, prior assumptions, and the\ngraph structure. Lastly, we recover an approximation for the structure score, which we use to\nimplement a scalable structure learning algorithm. The notion of using marginal CTBN dynamics\nfor network reconstruction from noisy and incomplete observations was recently explored in [21]\nto successfully reconstruct networks of up to eleven nodes by sampling from the exact marginal\nposterior of the process, albeit using large computational effort. Yet, the method is sampling-based\nand thus still scales unfavorably in high dimensions. In contrast, we can recover the marginal CTBN\ndynamics at once, using a standard ODE solver.\n\n2 Background\n\n2.1 Continuous-time Bayesian networks\nWe consider continuous-time Markov chains (CTMCs) {X(t)}t\u22650 taking values in a countable state-\nspace S. A time-homogeneous Markov chain evolves according to an intensity matrix R : S\u00d7S \u2192 R,\nwhose elements are denoted by R(s, s(cid:48)), where s, s(cid:48)\n\u2208 S. A continuous-time Bayesian network [13]\nis de\ufb01ned as an N-component process over a factorized state-space S = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XN evolving\njointly as a CTMC. For local states xn, x(cid:48)\nn \u2208 Xn, we will drop the states\u2019 component index n, if\nevident by the context and no ambiguity arises. We impose a directed graph structure G = (V, E),\nencoding the relationship among the components V \u2261 {V1, . . . , VN}, which we refer to as nodes.\nThese are connected via an edge set E \u2286 V \u00d7 V . This quantity \u2013 the structure \u2013 is what we will later\nlearn. The instantaneous state of each component is denoted by Xn(t) assuming values in Xn, which\ndepends only on the states of a subset of nodes, called the parent set pa(n) \u2261 {m | (m, n) \u2208 E}.\nConversely, we de\ufb01ne the child set ch(n) \u2261 {m | (n, m) \u2208 E}. The dynamics of a local state\nXn(t) are described as a Markov process conditioned on the current state of all its parents Un(t)\ntaking values in Un \u2261 {Xm | m \u2208 pa(n)}. They can then be expressed by means of the conditional\nintensity matrices (CIMs) Ru\nn : Xn \u00d7 Xn \u2192 R, where un \u2261 (u1, . . . uL) \u2208 Un denotes the current\nstate of the parents (L = |pa(n)|). Speci\ufb01cally, we can express the probability of \ufb01nding node n in\nstate x(cid:48) after some small time-step h, given that it was in state x at time t with x, x(cid:48)\n\n\u2208 Xn as\n\n(cid:48)\nn(x, x\n\nP (Xn(t + h) = x\n\n(cid:48)\n\nwhere Ru\nstate u \u2208 Un. It holds that Ru\nintensity matrix R of the CTMC via amalgamation \u2013 see, for example, [13].\n\nn(x, x(cid:48)) is the matrix element of Ru\nn(x, x) = \u2212\n\nx(cid:48)(cid:54)=x Ru\n\n| Xn(t) = x, Un(t) = u) = \u03b4x,x(cid:48) + Ru\n\n(cid:80)\nn corresponding to the transition x \u2192 x(cid:48) given the parents\u2019\nn(x, x(cid:48)). The CIMs are connected to the joint\n\n)h + o(h),\n\n2.2 Variational lower bound\n\nThe foundation of this work is to derive a variational lower bound on the evidence of the data for\na CTMC. Such variational lower bounds are of great practical signi\ufb01cance and pave the way to a\nmultitude of approximate inference methods (variational inference). We consider paths X[0,T ] \u2261\n{X(\u03be) | 0 \u2264 \u03be \u2264 T} of a CTMC with a series of noisy state observations Y \u2261 (Y 0, . . . , Y I )\nat times (t0, . . . , tI ), drawn according to an observation model Y i \u223c P (Y i | X(ti)). We con-\nsider the posterior Kullback\u2013Leibler (KL) divergence DKL(Q(X[0,T ])||P (X[0,T ] | Y )) given a\ncandidate distribution Q(X[0,T ]), which can be decomposed as DKL(Q(X[0,T ])||P (X[0,T ] | Y )) =\nDKL(Q(X[0,T ])||P (X[0,T ])) \u2212 E[ln P (Y | X[0,T ])] + ln P (Y ), where E[\u00b7] denotes the expectation\nwith respect to Q(X[0,T ]), unless speci\ufb01ed otherwise. As DKL(Q(X[0,T ])||P (X[0,T ] | Y )) \u2265 0 this\nrecovers a lower bound on the evidence\n\n(1)\nwhere the variational lower bound F \u2261 \u2212DKL(Q(X[0,T ])||P (X[0,T ])) + E[ln P (Y | X[0,T ])] is\nalso known as the Kikuchi functional [11]. The Kikuchi functional has recently found heavy use in\nvariational approximations for probabilistic models [23, 22, 17], because of the freedom it provides\nfor choosing clusters in space and time. We will now make use of this feature.\n\nln P (Y ) \u2265 F,\n\n2\n\n\fFigure 1: Sketch of different cluster choices for a CTBN in discretized time: a) star approximation b)\nnaive mean-\ufb01eld.\n\n3 Cluster variational approximations for CTBNs\n\nThe idea behind cluster variational approximations, derived subsequently, is to \ufb01nd an approximation\nof the variational lower bound using M cluster functionals Fj of smaller sub-graphs Aj(t) for a\nCTBN using its h-discretization (see Figure 1)\n\n(cid:90) T\n\nM(cid:88)\n\nF \u2248\n\ndt\n\n0\n\nj=1\n\nFj(Aj(t)).\n\nExamples for Aj(t) are the completely local naive mean-\ufb01eld approximation Amf\nj (t) = {Xj(t +\nh), Xj(t)}, or the star approximation As\nj(t) = {Xj(t + h), Uj(t), Xj(t)} on which our method is\n| s) \u2261 Q(X(t + h) = s(cid:48)\nbased. In order to lighten the notation, we de\ufb01ne Qh(s(cid:48)\n| X(t) = s) and\nQ(s, t) \u2261 Q(X(t) = s) for Q and P , respectively. Marginal probabilities of individual nodes carry\ntheir node index as a subindex. The formulation of CTBNs imposes structure on the transition matrix\n\n(cid:48)\nP h(s\n\n| s) =\n\nP h\n\nn (x\n\n(cid:48)\nn | xn, un),\n\n(2)\n\nN(cid:89)\n\nn=1\n\nsuggesting a node-wise factorization to be a natural choice. In order to arrive at the variational lower\nbound in the star approximation, we assume that Q(X[0,T ]) describes a CTBN, i.e. its transition\nmatrices satisfy (2). However, to render our approximation tractable, we further restrict the set of\napproximating processes. Speci\ufb01cally, we require the existence of some expansion in orders of the\ncoupling strength \u03b5,\n\n(cid:48)\n\n(cid:48)\n\nQh\n\nn(x\n\nn(x\n\n| x, u) = Qh\n\n| x) + O(\u03b5) \u2200n \u2208 {1, . . . , N},\n\n(3)\nwhere the remainder O(\u03b5) contains the dependency on the parents.1 In the following, we derive\nthe star approximation of a factorized stochastic process. While the star approximation can be\nconstructed according to the rules of cluster variational methods, see [23, 22, 17], we present a\nnovel derivation via a perturbative expansion of the lower bound. This is meaningful as in cluster\nvariational approximations, the assumptions on the approximating process (and similarity measures)\nand thus the resulting approximation error cannot be quanti\ufb01ed analytically [23]. This new derivation\nalso highlights the difference to conventional mean-\ufb01eld approximations, where only the class of\napproximating distributions is restricted. The exact expression of the variational lower bound F for a\ncontinuous-time Markov process decomposes into time-wise components F = limh\u21920\n0 dt f h(t)\n(cid:48)\n| s)Q(s, t) ln Qh(s\n| s)\n,\n\n(cid:48)\n| s)Q(s, t) ln P h(s\n\n(cid:48)\nQh(s\n\n(cid:48)\nQh(s\n\nf h(t) =\n\n| s)\n\n\u2212\n\n1\nh\n\n(cid:88)\n(cid:124)\n\ns(cid:48),s\n\n(cid:125)\n\n(cid:82) T\n(cid:125)\n\nwhere we identi\ufb01ed the time-dependent energy E(t) and the entropy H(t). Following (2), we can\nwrite Qh(s(cid:48)\n\n(cid:88)\nn | xn, un). For now, we consider the time-dependent energy\nQ(s, t)\n\nn Qh\nE(t) =\n\n(cid:89)\n\n(cid:89)\n\nn(x(cid:48)\n\nQh\n\nP h\n\n(cid:48)\nn | xn, un) ln\nn(x\n\nk\n\nk (x\n\n(cid:48)\nk | xk, uk).\n\ns,s(cid:48)\n\nn\n\n1An example of a function with such an expansion is a Markov random \ufb01eld with coupling strength \u03b5.\n\n3\n\n(cid:123)(cid:122)\n\n\u2261E(t)\n\n(cid:123)(cid:122)\n\n\u2261H(t)\n\ns(cid:48),s\n\n(cid:88)\n(cid:124)\n| s) =(cid:81)\n\na)b)tX1X2X3X4X1X2X3X4t+h\fWe start by making use of the assumption in (3). Subsequently, we arrive at an expansion of the\nenergy by using the formula from Appendix B.1,\n\nE(t) =\n\nQ(s, t)\n\nQh\n\nm(x\n\n(cid:48)\nm | xm)\n\nn(x(cid:48)\nQh\nn | xn, un)\nn | xn) \u2212 (N \u2212 1)\nn(x(cid:48)\nQh\n\n(cid:41)\n\n(cid:40)(cid:88)\n\nn\n\n(cid:89)\n\nm\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ns,s(cid:48)\n\u00d7\n\nk\n\nln P h\n\n(cid:48)\nk | xk, uk) + O(\u03b52).\nk (x\n(cid:88)\n\n(cid:48)\n\n(cid:48)\n\nn\n\ns,s(cid:48)\n\nQh\n\nn(x\n\nn (x\n\nQ(s, t)\n\nE(t) =\n\nf h(t) =\n\n| x, u) + O(\u03b52).\n\nFor each k, we can sum over x(cid:48) for each n (cid:54)= k. This leaves us with\nQ(s, t) =(cid:81)\n\n| x, u) ln P h\n\n(cid:88)\n\n(cid:88)\n\n| x, u)Qn(x, t)Qu\n\nThe exact same treatment can be done for the entropy term. Finally, assuming marginal independence\nn Qn(xn, t), we arrive at the weak coupling expansion of the variational lower bound\n\n(cid:88)\n(cid:81)\nn\nwith the shorthand Qu\nl\u2208pa(n) Ql(ul, t). The variational lower bound F in star approximation\n(up to \ufb01rst order in \u03b5) decomposes on the h-discretized network spanned by the CTBN process,\ninto local star-shaped terms, see Figure 1. We emphasize that the variational lower bound in star\napproximation is no longer a lower bound on the evidence but provides an approximation. We note\nthat in contrast to the naive mean-\ufb01eld approximation employed in [16, 5], we do not have to drop\nthe dependence on the parents state of the variational transition matrix. Indeed, if we consider the\nvariational lower bound in star approximation in zeroth order of \u03b5, we recover exactly their previous\nresult, demonstrating the generality of our method (see Appendix B.3).\n\nn (x(cid:48)\nP h\n| x, u)\nn(x(cid:48) | x, u)\nQh\n\nx(cid:48),x\u2208Xn\nn \u2261\n\n+ O(\u03b52),\n\nu\u2208Un\n\nn ln\n\nn(x\n\nQh\n\n(cid:48)\n\n3.1 CTBN dynamics in star approximation\n\n(cid:48)\n\u03c4 u\nn (x, x\n\nWe will now derive differential equations governing CTBN dynamics in star approximation. In order\nto perform the continuous-time limit h \u2192 0, we de\ufb01ne,\n(cid:80)\n\nn (x, x, t) \u2261\nn (x, x(cid:48), t). The variational transition probability can then be written as an expansion in h\n\nn(x(cid:48), x, u)\nQt\n, t) \u2261 lim\nh\u21920\nn(x(cid:48), x, u) \u2261 Qh\nwith the variational transition probability Qt\n\u2212\nn + h\u03c4 u\n(cid:88)\n\nChecking self-consistency of this quantity via marginalization recovers an inhomogeneous Master\nequation\n\nfor x (cid:54)= x\n| x, u)Qn(x, t)Qu\n\n, x, u) = \u03b4x,x(cid:48)Qn(x, t)Qu\n\n(cid:48)\nn (x, x\n\nn and \u03c4 u\n\n, t) + o(h).\n\nx(cid:48)(cid:54)=x \u03c4 u\n\nn(x(cid:48)\n\nn(x\n\nQt\n\nh\n\n(cid:48)\n\n(cid:48)\n\n,\n\n, t)].\n\n(4)\n\n\u02d9Qn(x, t) =\n\n(cid:48)\n\n[\u03c4 u\n\nn (x\n\n, x, t) \u2212 \u03c4 u\n\n(cid:48)\nn (x, x\n\nx(cid:48)(cid:54)=x,u\n\nBecause of the intrinsic asynchronous update constraint on CTBNs, only local probability \ufb02ow inside\nthe state-space Xn is allowed. This renders the above equation equivalent to a continuity constraint\non the global probability distribution. After plugging in the variational transition probability into the\nvariational lower bound, we arrive at a functional that is only dependent on the marginal distributions.\nPerforming the limit of h \u2192 0, we recover at a sum of node-wise functionals in continuous-time (see\nAppendix B.2)\n\nwhere we identi\ufb01ed the variational lower bound in star approximation FS, the entropy Hn and the\nenergy En, respectively, as\n\n(Hn + En) + F0,\n\nN(cid:88)\n\nn=1\n\n(cid:21)\n\n,\n\nn (x, x(cid:48), t)\n\u03c4 u\nQn(x, t)Qu\nn\n\n(cid:88)\n\n(cid:88)\n\nx,u\n\nx(cid:48)(cid:54)=x\n\n\uf8f9\uf8fb ,\n\n)\n\n(cid:48)\n\u03c4 u\nn (x, x\n\n, t) ln Ru\n\n(cid:48)\nn(x, x\n\n(cid:90) T\n(cid:90) T\n\n0\n\nHn =\n\ndt\n\nEn =\n\ndt\n\n0\n\n(cid:20)\n\nF = FS + O(\u03b52), FS \u2261\n(cid:88)\n(cid:88)\n\uf8ee\uf8f0(cid:88)\n\nQn(x, t)En[Ru\n\n(cid:48)\n\u03c4 u\nn (x, x\n\n1 \u2212 ln\n\nx(cid:48)(cid:54)=x\n\n, t)\n\nx,u\n\nn(x, x)] +\n\nx\n\n4\n\n\fAlgorithm 1 Stationary points of Euler\u2013Lagrange equation\n1: Input: Initial trajectories Qn(x, t), boundary conditions Q(x, 0) and \u03c1(x, T ) and data Y .\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9: until Convergence\n10: Output: Set of Qn(x, t) and \u03c1n(x, t).\n\nfor all n \u2208 {1, . . . , N} do\nfor all Y i \u2208 Y do\nUpdate \u03c1n(x, t) by backward propagation from ti to ti\u22121 using (5) ful\ufb01lling (6).\nend for\nUpdate Qn(x, t) by forward propagation using (4) given \u03c1n(x, t).\n\nend for\n\n(cid:80)\nand F0 = E[ln P (Y | X[0,T ])]. The neighborhood average is de\ufb01ned as En[f u(x)] \u2261\nu(cid:48) Qu(cid:48)\n(x) for any function f u(x). In principle, higher-order clusters can be considered [22, 17].\nLastly, we enforce continuity by (4) ful\ufb01lling the constraint. We can then derive the Euler\u2013Lagrange\nequations corresponding to the Lagrangian,\n\nn f u(cid:48)\n\n(cid:90) T\n\n(cid:88)\n\nL = FS \u2212\n\ndt\n\n\u03bbn(x, t)\n\n0\n\nn,x\n\n\uf8f1\uf8f2\uf8f3 \u02d9Qn(x, t) \u2212\n\n(cid:88)\n\nx(cid:48)(cid:54)=x,u\n\n(cid:48)\n\n[\u03c4 u\n\nn (x\n\n, x, t) \u2212 \u03c4 u\n\n(cid:48)\nn (x, x\n\n, t)]\n\nwith Lagrange multipliers \u03bbn(x, t).\nThe approximate dynamics of the CTBN can be recovered as stationary points of the Lagrangian,\nsatisfying the Euler\u2013Lagrange equation. Differentiating L with respect to Qn(x, t), its time-derivative\nn (x, x(cid:48), t) and the Lagrange multiplier \u03bbn(x, t) yield a closed set of coupled ODEs for\n\u02d9Qn(x, t), \u03c4 u\nthe posterior process of the marginal distributions Qn(x, t) and transformed Lagrange multipliers\n\u03c1n(x, t) \u2261 exp(\u03bbn(x, t)), eliminating \u03c4 u\n\nn (x, x(cid:48), t),\n\nEn [Ru\n\n(cid:48)\nn(x, x\n\n(cid:48)\n\n)] \u03c1n(x\n\n, t),\n\n(5)\n\nn(x, x)] + \u03c8n(x, t)}\u03c1n(x, t) \u2212\n\n(cid:88)\n\nx(cid:48)(cid:54)=x\n\nQn(x\n\n, t)En[Ru\n\nn(x\n\n, x)]\n\n(cid:48)\n\n\u03c1n(x, t)\n\u03c1n(x(cid:48), t) \u2212 Qn(x, t)En[Ru\n\n(cid:48)\nn(x, x\n\n)]\n\n\u03c1n(x(cid:48), t)\n\u03c1n(x, t)\n\n\u02d9Qn(x, t) =\n\n\u02d9\u03c1n(x, t) ={En [Ru\n(cid:88)\n(cid:88)\n\nx(cid:48)(cid:54)=x\n\n\u03c8n(y, t) =\n\n(cid:48)\n\n(cid:88)\n\nwith\n\nQj(x, t)\n\nj\u2208ch(n)\n\nx,x(cid:48)(cid:54)=x\n\nEj[Ru\n\n(cid:48)\nj (x, x\n\nj (x, x) | y]\n\n(6)\n\n,\n\n,\n\nwhere Ej[\u00b7 | y] for y \u2208 Xn is the neighborhood average with the state of node n being \ufb01xed to y.\nFurthermore, we recover the reset condition\n\n\uf8fc\uf8fd\uf8fe ,\n\n(cid:27)\n\uf8fc\uf8fd\uf8fe ,\n\n\u03c1j(x, t)\n\n(cid:26) \u03c1j(x(cid:48), t)\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n) | y] + Ej[Ru\nN(cid:89)\n\nk=1,k(cid:54)=n\n\nt\u2192ti\u2212 \u03c1n(x, t) = lim\nlim\nt\u2192ti+\n\n\u03c1n(x, t) exp\n\nln P (Y i | s)\n\ns\u2208X|sn=x\n\nQk(xk, t)\n\n(7)\n\nn and \u03c1n(x, T ) = Y I\n\nwhich incorporates the conditioning of the dynamics on noisy observations. For the full derivation\nwe refer the reader to Appendix B.4. We require boundary conditions for the evolution interval\nin order to determine a unique solution to the set of equations (5) and (6). We thus set either\nn in the case of noiseless observations, or \u2013 if the observations\nQn(x, 0) = Y 0\nhave been corrupted by noise \u2013 Qn(x, 0) = 1\n2 and \u03c1n(x, T ) = 1 as boundaries before and after the\n\ufb01rst and the last observation, respectively. The coupled set of ODEs can then be solved iteratively as a\n\ufb01xed-point procedure in the same manner as in previous works [16, 5] (see Algorithm 1) in a forward-\nbackward procedure. In order to incorporate noisy observations into the CTBN dynamics, we need to\nassume an observation model. In the following we assume that the data likelihood factorizes P (Y i |\nn | Xn), allowing us to condition on the data by enforcing limt\u2192ti\u2212 \u03c1n(x, t) =\nlimt\u2192ti+ Pn(Y i | x)\u03c1n(x, t). In Figure 2, we exemplify CTBN dynamics (N = 3) conditioned\non observations corrupted by independent Gaussian noise. We \ufb01nd close agreement with the exact\nposterior dynamics. Because we only need to solve 2N ODEs to approximate the dynamics of an\nN-component system, we recover a linear complexity in the number of components, rendering our\nmethod scalable.\n\nX) = (cid:81)\n\nn P (Y i\n\n5\n\n\fFigure 2: Dynamics in star approximation of a three node CTBN following Glauber dynamics at\na = 1 and b = 0.6 conditioned on noisy observations (diamonds). We plotted the expected state\n(blue) plus variance (grey area). The observation model is the latent state plus gaussian random noise\nof variance \u03c3 = 0.8 and zero mean. The latent state (dashed) is well estimated for X2, even when no\ndata has been provided. For comparison, we plotted the exact posterior mean (dots). We did not plot\nthe exact variance, which depends only on the mean, for better visibility.\n\n3.2 Parameter estimation\n\nMaximization of the variational lower bound with respect to transition rates Ru\nexpected result for the estimator of transition rates\n\nn(x, x(cid:48)) yields the\n\n(cid:48)\nn(x, x\n\n\u02c6Ru\n\n) =\n\nE[M u\nE[T u\n\nn (x, x(cid:48))]\nn (x)]\n\n,\n\ngiven the expected suf\ufb01cient statistics [15]\n\nE[T u\n\nn (x)] =\n\ndt Qn(x, t)Qu\n\nn, E[M u\n\n(cid:48)\nn (x, x\n\n(cid:90) T\n\n0\n\n(cid:90) T\n\n0\n\n)] =\n\ndt \u03c4 u\n\n(cid:48)\nn (x, x\n\n, t),\n\nn (x)] is the expected dwelling time for the n\u2019th node in state x and E[M u\n\nn (x, x(cid:48))] are\nwhere E[T u\nthe expected number of transitions from state x to x(cid:48), both conditioned on the parents state u.\nFollowing a standard expectation\u2013maximization (EM) procedure, e.g. [16], we can estimate the\nsystems\u2019 parameters given the underlying network.\n\n3.3 Benchmark\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n1 + x tanh\n\nl\u2208pa(n) ul\n\nb(cid:80)\n\nIn the following, we compare the accuracy of the star approximation with the naive mean-\ufb01eld\n(cid:16)\napproximation. Throughout this section, we will consider a binary local state-space (spins) Xn =\n{+1,\u22121}. We consider a system obeying Glauber dynamics [10] with the rates Ru\nn(x,\u2212x) =\n. Here, b is the inverse temperature of the system. With increasing\na\n2\nb, the dynamics of each node depend more strongly on the dynamics of its neighbors. This corresponds\nto increasing the perturbation parameter \u03b5. The pre-factor a scales the overall rate of the process.\nThis system is an appropriate toy-example for biological networks as it encodes additive threshold\nbehavior. In Figure 3 a) and c), we show the mean-squared-error (MSE) between the expected\nsuf\ufb01cient statistics and the true ones for a tree network and an undirected chain with periodic\nboundaries of eight nodes, so that comparison with the exact result is still tractable. In this application,\nwe restrict ourselves to noiseless observations to better connect to previous results as in [5]. We\ncompare the estimation of the evidence using the variational lower bound in Figure 3 b) and d). We\n\ufb01nd that while our estimate using the star approximation is a much closer approximation, it does not\nprovide a lower bound .\n\n6\n\n510152025-2-1.5-1-0.500.511.52510152025-2-1.5-1-0.500.511.52510152025-2-1.5-1-0.500.511.52X1X2X322010203011tE[X1]22010203011t22010203011tE[X2]E[X3]\fFigure 3: We perform inference on a tree network, see a) and b), and an undirected chain, displayed\nin c) and d). In both plots we consider CTBN of eight nodes with noiseless evidence as denoted in\nsketch inlet (black: x = \u22121, white: x = 1) in a) and c) obeying Glauber dynamics with a = 8. In a),\nwe plot the mean-squared-error (MSE) for the expected dwelling times (dashed) and the expected\nnumber of transitions for the naive mean-\ufb01eld (circle, red) and star approximation (diamond, blue)\nwith respect to the predictions of the exact simulation as a function of temperature b. In b) and d), we\nplot the approximation of logarithmic evidence as a function of temperature. We \ufb01nd that for both\napproximations (star approximation in blue, naive mean-\ufb01eld in red dashed and exact result in black)\nbetter performance on the tree network, while the star approximation clearly improves upon the naive\nmean-\ufb01eld approximation in both scenarios.\n\n4 Cluster variational structure learning for CTBNs\n\nFor structure learning tasks, knowing the exact parameters of a model is in general unnecessary.\nFor this reason, we will derive a parameter-free formulation of the variational approximation for\nthe evidence lower bound and the latent state dynamics, analogous to the ones in the previous\nsection. We derive an approximate CTBN structure score, for which we need to marginalize over\nthe parameters of the variational lower bound. To this end, we assume that the parameters of the\nCTBN are random variables distributed according to a product of local and independent Gamma\nn(x)] given a graph\nstructure G. In star approximation, the evidence is approximately given by P (Y | R,G) \u2248 exp(FS).\nBy a simple analytical integration, we recover an approximation to the CTBN structure score\n\ndistributions P (R | \u03b1, \u03b2,G) =(cid:81)\n(cid:90) \u221e\n\nn(x, x(cid:48)) | \u03b1u\n\nn(x, x(cid:48)), \u03b2u\n\nx(cid:48)(cid:54)=x Gam [Ru\n\n(cid:81)\n\n(cid:81)\n\nn\n\nx,u\n\n(cid:18)\n\nP (G | Y, \u03b1, \u03b2) \u2248 P (G)\n\n0\n\n\u221d eH(cid:89)\n\nn\n\n(cid:89)\n\n(cid:89)\n\nx,u\n\nx(cid:48)(cid:54)=x\n\ndR e\n\nFS P (R | \u03b1, \u03b2,G)\n(cid:19)\u03b1u\n\nn(x,x(cid:48)\n\n\u03b2u\nn(x)\nn (x)] + \u03b2u\n\n(E[T u\n\nn(x))M u\n\nn (x,x(cid:48))\n\n) \u0393 (E[M u\n\nn (x, x(cid:48))] + \u03b1n\nn(x, x(cid:48)))\n\u0393 (\u03b1u\n\nn(x, x(cid:48)))\n\n,\n\n(8)\n\nwith \u0393 being the Gamma-function. The approximated CTBN structure score still satis\ufb01es structural\nmodularity, if not broken by the structure prior P (G). However, an implementation of a k-learn\nstructure learning strategy as originally proposed in [14] is prohibited, as the latent state estimation\ndepends on the entire network. For a detailed derivation, see Appendix B.5. Finally, we note that,\nin contrast to the evidence in Figure 3, we have no analytical expression for the structure score (the\nintegral is intractable), so that we can not compare with the exact result after integration.\n\n4.1 Marginal dynamics of CTBNs\n\nThe evaluation of the approximate CTBN structure score requires the calculation of the latent\nstate dynamics of the marginal CTBN. For this, we approximate the Gamma function in (8) via\nStirling\u2019s approximation. As Stirling\u2019s approximation becomes accurate asymptotically, we imply\nthat suf\ufb01ciently many transitions have been recorded across samples or have been introduced via a\nsuf\ufb01ciently strong prior assumption. By extremization of the marginal variational lower bound, we\nrecover a set of integro-differential equations describing the marginal self-exciting dynamics of the\nCTBN (see Appendix B.6). Surprisingly, the only difference of this parameter-free version compared\nto (5) and (6) is that the conditional intensity matrix has been replaced by its posterior estimate\n\n(9)\nn(x, x(cid:48)) is thus determined recursively by the dynamics generated by itself, conditioned\nThe rate \u00afRu\non the observations and prior information. We notice the similarity of our result to the one recovered\n\n) \u2261\n\n\u00afRu\n\n.\n\n(cid:48)\nn(x, x\n\nE[M u\n\nn (x, x(cid:48))] + \u03b1u\nE[T u\nn (x)] + \u03b2u\n\nn(x, x(cid:48))\nn(x)\n\n7\n\n12345678910-11-10-9-8-7-6-5b)119571234567891000.050.10.150.20.250.30.350.4bMSEa)ln\u02c6P(Y)b00.10.20.30.400.5112345678910-7.5-7-6.5-6-5.5-5-4.5-4-3.5-3d)6754312345678910012345678MSEc)bb24680ln\u02c6P(Y)00.5100.5100.51\fTable 1: Experimental results with datasets generated from random CTBNs (N = 5) with families\nof up to kmax parents. To demonstrate that our score prevents over-\ufb01tting we search for families of\nup to k = 2 parents. When changing one parameter the other default values are \ufb01xed to D = 10,\nb = 0.6 and \u03c3 = 0.2.\n\nkmax\n1\n\nExperiment\nNumber of\ntrajectories\n\n2\n\nMeasurement\nnoise\nNumber of\ntrajectories\n\nMeasurement\nnoise\n\nVariable AUROC\n\nD = 5\nD = 10\nD = 20\n\n\u03c3 = 0.6\n\u03c3 = 1.0\n\nD = 5\nD = 10\nD = 20\n\n\u03c3 = 0.6\n\u03c3 = 1.0\n\n0.78\u00b1 0.03\n0.87\u00b1 0.03\n0.96\u00b1 0.02\n0.81\u00b1 0.10\n0.69\u00b1 0.07\n0.64\u00b1 0.09\n0.68\u00b1 0.12\n0.75\u00b1 0.11\n0.71\u00b1 0.13\n0.64\u00b1 0.11\n\nAUPR\n0.64\u00b1 0.01\n0.76\u00b1 0.00\n0.92\u00b1 0.00\n0.71\u00b1 0.00\n0.49\u00b1 0.01\n0.50\u00b1 0.17\n0.54\u00b1 0.14\n0.68\u00b1 0.16\n0.58\u00b1 0.20\n0.53\u00b1 0.15\n\nin [21], where, however, the expected suf\ufb01cient statistics had to be computed self-consistently during\neach sample path. We employ a \ufb01xed-point iteration scheme to solve the integro-differential equation\nfor the marginal dynamics in a manner similar to EM (for the detailed algorithm, see Appendix A.2).\n\n5 Results and discussion\n\nFor the purpose of learning, we employ a greedy hill-climbing strategy. We exhaustively score all\npossible families for each node with up to k parents and set the highest scoring family as the current\none. We do this repeatedly until our network estimate converges, which usually takes only two of such\nsweeps. We can transform the scores to probabilities and generate Reciever-Operator-Characteristics\n(ROCs) and Precision-Recall (PR) curves by thresholding the averaged graphs. As a measure of\nperformance, we calculate the averaged Area-Under-Curve (AUC) for both. We evaluate our method\nusing both synthetic and real-world data from molecular biology 2. In order to stabilize our method\nin the presence of sparse data, we augment our algorithm with a prior \u03b1 = 5 and \u03b2 = 10, which is\nuninformative of the structure, for both experiments. We want to stress that, while we removed the\nbottleneck of exponential scaling of latent state estimation of CTBNs, Bayesian structure learning via\nscoring still scales super-exponentially in the number of components [9]. Our method can thus not be\ncompared to shrinkage based network inference methods such as fused graphical lasso.\nThe synthetic experiments are performed on CTBNs encoded with Glauber dynamics. For each of\nthe D trajectories, we recorded 10 observations Y i at random time-points ti and corrupted them with\nGaussian noise with variance \u03c3 = 0.6 and zero mean. In Table 1, we apply our method to random\ngraphs consisting of N = 5 nodes and up to kmax parents. We note that \ufb01xing kmax does not \ufb01x the\npossible degree of the node (which can go up to N \u2212 1). For random graphs with kmax = 1, our\nmethod performs best, as expected, and we are able to reliably recover the correct graph if enough\ndata are provided. To demonstrate that our score penalizes over-\ufb01tting, we search for families of up to\nk = 2 parents. For the more challenging scenario of kmax = 2, we \ufb01nd a drop in performance. This\ncan be explained by the presence of strong correlations in more connected graphs and the increased\nmodel dimension with larger kmax. In order to prove that our method outperforms existing methods\nin terms of scalability, we successfully learn a tree-network, with a leaf-to-root feedback, of 14 nodes\nwith a = 1, b = 0.6, see Figure 4 II). This is the largest inferred CTBN from incomplete data reported\n(in [21] a CTBN of 11 nodes is learned, albeit with incomparably larger computational effort).\nFinally, we apply our algorithm to the In vivo Reverse-engineering and Modeling Assessment (IRMA)\nnetwork [4], a gene regulatory network that has been implemented on cultures of yeast, as a benchmark\nfor network reconstruction algorithms, see Figure 4 I). Special care has been taken in order to isolate\nthis network from crosstalk with other cellular components. It is thus, to best of our knowledge,\nthe only molecular biological network with a ground truth. The authors of [4] provide time course\ndata from two perturbation experiments, referred to as \u201cswitch on\u201d and \u201cswitch off\u201d, and attempted\n\n2Our toolbox and code for experiments are available at https://github.com/dlinzner-bcs/.\n\n8\n\n\fFigure 4: I) Reconstruction of a gene regulatory network (IRMA) from real-world data. To the left we\nshow the inferred network for the \u201cswitch off\u201d and \u201cswitch on\u201d dataset. The ground truth network is\ndisplayed by black thin edges, the correctly inferred edges are thick (all inferred edges were correct).\nThe the red edge was identi\ufb01ed only in \u201cswitch on\u201d, the teal edge only in \u201cswitch off\u201d. On the right\nwe show a small table summarizing the reconstruction capabilities of our method, TSNI and BANJO\n(PPV of random guess is 0.5). II) Reconstruction of large graphs. We tested our method on a ground\ntruth graph with 14 nodes, as displayed in a) with node-relations sketched in the inlet, encoded with\nGlauber dynamics and searched for a maximum of k = 1 parents. Although we used relatively few\nobservations that have been strongly corrupted, the averaged learned graph b) is visibly close to the\nground truth. We framed the prediction of the highest scoring graph, where correctly learned edges\nare framed white and the incorrect ones are framed red.\n\nreconstruction using different methods. In order to compare to their results we adopt their metrics\nPositive Predicted Value (PPV) and the Sensitivity score (SE) [2]. The best performing method is\nODE-based (TSNI [3]) and required additional information on the perturbed genes in each experiment,\nwhich may not always be available. As can be seen in Figure 4 I), our method performs accurately on\nthe \u201cswitch off\u201d and the \u201cswitch on\u201d data set regarding the PPV. The SE is slightly worse than for\nTSNI on \u201cswitch off\u201d. In both cases, we perform better than the other method based on Bayesian\nnetworks (BANJO [24]). Lastly, we note that in [1] more correct edges could be inferred using\nCTBNs, however with parameters tuned with respect to the ground truth to reproduce the IRMA\nnetwork. For details on our processing of the IRMA data, see Appendix C. For comparison with other\nmethods tested in [18] we refer to Appendix D where our method is consistently a top performer\nusing AUROC and AUPR as metrics.\n\n6 Conclusion\n\nWe develop a novel method for learning directed graphs from incomplete and noisy data based on\na continuous-time Bayesian network. To this end, we approximate the exact but intractable latent\nprocess by a simpler one using cluster variational methods. We recover a closed set of ordinary\ndifferential equations that are simple to implement using standard solvers and retain a consistent and\naccurate approximation of the original process. Additionally, we provide a close approximation to\nthe evidence in the form of a variational lower bound that can be used for learning tasks. Lastly, we\ndemonstrate how marginal dynamics of continuous-time Bayesian networks, which only depend on\ndata, prior assumptions, and the underlying graph structure, can be derived by the marginalization of\nthe variational lower bound. Marginalization of the variational lower bound provides an approximate\nstructure score. We use this to detect the best scoring graph using a greedy hill-climbing procedure.\nIt would be bene\ufb01cial to identify higher-order approximations of the variational lower bound in the\nfuture. We test our method on synthetic as well as real data and show that our method produces\nmeaningful results while outperforming existing methods in terms of scalability.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for helpful comments on the previous version of this manuscript.\nDominik Linzner is funded by the European Union\u2019s Horizon 2020 research and innovation pro-\ngramme under grant agreement 668858.\n\n9\n\n2468101214246810121400.10.20.30.40.50.60.70.80.9246810121424681012141114700.514171417PPV=0.77a)b)II)SWI5CBF1ASH1PPV=1SE=0.5PPV=1SE=0.5Switch onSwitch o\ufb00I)PPV=1SE=0.67PPV=0.75SE=0.5TSNICTBNTSNICTBNBANJOPPV=0.5SE=0.33BANJOPPV=0.33SE=0.33GLA4\\GLA80\fReferences\n[1] Enzo Acerbi, Teresa Zelante, Vipin Narang, and Fabio Stella. Gene network inference us-\ning continuous time Bayesian networks: a comparative study and application to Th17 cell\ndifferentiation. BMC Bioinformatics, 15, 2014.\n\n[2] Mukesh Bansal, Vincenzo Belcastro, Alberto Ambesi-Impiombato, and Diego di Bernardo.\nHow to infer gene networks from expression pro\ufb01les. Molecular systems biology, 3:78, 2007.\n\n[3] Mukesh Bansal, Giusy Della Gatta, and Diego di Bernardo. Inference of gene regulatory net-\nworks and compound mode of action from time course gene expression pro\ufb01les. Bioinformatics,\n22(7):815\u2013822, apr 2006.\n\n[4] Irene Cantone, Lucia Marucci, Francesco Iorio, Maria Aurelia Ricci, Vincenzo Belcastro,\nMukesh Bansal, Stefania Santini, Mario Di Bernardo, Diego di Bernardo, and Maria Pia Cosma.\nA Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling\nApproaches. Cell, 137(1):172\u2013181, apr 2009.\n\n[5] Ido Cohn, Tal El-Hay, Nir Friedman, and Raz Kupferman. Mean \ufb01eld variational approximation\nfor continuous-time Bayesian networks. Journal Of Machine Learning Research, 11:2745\u20132783,\n2010.\n\n[6] Tal El-Hay, Ido Cohn, Nir Friedman, and Raz Kupferman. Continuous-Time Belief Propagation.\nProceedings of the 27th International Conference on Machine Learning, pages 343\u2013350, 2010.\n\n[7] Tal El-Hay, R Kupferman, and N Friedman. Gibbs sampling in factorized continuous-time\nMarkov processes. Proceedings of the 22th Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2011.\n\n[8] Yu Fan and CR Shelton. Sampling for approximate inference in continuous time Bayesian\n\nnetworks. AI and Math, 2008.\n\n[9] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning Probabilistic Relational\nModels. In Proceedings of the Sixteenth International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI-99), August 1999.\n\n[10] Roy J Glauber. Time-Dependent Statistics of the Ising Model. J. Math. Phys., 4(1963):294\u2013307,\n\n1963.\n\n[11] Ryoichi Kikuchi. A theory of cooperative phenomena. Physical Review, 81(6):988\u20131003, mar\n\n1951.\n\n[12] Michael Klann and Heinz Koeppl. Spatial Simulations in Systems Biology: From Molecules to\n\nCells. International Journal of Molecular Sciences, 13(6):7798\u20137827, 2012.\n\n[13] Uri Nodelman, Christian R Shelton, and Daphne Koller. Continuous Time Bayesian Networks.\nProceedings of the 18th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 378\u2013387,\n1995.\n\n[14] Uri Nodelman, Christian R. Shelton, and Daphne Koller. Learning continuous time Bayesian\nnetworks. Proceedings of the 19th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n451\u2013458, 2003.\n\n[15] Uri Nodelman, Christian R Shelton, and Daphne Koller. Expectation Maximization and Complex\nDuration Distributions for Continuous Time Bayesian Networks. Proc. Twenty-\ufb01rst Conference\non Uncertainty in Arti\ufb01cial Intelligence, pages pages 421\u2013430, 2005.\n\n[16] Manfred Opper and Guido Sanguinetti. Variational inference for Markov jump processes.\n\nAdvances in Neural Information Processing Systems 20, pages 1105\u20131112, 2008.\n\n[17] Alessandro Pelizzola and Marco Pretti. Variational approximations for stochastic dynamics on\n\ngraphs. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):1\u201328, 2017.\n\n[18] Christopher A. Penfold and David L. Wild. How to infer gene networks from expression pro\ufb01les,\n\nrevisited. Interface Focus, 1(6):857\u2013870, dec 2011.\n\n10\n\n\f[19] Vinayak Rao and Yee Whye Teh. Fast MCMC sampling for Markov jump processes and\n\nextensions. Journal of Machine Learning Research, 14:3295\u20133320, 2012.\n\n[20] Eric E Schadt, John Lamb, Xia Yang, Jun Zhu, Steve Edwards, Debraj Guha Thakurta, Solveig K\nSieberts, Stephanie Monks, Marc Reitman, Chunsheng Zhang, Pek Yee Lum, Amy Leonardson,\nRolf Thieringer, Joseph M Metzger, Liming Yang, John Castle, Haoyuan Zhu, Shera F Kash,\nThomas A Drake, Alan Sachs, and Aldons J Lusis. An integrative genomics approach to infer\ncausal associations between gene expression and disease. Nature Genetics, 37(7):710\u2013717, jul\n2005.\n\n[21] Lukas Studer, Christoph Zechner, Matthias Reumann, Loic Pauleve, Maria Rodriguez Mar-\ntinez, and Heinz Koeppl. Marginalized Continuous Time Bayesian Networks for Network\nReconstruction from Incomplete Observations. Proceedings of the 30th Conference on Arti\ufb01cial\nIntelligence (AAAI 2016), pages 2051\u20132057, 2016.\n\n[22] Eduardo Dom\u00ednguez V\u00e1zquez, Gino Del Ferraro, and Federico Ricci-Tersenghi. A simple\nanalytical description of the non-stationary dynamics in Ising spin systems. Journal of Statistical\nMechanics: Theory and Experiment, 2017(3):033303, 2017.\n\n[23] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, Kikuchi approx-\nimations, and belief propagation algorithms. Advances in neural information, 13:657\u2013663,\n2000.\n\n[24] Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J. Hartemink, and Erich D. Jarvis. Advances\nto Bayesian network inference for generating causal networks from observational biological\ndata. Bioinformatics, 20(18):3594\u20133603, dec 2004.\n\n11\n\n\f", "award": [], "sourceid": 4892, "authors": [{"given_name": "Dominik", "family_name": "Linzner", "institution": "TU Darmstadt"}, {"given_name": "Heinz", "family_name": "Koeppl", "institution": "Technische Universit\u00e4t Darmstadt"}]}