NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2044 Scalable Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data

### Reviewer 1

I like the approach. This paper describes something new for continuous-time structure estimation. While mixtures have been explored for structure estimation in other domains, they have not been applied here and there are non-trivial hurdles that were overcome to do so. This method seems to be scalable. However, this is not tested in the paper. It would be good to see some experiments that would demonstrate this, particularly as the BHPS data was down-sampled to 200 trajectories (from 1535) and it isn't clear why. The variational method used seems very similar to that of [4]. This paper should make the connection clearer. Finally, structural EM (SEM) (see Section 19.4.3 of Koller & Friedman or Friedman's original 1997 paper) from BNs has been applied to CTBNs before (see [16], for example, and it seems to be implemented in CTBN-RLE). While exact inference is not scalable, this could be used with the the variation inference of [4]. This would make a natural comparison, as it is a scalable existing alternative that also employs variational inference. Minor notes: - I think Equation 2 needs a "+1" inside the Gamma function, as well. - The last equation on page 2 of the supplementary material does not seem to account for the sqrt(2pi/z) part of Stirling's approximation (which has an apostrophe, please note).

### Reviewer 2

Originality; the work sheds new light on CTBN models and how to learn them from data in the case when big data has to be managed. Furthermore, the formulation of the structural learning algorithm for complete and incomplete data is a relevant step to improve effectiveness of CTBNs. Quality; the submission is formal and sound. However, I have the following concerns: Pag. 2; formula (1), I would ask to explain why the likelihood misses the part related to permanence in any given state, Pag. 3; I read "The integral ca ..." which one ? I suggest it is better to clarify this point, even if I know it. Pag. 5; the assumption about the alpha>>1 is quite strong and I would kindly ask to better motivate, investigate and analyze its' impact on solutions. I think this could strongly limit the application of the proposed approach in case where you have few observations w.r.t the number of variables. I found some minor typos.. I also would like to know something about inference on the CTBN once you have learnt it from data, i.e. how are you computing filtering, smoothing, ...? Clarity; the paper is in general quite clear, even if some more details and examples to introduce the idea of mixtures could have helped the reader to better understand the proposed approach. Significance; the contribution, in my humble opinion, is relevant with specific reference to the research area of CTBNs. Furthermore, it can help improve results achieved in relevant application domains as finance, medicine and biology.

### Reviewer 3

Summary: Within the manuscript, the authors extend the continuous time Bayesian Networks by incorporating a mixture prior over the conditional intensity matrices, thereby allowing for a larger class compared to a gamma prior usually employed over these. My main concerns are with clarity / quality as the manuscript is quite densely written with quite some material has either been omitted or shifted to the appendix. For a non-expert in continuous time bayesian networks, it is quite hard to read. Additionally, there are quite a few minor mistakes (see below) that make understanding of the manuscript harder. As it stands, Originality: The authors combine variational inference method from Linzner et al [11], with the new prior over the dependency structure (mixture). By replacing sufficient statistics with expected (according to the variational distribution) sufficient statistics the authors derive a gradient based scheme according to the approximation to the (marginal likelihood). Quality/Clarity: As said, my main concern is about clarity and to some degree therefore also quality. My main confusion arises from section 4 (partly also 3), as the overall scheme is opaque to me. This is mainly due to the fact that part of the derivation is shifted to the appendix. As a result, it is unclear to me, how the expected moments can be computed from \rho_i, q_i. It is said, that this can be done from 7, but there I need \tau_i, how do I get this, this is not explained. Also, the final solution to (9) in (10,11) does not depend on the observations Y anymore, how is this possible? Some minor things contributing to my confusion: - Line 75: "a posteriori estimate": This is not a posterior over structures, but a maximum marginal likelihood estimate. - Eq (5), line 114: I was wondering about the 'normalization constant'. First, I think, it should be mentioned, that it is constant wrt to \pi. Second, Z is not necessarily the normalization constant of the true posterior but the approximation to the normalization constant that one would obtain, if the lower bound of line 105 would be used as likelihood, correct? - Algorithm 1: is only mentioned two pages later and the references to equations don't make sense. Also this algorithm is not explained at all. - Line 127: ref [5] is actually EP not VI - Line 149: the shorthand is used later not there. - Line 161: psi (x,t): I guess this should depend on Y. As stated the overall inference scheme does not depend on the observations Y, that does not make sense. - line 168: why should constraint ensure that incorporate noisy observations. The whole section is opaque to me. - Figure 1: subfigure labeling is wrong - Experiment british household: the authors report ROC scores, but do not mention the classification problem they are trying to solve, what was the ground truth? Also, it seems odd to me, that childcare is not linked to children. Significance: The proposed method does improve the scaling of inferring the dependency structure (reported from 4 nodes to 11). However, other approaches as in were discarded as not being sufficiently accurate or being too data hungry. The quality of the likelihood approximation for example could be evaluated on a small toy-example and compared against sampling based approaches, or [11].