{"title": "A Model for Temporal Dependencies in Event Streams", "book": "Advances in Neural Information Processing Systems", "page_first": 1962, "page_last": 1970, "abstract": "We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efficiently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events.", "full_text": "A Model for Temporal Dependencies\n\nin Event Streams\n\nAsela Gunawardana\nMicrosoft Research\nOne Microsoft Way\nRedmond, WA 98052\n\naselag@microsoft.com\n\nChristopher Meek\nMicrosoft Research\nOne Microsoft Way\nRedmond, WA 98052\n\nmeek@microsoft.com\n\nPuyang Xu\n\nECE Dept. & CLSP\n\nJohns Hopkins University\n\nBaltimore, MD 21218\npuyangxu@jhu.edu\n\nAbstract\n\nWe introduce the Piecewise-Constant Conditional Intensity Model, a model for\nlearning temporal dependencies in event streams. We describe a closed-form\nBayesian approach to learning these models, and describe an importance sampling\nalgorithm for forecasting future events using these models, using a proposal distri-\nbution based on Poisson superposition. We then use synthetic data, supercomputer\nevent logs, and web search query logs to illustrate that our learning algorithm can\nef\ufb01ciently learn nonlinear temporal dependencies, and that our importance sam-\npling algorithm can effectively forecast future events.\n\n1\n\nIntroduction\n\nThe problem of modeling temporal dependencies in temporal streams of discrete events arises in\na wide variety of applications. For example, system error logs [14], web search query logs, the\n\ufb01ring patterns of neurons [18] and gene expression data [8], can all be viewed as streams of events\nover time. Events carry both information about their timing and their type (e.g., the web query\nissued or the type of error logged), and the dependencies between events can be due to both their\ntiming and their types. Modeling these dependencies is valuable for forecasting future events in\napplications such as system failure prediction for preemptive maintenance or forecasting web users\u2019\nfuture interests for targeted advertising.\nWe introduce the Piecewise-Constant Conditional Intensity Model (PCIM), which is a class of\nmarked point processes [4] that can model the types and timing of events. This model captures\nthe dependencies of each type of event on events in the past through a set of piecewise-constant\nconditional intensity functions. We use decision trees to represent these dependencies and give a\nconjugate prior for this model, allowing for closed-form computation of the marginal likelihood and\nparameter posteriors. Model selection then becomes a problem of choosing a decision tree. Decision\ntree induction can be done ef\ufb01ciently because of the closed form for the marginal likelihood. Fore-\ncasting can be carried out using forward sampling for arbitrary \ufb01nite duration queries. For episodic\nsequence queries, that is, queries that specify particular sequences of events in given future time\nintervals, we develop a novel approach for estimating the probability of rare queries, which we call\nthe Poisson Superposition Importance Sampler (PSIS).\nWe validate our learning and inference procedures empirically. Using synthetic data we show that\nPCIMs can correctly learn the underlying dependency structure of event streams, and that the PSIS\nleads to effective forecasting. We then use real supercomputer event log data to show that PCIMs\ncan be learned more than an order of magnitude faster than Poisson Networks [15, 18], and that they\nhave better test set likelihood. Finally, we show that PCIMs and the PSIS are useful in forecasting\nfuture interests of real web search users.\n\n1\n\n\f2 Related Work\n\nWhile graphical models such as Bayesian networks [2] and dependency networks [10] are widely\nused to model the dependencies between variables, they do not model temporal dependencies (see\ne.g., [8]). Dynamic Bayesian Networks (DBN) [5, 9] allow modeling of temporal dependencies in\ndiscrete time. It is not clear how timestamps in our data should be discretized in order to apply the\nDBN approach. At a minimum, too slow a sampling rate results in poor representation of the data,\nand too fast a sampling rate increases the number of samples making learning and inference more\ncostly. In addition, allowing long term dependencies requires conditioning on multiple steps into\nthe past, and choosing too fast a sampling rate increases the number of such steps that need to be\nconditioned on.\nRecent progress in modeling continuous time processes include Continuous Time Bayesian Net-\nworks (CTBNs) [12, 13], Continuous Time Noisy-Or (CT-NOR) [16], Poisson Cascades [17], and\nPoisson Networks [15, 18]. CTBNs are homogeneous Markov models of the joint trajectories of\ndiscrete \ufb01nite variables, rather than models of event streams in continuous time [15]. In contrast,\nCT-NOR and Poisson Cascades model event streams, but require the modeler to choose a parametric\nform for temporal dependencies. Simma et al [16, 17] describe how this choice signi\ufb01cantly impacts\nmodel performance, and depends strongly on the domain. In particular, the problem of model selec-\ntion for CT-NOR and Poisson Cascades is unaddressed. PCIMs, in contrast to CT-NOR and Poisson\nCascades, perform structure learning to learn how different events in the past affect future events.\nPoisson Networks, described in more detail below, are closely related to PCIMs, but PCIMs are over\nan order of magnitude faster to learn and can model nonlinear temporal dependencies.\n\n3 Conditional Intensity Models\n\nIn this section, we de\ufb01ne Conditional Intensity Models, introduce the class of Piecewise-Constant\nConditional Intensity Models, and describe Poisson Networks. We assume that events of dif-\nferent types are distinguished by labels l drawn from a \ufb01nite set L. An event is then com-\nposed of a non-negative time-stamp t and a label l. An event sequence x = {(ti, li)}n\nwhere 0 < t1 < \u00b7\u00b7\u00b7 < tn. The history at time t of event sequence x is the sub-sequence\nh(t, x) = {(ti, li) | (ti, li) \u2208 x, ti \u2264 t}. We write hi for h(ti\u22121, x) when it is clear from context\nwhich x is meant. By convention t0 = 0. We de\ufb01ne the ending time t(x) of an event sequence x as\nthe time of the last event in x: t(x) = max ({t : (t, l) \u2208 x}) so that t(hi) = ti\u22121.\nA Conditional Intensity Model (CIM) is a set of non-negative conditional intensity functions indexed\nby label {\u03bbl(t|x; \u03b8)}l\u2208L. The data likelihood for this model is\n\ni=1\n\n\u03bbl(ti|hi, \u03b8)1l(li)e\u2212\u039bl(ti|hi;\u03b8)\n\n(1)\n\nl\u2208L\n\ni=1\n\n\u2212\u221e \u03bbl(\u03c4|x; \u03b8)d\u03c4 for each event sequence x and the indicator function 1l(l0) is\none if l0 = l and zero otherwise. The conditional intensities are assumed to satisfy \u03bbl(t|x; \u03b8) = 0 for\nt \u2264 t(x) to ensure that ti > ti\u22121 = t(hi). These modeling assumptions are quite weak. In fact, any\ndistribution for x in which the timestamps are continuous random variables can be written in this\nform. For more details see [4, 6]. Despite the fact that the modeling assumptions are weak, these\nmodels offer a powerful approach for decomposing the dependencies of different event types on the\npast. In particular, this per-label conditional factorization allows one to model detailed label-speci\ufb01c\ndependence on past events.\n\np(x|\u03b8) =Y\n\nnY\n\nwhere \u039bl(t|x; \u03b8) =R t\n\n3.1 Piecewise-Constant Conditional Intensity Models\n\nPiecewise-Constant Conditional Intensity Models (PCIMs) are Conditional Intensity Models where\nthe conditional intensity functions are assumed to be piecewise-constant. As described below, this\nassumption allows ef\ufb01cient learning and inference. PCIMs are de\ufb01ned in terms of local structures\nSl for each label l, which specify regions in time where the corresponding conditional intensity\nfunction is constant, and local parameters \u03b8l for each label which specify the values taken in those\nregions. Piecewise-Constant Conditional Intensity Models (PCIMs) are de\ufb01ned by local structures\nSl = (\u03a3l, \u03c3l(t, x)) and local parameters \u03b8l = {\u03bbls}s\u2208\u03a3l, where \u03a3l denotes a set discrete states, \u03bbls\n\n2\n\n\fare non-negative constants, and \u03c3l denotes a state function that maps a time and an event sequence\nto \u03a3l and is piecewise constant in time for every event sequence. The conditional intensity functions\nare de\ufb01ned as \u03bbl(t|x) = \u03bbls with s = \u03c3l(t, x), and thus are piecewise constant. The resulting data\nlikelihood can be written as\n\np(x|S, \u03b8) =Y\n\nY\n\n\u03bbcls(x)\nls\n\ne\u2212\u03bblsdls(x)\n\nwhere S = {Sl}l\u2208L , \u03b8 = {\u03b8l}l\u2208L , cls(x) is the number of times label l occurs in x when the state\ni 1l (li) 1s (\u03c3l(ti, hi))), and dls(x) is the total duration during\n\nfunction for l maps to state s (i.e.,P\nwhich the state function for l maps to state s in the data x (i.e.,R t(x)\n\n1s (\u03c3 (\u03c4, h (\u03c4, x))) d\u03c4).\n\ns\u2208\u03a3l\n\nl\u2208L\n\n0\n\n(2)\n\n3.2 Poisson Networks\nPoisson networks[15, 18] are closely related to PCIMs. Given a basis set B of piecewise-constant\nreal-valued feature functions f(t, x), a feature vector \u03c3l(t, x) is de\ufb01ned for each l by selecting\ncomponent feature functions from B. The resulting \u03c3l(t, x) are piecewise-constant in time. The\nconditional intensity for l is given by the regression \u03bbl(t|x, \u03b8) = ewl\u00b7\u03c3l(t,x) with parameter wl. By\nconvention, the component \u03c3l,0(t, x) = 1 so that wl,0 is a bias parameter.\nThe resulting likelihood does not have a conjugate prior, and in our experiments we use iterative\n(cid:16)\nMAP parameter estimates under a Gaussian prior, and use a Laplace approximation of the marginal\nlikelihood for structure learning (i.e., feature selection) [15]. In our experiments, each f \u2208 B is spec-\ni\ufb01ed by a label l and a pair of time offsets 0 \u2264 d1 < d2, and takes on the value log\n1 + cl,d1,d2 (t,x)\nwhere cl,d1,d2(t, x) is the number of times l occurs in x in the interval [t \u2212 d2, t \u2212 d1).\n\nd2\u2212d1\n\n(cid:17)\n\n4 Learning PCIMs\n\nIn this section, we present an ef\ufb01cient learning algorithm for PCIMs. We give a conjugate prior for\nthe parameters \u03b8 which yields closed form formulas for the parameter posteriors and the marginal\nlikelihood of the data given a structure S. We then give a decision tree based learning algorithm that\nuses the closed-form marginal likelihood formula to learn the local structure Sl for each label.\n\n4.1 Closed-Form Parameter Posterior and Marginal Likelihood\n\nIn general, computing parameter posteriors for likelihoods of the form of equation (1) is compli-\ncated. However, in the case of PCIMs, the Gamma distribution is a conjugate prior for \u03bbls, despite\nthe fact that the data likelihood of equation (2) is not a product of exponential densities (i.e., when\ncls(x) 6= 1). The corresponding prior and posterior densities are given by\np(\u03bbls|\u03b1ls, \u03b2ls) = \u03b2ls\nAssuming the prior over \u03b8 is a product of such p(\u03bbls|\u03b1ls, \u03b2ls), the marginal likelihood is\n\np(\u03bbls|\u03b1ls, \u03b2ls, x) = p(\u03bbls|\u03b1ls + cls(x), \u03b2ls + dls(x))\n\ne\u2212\u03b2ls\u03bbls;\n\n\u03b1ls\n\nIn our experiments, we use the point estimate \u02c6\u03bbls = \u03b1ls+cls(x)\n\n\u03b3ls(x);\n\n\u03b3ls(x) = \u03b2ls\n\n(\u03b2ls + dls(x))\u03b1ls+cls(x)\n\n\u0393(\u03b1ls + cls(x))\n\n\u03b1ls\n\u0393(\u03b1ls)\n\u03b2ls+dls(x) which is E [\u03bbls | x].\n\nls\n\n\u0393(\u03b1ls) \u03bb\u03b1ls\u22121\np(x|S) =Y\nY\n\nl\u2208L\n\ns\u2208\u03a3l\n\n4.2 Structure Learning with Decision Trees\n\nIn this section, we specify the set of possible structures in terms of a set of basis state functions, a\nset of decision trees built from them, and a greedy Bayesian model selection procedure for learning\na structure. Finally, we describe the particular set of basis state functions we use in our experiments.\nWe use B to denote the set of basis state functions f(t, x), each taking values in a basis state set\n\u03a3f . Given B, we specify Sl through a decision tree whose interior nodes each have an associated\nf \u2208 B and a child corresponding to each value in \u03a3f . The per-label state set \u03a3l is then the set of\n\n3\n\n\fs\u2208\u03a3l\n\nQ\n\nl\u2208LQ\n\nleaves in the tree. The state function \u03c3l(t, x) is computed by recursively applying the basis state\nfunctions in the tree until a leaf is reached. Note that the resulting mapping is a valid state function\nby construction.\nIn order to carry out Bayesian model selection, we use a factored structural prior p(S) \u221d\n\u03bals. Since the prior and the marginal likelihood both factor over l, the local struc-\ntures Sl can be chosen independently. We search for each Sl as follows. We begin with Sl being the\ntrivial decision tree that maps all event sequences and times to the root. In this case, \u03bbl(t|x) = \u03bbl.\nl speci\ufb01ed by choosing a leaf s \u2208 \u03a3l and a basis state function\nGiven the current Sl, we consider S0\nf \u2208 B, and assigning f to s to get a set of new child leaves {s1,\u00b7\u00b7\u00b7 , sm} where m = |\u03a3f|. Be-\ncause the marginal likelihood factors over states, the gain in the posterior of the structure due to this\nsplit is p(S0\nl is chosen by selecting the s and f\nwith the largest gain. The search terminates if there is no gain larger than one. We note that the\nlocal structure representation and search can be extended from decision trees to decision graphs in a\nmanner analogous to [3].\nIn our experiments, we wish to learn how events depend on the timing and type of prior events. We\ntherefore use a set of time and label speci\ufb01c basis state functions. In particular, we use binary basis\nstate functions fl0,d1,d2,\u03c4 indexed by a label l0 \u2208 L, two time offsets 0 \u2264 d1 < d2 and a threshold\n\u03c4 > 0. Such a f encodes whether or not the event sequence x contains at least \u03c4 events with label l0\nwith timestamps in the window [t \u2212 d2, t \u2212 d1). Examples of decision trees that use such basis state\nfunctions are shown in Figure 1.\n\np(Sl|x) = \u03bals1 \u03b3ls1 (x)\u00b7\u00b7\u00b7\u03balsm \u03b3lsm (x)\n\n. The next structure S0\n\n\u03bals\u03b3ls(x)\n\nl|x)\n\n5 Forecasting\n\nj , b\u2217\n\nj , b\u2217\n\nj , [a\u2217\n\nj , tij \u2208 [a\u2217\n\nj )(cid:1)(cid:9)k\n\nepisodic sequence and denote it by e = (cid:8)(cid:0)l\u2217\n\nIn this section, we describe how to use PCIMs to forecast whether a sequence of target labels will\noccur in a given order and in given time intervals. For example, we may wish to know the probability\nthat a computer system will experience a system failure in the next week and again in the following\nweek, or that an internet user will be shown a particular display ad and then visit the advertising\nmerchant\u2019s website in the next month. We call such a sequence and set of associated intervals an\nj )) the jth episode.\nWe say that the episodic sequence e occurs in an event sequence x if \u2203i1 < \u00b7\u00b7\u00b7 < ik : (tij , lij ) \u2208\nx, lij = l\u2217\nGiven an event sequence h and a time t\u2217 \u2265 t(h), we term any event sequence x whose history up\nto t\u2217 agrees with h (i.e., h(t\u2217, x) = h) an extension of h from t\u2217. Our forecasting problem is, given\nat observed sequence h at time t\u2217 \u2265 t(h), to compute the probability that e occurs in extensions of\nh from t\u2217. This probability is p (X \u2208 Xe | h(t\u2217, X) = h) and will be denoted using the shorthand\np(Xe|h, t\u2217). Computing p(Xe|h, t\u2217) is hard in general because the probability of episodes of interest\ncan depend on arbitrary numbers of intervening events. We therefore give Monte Carlo estimates for\np(Xe|h, t\u2217), \ufb01rst describing a forward sampling procedure for forecasting episodic sequences (also\napplicable to other forecasting problems), and then introducing an importance sampling scheme\nspeci\ufb01cally designed for forecasting episodic sequences.\n\nj ). The set of event sequences x in which e occurs is denoted Xe.\n\n. We call (l\u2217\n\nj , [a\u2217\n\nj , b\u2217\n\nj=1\n\n5.1 Forward Sampling\n\n1\nM\n\nPM\nThe probability of an episodic sequence can be estimated using a forward sampling approach by\nsampling M extensions {x(m)}M\nm=1 of h from t\u2217 and using the estimate \u02c6pFwd(Xe|h, t\u2217; M) =\nm=1 1Xe(x(m)). By Hoeffding\u2019s inequality, P (|\u02c6pFwd(Xe|h, t\u2217; M) \u2212 p(Xe|h, t\u2217)| > \u0001) \u2264\nIt is important to note that\nk, and thus we need only sample \ufb01nite extensions x such that\n\n(cid:1).\nk from p(cid:0)x | h(t\u2217, x) = h, t|x|+1 \u2265 b\u2217\nan arbitrary event sequence x can be written asQn\n\n2e\u22122\u00012M . Thus, the error in \u02c6pFwd(Xe|h, t\u2217; M) falls as O(1/\n1Xe(x) only depends on x up to b\u2217\nt(x) < b\u2217\nThe forward sampling algorithm for Poisson Networks [15] can be easily adapted for PCIMs. Here\nwe outline how to forward sample an extension x of h from t\u2217 to b\u2217\nk given a general CIM. Forward\nsampling consists of iteratively obtaining a sample sequence xi of length i by sampling (ti, li) and\nappending to a prior sampled sequence xi\u22121 of length i \u2212 1. The CIM likelihood (Equation 1) of\ni=1 p(ti, li|hi; \u03b8). Thus, we begin with x|h| = h,\n\nM).\n\n\u221a\n\nk\n\n4\n\n\fnote that p(ti, li|hi; \u03b8) = \u03bbli(ti|hi, \u03b8)e\u2212\u039bli (ti|hi;\u03b8)Q\n\nand iteratively sample (ti, li) from p(ti, li|hi = xi\u22121; \u03b8) and append to xi\u22121 to obtain xi. Note\nthat one needs to use rejection sampling during the \ufb01rst iteration to ensure t|h|+1 > t\u2217. The \ufb01nite\nk is obtained by terminating when ti > b\u2217\nextension up to b\u2217\nk and rejecting ti. To sample (ti, li) we\ne\u2212\u039bl(ti|hi;\u03b8) has a competing risks form\nl6=li\n[1, 11], so that we can sample |L| candidate times tl\ni independently from the non-homogeneous\ni|hi;\u03b8) and then let ti be the smallest of these candidate times\nexponential densities \u03bbl(tl\ni from a piecewise constant\nand li be the corresponding l. A more detailed description of sampling tl\nconditional intensities is given in [15]. Finally, we note that the basic sampling procedure can be\nmade more ef\ufb01cient using the techniques described in [15] and [7].\n\ni|hi, \u03b8)e\u2212\u039bl(tl\n\n5.2\n\nImportance Sampling\n\nWhen using a forward sampling approach to forecast unlikely episodic sequences, the episodes\nof interest will not occur in most of the sampled extensions and our estimate of p(Xe|h, t\u2217) will\nbe noisy. In fact, due to the fact that absolute error in \u02c6pFwd falls as the square root of the num-\nber of sequences sampled, we would need O(1/p(Xe|h, t\u2217)2) sample sequences to get non-trivial\nlower bounds on p(Xe|h, t\u2217) using a forward sampling approach. To mitigate this problem we\ndevelop an importance sampling approach, where sequences are drawn from a proposal distribu-\ntion q(\u00b7) that has an increased likelihood of generating extensions in which Xe occurs, and then\nuses a weighted empirical estimate. In particular, we will sample extensions x(m) of h from t\u2217\nmate p(Xe|h, t\u2217) through\n\nfrom q(cid:0)x | h(t\u2217, x) = h, t|x|+1 \u2265 b\u2217\n\n(cid:1), and will esti-\n\nk\n\nk\n\n(cid:1) instead of p(cid:0)x | h(t\u2217, x) = h, t|x|+1 \u2265 b\u2217\n1PM\np(cid:0)x | h(t\u2217, x) = h, t|x|+1 \u2265 b\u2217\nq(cid:0)x | h(t\u2217, x) = h, t|x|+1 \u2265 b\u2217\n\nw(x(m))1Xe(x(m)),\n\nm=1 w(x(m))\n\nMX\n\n(cid:1)\n(cid:1)\n\nm=1\n\nk\n\nk\n\n\u02c6pImp(Xe|h, t\u2217; M) =\n\nw(x) =\n\nThe Poisson Superposition Importance Sampler (PSIS) is an importance sampler whose proposal\ndistribution q is based on Poisson superposition. This proposal distribution is de\ufb01ned to be a CIM\nl (t|x) where \u03bbl(t|x; \u03b8) is the con-\nwhose conditional intensity functions are given by \u03bbl(t|x; \u03b8) + \u03bb\u2217\nditional intensity function of l under the model and \u03bb\u2217\n\nl (t|x) is given by\nj(x), t \u2208 [aj(x)(x), b\u2217\n\nj(x)), and j(x) 6= 0.\n\n(\n\nl (t|x) =\n\u03bb\u2217\n\n1\n\nj(x)\u2212aj(x)(x)\nb\u2217\n0\n\nfor l = l\u2217\notherwise,\n\nwhere the active episode j(x) is 0 if t(x) \u2265 bj(x), j = 1,\u00b7\u00b7\u00b7 , k and is min ({j : bj(x) > t(x)})\notherwise. The time bj(x) when the jth episode ceases to be active is the time at which the jth\nepisode occurs in x, or b\u2217\nj ) do not overlap,\naj(x) = a\u2217\n\nj if it does not occur. If the episodic intervals [a\u2217\n\nj . In general aj(x) and bj(x) are given by the recursion\n\nj , b\u2217\n\naj(x) = max(cid:0){a\u2217\nbj(x) = min(cid:0){b\u2217\n\nj , bj\u22121(x)}(cid:1)\n\nj} \u222a {(ti, li) \u2208 x : li = l\u2217\n\nj , ti \u2208 [aj(x), b\u2217\n\nj )}(cid:1) .\n\nThis choice of q makes it likely that the jth episode will occur after the j \u2212 1th episode.\nAs the proposal distribution is also a CIM, importance sampling can be done using the forward\nsampling procedure above. If the model is a PCIM, the proposal distribution is also a PCIM, since\nl (t|x) are piecewise constant in t. In practice the computation of j(x), aj(x), and bj(x) can be\n\u03bb\u2217\ndone during forward sampling.\nThe importance weight corresponding to our proposal distribution is\n\n \n\nkY\n\nj=1\n\nw(x) =\n\nexp\n\nbj(x) \u2212 aj(x)\nj \u2212 aj(x)\nb\u2217\n\n! Y\n\n(ti,li)\u2208x:\n\nti=bj (x),li=l\u2217\n\nj\n\n\u03bbl\u2217\n\nj\n\n5\n\n\u03bbl\u2217\n\nj\n\n(ti|xi) +\n\n(ti|xi)\nj \u2212aj (x)\nb\u2217\n\n1\n\n.\n\n\fIn many problems, the importance weight w(x) of a sequence x of length n is a product of n small\nterms. When n large, this can cause the importance weights to become degenerate, and this problem\nis often solved using particle \ufb01ltering [7]. Note that the second product in w(x) above has at most\none term for each j so that w(x) has k terms corresponding to the k episodes, which is independent\nof n. Thus, we do not experience the problem of degenerate weights when k is small, regardless of\nthe number of events sampled.\n\n6 Experimental Results\n\nWe \ufb01rst validate that PCIMs can learn temporal dependencies and that the PSIS gives faster fore-\ncasting than forward sampling using a synthetic data set. We then show that PCIMs are more than\nan order of magnitude faster to train than Poisson Networks, and better model unseen test data using\nreal supercomputer log data. Finally we show that PCIMs and the PSIS allow the forecasting future\ninterests of web search users using real log data from a major commercial search engine.\n\n6.1 Validation on Synthetic Data\n\nIn order to evaluate the ability of PCIMs to learn nonlinear temporal dependencies we sampled data\nfrom a known model and veri\ufb01ed that the dependencies learned were correct. Data was sampled\nfrom a PCIM with L = {A,B,C}. The known model is shown in Figure 1.\n\n(a) Event type A\n\n(b) Event type B\n\n(c) Event type C\n\nFigure 1: Decision trees representing S and \u03b8 for events of type A, B and C.\n\nWe sampled 100 time units of data, observing 97 instances of A, 58 instances of B, and 71 instances\nof C. We then learn a PCIM from the sampled data. We used basis state functions that tested for the\npresence of each label in windows with boundaries at t \u2212 0, 1, 2,\u00b7\u00b7\u00b7 , 10, and +\u221e time units. We\nused a common prior with a mean rate of 0.1 and a equivalent sample size of one time unit for all\n\u03bbls, and the structural prior described above with \u03bals = 0.1 for all s.\nThe learned PCIM perfectly recovered the correct model structure. We repeated the experiment by\nsampling data from a model with \ufb01fteen labels, consisting of \ufb01ve independent copies of the model\nabove. That is, L = {A1, B1, C1,\u00b7\u00b7\u00b7 , A5, B5, C5} with each triple Ai, Bi, Ci independent of other\nlabels, and dependent on each other as speci\ufb01ed by Figure 1. Once again, the model structure was\nrecovered perfectly.\nWe evaluated the PSIS in forecasting event sequences with the model shown in Figure 1. The\nconvergence of importance sampling is compared with that of forward sampling in Figure 2. We\ngive results for forecasting three different episodic sequences, consisting of the label sequences\n{C}, {C, B}, and {C, B, A}, all in the interval [0, 1], given an empty history. The three queries are\ngiven in order of decreasing probability, so that inference becomes harder. We show how estimates\nof the probabilities of given episodic sequences vary as a function of the number of sequences\nsampled, giving the mean and variance of the trajectories of the estimates computed over ten runs.\nFor all three queries, importance sampling converges faster and has lower variance. Since exact\ninference is infeasible for this model, we forward sample 4,000,000 event sequences and display\nthis estimate. Note that despite the large sample size the Hoeffding bound gives a 95% con\ufb01dence\n\n6\n\nA in[t-1,t)A in [t-2,t-1)\u03bb=10.0\u03bb=0.0yesyesno\u03bb=0.1noB in [t-1,t)B in [t-2,t-1)\u03bb=10.0\u03bb=0.0yesyesnoA in [t-5,t)\u03bb=0.002\u03bb=0.2yesnonoC in [t-1,t)C in [t-2,t-1)\u03bb=10.0\u03bb=0.0yesyesnoB in [t-5,t)\u03bb=0.002\u03bb=0.2yesnono\f(a) Label C in [0, 1]\n\n(b) Labels C, B in [0, 1]\n\n(c) Labels C, B, A in [0, 1]\n\nFigure 2: Trajectories of \u02c6pImp and \u02c6pFwd vs.\nthe number of sequences sampled for three different\nqueries. The dashed and dotted lines show the empirical mean and standard deviation over ten runs\nof \u02c6pImp and \u02c6pFwd. The solid line shows \u02c6pFwd based on 4 million event sequences.\n\ninterval of \u00b10.0006 for this estimate, which is large relative to the probabilities estimated. This\nfurther suggests the need for importance sampling for rare label sequences.\n\n6.2 Modeling Supercomputer Event Logs\n\nWe compared PCIM and Poisson Nets on the task of modeling system event logs from the Blue-\nGene/L supercomputer at Lawrence Livermore National Laboratory [14], available at the USENIX\nComputer Failure Data Repository. We \ufb01ltered out informational (non-alert) messages from the logs,\nand randomly split the events by node into a training set with 311,060 alerts from 21,962 nodes, and\na test set with 68,502 alerts from 9,412 nodes. We learned dependencies between the 38 alert types\nin the data. We treat the events from each node as separate sequences, and use a product of the\nper-sequence likelihoods given in equation (1).\nFor both models, we used window boundaries at t \u2212 1/60, 1, 60, 3600, and \u221e seconds. The PCIM\nused count threshold basis state functions with thresholds of 1, 4, 16 and 64 while the Poisson Net\nused log count feature vectors as described above. Both models used priors with a mean rate of an\nevent every 100 days, no dependencies, and an equivalent sample size of one second. Both used a\nstructural prior with \u03bals = 0.1. Table 1 shows the test set likelihood and the run time for the two\napproaches. PCIM achieves better test set likelihood and is more than an order of magnitude faster.\n\nPCIM\nPoisson Net\n\nTest Log Likelihood Training Time\n11 min\n3 hr 33 min\n\n-85.3\n-88.8\n\nTable 1: A comparison of the PCIM and Poisson Net in modeling supercomputer event logs. The\ntest set log likelihood reported has been divided by the number of test nodes (9,412). The training\ntime for the PCIM and Poisson Net are also shown.\n\n6.3 Forecasting Future Interests of Web Search Users\n\nWe used the query logs of a major internet search engine to investigate the use of PCIMs in forecast-\ning the future interests of web search users. All queries are mapped to one of 36 different interest\ncategories using an automatic classi\ufb01er. Thus, L contains 36 labels, such as \u201cTravel\u201d or \u201cHealth &\nWellness.\u201d Our training set contains event sequences for approximately 23k users consisting of about\n385k timestamped labels recorded over a two month period. The test set contains event sequences\nfor approximately 11k users of about 160k timestamped labels recorded over the next month.\nWe trained a PCIM on the training data using window boundaries at t \u2212 1 hour, t \u2212 1 day, and t \u2212 1\nweek, and basis state functions that tested for the presence of one or more instance of each label in\neach window, treating users as i.i.d. The prior had a mean rate of an event every year, an equivalent\nsample size of one day. The structural prior had \u03bals = 0.1. The model took 1 day and 18 hours to\ntrain on 3 GHz workstation. We did not compare to a Poisson network on this data since, as shown\nabove, Poisson networks take an order of magnitude longer to learn.\n\n7\n\n\fFigure 3: Precision-recall curves for forecasting future Health & Wellness queries using a full PCIM,\na restricted PCIM that conditions only on past Health & Wellness queries, a baseline that takes into\naccount only past Health & Wellness queries and not their timing, and random guessing.\n\nGiven the \ufb01rst week of each test user\u2019s event sequence, we forecasted whether they would issue a\nquery in a chosen target category in the second week. We used the PSIS with 100 sample sequences\nfor forecasting. Figure 3 shows the precision recall curve for one target category label. Also shown\nis the result for restricted PCIMs that only model dependencies on prior occurrences of the target\ncategory. This is compared to a baseline where the conditional intensity depends only on whether the\ntarget label appeared in the history. This shows that modeling the temporal aspect of dependencies\ndoes provide a large improvement. Modeling dependencies on past occurrences of other labels also\nprovides an improvement in the right-hand region of the precision-recall curve.\nTo better understand the performance of PCIMs we also examined the problem of predicting the \ufb01rst\noccurrence of the target label. As Figure 3 suggests (but doesn\u2019t show), the PCIM can model cross-\nlabel dependencies to forecast the \ufb01rst occurrence of the target label. Forecasting new interests is\nvaluable in a variety of applications including advertising and the fact that PCIMs are able to forecast\n\ufb01rst occurrences is promising. Results similar to Figure 3 were obtained for other target labels.\n\n7 Discussion\n\nWe presented the Piecewise-Constant Conditional Intensity Model, which is a model of temporal\ndependencies in continuous time event streams. We gave a conjugate prior and a greedy tree build-\ning procedure that allow for ef\ufb01cient learning of these models. Dependencies on the history are\nrepresented through automatically learned combinations of a given set of basis state functions. One\nof the key bene\ufb01ts of PCIMs is that they allow domain knowledge to be encoded in these basis\nstate functions. This domain knowledge is incorporated into the model during structure search in\nsituations where it is supported by the data. The fact that we use decision trees allows us to easily\ninterpret the learned dependencies.\nIn this paper, we focused on basis state functions indexed by a \ufb01xed set of time windows and labels.\nExploring alternative types of basis state functions is an area for future research. For example, basis\nstate functions could encode the most recent events that have occurred in the history rather than the\nevents that occurred in windows of interest. The capacity of the resulting model class depends on\nthe set of basis state functions chosen. Understanding how to choose the basis state functions and\nhow to adapt our learning procedure to control the resulting capacity is another open topic. We also\npresented the Poisson Superposition Importance Sampler for forecasting episodic sequences with\nPCIMs. Developing forecasting algorithms for more general queries is of interest.\nFinally, we demonstrated the value of PCIMs in modeling the temporal behavior of web search users\nand of supercomputer nodes. In many applications, we have access to richer event streams such as\nspatio-temporal event streams and event streams with structured labels. It would be interesting to\nextend PCIMs to handle such rich event streams.\n\n8\n\n\fReferences\n[1] Simeon M. Berman. Note on extreme values, competing risks and semi-Markov processes.\n\nAnn. Math. Stat., 34(3):1104\u20131106, 1963.\n\n[2] W. Buntine. Theory re\ufb01nement on Bayesian networks. In UAI, 1991.\n[3] David Maxwell Chickering, David Heckerman, and Christopher Meek. A Bayesian approach\n\nto learning Bayesian networks with local structure. In UAI, 1997.\n\n[4] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes: Elementary\n\nTheory and Methods, volume I. Springer, 2 edition, 2003.\n\n[5] Thomas Dean and Keiji Kanazawa. Probabilistic temporal reasoning. In AAAI, 1988.\n[6] Vanessa Didelez. Graphical models for marked point processes based on local independence.\n\nJ. Roy. Stat. Soc., Ser. B, 70(1):245\u2013264, 2008.\n\n[7] Yu Fan and Christian R. Shelton. Sampling for approximate inference in continuous time\n\nBayesian networks. In AI & M, 2008.\n\n[8] N. Friedman, I. Nachman, and D. Pe\u00b4er. Using Bayesian networks to analyze expression data.\n\nJ. Comp. Bio., 7:601\u2013620, 2000.\n\n[9] Nir Friedman, Kevin Murphy, and Stuart Russell. Learning the structure of dynamic proba-\n\nbilistic networks. In UAI, 1998.\n\n[10] David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite, and\nCarl Kadie. Dependency networks for inference, collaborative \ufb01ltering, and data visualization.\nJMLR, 1:49\u201375, October 2000.\n\n[11] A. A. J. Marley and Hans Colonius. The \u201chorse race\u201d random utility model for choice prob-\nabilities and reaction times, and its competing risks interpretation. J. Math. Psych., 36:1\u201320,\n1992.\n\n[12] Uri Nodelman, Christian R. Shelton, and Daphne Koller. Continuous time Bayesian networks.\n\nIn UAI, 2002.\n\n[13] Uri Nodelman, Christian R. Shelton, and Daphne Koller. Expectation Maximization and com-\n\nplex duration distributions for continuous time Bayesian networks. In UAI, 2005.\n\n[14] Adam Oliner and Jon Stearley. What supercomputers say - an analysis of \ufb01ve system logs. In\n\nIEEE/IFIP Conf. Dep. Sys. Net., 2007.\n\n[15] Shyamsundar Rajaram, Thore Graepel, and Ralf Herbrich. Poisson-networks: A model for\n\nstructured point processes. In AIStats, 2005.\n\n[16] Aleksandr Simma, Moises Goldszmidt, John MacCormick, Paul Barham, Richard Brock, Re-\nbecca Isaacs, and Reichard Mortier. CT-NOR: Representing and reasoning about events in\ncontinuous time. In UAI, 2008.\n\n[17] Aleksandr Simma and Michael I. Jordan. Modeling events with cascades of Poisson processes.\n\nIn UAI, 2010.\n\n[18] Wilson Truccolo, Uri T. Eden, Matthew R. Gellows, John P. Donoghue, and Emery N. Brown.\nA point process framework relating neural spiking activity to spiking history, neural ensemble,\nand extrinsic covariate effects. J. Neurophysiol., 93:1074\u20131089, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1106, "authors": [{"given_name": "Asela", "family_name": "Gunawardana", "institution": null}, {"given_name": "Christopher", "family_name": "Meek", "institution": null}, {"given_name": "Puyang", "family_name": "Xu", "institution": null}]}