{"title": "Predicting User Activity Level In Point Processes With Mass Transport Equation", "book": "Advances in Neural Information Processing Systems", "page_first": 1645, "page_last": 1655, "abstract": "Point processes are powerful tools to model user activities and have a plethora of applications in social sciences. Predicting user activities based on point processes is a central problem. However, existing works are mostly problem specific, use heuristics, or simplify the stochastic nature of point processes. In this paper, we propose a framework that provides an unbiased estimator of the probability mass function of point processes. In particular, we design a key reformulation of the prediction problem, and further derive a differential-difference equation to compute a conditional probability mass function. Our framework is applicable to general point processes and prediction tasks, and achieves superb predictive and efficiency performance in diverse real-world applications compared to state-of-arts.", "full_text": "Predicting User Activity Level In Point Processes\n\nWith Mass Transport Equation\n\nYichen Wang\u21e7, Xiaojing Ye\u21e4, Hongyuan Zha\u21e7, Le Song\u21e7\u2020\n\n\u21e7College of Computing, Georgia Institute of Technology\n\n\u21e4 School of Mathematics, Georgia State University\n\n\u2020 Ant Financial\n\n{yichen.wang}@gatech.edu, xye@gsu.edu\n\n{zha,lsong}@cc.gatech.edu\n\nAbstract\n\nPoint processes are powerful tools to model user activities and have a plethora of\napplications in social sciences. Predicting user activities based on point processes\nis a central problem. However, existing works are mostly problem speci\ufb01c, use\nheuristics, or simplify the stochastic nature of point processes. In this paper, we\npropose a framework that provides an ef\ufb01cient estimator of the probability mass\nfunction of point processes. In particular, we design a key reformulation of the\nprediction problem, and further derive a differential-difference equation to compute\na conditional probability mass function. Our framework is applicable to general\npoint processes and prediction tasks, and achieves superb predictive and ef\ufb01ciency\nperformance in diverse real-world applications compared to the state of the art.\n\n1\n\nIntroduction\n\nOnline social platforms, such as Facebook and Twitter, enable users to post opinions, share infor-\nmation, and in\ufb02uence peers. Recently, user-generated event data archived in \ufb01ne-grained temporal\nresolutions are becoming increasingly available, which calls for expressive models and algorithms\nto understand, predict and distill knowledge from complex dynamics of these data. Particularly,\ntemporal point processes are well-suited to model the event pattern of user behaviors and have been\nsuccessfully applied in modeling event sequence data [6, 10, 12, 21, 23, 24, 25, 26, 27, 28, 33].\nA fundamental task in social networks is to predict user activity levels based on learned point process\nmodels. Mathematically, the goal is to compute E[f (N (t))], where N (\u00b7) is a given point process\nthat is learned from user behaviors, t is a \ufb01xed future time, and f is an application-dependent\nfunction. A framework for doing this is critically important. For example, for social networking\nservices, an accurate inference of the number of reshares of a post enables the network moderator\nto detect trending posts and improve its content delivery networks [13, 32]; an accurate estimate of\nthe change of network topology (the number of new followers of a user) facilitates the moderator to\nidentify in\ufb02uential users and suppress the spread of terrorist propaganda and cyber-attacks [12]; an\naccurate inference of the activity level (number of posts in the network) allows us to gain fundamental\ninsight into the predictability of collective behaviors [22]. Moreover, for online merchants such as\nAmazon, an accurate estimate of the number of future purchases of a product helps optimizing future\nadvertisement placements [10, 25].\nDespite the prevalence of prediction problems, an accurate prediction is very challenging for two\nreasons. First, the function f is arbitrary. For instance, to evaluate the homogeneity of user activities,\nwe set f (x) = x log(x) to compute the Shannon entropy; to measure the distance between a predicted\nactivity level and a target x\u21e4, we set f (x) = (x  x\u21e4)2. However, most works [8, 9, 13, 30, 31, 32]\nare problem speci\ufb01c and only designed for the simple task with f (x) = x; hence these works are\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fSimulate point process on 0,#\n&'() ={#),#,,#-}\n#)\n#, #-\n#\n0\n&'(, ={#),#,,#-}\n#)\n#\n#, #-\n0\n\ntime\n\n(a) Samples of Hawkes process\n\n/0&1() ,\n0\u2208[0,#]\n/(0|&1(,)\n\nConstruct conditional intensity\n\n0\n\n#)\n\n#, #-\n\n#\n\n(b) Intensity functions\n\n0=0\n\nCompute conditional mass function on [0,#]\n0=#\n67)8,0\n67,(8,0)\n\n0\n\n1\n\n1\n\nx\n\n(c) Mass transport from 0 to #\t\n\nAverage \n\n(d) Unbiased estimator\t\n\nFigure 1: An illustration of HYBRID using Hawkes process (Eq. 1). Our method \ufb01rst generates\nt} of events; then it constructs intensity functions; with these inputs, it computes\ntwo samples {Hi\nconditional probability mass functions \u02dci(x, s) := P[N (s) = x|Hi\ns] using a mass transport equation.\nPanel (c) shows the transport of conditional mass at four different times (the initial probability mass\n\u02dc(x, 0) is an indicator function I[x = 0], as there is no event with probability one). Finally, the\naverage of conditional mass functions yields our estimator of the probability mass.\n\nnot generalizable. Second, point process models typically have intertwined stochasticity and can\nco-evolve over time [12, 25], e.g., in the in\ufb02uence propagation problem, the information diffusion over\nnetworks can change the structure of networks, which adversely in\ufb02uences the diffusion process [12].\nHowever, previous works often ignore parts of the stochasticity in the intensity function [29] or make\nheuristic approximations [13, 32]. Hence, there is an urgent need for a method that is applicable to an\narbitrary function f and keeps all the stochasticity in the process, which is largely nonexistent to date.\nWe propose HYBRID, a generic framework that provides an ef\ufb01cient estimator of the probability mass\nof point processes. Figure 1 illustrates our framework. We also make the following contributions:\n\u2022 Unifying framework. Our framework is applicable to general point processes and does not\ndepend on speci\ufb01c parameterization of intensity functions. It incorporates all stochasticity in point\nprocesses and is applicable to prediction tasks with an arbitrary function f.\n\n\u2022 Technical challenges. We reformulate the prediction problem and design a random variable with\nreduced variance. To derive an analytical form of this random variable, we also propose a mass\ntransport equation to compute the conditional probability mass of point processes. We further\ntransform this equation to an Ordinary Differential Equation and provide a scalable algorithm.\n\n\u2022 Superior performance. Our framework signi\ufb01cantly reduces the sample size to estimate the\nprobability mass function of point processes in real-world applications. For example, to infer\nthe number of tweeting and retweeting events of users in the co-evolution model of information\ndiffusion and social link creation [12], our method needs 103 samples and 14.4 minutes, while\nMonte Carlo needs 106 samples and 27.8 hours to achieve the same relative error of 0.1.\n\n2 Background and preliminaries\n\nPoint processes. A temporal point process [1] is a random process whose realization consists of a set\nof discrete events {tk}, localized in time. It has been successfully applied to model user behaviors\nin social networks [16, 17, 19, 23, 24, 25, 28, 30]. It can be equivalently represented as a counting\nprocess N (t), which records the number of events on [0, t]. The counting process is a right continuous\nstep function, i.e., if an event happens at t, N (t)  N (t) = 1.\nLet Ht = {tk|tk < t} be the history of events happened up to time t. An important way to character-\nize point processes is via the conditional intensity function (t) := (t|Ht), a stochastic model for\nthe time of the next event given the history. Formally, (t) is the conditional probability of observing\nan event in [t, t + dt) given events on [0, t), i.e., P{event in [t, t + dt)|Ht} = E[dN (t)|Ht] :=\n(t)dt, where dN (t) 2{ 0, 1}.\nThe intensity function is designed to capture the phenomena of interest. Some useful forms include\n(i) Poisson process: the intensity is a deterministic function, and (ii) Hawkes process [15]: it captures\nthe mutual excitation phenomena between events and its intensity is parameterized as\n\n(t) = \u2318 + \u21b5Xtk2Ht\n\n2\n\n\uf8ff(t  tk),\n\n(1)\n\n\fwhere \u2318 > 0 is the baseline intensity; the trigging kernel \uf8ff(t) = exp(t) models the decay of past\nevents\u2019 in\ufb02uence over time; \u21b5 > 0 quanti\ufb01es the strength of in\ufb02uence from each past event. Here,\nthe occurrence of each historical event increases the intensity by a certain amount determined by \uf8ff(t)\nand \u21b5, making (t) history-dependent and a stochastic process by itself.\nMonte Carlo (MC). To compute the probability mass of a point process, MC simulates n realizations\nt} using the thinning algorithm [20]. The number of events in sample i is de\ufb01ned\nof history {Hi\nas N i(t) = |Hi\nt|. Let (x, t) := P[N (t) = x], where x 2 N, be the probability mass. Then its\nestimator \u02c6mc\nn (x, t) =\nnPi f (N i(t)). The root mean square error (RMSE) is de\ufb01ned as\nnPi I[N i(t) = x] and \u02c6\u00b5mc\n1\n\nn (t) for \u00b5(t) := E[f (N (t))] are de\ufb01ned as \u02c6mc\n\nn (x, t) and the estimator \u02c6\u00b5mc\n\nn (t) = 1\n\n\"(\u02c6\u00b5mc\n\nn (t)) =pE[\u02c6\u00b5mc\n\nn (t)  \u00b5(t)]2 =pVAR[f (N (t))]/n.\n\n(2)\n\n3 Solution overview\n\nGiven an arbitrary point process N (t) that is learned from data, existing prediction methods for\ncomputing E[f (N (t))] have three major limitations:\n\u2022 Generalizability. Most methods [8, 9, 13, 30, 31, 32] only predict E[N (t)] and are not generaliz-\nable to an arbitrary function f. Moreover, they typically rely on speci\ufb01c parameterizations of the\nintensity functions, such as the reinforced Poisson process [13] and Hawkes process [5, 32]; hence\nthey are not applicable to general point processes.\n\n\u2022 Approximation and heuristics. These works also ignore parts of the stochasticity in the intensity\nfunctions [29] or make heuristic approximations to the point process [13, 32]. Hence the accuracy\nis limited by the approximations and heuristic corrections.\n\n\u2022 Large sample size. The MC method overcomes the above limitations since it has an unbiased\nestimator of the probability mass. However, the high stochasticity in point processes leads to a\nlarge value of VAR[f (N (t))], which requires a large number of samples to achieve a small error.\nTo address these challenges, we propose a generic framework with a novel estimator of the probability\nmass, which has a smaller sample size than MC. Our framework has the following key steps.\nI. New random variable. We design a random variable g(Ht), a conditional expectation given the\nhistory. Its variance is guaranteed to be smaller than that of f (N (t)). For a \ufb01xed number of samples,\nthe error of MC is decided by the variance of the random variable of interest, as shown in (2). Hence,\nto achieve the same error, applying MC to estimate the new objective EHt\n[g(Ht)] requires smaller\nnumber of samples compared with the procedure that directly estimates E[f (N (t))].\nII. Mass transport equation. To compute g(Ht), we derive a differential-difference equation that\ndescribes the evolutionary dynamics of the conditional probability mass P[N (t) = x|Ht]. We\nfurther formulate this equation as an Ordinary Differential Equation, and provide a scalable algorithm.\n\n4 Hybrid inference machine with probability mass transport\n\nIn this section, we present technical details of our framework. We \ufb01rst design a new random variable\nfor prediction; then we propose a mass transport equation to compute this random variable analytically.\nFinally, we combine the mass transport equation with the sampling scheme to compute the probability\nmass function of general point processes and solve prediction tasks with an arbitrary function f.\n\n4.1 New random variable with reduced variance\nWe reformulate the problem and design a new random variable g(Ht), which has a smaller variance\nthan f (N (t)) and the same expectation. To do this, we express E[f (N (t))] as an iterated expectation\n(3)\nwhere EHt is w.r.t. the randomness of the history and EN (t)|Ht\nis w.r.t. the randomness of the\npoint process given the history. We design the random variable as a conditional expectation given the\nhistory: g(Ht) = EN (t)|Ht\n\nE[f (N (t))] = EHthEN (t)|Ht\u21e5f (N (t))|Ht\u21e4i = EHthg(Ht)i,\n\n[f (N (t))|Ht]. Theorem 1 shows that it has a smaller variance.\n\n3\n\n\fTheorem 1. For time t > 0 and an arbitrary function f, we have VAR[g(Ht)] < VAR[f (N (t))].\nTheorem 1 extends the Rao-Blackwell (RB) theorem [3] to point processes. RB says that if \u02c6\u2713 is an\nestimator of a parameter \u2713 and T is a suf\ufb01cient statistic for \u2713; then VAR[E[\u02c6\u2713|T ]] 6 VAR[\u02c6\u2713], i.e., the\nsuf\ufb01cient statistic reduces uncertainty of \u02c6\u2713. However, RB is not applicable to point processes since it\nstudies a different problem (improving the estimator of a distribution\u2019s parameter), while we focus on\nthe prediction problem for general point processes, which introduces two new technical challenges:\n(i) Is there a de\ufb01nition in point processes whose role is similar to the suf\ufb01cient statistic in RB?\nOur \ufb01rst contribution shows that the history Ht contains all the necessary information in a point\nprocess and reduces the uncertainty of N (t). Hence, g(Ht) is an improved variable for prediction.\nMoreover, in contrast to the RB theorem, the inequality in Theorem 1 is strict because the counting\nprocess N (t) is right-continuous in time t and not predictable [4] (a predictable process is measurable\nw.r.t. Ht, such as the processes that are left-continuous). Appendix C contains details on the proof.\n(ii) Is g(Ht) computable for general point processes and an arbitrary function f? An ef\ufb01cient\ncomputation will enable us to estimate EHt\n[g(Ht)] using the sampling method. Speci\ufb01cally, let\nt) be the estimator computed from n samples; then from the de\ufb01nition of RMSE\n\u02c6\u00b5n(t) = 1\nin (2), this estimator has smaller error than MC: \"(\u02c6\u00b5n(t)) <\" (\u02c6\u00b5mc\nHowever, the challenge in our new formulation is that it seems very hard to compute this conditional\nexpectation, as one typically needs another round of sampling, which is undesirable as it will increase\nthe variance of the estimator. To address this challenge, next we propose a mass transport equation.\n\nnPi g(Hi\n\nn (t)).\n\n4.2 Transport equation for conditional probability mass function\n\nWe present a novel mass transport equation that computes the conditional probability mass \u02dc(x, t) :=\nP[N (t) = x|Ht] of general point processes. With this de\ufb01nition, we derive an analytical expression\nfor the conditional expectation: g(Ht) =Px f (x) \u02dc(x, t). The transport equation is as follows.\nTheorem 2 (Mass Transport Equation for Point Processes). Let (t) := (t|Ht) be the conditional\nintensity function of the point process N (t) and \u02dc(x, t) := P[N (t) = x|Ht] be its conditional\nprobability mass function; then \u02dc(x, t) satis\ufb01es the following differential-difference equation:\n\n\u02dct(x, t)\n\"\n\n:=\n\n@ \u02dc(x, t)\n\n@t\n\nrate of change in conditional mass\n\nloss in mass, at rate (t)\n\n=8><>:\n\n(t) \u02dc(x, t)\n(t) \u02dc(x, t)\n}\n|\n\n{z\n\n+ (t) \u02dc(x  1, t)\n}\n|\n\ngain in mass, at rate (t)\n\n{z\n\nif x = 0\nif x = 1, 2, 3,\u00b7\u00b7\u00b7\n\n(4)\n\nProof sketch. For the simplicity of notation, we set the right-hand-side of (4) to be F[ \u02dc], where\nF is a functional operator on \u02dc. We also de\ufb01ne the inner product between functions u : N ! R\nand v : N ! R as (u, v) :=Px u(x)v(x). The main idea in our proof is to show that the equality\n(v, \u02dct) = (v,F[ \u02dc]) holds for any test function v; then \u02dct = F[ \u02dc] follows from the fundamental\nlemma of the calculus of variations [14]. Speci\ufb01cally, the proof contains two parts as follows.\nWe \ufb01rst prove (v, \u02dct) = (B[v], \u02dc), where B[v] is a functional operator de\ufb01ned as B[v] = (v(x +\n1)  v(x))(t). This equality can be proved by the property of point processes and the de\ufb01nition of\nconditional mass. Second, we show (B[v], \u02dc) = (v,F[ \u02dc]) using a variable substitution technique.\nMathematically, this equality means B and F are adjoint operators on the function space. Combining\nthese two equalities yields the mass transport equation. Appendix A contains details on the proof.\nMass transport dynamics. This differential-difference equation describes the time evolution of the\nconditional mass. Speci\ufb01cally, the differential term \u02dct, i.e., the instantaneous rate of change in the\nprobability mass, is equal to a \ufb01rst order difference equation on the right-hand-side. This difference\nequation is a summation of two terms: (i) the negative loss of its own probability mass \u02dc(x, t) at\nrate (t), and (ii) the positive gain of probability mass \u02dc(x  1, t) from last state x  1 at rate (t).\nMoreover, since initially no event happens with probability one, we have \u02dc(x, 0) = I[x = 0]. Solving\nthis transport equation on [0, t] essentially transports the initial mass to the mass at time t.\n\n4\n\n\fk=1, \u2327, set t = tK+1\n\nAlgorithm 1: CONDITIONAL MASS FUNCTION\nInput: Ht = {tk}K\nOutput: Conditional probability mass function \u02dc(t)\nfor k = 0,\u00b7\u00b7\u00b7 K do\n\nConstruct (s) and Q(s) on [tk, tk+1] ;\n\u02dc(tk+1) = ODE45[ \u02dc(tk), Q(s), \u2327 )] (RK Alg);\n\nend\nSet \u02dc(t) = \u02dc(tK+1)\n\nAlgorithm 2: HYBRID MASS TRANSPORT\nInput: Sample size n, time t, \u2327\nOutput: \u02c6\u00b5n(t), \u02c6n(x, t)\n\nGenerate n samples of point process:Hi\nt n\nfor i = 1,\u00b7\u00b7\u00b7 , n do\n\u02dci(x, t) = COND-MASS-FUNC(Hi\nend\nnPi\n\u02c6n(x, t) = 1\n\n\u02dci(x, t), \u02c6\u00b5n(t) =Px f (x) \u02c6n(x, t)\n\nt , \u2327 );\n\ni=1;\n\n4.3 Mass transport as a banded linear Ordinary Differential Equation (ODE)\n\nTo ef\ufb01ciently solve the mass transport equation, we reformulate it as a banded linear ODE. Speci\ufb01cally,\nwe set the upper bound for x to be M, and set \u02dc(t) to be a vector that includes the value of \u02dc(x, t) for\neach integer x: \u02dc(t) = ( \u02dc(0, t), \u02dc(1, t),\u00b7\u00b7\u00b7 , \u02dc(M, t))>. With this representation of the conditional\nmass, the mass transport equation in (4) can be expressed as a simple banded linear ODE:\n(5)\nwhere \u02dc(t)0 = ( \u02dct(0, t),\u00b7\u00b7\u00b7 , \u02dct(M, t))>, and the matrix Q(t) is a sparse bi-diagonal matrix with\nQi,i = (t) and Qi1,i = (t). The following equation visualizes the ODE in (5) when M = 2.\n\n\u02dc(t)0 = Q(t) \u02dc(t),\n\n\u02dct(0, t)\n\u02dct(1, t)\n\u02dct(2, t)\n\n0@\n\n1A = (t)\n\n(t) (t) !0@\n\n\u02dc(0, t)\n\u02dc(1, t)\n\u02dc(2, t)\n\n1A .\n\n(t) (t)\n\n(6)\n\n$\n\n$'\n)(*)\n$'\n!\"$'\n\nThis dynamic ODE is a compact representation of the transport equation in (4) and M decides\nthe dimension of the ODE in (5). In theory, M can be unbounded. However, the conditional\nprobability mass is tends to zero when M becomes large. Hence, in practice we choose a \ufb01nite\nsupport {0, 1,\u00b7\u00b7\u00b7 , M} for the conditional probability mass function. To choose a proper M, we\ngenerate samples from the point process. Suppose the largest number of events in the samples\nis L, we set M = 2L such that it is reasonably large. Next, with the initial probability mass\n\u02dc(t0) = (1, 0,\u00b7\u00b7\u00b7 , 0)>, we present an ef\ufb01cient algorithm to solve the ODE.\n4.4 Scalable algorithm for solving the ODE\n\n$(\n\n1\n\n)*\n$&\n!\"0\n\n$&\n\n)(*)\n\n$%\n)*,*\u2208[$&,$%]\t\n$&\n$%\n$%\n!\"$&\n!\"$%\n\n$\n\n$'\n\n!\"$\n\nFigure 2: Illustration of Algorithm 1 using Hawkes\nprocess. The intensity is updated after each event\ntk. Within [tk, tk+1], we use (tk) and the inten-\nsity (s) to solve the ODE and obtain (tk+1).\n\nWe present the algorithm that transports the ini-\ntial mass \u02dc(t0) to \u02dc(t) by solving the ODE.\nSince the intensity function is history-dependent\nand has a discrete jump when an event happens\nat time tk, the matrix Q(t) in the ODE is discon-\ntinuous at tk. Hence we split [0, t] into intervals\n[tk, tk+1]. On each interval, the intensity is con-\ntinuous and we can use the classic numerical\nRunge-Kutta (RK) method [7] to solve the ODE.\nFigure 2 illustrates the overall algorithm.\nOur algorithm works as follows. First, with the initial intensity on [0, t1] and \u02dc(t0) as input, the RK\nmethod solves the ODE on [0, t1] and outputs \u02dc(t1). Since an event happens at t1, the intensity is\nupdated on [t1, t2]. Next, with the updated intensity and \u02dc(t1) as the initial value, the RK method\nsolves the ODE on [t1, t2] and outputs \u02dc(t2). This procedure repeats for each [tk, tk+1] until time t.\nNow we present the RK method that solves the ODE on each interval [tk, tk+1]. RK divides this\ninterval into equally-spaced subintervals [\u2327i,\u2327 i+1], for i = 0,\u00b7\u00b7\u00b7 , I and \u2327 = \u2327i+1  \u2327i. It then\nconducts linear extrapolation on each subinterval. It starts from \u23270 = tk and uses \u02dc(\u23270) and the\napproximation of the gradient \u02dc(\u23270)0 to compute \u02dc(\u23271). Next, \u02dc(\u23271) is taken as the initial value and\nthe process is repeated until \u2327I = tk+1. Appendix D contains details of this method.\nThe RK method approximates the gradient \u02dc(t)0 with different levels of accuracy, called states s.\nWhen s = 1, it is the Euler method, which uses the \ufb01rst order approximation \u02dc(\u2327i+1)  \u02dc(\u2327i)/\u2327.\n\n5\n\n\fWe use the ODE45 solver in MATLAB and choose the stage s = 4 for RK. Moreover, the main\ncomputation in the RK method comes from the matrix-vector product. Since the matrix Q(t) is\nsparse and bi-diagonal with O(M ) non-zero elements, the cost for this operation is only O(M ).\n4.5 Hybrid inference machine with mass transport equation\nWith the conditional probability mass, we are now ready to express g(Ht) in closed form and\nestimate EHt\n\n[g(Ht)] using the MC sampling method. We present our framework HYBRID:\nt} from a point process N (t) with a stochastic intensity (t).\n\n(i) Generate n samples {Hi\n(ii) For each sample Hi\ns), for each s 2\n(iii) We obtain the estimator of the probability mass function (x, t) and \u00b5(t) by taking the\n\n[0, t]; then we solve (5) to compute the conditional probability mass \u02dci(x, t).\n\nt, we compute the value of intensity function (s|Hi\nnPn\n\n\u02c6\u00b5n(t) =Px f (x) \u02c6n(x, t)\n\naverage: \u02c6n(x, t) = 1\n\n\u02dci(x, t),\n\ni=1\n\n[ \u02dc]). Then EHt\n\n[g(Ht)] = (f, EHt\n\nAlgorithm 2 summarizes the above procedure. Next, we discuss two properties of HYBRID.\nFirst, our framework ef\ufb01ciently uses all event information in each sample. In fact, each event tk\nin\ufb02uences the transport rate of the conditional probability mass (Figure 2). This feature is in sharp\ncontrast to MC that only uses the information of the total number of events and neglects the differences\nin event times. For instance, the two samples in Figure 1(a) both have three events and MC treats them\nequally; hence its estimator is an indicator function \u02c6mc\nn (x, t) = I[x = 3]. However, for HYBRID,\nthese samples have different event information and conditional probability mass functions, and our\nestimator in Figure 1(d) is much more informative than an indicator function.\nMoreover, our estimator for the probability mass is unbiased if we can solve the mass transport\nequation in (4) exactly. To prove this property, we show that the following equality holds for an\n[ \u02c6n] = \narbitrary function f: (f, ) = E[f (N (t))] = EHt\nfollows from the fundamental lemma of the calculus of variations [14]. Appendix B contains detailed\nderivations. In practice, we choose a reasonable \ufb01nite support for the conditional probability mass in\norder to solve the mass transport ODE in (5). Hence our estimator is nearly unbiased.\n5 Applications and extensions to multi-dimensional point processes\nIn this section, we present two real world applications, where the point process models have inter-\ntwined stochasticity and co-evolving intensity functions.\nPredicting the activeness and popularity of users in social networks. The co-evolution model [12]\nuses a Hawkes process Nus(t) to model information diffusion (tweets/retweets), and a survival process\nAus(t) to model the dynamics of network topology (link creation process). The intensity of Nus(t)\ndepends on the network topology Aus(t), and the intensity of Aus(t) also depends on Nus(t); hence\nthese processes co-evolve over time. We focus on two tasks in this model: (i) inferring the activeness\n\nof a user by E[Pu Nus(t)], which is the number of tweets and retweets from user s; and (ii) inferring\nthe popularity of a user by E[Pu Aus(t)], which is the number of new links created to the user.\n\nPredicting the popularity of items in recommender systems. Recent works on recommendation\nsystems [10, 25] use a point process Nui(t) to model user u\u2019s sequential interaction with item i.\nThe intensity function ui(t) denotes user\u2019s interest to the item. As users interact with items over\ntime, the user latent feature uu(t) and item latent feature iu(t) co-evolve over time, and are mutually\ndependent [25]. The intensity is parameterized as ui(t) = \u2318ui +uu(t)>ii(t), where \u2318ui is a baseline\nterm representing the long-term preference, and the tendency for u to interact with i depends on the\ncompatibility of their instantaneous latent features uu(t)>ii(t). With this model, we can infer an\n\nTo solve these prediction tasks, we extend the transport equation to the multivariate case. Speci\ufb01cally,\n\nitem\u2019s popularity by evaluating E[Pu Nui(t)], which is the number of events happened to item i.\nwe create a new stochastic process x(t) =Pu Nus(t) and compute its conditional mass function.\nTheorem 3 (Mass Transport for Multidimensional Point Processes). Let Nus(t) be the point process\nwith intensity us(t), x(t) = PU\nu=1 Nus(t), and \u02dc(x, t) = P[x(t) = x|Ht] be the conditional\nprobability mass of x(t); then \u02dc satis\ufb01es: \u02dct = Pu us(t) \u02dc(x, t) +Pu us(t) \u02dc(x  1, t).\n\nTo compute the conditional probability mass, we also solve the ODE in (5), where the diagonal and\noff-diagonal of Q(t) is now the negative and positive summation of intensities in all dimensions.\n\n6\n\n\f0.8\n\n0.6\n\nE\nP\nA\nM\n\n0.4\n\n0.2\n\n1\n\n2\n4\nTest time (half day)\n\n3\n\n5\n\n(a) MAPE vs. test time\n\nHYBRID\nMC-1e6\nMC-1e3\nSEISMIC\nRPP\nFPE\n\nE\nP\nA\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nHYBRID\nMC-1e6\nMC-1e3\nFPE\n\nHYBRID\nMC-1e6\nMC-1e3\nSEISMIC\nRPP\nFPE\n\nE\nP\nA\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nE\nP\nA\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nHYBRID\nMC-1e6\nMC-1e3\nFPE\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nTraining data size in proportion\n(b) MAPE vs. train size\n\n1\n\n2\n4\nTest time (half day)\n\n3\n\n5\n\n(c) MAPE vs. test time\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nTraining data size in proportion\n(d) MAPE vs. train size\n\nFigure 3: Prediction results for user activeness and user popularity. (a,b) user activeness: predicting\nthe number of posts per user; (c,d) user popularity: predicting the number of new links per user. Test\ntimes are the relative times after the end of train time. The train data is \ufb01xed with 70% of total data.\n\nHYBRID\nMC-1e6\nMC-1e3\nSEISMIC\nRPP\nFPE\n\n0.8\n\n0.6\n\nE\nP\nA\nM\n\n0.4\n\n0.2\n\nE\nP\nA\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nHYBRID\nMC-1e6\nMC-1e3\nSEISMIC\nRPP\nFPE\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nE\nP\nA\nM\n\nE\nP\nA\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\n2\n\n3\n\nTest time (week)\n\n4\n\n5\n\n(a) MAPE vs. test time\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nTraining data size in proportion\n(b) MAPE vs. train size\n\n2\n\n4\n\n6\n\nTest time (day)\n\n8\n\n10\n\n(c) MAPE vs. test time\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nTraining data size in proportion\n(d) MAPE vs. train size\n\nFigure 4: Prediction results for item popularity. (a,b) predicting the number of watching events per\nprogram on IPTV; (c,d) predicting the number of discussions per group on Reddit.\n\n6 Experiments\n\nIn this section, we evaluate the predictive performance of HYBRID in two real world applications in\nSection 5 and a synthetic dataset. We use the following metrics:\n(i) Mean Average Percentage Error (MAPE). Given a prediction time t, we compute the MAPE\n\n|\u02c6\u00b5n(t)  \u00b5(t)|/\u00b5(t) between the estimated value and the ground truth.\n(ii) Rank correlation. For all users/items, we obtain two lists of ranks according to the true and\nestimated value of user activeness/user popularity/item popularity. The accuracy is evaluated by\nthe Kendall-\u2327 rank correlation [18] between two lists.\n\n6.1 Experiments on real world data\n\nWe show HYBRID has both accuracy and ef\ufb01ciency improvement in predicting the activeness and\npopularity of users in social networks and predicting the popularity of items in recommender systems.\nCompetitors. We use 103 samples for HYBRID and compare it with the following the state of the art.\n\u2022 SEISMIC [32]. It de\ufb01nes a self-exciting process with a post infectiousness factor. It uses the\n\u2022 RPP [13]. It adds a reinforcement coef\ufb01cient to Poisson process that depicts the self-excitation\n\u2022 FPE [29]. It uses a deterministic function to approximate the stochastic intensity function.\n\u2022 MC-1E3. It is the MC sampling method with 103 samples (same as these for HYBRID), and\n\nphenomena. It sets dN (t) = (t)dt and solves a deterministic equation for prediction.\n\nbranching property of Hawkes process and heuristic corrections for prediction.\n\nMC-1E6 uses 106 samples.\n\n6.1.1 Predicting the activeness and popularity of users in social networks\nWe use a Twitter dataset [2] that contains 280,000 users with 550,000 tweet, retweet, and link creation\nevents during Sep. 21 - 30, 2012. This data is previously used to validate the network co-evolution\nmodel [12]. The parameters for tweeting/retweeting processes and link creation process are learned\nusing maximum likelihood estimation [12]. SEISMIC and RPP are not designed for the popularity\nprediction task since they do not consider the evolution of network topology. We use p proportion of\ntotal data as the training data to learn parameters of all methods, and the rest as test data. We make\npredictions for each user and report the averaged results.\n\n7\n\n\f)\ns\n(\n \ne\nm\nT\n\ni\n\n10\n8\n6\n4\n2\n0\n0.5\n\n104\nHYBRID\nMC\n\n104\nHYBRID\nMC\n\n3\n\n2\n\n1\n\n)\ns\n(\n \ne\nm\nT\n\ni\n\n)\ns\n(\n \ne\nm\nT\n\ni\n\n0.4\n\n0.3\nMAPE\n\n0.2\n\n0.1\n\n0\n0.5\n\n0.4\n\n0.3\nMAPE\n\n0.2\n\n0.1\n\n1000\n800\n600\n400\n200\n0\n0.5\n\nHYBRID\n\n600\n\nHYBRID\n\n)\ns\n(\n \ne\nm\nT\n\ni\n\n400\n\n200\n\n0.4\n\n0.3\nMAPE\n\n0.2\n\n0.1\n\n0\n0.5\n\n0.4\n\n0.3\nMAPE\n\n0.2\n\n0.1\n\n(a) User activeness\n\n(d) Item popularity, IPTV\nFigure 5: Scalability analysis: computation time as a function of error. (a,b) comparison between\nHYBRID and MC in different problems; (c,d) scalability plots for HYBRID.\n\n(b) Item popularity, IPTV\n\n(c) User activeness\n\n(a) User activeness\n\n(c) Item popularity, IPTV (d) Item popularity, Reddit\nFigure 6: Rank correlation results in different problems. We vary the proportion p of training data\nfrom 0.6 to 0.8, and the error bar represents the variance over different training sets.\n\n(b) User popularity\n\nPredictive performance. Figure 3(a) shows that MAPE increases as test time increases, since the\nmodel\u2019s stochasticity increases. HYBRID has the smallest error. Figure 3(b) shows that MAPE\ndecreases as training data increases since model parameters are more accurate. Moreover, HYBRID is\nmore accurate than SEISMIC and FPE with only 60% of training data, while these works need 80%.\nThus, we make accurate predictions by observing users in the early stage. This feature is important\nfor network moderators to identify malicious users and suppress the propagation undesired content.\nMoreover, the consistent performance improvement shows two messages: (i) considering all the\nrandomness is important. HYBRID is 2\u21e5 more accurate than SEISMIC and FPE because HYBRID\nnaturally considers all the stochasticity, but SEISMIC, FPE, and RPP need heuristics or approximations\nthat discard parts of the stochasticity; (ii) sampling ef\ufb01ciently is important. To consider all the\nstochasticity, we need to use the sampling scheme, and HYBRID has a much smaller sample size.\nSpeci\ufb01cally, HYBRID uses the same 103 samples, but has 4\u21e5 error reduction compared with MC-1E3.\nMC-1E6 has a similar predictive performance as HYBRID, but needs 103\u21e5 more samples.\nScalability. How does the reduction in sample size improve the speed? Figure 5(a) shows that as the\nerror decreases from 0.5 to 0.1, MC has higher computation cost, since it needs much more samples\nthan HYBRID to achieve the same error. We include the plots of HYBRID in (c). In particular, to\nachieve the error of 0.1, MC needs 106 samples in 27.8 hours, but HYBRID only needs 14.4 minutes\nwith 103 samples. We use the machine with 16 cores, 2.4 GHz Intel Core i5 CPU and 64 GB memory.\nRank correlation. We rank all users according to the predicted level of activeness and level of\npopularity separately. Figure 6(a,b) show that HYBRID performs the best with the accuracy around\n80%, and it consistently identi\ufb01es around 30% items more correctly than FPE on both tasks.\n\n6.1.2 Predicting the popularity of items in recommender systems\n\nIn the recommendation system setting, we use two datasets from [25]. The IPTV dataset contains\n7,100 users\u2019 watching history of 436 TV programs in 11 months, with around 2M events. The Reddit\ndataset contains online discussions of 1,000 users in 1,403 groups, with 10,000 discussion events.\nThe predictive and scalability performance are consistent with the application in social networks.\nFigure 4 shows that HYBRID is 15% more accurate than FPE and 20% than SEISMIC. Figure 5 also\nshows that HYBRID needs much smaller amount of computation time than MC-1E6. To achieve the\nerror of 0.1, it takes 9.8 minutes for HYBRID and 7.5 hours for MC-1E6. Figure 6(c,d) show that\nHYBRID achieves the rank correlation accuracy of 77%, with 20% improvement over FPE.\n\n8\n\n0.710.690.410.390.210.130.000.250.500.75MethodsRank correlationMethodsHYBRIDMC-1e6FPESEISMICRPPMC-1e30.720.690.440.110.000.250.500.75MethodsRank correlationMethodsHYBRIDMC-1e6FPEMC-1e30.780.760.510.410.210.150.000.250.500.75MethodsRank correlationMethodsHYBRIDMC-1e6FPESEISMICRPPMC-1e30.770.750.580.510.310.210.000.250.500.75MethodsRank correlationMethodsHYBRIDMC-1e6FPESEISMICRPPMC-1e3\f10-1\n\n10-2\n\n10-3\n\nE\nP\nA\nM\n\nHYBRID\nMC\n\n100\n\n10-1\n\n10-2\n\n10-3\n\nE\nP\nA\nM\n\n101\n\n103\n\n102\n104\nnumber of samples\n\n105\n\n101\n\nHybrid\nMC\n\n100\n\n10-1\n\n10-2\n\n10-3\n\nE\nP\nA\nM\n\nHYBRID\nMC\n\n101\n100\n10-1\n\n10-2\n\nE\nP\nA\nM\n\nHYBRID\nMC\n\n(a) f (x) = x\n\n(b) f (x) = x log(x)\n\n103\n\n102\n105\nnumber of samples\n\n104\n\n106\n\n101\n\n103\n\n104\n\n102\n105\nnumber of samples\n(c) f (x) = x2\n\n106\n\n101\n\n103\n\n104\n\n102\n105\nnumber of samples\n(d) f (x) = exp(x)\n\n106\n\nFigure 7: Error of E[f (N (t))] as a function of sample size (loglog scale). (a-d) different choices of f.\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n\n0\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0\n\n85\nCounts\n\n160\n\n(a) HYBRID, \u02c6n(x, t)\n\n160\n\n0\n\n80\nCounts\n(b) MC, \u02c6mc\nn (x, t)\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.03\n0.025\n0.02\n0.015\n0.01\n0.005\n0\n\n0\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.01\n\n160\n\n0\n\n0\n\n85\nCounts\n\n85\nCounts\n\n160\n\n(c) HYBRID, 1 sample\n\n(d) HYBRID, 1 sample\n\nFigure 8: Comparison of estimators of probability mass functions in HYBRID and MC.\nestimators with the same 1000 samples. (c,d) estimator with one sample in HYBRID.\n\n(a,b)\n\n6.2 Experiments on synthetic data\nWe compare HYBRID with MC in two aspects: (i) the signi\ufb01cance of the reduction in the error and\nsample size, and (ii) estimators of the probability mass function. We study a Hawkes process and set\nthe parameters of its intensity function as \u2318 = 1.2, and \u21b5 = 0.5. We \ufb01x the prediction time to be\nt = 30. The ground truth is computed with 108 samples from MC simulations.\nError vs. number of samples. In four tasks with different f, Figure 7 shows that given the same\nnumber of samples, HYBRID has a smaller error. Moreover, to achieve the same error, HYBRID needs\n100\u21e5 less samples than MC. In particular, to achieve the error of 0.01, (a) shows HYBRID needs 103\nand MC needs 105 samples; (b) shows HYBRID needs 104 and MC needs 106 samples.\nProbability mass functions. We compare our estimator of the probability mass with MC. Fig-\nure 8(a,b) show that our estimator is much smoother than MC, because our estimator is the average of\nconditional probability mass functions, which are computed by solving the mass transport equation.\nMoreover, our estimator centers around 85, which is the ground truth of E[N (t)], while that of MC\ncenters around 80. Hence HYBRID is more accurate. We also plot two conditional mass functions in\n(c,d). The average of 1000 conditional mass functions yields (a). Thus, this averaging procedure in\nHYBRID adjusts the shape of the estimated probability mass. On the contrary, given one sample, the\nestimator in MC is just an indicator function and cannot capture the shape of the probability mass.\n\n7 Conclusions\n\nWe have proposed HYBRID, a generic framework with a new formulation of the prediction problem\nin point processes and a novel mass transport equation. This equation ef\ufb01ciently uses the event\ninformation to update the transport rate and compute the conditional mass function. Moreover,\nHYBRID is applicable to general point processes and prediction tasks with an arbitrary function f.\nHence it can take any point process models as input, and the predictive performance of our framework\ncan be further improved with the advancement of point process models. Experiments on real world\nand synthetic data demonstrate that HYBRID outperforms the state of the art both in terms of accuracy\nand ef\ufb01ciency. There are many interesting lines for future research. For example, HYBRID can be\ngeneralized to marked point processes [4], where a mark is observed along with the timing of each\nevent.\n\n9\n\n\fAcknowledgements. This project was supported in part by NSF IIS-1218749, NIH BIGDATA\n1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR\nN00014-15-1-2340, DMS-1620342, CMMI-1745382, IIS-1639792, IIS-1717916, NVIDIA, Intel\nISTC and Amazon AWS.\n\nReferences\n[1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of\n\nview. Springer, 2008.\n\n[2] D. Antoniades and C. Dovrolis. Co-evolutionary dynamics in social networks: A case study of\n\ntwitter. arXiv preprint arXiv:1309.6001, 2013.\n\n[3] D. Blackwell. Conditional expectation and unbiased sequential estimation. The Annals of\n\nMathematical Statistics, pages 105\u2013110, 1947.\n\n[4] P. Br\u00e9maud. Point processes and queues. 1981.\n\n[5] J. Da Fonseca and R. Zaatour. Hawkes process: Fast calibration, application to trade clustering,\n\nand diffusive limit. Journal of Futures Markets, 34(6):548\u2013579, 2014.\n\n[6] H. Dai, Y. Wang, R. Trivedi, and L. Song. Deep coevolutionary network: Embedding user and\n\nitem features for recommendation. arXiv preprint arXiv:1609.03675, 2016.\n\n[7] J. R. Dormand and P. J. Prince. A family of embedded runge-kutta formulae. Journal of\n\ncomputational and applied mathematics, 6(1):19\u201326, 1980.\n\n[8] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable in\ufb02uence estimation in continuous-\n\ntime diffusion networks. In NIPS, 2013.\n\n[9] N. Du, L. Song, A. J. Smola, and M. Yuan. Learning networks of heterogeneous in\ufb02uence. In\n\nNIPS, 2012.\n\n[10] N. Du, Y. Wang, N. He, and L. Song. Time sensitive recommendation from recurrent user\n\nactivities. In NIPS, pages 3492\u20133500, 2015.\n\n[11] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK,\n\n2002.\n\n[12] M. Farajtabar, Y. Wang, M. Gomez-Rodriguez, S. Li, H. Zha, and L. Song. Coevolve: A\njoint point process model for information diffusion and network co-evolution. In NIPS, pages\n1954\u20131962, 2015.\n\n[13] S. Gao, J. Ma, and Z. Chen. Modeling and predicting retweeting dynamics on microblogging\n\nplatforms. In WSDM, 2015.\n\n[14] I. M. Gelfand, R. A. Silverman, et al. Calculus of variations. Courier Corporation, 2000.\n\n[15] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n58(1):83\u201390, 1971.\n\n[16] N. He, Z. Harchaoui, Y. Wang, and L. Song. Fast and simple optimization for poisson likelihood\n\nmodels. arXiv preprint arXiv:1608.01264, 2016.\n\n[17] X. He, T. Rekatsinas, J. Foulds, L. Getoor, and Y. Liu. Hawkestopic: A joint model for network\n\ninference and topic modeling from text-based cascades. In ICML, pages 871\u2013880, 2015.\n\n[18] M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81\u201393, 1938.\n\n[19] W. Lian, R. Henao, V. Rao, J. E. Lucas, and L. Carin. A multitask point process predictive\n\nmodel. In ICML, pages 2030\u20132038, 2015.\n\n[20] Y. Ogata. On lewis\u2019 simulation method for point processes. IEEE Transactions on Information\n\nTheory, 27(1):23\u201331, 1981.\n\n10\n\n\f[21] J. Pan, V. Rao, P. Agarwal, and A. Gelfand. Markov-modulated marked poisson processes for\n\ncheck-in data. In ICML, pages 2244\u20132253, 2016.\n\n[22] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani. Epidemic processes in\n\ncomplex networks. Reviews of modern physics, 87(3):925, 2015.\n\n[23] X. Tan, S. A. Naqvi, A. Y. Qi, K. A. Heller, and V. Rao. Content-based modeling of reciprocal\n\nrelationships using hawkes and gaussian processes. In UAI, pages 726\u2013734, 2016.\n\n[24] R. Trivedi, H. Dai, Y. Wang, and L. Song. Know-evolve: Deep temporal reasoning for dynamic\n\nknowledge graphs. In ICML, 2017.\n\n[25] Y. Wang, N. Du, R. Trivedi, and L. Song. Coevolutionary latent feature processes for continuous-\n\ntime user-item interactions. In NIPS, pages 4547\u20134555, 2016.\n\n[26] Y. Wang, E. Theodorou, A. Verma, and L. Song. A stochastic differential equation framework\n\nfor guiding online user activities in closed loop. arXiv preprint arXiv:1603.09021, 2016.\n\n[27] Y. Wang, G. Williams, E. Theodorou, and L. Song. Variational policy for guiding point processes.\n\nIn ICML, 2017.\n\n[28] Y. Wang, B. Xie, N. Du, and L. Song. Isotonic hawkes processes. In ICML, pages 2226\u20132234,\n\n2016.\n\n[29] Y. Wang, X. Ye, H. Zha, and L. Song. Predicting user activity level in point processes with\n\nmass transport equation. In NIPS, 2017.\n\n[30] S.-H. Yang and H. Zha. Mixture of mutually exciting processes for viral diffusion. In ICML,\n\npages 1\u20139, 2013.\n\n[31] L. Yu, P. Cui, F. Wang, C. Song, and S. Yang. From micro to macro: Uncovering and predicting\n\ninformation cascading process with behavioral dynamics. In ICDM, 2015.\n\n[32] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec. Seismic: A self-exciting\n\npoint process model for predicting tweet popularity. In KDD, 2015.\n\n[33] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using\n\nmulti-dimensional hawkes processes. In AISTAT, volume 31, pages 641\u2013649, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1035, "authors": [{"given_name": "Yichen", "family_name": "Wang", "institution": "Georgia Tech"}, {"given_name": "Xiaojing", "family_name": "Ye", "institution": "Georgia State University"}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}